Monk Datastore Overview

prev     tcon

Version History

Version 1.0.21. April 21, 2009

We added the Monk licensing agreement to the source code and the documentation and revised the datastore overview documentation.

Version 1.0.20. April 11, 2009

We fixed a bug which caused null pointer exceptions when using aggregate counters.

Version 1.0.19. March 28, 2009

We added the ECCO and DocSouth corpora (Eighteenth Century Collections Online and Documenting the American South).

There were a number of changes to NUPOS.

We now include special "gap marker" words in the datastore for "gap" elements in the TEI-A source files.

Version 1.0.18. February 11, 2009

Precomputed counts and frequencies are now available for lemma bigrams and part of speech trigrams, using the new feature classes LemmaBigram and PosTrigram.

We added two new examples to find lemma bigrams and to find collocates.

The 30 old individual search criteria for container summary counts (e.g., AuthorNumWordsCriterion, WorkPartNumSentencesNonCumMainCriterion, etc.) have been replaced by four new parameterized search criteria:

Version 1.0.17. January 12, 2009

All counts and frequencies are now available over just the main text (non-paratext) in the container in addition to over all the text in the container. To get counts and frequencies over just main text, append "Main" to the method name. For example:

Print the number of words in main text for a work:

System.out.println(work.getNumWordsMain());

For a work part lemma counter counter, the following prints the non-cumulative frequency of the lemma in the work part over just main text (non-paratext):

System.out.println(counter.getGetFreqMain(CumKind.NON_CUM));

Version 1.0.16. December 16, 2008

In the datastore test web application, search results can be saved to tab-delimited text files, and multiple values can be selected or entered for many of the search criteria.

The colOrdinal, puncBefore, and puncAfter attributes of words were being set incorrectly. One symptom was that concordances were incorrect. The bugs have been fixed.

Version 1.0.15. December 10, 2008

There are six new word attributes and associated search criteria:

Spellings are now mapped to lower case. All spelling searches and spelling counts are now case-insensitive. The new token attribute provides case-sensitive spellings of words.

The authors attribute for works is now an ordered list rather than an unordered collection.

When corpora, works, and work parts are sorted by title, any "The", "An" and "A" prefixes are now ignored.

The HTML text is formatted a little bit nicer.

Version 1.0.14. December 4, 2008

The datastore now models sentences. There are several new attributes and associated search criteria.

New word attributes:

New corpus attribute:

New author attribute:

New work part attributes:

The "Clever, handsome and rich" example was modified to only find and print pattern matches which do not cross sentence boundaries.

The new "Get random sentences" example finds and prints random sentences from a work.

We added new search criteria for work and work part tag patterns.

Acolyte now outputs an xmlns attribute on the monkHeader element to declare a namespace binding.

The new tool Val can be used to validate Monk texts using Jing and RELAX NG schemata for TEI-A files, morhpadorned files, and bibadorned files. See Val for details.

Version 1.0.13. November 26, 2008

We made some minor changes to Acolyte and Prior and did some editing of the documentation.

Version 1.0.12. November 23, 2008

We added documentation on the data ingest pipeline, Acolyte, Prior, and cdb.csh.

Version 1.0.11. November 22, 2008

We added the Shakespeare and Early American Fiction corpora.

We removed seven bad texts from the Wright corpus which had been included by mistake.

Version 1.0.10. November 15, 2008

We added fourteen large EEBO texts.

Version 1.0.9. November 13, 2008

There are four new work attributes and associated search criteria:

There are three new author attributes and associated search criteria:

All bibliographic attributes, including the new items listed above, are now obtained from curator-supplied data. We no longer use the teiHeader to get this information.

The new ingest program named Acolyte merges curator-supplied bibliographic data with TEI-A XML files. The bibliographic data is added to the XML files in a new monkHeader element at the beginning of the file. In the Monk ingest pipeline, Acolyte is run after Abbot and MorphAdorner and before Prior.

There are several new texts, and the format and tagging data for many of the texts have been improved.

The TCP corpus has been renamed EEBO.

For the author birth and death year attributes and for the work circulation year attribute, we now use null instead of the value -1 to represent an unknown or missing value. These attributes now have type Integer instead of type int. This makes it possible to accomodate negative years. For example, the author Lucius Annaeus Seneca has a birth year of -4 (representing 4 B.C.).

In all of the numeric range search criteria, we now use null instead of negative numbers to represent no constraint.

The new method getTeiHeader in the Work class gets the XML for the teiHeader for the work.

The getXml method in the WorkPart class has been renamed getAdornedXml.

The unadorned XML for each work part is now precomputed and stored in the database. This makes the getUnadornedXml method in the WorkPart class faster.

The AuthorBirthDateCriterion and AuthorDeathDateCriterion search criteria have been renamed AuthorBirthYearCriterion and AuthorDeathYearCriterion.

The TEI elements front, body and back now create work parts.

There is no longer an initial preconstructed "title page" work part in each work. Title pages, if desired, can be formatted using the TEI headers.

Version 1.0.8. October 22, 2008

We fixed a bug which could cause the Concordance.find method to hang on very large collections of words.

Version 1.0.7. September 10, 2008

We added a new method getUnadornedXml to the WorkPart class.

Version 1.0.6. August 29, 2008

This version of the datastore was built with the new Prior tool. It contains the NCF, TCP and Wright collections, as tagged by MorphAdorner with the latest version of NUPOS.

You can now get XML for work parts with the new getXml method.

The HTML for work parts is formatted differently. There is no longer any font style or line justification formatting. The only formatting is line and paragraph breaks at the appropriate places.

Version 1.0.5. February 19, 2008

We upgraded the MySQL connector to version 5.0.8 and fixed a bug in the unit tests.

Version 1.0.4. February 15, 2008

Spelling counts were incorrect because contractions were counted twice. We fixed the bug.

Version 1.0.3. February 12, 2008

The static method Work.find did not work when passed a collection of search criteria due to what we think is a bug in Apple's Java compler version 1.5.0_13. We found a workaround for the compiler bug.

Words with accent marks in form fields in the test servlet did not work properly. We fixed the bug.

Version 1.0.2. January 23, 2008

We fixed a bug which could cause some kinds of searches to take a very long time, especially counter searches with large result sets.

Version 1.0.1. December 22, 2007

Work part objects have a new type attribute and search criterion. The type attribute of a work part is the string value of the type attribute on the div or other element in the TEI source from which the work part was created, or the value **missing** if there was no such attribute.

The test servlet has a new operation in the "Examples" section named "Work Part Types" which displays a summary of work part types for a given corpus.

Spelling objects have a new length attribute and search criterion.

You may now call ModelInit.init more than once. All calls after the first one are ignored.

Each new release of the datastore has a version number, and the documentation has a new "version history" section which records all the changes and new features introduced in each new version.

Version 1.0. November 16, 2007

The first release of the Monk datastore.

prev     tcon