Monk Datastore Overview
We added the Monk licensing agreement to the source code and the documentation and revised the datastore overview documentation.
We fixed a bug which caused null pointer exceptions when using aggregate counters.
We added the ECCO and DocSouth corpora (Eighteenth Century Collections Online and Documenting the American South).
There were a number of changes to NUPOS.
We now include special "gap marker" words in the datastore for "gap" elements in the TEI-A source files.
Precomputed counts and frequencies are now available for lemma bigrams and part of speech trigrams, using the new feature classes
LemmaBigramandPosTrigram.We added two new examples to find lemma bigrams and to find collocates.
The 30 old individual search criteria for container summary counts (e.g.,
AuthorNumWordsCriterion,WorkPartNumSentencesNonCumMainCriterion, etc.) have been replaced by four new parameterized search criteria:
CorpusSummaryCountCriterionAuthorSummaryCountCriterionWorkSummaryCountCriterionWorkPartSummaryCountCriterion
All counts and frequencies are now available over just the main text (non-paratext) in the container in addition to over all the text in the container. To get counts and frequencies over just main text, append "
Main" to the method name. For example:Print the number of words in main text for a work:
System.out.println(work.getNumWordsMain());For a work part lemma counter
counter, the following prints the non-cumulative frequency of the lemma in the work part over just main text (non-paratext):System.out.println(counter.getGetFreqMain(CumKind.NON_CUM));
In the datastore test web application, search results can be saved to tab-delimited text files, and multiple values can be selected or entered for many of the search criteria.
The
colOrdinal,puncBefore, andpuncAfterattributes of words were being set incorrectly. One symptom was that concordances were incorrect. The bugs have been fixed.
There are six new word attributes and associated search criteria:
context= the TEI "context" of the word. This attribute can be used to answer questions like "Is this word in a paragraph?" It is the path of TEI element names leading from and including the element that created the work part containing the word down to but not including thewelement for the word, separated by slashes, with leading and trailing slashes. E.g.,"/div/quote/p/hi/".verse= true if the word appears in verse, false if it appears in prose.paratext= true if the word appears in paratext, false if it appears in main text.standardSpelling= the standard spelling of the word mapped to lower case.token= the spelling of the word as it appears in the text.parBreak= true if the word is followed by a paragraph break. Paragraph breaks are represented in concordances by "//".Spellings are now mapped to lower case. All spelling searches and spelling counts are now case-insensitive. The new
tokenattribute provides case-sensitive spellings of words.The
authorsattribute for works is now an ordered list rather than an unordered collection.When corpora, works, and work parts are sorted by title, any "The", "An" and "A" prefixes are now ignored.
The HTML text is formatted a little bit nicer.
The datastore now models sentences. There are several new attributes and associated search criteria.
New word attributes:
- endOfSentence: True if the word is at the end of a sentence.
- sentenceInWorkOrdinal: The ordinal of the word's sentence within its work.
- wordInSentenceOrdinal: The ordinal of the word within its sentence.
New corpus attribute:
- numSentences: The number of sentences in the corpus.
New author attribute:
- numSentences: The number of sentences by the author.
New work part attributes:
- numSentencesCum: The cumulative number of sentences in the work part.
- numSentencesNonCum: The non-cumulative number of sentences in the work part.
The "Clever, handsome and rich" example was modified to only find and print pattern matches which do not cross sentence boundaries.
The new "Get random sentences" example finds and prints random sentences from a work.
We added new search criteria for work and work part tag patterns.
Acolyte now outputs an
xmlnsattribute on themonkHeaderelement to declare a namespace binding.The new tool
Valcan be used to validate Monk texts using Jing and RELAX NG schemata for TEI-A files, morhpadorned files, and bibadorned files. See Val for details.
We made some minor changes to Acolyte and Prior and did some editing of the documentation.
We added documentation on the data ingest pipeline, Acolyte, Prior, and cdb.csh.
We added the Shakespeare and Early American Fiction corpora.
We removed seven bad texts from the Wright corpus which had been included by mistake.
We added fourteen large EEBO texts.
There are four new work attributes and associated search criteria:
- circulation year: The year the work was first circulated. This new attribute and search criterion replaces the old publication date start and end attributes and search criteria.
- genre: The work's genre (e.g., "prose", "play", etc.)
- subgenre: The work's subgenre, if any (e.g., "sermon", "witchcraft", etc.)
- availability: E.g., "restricted", "unrestricted", etc.
There are three new author attributes and associated search criteria:
- gender: The author's gender (e.g., "M", "F", "U").
- origin: The author's origin (e.g., "British Isles", "American", etc.)
- flourished: When the author flourished (e.g., "1612-1653", "n/a", etc.)
All bibliographic attributes, including the new items listed above, are now obtained from curator-supplied data. We no longer use the
teiHeaderto get this information.The new ingest program named
Acolytemerges curator-supplied bibliographic data with TEI-A XML files. The bibliographic data is added to the XML files in a newmonkHeaderelement at the beginning of the file. In the Monk ingest pipeline,Acolyteis run afterAbbotandMorphAdornerand beforePrior.There are several new texts, and the format and tagging data for many of the texts have been improved.
The TCP corpus has been renamed EEBO.
For the author birth and death year attributes and for the work circulation year attribute, we now use
nullinstead of the value-1to represent an unknown or missing value. These attributes now have typeIntegerinstead of typeint. This makes it possible to accomodate negative years. For example, the author Lucius Annaeus Seneca has a birth year of -4 (representing 4 B.C.).In all of the numeric range search criteria, we now use
nullinstead of negative numbers to represent no constraint.The new method
getTeiHeaderin theWorkclass gets the XML for theteiHeaderfor the work.The
getXmlmethod in theWorkPartclass has been renamedgetAdornedXml.The unadorned XML for each work part is now precomputed and stored in the database. This makes the
getUnadornedXmlmethod in theWorkPartclass faster.The
AuthorBirthDateCriterionandAuthorDeathDateCriterionsearch criteria have been renamedAuthorBirthYearCriterionandAuthorDeathYearCriterion.The TEI elements
front,bodyandbacknow create work parts.There is no longer an initial preconstructed "title page" work part in each work. Title pages, if desired, can be formatted using the TEI headers.
We fixed a bug which could cause the
Concordance.findmethod to hang on very large collections of words.
We added a new method
getUnadornedXmlto theWorkPartclass.
This version of the datastore was built with the new
Priortool. It contains the NCF, TCP and Wright collections, as tagged by MorphAdorner with the latest version of NUPOS.You can now get XML for work parts with the new
getXmlmethod.The HTML for work parts is formatted differently. There is no longer any font style or line justification formatting. The only formatting is line and paragraph breaks at the appropriate places.
We upgraded the MySQL connector to version 5.0.8 and fixed a bug in the unit tests.
Spelling counts were incorrect because contractions were counted twice. We fixed the bug.
The static method
Work.finddid not work when passed a collection of search criteria due to what we think is a bug in Apple's Java compler version 1.5.0_13. We found a workaround for the compiler bug.Words with accent marks in form fields in the test servlet did not work properly. We fixed the bug.
We fixed a bug which could cause some kinds of searches to take a very long time, especially counter searches with large result sets.
Work part objects have a new
typeattribute and search criterion. Thetypeattribute of a work part is the string value of thetypeattribute on thedivor other element in the TEI source from which the work part was created, or the value**missing**if there was no such attribute.The test servlet has a new operation in the "Examples" section named "Work Part Types" which displays a summary of work part types for a given corpus.
Spelling objects have a new
lengthattribute and search criterion.You may now call
ModelInit.initmore than once. All calls after the first one are ignored.Each new release of the datastore has a version number, and the documentation has a new "version history" section which records all the changes and new features introduced in each new version.
The first release of the Monk datastore.