Monk Datastore Overview

prev     tcon     next

Core Objects and Attributes

Core objects are loaded into memory when the model is initialized. These are all the objects in the model except for words, word parts, and counters. All core objects have a "tag", a short unique string identifier for the object.

Many of the summary count attributes listed below are over "main text (non-paratext)". For the definitions of main text and paratext see Words and Word Searching.

Corpus

tag A short unique string identifier for the corpus.
title The corpus title.
numWorks The number of works in the corpus.
works Collection of works in the corpus.
numAuthors The number of authors who have works in the corpus.
authors Collection of authors who have works in the corpus.
numWords The number of words in the corpus.
numWordParts The number of word parts in the corpus.
numSentences The number of sentences in the corpus.
numWordsMain The number of words in main text (non-paratext) in the corpus.
numWordPartsMain The number of word parts in main text (non-paratext) in the corpus.
numSentencesMain The number of sentences in main text (non-paratext) in the corpus.

Note that the number of word parts is typically greater than the number of words, with the difference being the number of contractions.

Work

corpus The corpus to which the work belongs
circulationYear The year the work was first circulated, or null if unknown.
genre The work's genre. E.g., "prose", "play", etc.
subgenre The work's subgenre, if any. E.g., "sermon", "witchcraft", etc.
availability The work's availability. E.g., "restricted", "unrestricted", etc.
numAuthors The number of authors of the work.
authors List of authors of the work.
teiHeader The teiHeader XML for the work, or null if none.

Genre, subgenre, and availability may be empty strings, but they are never null.

Works are also work parts. Each work is the root of the tree of its parts. In addition to the attributes displayed in the table above, each work also has all the attributes in the table displayed below for work parts. In the Java code, Work is a subclass of WorkPart.

WorkPart

tag A short unique string identifier for the work part.
title The title of the work part.
type The type of the work part.
parent The parent of the work part in the work part tree.
Null for the root of a tree (a work).
numChildren The number of children of the work part in the work part tree.
children List of the children of the work part in the work part tree.
work The work to which the work part belongs.
workOrdinal The ordinal of the work part within its work,
in preorder traversal of the work part tree,
with the root of the tree (the work) having ordinal 0.
level The level of the work part in the work part tree.
The root of the tree (the work) has level 0.
hasChildren True if the work part has child parts.
False if it is a leaf node of the work part tree.
This condition is the same as numChildren > 0.
hasText True if the work part has text.
This condition is the same as htmlText != null
hasWords True if the work part has words.
This condition is the same as numWordsNonCum > 0.
numWordsCum The cumulative number of words in the work part.
numWordsNonCum The non-cumulative number of words in the work part.
numWordPartsCum The cumulative number of word parts in the work part.
numWordPartsNonCum The non-cumulative number of word parts in the work part.
numSentencesCum The cumulative number of sentences in the work part.
numSentencesNonCum The non-cumulative number of sentences in the work part.
numWordsCumMain The cumulative number of words in main text (non-paratext)
in the work part.
numWordsNonCumMain The non-cumulative number of words in main text (non-paratext)
in the work part.
numWordPartsCumMain The cumulative number of word parts in main text (non-paratext)
in the work part.
numWordPartsNonCumMain The non-cumulative number of word parts in main text (non-paratext)
in the work part.
numSentencesCumMain The cumulative number of sentences in main text (non-paratext)
in the work part.
numSentencesNonCumMain The non-cumulative number of sentences in main text (non-paratext)
in the work part.
htmlText The text for the work part formatted as HTML, or null if none.
adornedXML The morphadorned TEI-A XML for the work part, or null if none.
unadornedXML The unadorned TEI-A XML for the work part, or null if none.

Work parts have both cumulative and non-cumulative counts of the number of words, word parts, and sentences which they contain. Cumulative counts include the work part itself and all of its descendants. Non-cumulative counts include just the work part itself, not any of its descendants.

Note that work parts may contain both words and children. It is not the case that all words are contained within leaf nodes of the work part tree. For example, consider a novel organized into volumes at level 1 and chapters at level 2. A volume may contain words at the beginning which precede the first chapter of the volume. The work part for this volume contains both words and child parts. Such a work part has hasWords=true and hasChildren=true. A work part which contains both words and children is called "odd".

A work part may also have text in which none of the words are tagged, for example, on a title page or dramatis personae page. Such parts have hasText=true and hasWords=false.

Tagged words in the HTML text are marked up as:


<span id="wordTag">word</span>

In the HTML text, notes (as encoded in TEI note elements) are moved to the end of the text and numbered. In the main text, each note is referenced by its number (e.g., "[13]").

The type attribute of a work part is the string value of the type attribute on the div or other element in the TEI source from which the work part was created, or the value **missing** if there was no such attribute.

Author

tag A short unique string identifier for the author,
currently the same as the author's name.
name The author's name.
birthYear The author's birth year, or null if unknown.
deathYear The author's death year, or null if unknown.
flourished When the author flourished. E.g, "1612-1653", "n/a", etc.
origin The author's origin. E.g., "British Isles", "American", etc.
gender The author's gender. E.g., "M", "F", "U".
numWorks The number of works by the author.
works Collection of works by the author.
numCorpora The number of corpora in which the author has works.
corpora Collection of corpora in which the author has works.
numWords The number of words by the author.
numWordParts The number of word parts by the author.
numSentences The number of sentences by the author.
numWordsMain The number of words in main text (non-paratext) by the author.
numWordPartsMain The number of word parts in main text (non-paratext) by the author.
numSentencesMain The number of sentences in main text (non-paratext) by the author.

Flourished, origin, and gender may be empty strings, but they are never null.

If two authors have the same name, by convention they are distinguished by appending a number to the end of the name. For example,

Burton, William(1)
Burton, William(2)

Lemma

tag A short unique string identifier for the lemma.
headWord The lemma's headword.
wordClass The lemma's word class.
homonym The lemma's homonym number,
or 0 if the lemma is not a homonym.

Lemma tags are formed from the headword and word class tag in parentheses. For example, the lemma tag for the verb "love" is:

love (v)

There may be more than one lemma with the same headword. For example, the lemma tag for the noun "love" is:

love (n)

The current version of Monk does not disambiguate homonyms. The homonym attribute is always 0.

Homonyms could be disambiguated with a number in parentheses at the end of the lemma tag. For example, consider the following two uses of the noun "temple" in Shakespeare's A Midsummer Night's Dream:

For she his hairy temples then had rounded (Act 4, Scene 1). The lemma's tag could be:

temple (n) (1)

Ay, in the temple, in the town, the field, (Act 2, Scene 1). The lemma's tag could be:

temple (n) (2)

In the first example, the noun "temple" is the body part. In the second example, it is the place of worship.

In the current version of Monk, in this example, there is only one lemma with the tag:

temple (n)

Pos

tag A short unique string identifier for the part of speech.
wordClass The part of speech's word class.
syntaxCategory Syntax category, or null if none.
tenseCategory Tense category, or null if none.
moodCategory Mood category, or null if none.
caseCategory Case category, or null if none.
personCategory Person category, or null if none.
numberCategory Number category, or null if none.
degreeCategory Degree category, or null if none.
negativeCategory Negative category, or null if none.

WordClass

tag A short unique string identifier for the word class.
description A description of the word class.
majorWordClass The major word class.

Spelling

tag A short unique string identifier for the spelling.
length The length of the spelling.

The tag for a spelling object is the spelling itself. For example, for the spelling "loveth", the tag is simply:

loveth

Spellings are always lower case.

MajorWordClass
SyntaxCategory
TenseCategory
MoodCategory
CaseCategory
PersonCategory
NumberCategory
DegreeCategory
NegativeCategory

tag A short unique string identifier for the object.

Each of these objects has only its tag as an attribute.

prev     tcon     next