Monk Datastore Overview
The Counter class implements counts and count searching.
A Counter object has the following attributes:
container The count container. feature The count feature. countCum The cumulative count of the number of
occurrences of the feature in the container.countNonCum The non-cumulative count of the number of
occurrences of the feature in the container.freqCum The cumulative frequency of
the feature in the container.freqNonCum The non-cumulative frequency of
the feature in the container.countCumMain The cumulative count of the number of
occurrences of the feature
in main text (non-paratext) in the container.countNonCumMain The non-cumulative count of the number of
occurrences of the feature
in main text (non-paratext) in the container.freqCumMain The cumulative frequency of the feature
in main text (non-paratext) in the container.freqNonCumMain The non-cumulative frequency of the feature
in main text (non-paratext) in the container.
A container is an object over which features are counted. There are four container classes:
CorpusWorkWorkPartAuthorA feature is an object being counted within a container. There are 18 feature classes:
LemmaPosSpellingWordClassMajorWordClassLemPosWordFormHeadWordLemmaBigramPosTrigramSyntaxCategoryTenseCategoryMoodCategoryCaseCategoryPersonCategoryNumberCategoryDegreeCategoryNegativeCategoryThere are 72 different kinds of counters, for the 4 containers times the 18 features.
LemPos is a feature which combines lemmas and parts of speech.
WordForm is a feature which combines lemmas, parts of speech, and spellings.
HeadWord is a feature which is just the headword part of lemmas.
The LemmaBigram and PosTrigram features represent runs of two lemmas
and three parts of speech respectively. The precomputed counts for these features do not include
runs which cross sentence boundaries.
For the Work and WorkPart containers, the cumulative
count and frequency include both the container object itself and all of its descendant
work parts. The non-cumulative count and frequency include only the container itself.
For the Corpus and Author containers, the non-cumulative
count and frequency are always zero.
Counts and fequencies are available over all the text in the container, and over just the main text (non-paratext) in the container. For the definitions of main text and paratext see Words and Word Searching.
Frequencies are measured relative to the total number of words, word parts, word part bigrams, or word part trigrams in the container, in units of parts per 10,000. The formula used to compute a frequency is:
freq = count * 10,000 / divisor
For spellings, the divisor is the total number of words in the container. For lemma bigrams and pos trigrams, the divisor is the total number of word part 2-grams or 3-grams in the container respectively. For all the other features, the divisor is the total number of word parts in the container.
Counts have type long. Frequencies have type double.
The Counter<T,S> class is parameterized by the container type T
and the feature type S. For example, a count of a lemma in a corpus has the
type:
Counter<Corpus,Lemma>
A collection of such counts has the type:
Collection<Counter<Corpus,Lemma>>
The static methods Counter.find find counts.
/** Prints corpus spelling counts and frequencies for all the corpora. * * @throws ModelException */ void printCorpusSpellingCountsAndFrequencies () throws ModelException { Collection<Counter<Corpus,Spelling>> result = Counter.find(Corpus.class, Spelling.class, new SearchCriterion[0]); for (Counter<Corpus,Spelling> counter : result) { Corpus corpus = counter.getContainer(); Spelling spelling = counter.getFeature(); long count = counter.getCount(CumKind.CUM); double freq = counter.getFreq(CumKind.CUM); System.out.println( corpus.getTag() + " " + spelling.getTag() + " " + count + " " + String.format("%10.2f", freq) ); } }Note that to specify an empty collection of search criteria we use:
new SearchCriterion[0]
/** Prints the top 10 nouns used by an author in main text (non-paratext). * * @param author Author. * * @throws ModelException */ void topTenNouns (Author author) throws ModelException { Collection<Counter<Author,Lemma>> counters = Counter.find(Author.class, Lemma.class, new AuthorCriterion(author), new MajorWordClassCriterion("noun")); Counter<Author,Lemma>[] sortedCounters = Counter.sort(counters, Counter.SortOption.COUNT_CUM_MAIN_DESCENDING); int k = 0; for (Counter<Author,Lemma> counter : sortedCounters) { Lemma lemma = counter.getFeature(); long count = counter.getCountMain(CumKind.CUM); System.out.println(lemma.getTag() + " " + count); k++; if (k == 10) break; } }In this example, we use the static utility method
Counter.sortto sort the results of the search in descending order by main text cumulative count. This sorts the most frequently used nouns by the author in main text to the beginning of the list, ready to be printed.Note the extensive use of parameterized types in this example. It's verbose, but worth the trouble.
Also note that whenever you get a count or a frequency, you must specify whether you want a cumulative or non-cumulative count or frequency, using one of the constants in the
CumKindenum type. This may seem bothersome, but forcing you to always make a deliberate decision about which one you want is valuable in practice, especially with work parts, when sometimes you want one kind of count or frequency and sometimes you want the other kind.
/** Extracts a large sparse matrix of work part lemma frequencies for data mining * or other analysis. * * @param corpus Corpus. * * @throws ModelException */ void miner (Corpus corpus) throws ModelException { for (Work work : corpus.getWorks()) { Collection<Counter<WorkPart,Lemma>> counters = Counter.find(WorkPart.class, Lemma.class, new WorkCriterion(work) ); for (Counter<WorkPart,Lemma> counter : counters) { WorkPart workpart = counter.getContainer(); Lemma lemma = counter.getFeature(); double freq = counter.getFreqMain(CumKind.NON_CUM); // Populate row "workPart" and column "lemma" of your // sparse matrix with the number "freq" } } // Use analysis software to do your thing. }Note that we search for counts one work at a time, to avoid running out of memory. As it is, this example requires a large Java heap setting! The example could also be done using an additional inner loop to process just a single work part at a time.
To avoid having runaway requests for extremely large numbers of objects swamp the server and deplete its memory resources, all searches are limited to a maximum of five million results. A
ModelExceptionis thrown if more results are generated when trying to execute a search.Note that we use frequencies over main text only (non-paratext). Ignoring paratext is usually what you want for data miining or aother kinds of text analysis.
Finally, note the use of the constant
CumKind.NON_CUM. We definitely want non-cumulative counts in this context!
We provide the four pre-defined "natural" containers for corpora, works, work parts, and authors. What about other user-defined containers over which we might wish to compute counts and frequencies, such as sets of works, work parts, or authors? In these kinds of problems we need to aggregate counts and frequencies over the individual containers.
/** Prints work set lemma counts and frequencies. * * @param works Collection of works. * * @throws ModelException */ void printWorkSetLemmaCountsAndFrequencies (Collection<Work> works) throws ModelException { SearchCriterion c = new WorkCriterion(works); Map<Lemma,AggregateCounter> map = Counter.findAndAggregateCounts(Work.class, Lemma.class, works, c); for (Lemma lemma : map.keySet()) { AggregateCounter counter = map.get(lemma); long count = counter.getCount(CumKind.CUM); double freq = counter.getFreq(CumKind.CUM); System.out.println(lemma.getTag() + " " + count + " " + String.format("%10.2f", freq)); } }This example uses the static method
Counter.findAndAggregateCounts. This method first callsCounter.findto find the work lemma counts, then aggregates the counts over all the works in the work set. The result returned by this method is a map from lemmas to aggregated counter objects.
Aggregation proves useful once again in this example.
/** Compares word use over time. * * <p>Prints a list of all the verbs in a corpus which are used at least * twice as frequently before a given date as after the date, and are used * at least 10 times both before and after the date. * * @param corpus Corpus. * * @param year Year. * * @throws ModelException */ void compareTime (Corpus corpus, int year) throws ModelException { SearchCriterion beforeCriterion = new CirculationYearCriterion(-1, year); SearchCriterion afterCriterion = new CirculationYearCriterion(year+1, -1); SearchCriterion corpusCriterion = new CorpusCriterion(corpus); SearchCriterion verbCriterion = new MajorWordClassCriterion("verb"); Collection<Work> beforeWorks = Work.find(corpusCriterion, beforeCriterion); Collection<Work> afterWorks = Work.find(corpusCriterion, afterCriterion); Map<Lemma,AggregateCounter> beforeMap = Counter.findAndAggregateCounts(Work.class, Lemma.class, beforeWorks, corpusCriterion, beforeCriterion, verbCriterion); Map<Lemma,AggregateCounter> afterMap = Counter.findAndAggregateCounts(Work.class, Lemma.class, afterWorks, corpusCriterion, afterCriterion, verbCriterion); for (Lemma lemma : beforeMap.keySet()) { AggregateCounter beforeCounter = beforeMap.get(lemma); AggregateCounter afterCounter = afterMap.get(lemma); if (beforeCounter.getCount(CumKind.CUM) >= 10 && afterCounter != null && afterCounter.getCount(CumKind.CUM) >= 10 && beforeCounter.getFreq(CumKind.CUM) >= 2.0 * afterCounter.getFreq(CumKind.CUM)) System.out.println(lemma.getTag()); } }