Monk Datastore Overview

prev     tcon     next

Counts and Count Searching

The Counter class implements counts and count searching.

A Counter object has the following attributes:

container The count container.
feature The count feature.
countCum The cumulative count of the number of
occurrences of the feature in the container.
countNonCum The non-cumulative count of the number of
occurrences of the feature in the container.
freqCum The cumulative frequency of
the feature in the container.
freqNonCum The non-cumulative frequency of
the feature in the container.
countCumMain The cumulative count of the number of
occurrences of the feature
in main text (non-paratext) in the container.
countNonCumMain The non-cumulative count of the number of
occurrences of the feature
in main text (non-paratext) in the container.
freqCumMain The cumulative frequency of the feature
in main text (non-paratext) in the container.
freqNonCumMain The non-cumulative frequency of the feature
in main text (non-paratext) in the container.

A container is an object over which features are counted. There are four container classes:

  1. Corpus
  2. Work
  3. WorkPart
  4. Author

A feature is an object being counted within a container. There are 18 feature classes:

  1. Lemma
  2. Pos
  3. Spelling
  4. WordClass
  5. MajorWordClass
  6. LemPos
  7. WordForm
  8. HeadWord
  9. LemmaBigram
  10. PosTrigram
  11. SyntaxCategory
  12. TenseCategory
  13. MoodCategory
  14. CaseCategory
  15. PersonCategory
  16. NumberCategory
  17. DegreeCategory
  18. NegativeCategory

There are 72 different kinds of counters, for the 4 containers times the 18 features.

LemPos is a feature which combines lemmas and parts of speech. WordForm is a feature which combines lemmas, parts of speech, and spellings. HeadWord is a feature which is just the headword part of lemmas.

The LemmaBigram and PosTrigram features represent runs of two lemmas and three parts of speech respectively. The precomputed counts for these features do not include runs which cross sentence boundaries.

For the Work and WorkPart containers, the cumulative count and frequency include both the container object itself and all of its descendant work parts. The non-cumulative count and frequency include only the container itself.

For the Corpus and Author containers, the non-cumulative count and frequency are always zero.

Counts and fequencies are available over all the text in the container, and over just the main text (non-paratext) in the container. For the definitions of main text and paratext see Words and Word Searching.

Frequencies are measured relative to the total number of words, word parts, word part bigrams, or word part trigrams in the container, in units of parts per 10,000. The formula used to compute a frequency is:


freq = count * 10,000 / divisor

For spellings, the divisor is the total number of words in the container. For lemma bigrams and pos trigrams, the divisor is the total number of word part 2-grams or 3-grams in the container respectively. For all the other features, the divisor is the total number of word parts in the container.

Counts have type long. Frequencies have type double.

The Counter<T,S> class is parameterized by the container type T and the feature type S. For example, a count of a lemma in a corpus has the type:

Counter<Corpus,Lemma>

A collection of such counts has the type:

Collection<Counter<Corpus,Lemma>>

The static methods Counter.find find counts.

Example 1. Print corpus spelling counts and frequencies for all the corpora.


/** Prints corpus spelling counts and frequencies for all the corpora.
 *
 *  @throws ModelException
 */

void printCorpusSpellingCountsAndFrequencies ()
    throws ModelException
{
    Collection<Counter<Corpus,Spelling>> result =
        Counter.find(Corpus.class, Spelling.class, new SearchCriterion[0]);
    for (Counter<Corpus,Spelling> counter : result) {
        Corpus corpus = counter.getContainer();
        Spelling spelling = counter.getFeature();
        long count = counter.getCount(CumKind.CUM);
        double freq = counter.getFreq(CumKind.CUM);
        System.out.println(
            corpus.getTag() + " " + 
            spelling.getTag() + " " +
            count + " " +
            String.format("%10.2f", freq)
        );
    }
}

Note that to specify an empty collection of search criteria we use:


    new SearchCriterion[0]

Example 2. Print the top 10 nouns used by an author.


/** Prints the top 10 nouns used by an author in main text (non-paratext).
 *
 *  @param  author      Author.
 *
 *  @throws ModelException
 */
 
void topTenNouns (Author author) 
    throws ModelException
{
    Collection<Counter<Author,Lemma>> counters =
        Counter.find(Author.class, Lemma.class, 
            new AuthorCriterion(author),
            new MajorWordClassCriterion("noun"));
    Counter<Author,Lemma>[] sortedCounters =
        Counter.sort(counters, Counter.SortOption.COUNT_CUM_MAIN_DESCENDING);
    int k = 0;
    for (Counter<Author,Lemma> counter : sortedCounters) {
        Lemma lemma = counter.getFeature();
        long count = counter.getCountMain(CumKind.CUM);
        System.out.println(lemma.getTag() + " " + count);
        k++;
        if (k == 10) break;
    }
}

In this example, we use the static utility method Counter.sort to sort the results of the search in descending order by main text cumulative count. This sorts the most frequently used nouns by the author in main text to the beginning of the list, ready to be printed.

Note the extensive use of parameterized types in this example. It's verbose, but worth the trouble.

Also note that whenever you get a count or a frequency, you must specify whether you want a cumulative or non-cumulative count or frequency, using one of the constants in the CumKind enum type. This may seem bothersome, but forcing you to always make a deliberate decision about which one you want is valuable in practice, especially with work parts, when sometimes you want one kind of count or frequency and sometimes you want the other kind.

Example 3. Extract a large sparse matrix of work part lemma frequencies for data mining or other analysis.


/** Extracts a large sparse matrix of work part lemma frequencies for data mining
 *  or other analysis.
 *
 *  @param  corpus      Corpus.
 *
 *  @throws ModelException
 */
 
void miner (Corpus corpus) 
    throws ModelException
{
    for (Work work : corpus.getWorks()) {
        Collection<Counter<WorkPart,Lemma>> counters =
            Counter.find(WorkPart.class, Lemma.class,
                new WorkCriterion(work)
            );
        for (Counter<WorkPart,Lemma> counter : counters) {
            WorkPart workpart = counter.getContainer();
            Lemma lemma = counter.getFeature();
            double freq = counter.getFreqMain(CumKind.NON_CUM);
            // Populate row "workPart" and column "lemma" of your
            // sparse matrix with the number "freq"
        }
    }
    // Use analysis software to do your thing.
}

Note that we search for counts one work at a time, to avoid running out of memory. As it is, this example requires a large Java heap setting! The example could also be done using an additional inner loop to process just a single work part at a time.

To avoid having runaway requests for extremely large numbers of objects swamp the server and deplete its memory resources, all searches are limited to a maximum of five million results. A ModelException is thrown if more results are generated when trying to execute a search.

Note that we use frequencies over main text only (non-paratext). Ignoring paratext is usually what you want for data miining or aother kinds of text analysis.

Finally, note the use of the constant CumKind.NON_CUM. We definitely want non-cumulative counts in this context!

Example 4. Print work set lemma counts and frequencies.

We provide the four pre-defined "natural" containers for corpora, works, work parts, and authors. What about other user-defined containers over which we might wish to compute counts and frequencies, such as sets of works, work parts, or authors? In these kinds of problems we need to aggregate counts and frequencies over the individual containers.


/** Prints work set lemma counts and frequencies.
 *
 *  @param  works       Collection of works.
 *
 *  @throws ModelException
 */
 
void printWorkSetLemmaCountsAndFrequencies (Collection<Work> works)
    throws ModelException
{
    SearchCriterion c = new WorkCriterion(works);
    Map<Lemma,AggregateCounter> map = 
        Counter.findAndAggregateCounts(Work.class, Lemma.class, works, c);
    for (Lemma lemma : map.keySet()) {
        AggregateCounter counter = map.get(lemma);
        long count = counter.getCount(CumKind.CUM);
        double freq = counter.getFreq(CumKind.CUM);
        System.out.println(lemma.getTag() + " " + count + " " +
            String.format("%10.2f", freq));
    }
}

This example uses the static method Counter.findAndAggregateCounts. This method first calls Counter.find to find the work lemma counts, then aggregates the counts over all the works in the work set. The result returned by this method is a map from lemmas to aggregated counter objects.

Example 5. Compare word use over time.

Aggregation proves useful once again in this example.


/** Compares word use over time.
 *
 *  <p>Prints a list of all the verbs in a corpus which are used at least
 *  twice as frequently before a given date as after the date, and are used
 *  at least 10 times both before and after the date.
 *
 *  @param  corpus      Corpus.
 *
 *  @param  year        Year.
 *
 *  @throws ModelException
 */

void compareTime (Corpus corpus, int year) 
    throws ModelException
{
    SearchCriterion beforeCriterion = new CirculationYearCriterion(-1, year);
    SearchCriterion afterCriterion = new CirculationYearCriterion(year+1, -1);
    SearchCriterion corpusCriterion = new CorpusCriterion(corpus);
    SearchCriterion verbCriterion = new MajorWordClassCriterion("verb");
    Collection<Work> beforeWorks = Work.find(corpusCriterion, beforeCriterion);
    Collection<Work> afterWorks = Work.find(corpusCriterion, afterCriterion);
    Map<Lemma,AggregateCounter> beforeMap = 
        Counter.findAndAggregateCounts(Work.class, Lemma.class, beforeWorks,
            corpusCriterion, beforeCriterion, verbCriterion);
    Map<Lemma,AggregateCounter> afterMap = 
        Counter.findAndAggregateCounts(Work.class, Lemma.class, afterWorks,
            corpusCriterion, afterCriterion, verbCriterion);
    for (Lemma lemma : beforeMap.keySet()) {
        AggregateCounter beforeCounter = beforeMap.get(lemma);
        AggregateCounter afterCounter = afterMap.get(lemma);
        if (beforeCounter.getCount(CumKind.CUM) >= 10 &&
            afterCounter != null &&
            afterCounter.getCount(CumKind.CUM) >= 10 &&
            beforeCounter.getFreq(CumKind.CUM) >=
                2.0 * afterCounter.getFreq(CumKind.CUM))
                    System.out.println(lemma.getTag());
    }
}

prev     tcon     next