ActualScan Tech

Human-usable collocations

Posted at Jan 2, 2020

Collocations are magically simple. They are a tool in language processing that just counts co-occurring words and gives you the important phrases and relationships.

Trying to keep it plain

Collocations consist of word structures, and often just simple phrases, like modern approach or stroke of luck. Let’s assume we are talking about bigrams, or two-word phrases. Most measures of association strength of bigram collocations use four frequencies: number of two-word sequences of word1 and word2 together, numbers of sequences where word1 appears but not word2, and vice versa, and finally, the number of sequences where neither of them is found.

This has the advantage of being simple and deterministic.

Since I do some research for humanities (mainly history), I am especially motivated to find methods that are powerful and also easily explainable without mathematical background. Finding such things is difficult. Nowadays you probably have the reflex to say something like ‘neural embeddings’, but these give results that are fuzzy and hard to intepret.

Just to illustrate the point, here are two runs of word2vec algorithm I performed on the same small corpus of psychology papers.

Two 2D projections of differing word2vec semantic spaces

Two 2D projections of differing word2vec semantic spaces.

Notice how outlier words stay mostly the same, but the main cluster changes its relative positions. Also the generated scale seems to be expanded over 2x in the second trial compared to the first one. There are cases where all this is fine, just as how the personal language knowledge (idiolect) of each individual person is slightly different. Vector representations are a little like personal language knowledge we have for computers now. But sometimes you want something that is more constant and fully interpretable, and that’s when something like collocations is very appealing.

Now, admittedly there is almost a hundred of various ways of defining and computing collocation strength reported in specialist literature. Many tackle the task from the standpoint of some paradigm of statistics, others probability or information theory, etc. (If you want to delve into the subject by yourself, I put some recommendations at the end of the post1.)

You can try to escape all complications with raw frequency, or just counting how often the phrase appears in the corpus of texts. If some word combination appears frequently in the text, it is clearly important for the authors or at least the convention they are writing in.

But the results will be dominated by words that are the most frequent overall. As anyone who ever did some NLP knows, these are the highly abstract functional words, the “stopwords”, such as the, be, have.

Here’s some Python code if you want try it for yourself on a sample from the Europarl corpus (European Parliament proceedings). Make sure you have NLTK installed and downloaded the europarl_raw corpus with dialog.

import nltk
from nltk.collocations import *
from nltk.corpus import europarl_raw

# We want to format the value of the metric with four positions after the decimal point.
def fmt_freq(w1, w2):
   return '{:0.4f}'.format(finder.score_ngram(measures.raw_freq, w1, w2))

measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(europarl_raw.english.words())
# Find 20 most frequent collocations.
[(w1, w2, fmt_freq(w1, w2)) for (w1, w2) in finder.nbest(measures.raw_freq, 20)]

The result list looks like this: [('of', 'the', '0.0107'), ('in', 'the', '0.0057'), ('.', 'The', '0.0040'), ... ('the', 'Commission', '0.0031'), ('the', 'European', '0.0030') ...]. The numbers here are the ratios of the whole corpus. (Multiply them by the total bigram count, or finder.N, to get the actual frequency.) Some of these results should impress us a little: I always like to appreciate that invisible phrases, like of the, can take up 1% of language, and the “true” meanings represent by nouns or verbs form such little parts of it. But this are definitely not the insights we are looking for.

Methods of acquring more usable collocations include taking only some parts of speech (POS), for example nouns and adjectives. Although our corpus here is not POS-tagged, we can hack some way forward by excluding short words (3 characters or less).

# Exclude words that have 3 characters or less.
finder.apply_word_filter(lambda w: len(w) <= 3)

[(w1, w2, fmt_freq(w1, w2)) for (w1, w2) in finder.nbest(measures.raw_freq, 20)]

The results are becoming more interesting, with bigrams like ('European', 'Union', '0.0017'), ('Member', 'States', '0.0013'), ('Madam', 'President', '0.0005') or ('human', 'rights', '0.0004'). We can pick up the most important elements of language used in the European Parliament and infer that the sample comes mostly from 1999-2001, when Nicole Fontaine, a woman, was the President (of course, in a world with more female Parliament presidents Madam President would be a less informative bigram).

Metrics and how to understand them

Even though you can make raw frequency collocations work somehow, and it’s great how much they can reveal while being so simple, they have their shortcomings. We can say that human rights form 0.04% of bigrams in the corpus, but how much really associated are these words? How strong is the association of rights with being human, in comparison to different kinds of rights?

A simple metric called Jaccard index divides the raw frequency by the sum of overall frequencies of words included in the collocation. This way we can mostly ignore how frequent, in magnitude, the words are. Instead we try to focus purely on how probable it is that when one appears, the other appears as well.

# Remove all capitalized words. Otherwise we will get mostly first names + last names.
finder.apply_word_filter(lambda w: w[0].upper() == w[0])
# A function similar to fmt_freq for Jaccard index.
def fmt_jacc(w1, w2):
   return '{:0.4f}'.format(finder.score_ngram(measures.jaccard, w1, w2))
# Get the 20 strongest bigrams.
[(w1, w2, fmt_jacc(w1, w2)) for (w1, w2) in  finder.nbest(measures.jaccard, 20)]

There is a whole lot of bigrams that have the Jaccard index of 1: these are the words that always appear together in the corpus. The code above gives us some examples like ('adult', 'literacy', '1.0000'), ('air-raid', 'shelters', '1.0000'), ('all-important.', 'co-president', '1.0000') and ('apprentice', 'dictator', '1.0000').

We can suspect that some of those contain rare words and decide we are interested in strong associations of reasonably frequent words. NLTK has a built-in facility for that. Let’s remove all words that occur less than three times.

# Skip everything that has less than three occurences.

[(w1, w2, fmt_jacc(w1, w2)) for (w1, w2) in finder.nbest(measures.jaccard, 20)]

They are still some extremely strong collocations with the index equal 1, such as cubic metres and inter alia, but there also milder examples, like ('one-way', 'street', '0.7500'), ('depleted', 'uranium', '0.6429'), ('genetically-modified', 'organisms', '0.4762'). I am selecting here the cases where the collocation can be interpreted as saying something about meaning of words as they were being used in the Europarliament proceedings. The MPs were talking about dealing with uranium waste and about organisms in the context of genetic modifications.

But just how much? We can look up raw frequencies for all these bigrams, for example with our fmt_freq helper function, but this is clunky and not satisfactory. Now we’d have to manually intepret a combination of two numbers.

That is, except when we have selected some interesting word beforehand, as with the rights example we mentioned before.

finder_rights = BigramCollocationFinder.from_words(europarl_raw.english.words())
# Require that one of the words in the bigram is 'rights'.
finder_rights.apply_ngram_filter(lambda w1, w2: w1 != 'rights' and w2 != 'rights')
finder_rights.apply_word_filter(lambda w: len(w) <= 3)

# A version using finder_rights collocation finder.
def fmt_jacc_r(w1, w2):
   return '{:0.4f}'.format(finder_rights.score_ngram(measures.jaccard, w1, w2))

[(w1, w2, fmt_jacc_r(w1, w2)) for (w1, w2) in finder_rights.nbest(measures.jaccard, 20)]

The list begins with [('human', 'rights', '0.3957'), ('fundamental', 'rights', '0.0439'), ('rights', 'violations', '0.0157') .... Notice how each next bigram in the beginning of the list is so much weaker than the previous one. Somewhat farther come such phrases as ('social', 'rights', '0.0100'), ('equal', 'rights', '0.0090'), ('civil', 'rights', '0.0087'). Here the numbers are actually very similar, so the conclusion that social rights were more important for the MPs than civil rights would be somewhat weak.

The log likelihood ratio

There are many measures of collocation strength that are statistical, such as chi-square. But it is not so much advertised that a related metric, log likelihood ratio, has an intuitive interpretation. Most of these measures aim at answering whether there is a statistically significant link between the words occurring together. But that is basically a binary question. If you want to ask not whether, but how much, it is again difficult to say what these numbers mean if you are not a statistician.

Log likelihood ratio takes some math to compute (again, look at the end of this posting if you want some sources with equations), but we can explain its meaning like so. Let’s say we have two hypotheses: one (A) says that words in the phrase appear together completely at random, and there is no factor linking them. The other hypothesis (B) is that there is a special tendency for these specific words to appear together.

It turns out that log likelihood ratio says how much the observed data in the corpus is more probable under hypothesis B in comparison to hypothesis A (the null hypothesis). For example, with likelihood ratio of 10 the estimated probability under hypothesis B is 10 times higher than the one given by hypothesis A.

Note that we are not saying that the hypothesis A or B is more probable by itself using the data; only which one better explains our observations. But what is important is that now we have a metric where we can say what the score of 0.5 (i.e., not significant at all) means when compared to the score of ten or a million.

Let’s return with this new tool to our Python session.

def fmt_ll(w1, w2):
   return '{:0.4f}'.format(finder.score_ngram(measures.likelihood_ratio, w1, w2))

# Get 20 strongest collocations measured by log likelihood ratio.
[(w1, w2, fmt_ll(w1, w2)) for (w1, w2) in finder.nbest(measures.likelihood_ratio, 20)]

At the first blush, the results may seem “worse”: we are back to having stopwords phrases (('would', 'like', '4827.3332'), ('have', 'been', '2928.0059')) as the strongest ones. But notice that this is not the whole story. The bigram ('human', 'rights', '3039.9230') is up there with these grammatical phrases. Dig in a little deeper and you can find other important constructs, such as ('common', 'position', '1061.9026'), ('internal', 'market', '927.8537') and ('precautionary', 'principle', '839.4803'). They tell us much (quantitavely!) about interests and internal processes of European Union, which operates seeking consensus – common positions – and is very concerned with economic relations between its member states.

In my experience the most common phrases have the LL ratio in thousands, other important ones in hundreds, with a long tail of diminishing importance┬ástarting around a hundred. These are, of course, rules of thumb. It is easy but important to notice that this metric is significantly influenced by word frequencies (much more so than Jaccard index), even though it’s not stricly designed to reflect them. So scores that wouldn’t be so impressive for frequent words can be more significant for rarer ones.

In short

  1. Collocation detection provides us a simple way to automatically extract information about the language used in texts. Collocations are especially useful when you need methods that are deterministic and interpretable.
  2. There are many association strength measures for collocations. You can use raw frequency, but it will be noisy.
  3. Jaccard index can be useful when comparing different associations of one word.
  4. Log likelihood ratio works as a handy all-round metric.
  5. Don’t forget to preprocess your data, hopefully by performing lemmatization and POS-annotation.

So, that’s enough basic word associations wisdom for today. Sometime in the future we may look at the measures related to information theory and the practical topic of comparing metrics of collocation of differing length: for example, of the bigram luxury goods with the larger trigram fake luxury goods. For now, happy collocating!

  1. The term collocation was primarely used in linguistics. You can find a very simple tutorial in NLTK docs (then take a look at the API reference). Stefan Evert’s website is good for overview and quick reference, while his PhD thesis and of course the relevant chapter in Manning and Sch├╝tze can serve as deeper introductions. I’d use the former for deeper explanations of specifics. [return]