Also refer to http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize
Frequently the texts we have are not those we want to analyze. We may have an single file containing the collected works of an author although we are only interested in a single work. Or we may be given a large work broken up into volumes (this is the case for Les Misèrables, as we will see later) where the division into volumes is not important to us.
If we are interested in an author’s style, we likely want to break up a long text (such as a book-length work) into smaller chunks so we can get a sense of the variability in an author’s writing. If we are comparing one group of writers to a second group, we may wish to aggregate information about writers belonging to the same group. This will require merging documents or other information that were initially separate. This section illustrates these two common preprocessing step: splitting long texts into smaller “chunks” and aggregating texts together.
Another important preprocessing step is tokenization. This is the process of splitting a text into individual words or sequences of words (n-grams). Decisions regarding tokenization will depend on the language(s) being studied and the research question. For example, should the phrase
"her father's arm-chair"be tokenized as as
["her", "father", "s", "arm", "chair"]or
["her", "father's", "arm-chair"]. Tokenization patterns that work for one language may not be appropriate for another (What is the appropriate tokenization of “Qu’est-ce que c’est?”?). This section begins with a brief discussion of tokenization before covering splitting and merging texts.
Each tutorial is self-contained and should be read through in order. Variables and functions introduced in one subsection will be referenced and used in subsequent subsections. For example, the NumPy library
numpyis imported and then used later without being imported a second time.
There are many ways to tokenize a text. Often ambiguity is inescapable. Consider the following lines of Charlotte Brontë’s Villette:
Does the appropriate tokenization include “armchair” or “arm-chair”? While it would be strange to see “arm-chair” in print today, the hyphenated version predominates in Villette and other texts from the same period. “gentleman”, however, seems preferable to “gentle-man,” although the latter occurs in early nineteenth century English-language books. This is a case where a simple tokenization rule (resolve end-of-line hyphens) will not cover all cases. For very large corpora containing a diversity of authors, idiosyncrasies resulting from tokenization tend not to be particularly consequential (“arm-chair” is not a high frequency word). For smaller corpora, however, decisions regarding tokenization can make a profound difference.
Languages that do not mark word boundaries present an additional challenge. Chinese and Classical Greek provide two important examples. Consider the following sequence of Chinese characters: 爱国人. This sequence could be broken up into the following tokens: [“爱”， 国人”] (to love one’s compatriots) or [“爱国”, “人”] (a country-loving person). Resolving this kind of ambiguity (when it can be resolved) is an active topic of research. For Chinese and for other languages with this feature there are a number of tokenization strategies in circulation.
Here are a number of examples of tokenizing functions:
Often we want to count inflected forms of a word together. This procedure is referred to as stemming. Stemming a German text treats the following words as instances of the word “Wald”: “Wald”, “Walde”, “Wälder”, “Wäldern”, “Waldes”, and “Walds”. Analogously, in English the following words would be counted as “forest”: “forest”, “forests”, “forested”, “forest’s”, “forests’”. As stemming reduces the number of unique vocabulary items that need to be tracked, it speeds up a variety of computational operations. For some kinds of analyses, such as authorship attribution or fine-grained stylistic analyses, stemming may obscure differences among writers. For example, one author may be distinguished by the use of a plural form of a word.
NLTK offers stemming for a variety of languages in the nltk.stem package. The following code illustrates the use of the popular Snowball stemmer:
Splitting a long text into smaller samples is a common task in text analysis. As most kinds of quantitative text analysis take as inputs an unordered list of words, breaking a text up into smaller chunks allows one to preserve context that would otherwise be discarded; observing two words together in a paragraph-sized chunk of text tells us much more about the relationship between those two words than observing two words occurring together in an 100,000 word book. Or, as we will be using a selection of tragedies as our examples, we might consider the difference between knowing that two character names occur in the same scene versus knowing that the two names occur in the same play.
To demonstrate how to divide a large text into smaller chunks, we will be working with the corpus of French tragedies. The following shows the first plays in the corpus:
Every 1,000 words
One way to split a text is to read through it and create a chunk every n words, where n is a number such as 500, 1,000 or 10,000. The following function accomplishes this:
To divide up the plays, we simply apply this function to each text in the corpus. We do need to be careful to record the original file name and chunk number as we will need them later. One way to keep track of these details is to collect them in a list of Pythondictionaries. There will be one dictionary for each chunk, containing the original filename, a number for the chunk, and the text of the chunk.
Writing chunks to a directory
These chunks may be saved in a directory for reference or for analysis in another program (such as MALLET or R).
(A stand-alone script for splitting texts is available:
It is possible to split a document into paragraph-length chunks. Finding the appropriate character (sequence) that marks a paragraph boundary requires familiarity with how paragraphs are encoded in the text file. For example, the version of Jane Eyre provided in theausten-brontë corpus, contains no line breaks within paragraphs inside chapters, so the paragraph marker in this case is simply the newline. Using the
splitstring method with the newline as the argument (
split('\n')) will break the text into paragraphs. That is, if the text of Jane Eyre is contained in the variable
textthen the following sequence will split the document into paragraphs:
By contrast, in the Project Gutenberg edition of Brontë’s novel, paragraphs are set off by two newlines in sequence. We still use the
splitmethod but we will use two newlines
\n\nas our delimiter:
When comparing groups of texts, we often want to aggregate information about the texts that comprise each group. For instance, we may be interested in comparing the works of one author with the works of another author. Or we may be interested in comparing texts published before 1800 with texts published after 1800. In order to do this, we need a strategy for collecting information (often word frequencies) associated with every text in a group.
As an illustration, consider the task of grouping word frequencies in French tragedies by author. We have four authors (Crébillon, Corneille, Racine, and Voltaire) and 60 texts. Typically the first step in grouping texts together is determining what criterion or “key” defines a group. In this case the key is the author, which is conveniently recorded at the beginning of each filename in our corpus. So our first step will be to associate each text (the contents of each file) with the name of its author. As before we will use a list of dictionaries to manage our data.
The easiest way to group the data is to use NumPy’s array indexing. This method is more concise than the alternatives and it should be familiar to those comfortable with R or Octave/Matlab. (Those for whom this method is unfamiliar will benefit from reviewing the introductions to NumPy mentioned in Getting started.)
Recall that gathering together the sum of the entries along columns is performed with
X.sum(axis=0). This is the NumPy equivalent of R’s
apply(X, 2, sum)(or
Grouping data together in this manner is such a common problem in data analysis that there are packages devoted to making the work easier. For example, if you have the pandas library installed, you can accomplish what we just did in two lines of code:
A more general strategy for grouping data together makes use of the
groupbyfunction in the Python standard library itertools. This method has the advantage of being fast and memory efficient. As a warm-up exercise, we will group just the filenames by author using
The preceding lines of code demonstrate how to group filenames by author. Now we want to aggregate document-term frequencies by author. The process is similar. We use the same strategy of creating a collection of dictionaries with the information we want to aggregate and the key—the author’s name—that identifies each group.
Now that we have done the work of grouping these texts together, we can examine the relationships among the four authors using the exploratory techniques we learned in Working with text.
Note that it is possible to group texts by any feature they share in common. If, for instance, we had wanted to organize our texts into 50 year periods (1650-1699, 1700-1749, ...) rather than by author, we would begin by extracting the publication year from the filename.
Then we would create a list of group identifiers based on the periods that interest us:
Finally we would group the texts together using the same procedure as we did with authors.
- Write a tokenizer that, as it tokenizes, also transforms uppercase words into lowercase words. Consider using the string method
- Using your tokenizer, count the number of times
greenoccurs in the following text sample.
- Personal names that occur in lowercase form in the dictionary illustrate one kind of information that is lost by ignoring case. Provide another example of useful information lost when lowercasing all words.