text preprocessing with python

From https://de.dariah.eu/tatom/preprocessing.html

Also refer to http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize

Preprocessing

Frequently the texts we have are not those we want to analyze. We may have an single file containing the collected works of an author although we are only interested in a single work. Or we may be given a large work broken up into volumes (this is the case for Les Misèrables, as we will see later) where the division into volumes is not important to us.
If we are interested in an author’s style, we likely want to break up a long text (such as a book-length work) into smaller chunks so we can get a sense of the variability in an author’s writing. If we are comparing one group of writers to a second group, we may wish to aggregate information about writers belonging to the same group. This will require merging documents or other information that were initially separate. This section illustrates these two common preprocessing step: splitting long texts into smaller “chunks” and aggregating texts together.
Another important preprocessing step is tokenization. This is the process of splitting a text into individual words or sequences of words (n-grams). Decisions regarding tokenization will depend on the language(s) being studied and the research question. For example, should the phrase "her father's arm-chair" be tokenized as as ["her", "father", "s", "arm", "chair"] or["her", "father's", "arm-chair"]. Tokenization patterns that work for one language may not be appropriate for another (What is the appropriate tokenization of “Qu’est-ce que c’est?”?). This section begins with a brief discussion of tokenization before covering splitting and merging texts.
Note
Each tutorial is self-contained and should be read through in order. Variables and functions introduced in one subsection will be referenced and used in subsequent subsections. For example, the NumPy library numpy is imported and then used later without being imported a second time.

Tokenizing

There are many ways to tokenize a text. Often ambiguity is inescapable. Consider the following lines of Charlotte Brontë’s Villette:
whose walls gleamed with foreign mirrors. Near the hearth
appeared a little group: a slight form sunk in a deep arm-
chair, one or two women busy about it, the iron-grey gentle-
man anxiously looking on. ...
Does the appropriate tokenization include “armchair” or “arm-chair”? While it would be strange to see “arm-chair” in print today, the hyphenated version predominates in Villette and other texts from the same period. “gentleman”, however, seems preferable to “gentle-man,” although the latter occurs in early nineteenth century English-language books. This is a case where a simple tokenization rule (resolve end-of-line hyphens) will not cover all cases. For very large corpora containing a diversity of authors, idiosyncrasies resulting from tokenization tend not to be particularly consequential (“arm-chair” is not a high frequency word). For smaller corpora, however, decisions regarding tokenization can make a profound difference.
Languages that do not mark word boundaries present an additional challenge. Chinese and Classical Greek provide two important examples. Consider the following sequence of Chinese characters: 爱国人. This sequence could be broken up into the following tokens: [“爱”, 国人”] (to love one’s compatriots) or [“爱国”, “人”] (a country-loving person). Resolving this kind of ambiguity (when it can be resolved) is an active topic of research. For Chinese and for other languages with this feature there are a number of tokenization strategies in circulation.
Here are a number of examples of tokenizing functions:
# note: there are three spaces between "at" and "her" to make the example more
# realistic (texts are frequently plagued by such idiosyncracies)
In [1]: text = "She looked at   her father's arm-chair."

In [2]: text_fr = "Qu'est-ce que c'est?"

# tokenize on spaces
In [3]: text.split(' ')
Out[3]: ['She', 'looked', 'at', '', '', 'her', "father's", 'arm-chair.']

In [4]: text_fr.split(' ')
Out[4]: ["Qu'est-ce", 'que', "c'est?"]

# scikit-learn
# note that CountVectorizer discards "words" that contain only one character, such as "s"
# CountVectorizer also transforms all words into lowercase
In [5]: from sklearn.feature_extraction.text import CountVectorizer

In [6]: CountVectorizer().build_tokenizer()(text)
Out[6]: ['She', 'looked', 'at', 'her', 'father', 'arm', 'chair']

In [7]: CountVectorizer().build_tokenizer()(text_fr)
Out[7]: ['Qu', 'est', 'ce', 'que', 'est']

# nltk word_tokenize uses the TreebankWordTokenizer and needs to be given
# a single sentence at a time.
In [8]: from nltk.tokenize import word_tokenize

In [9]: word_tokenize(text)
Out[9]: ['She', 'looked', 'at', 'her', 'father', "'s", 'arm-chair', '.']

In [10]: word_tokenize(text_fr)
Out[10]: ["Qu'est-ce", 'que', "c'est", '?']

# nltk PunktWordTokenizer
In [11]: from nltk.tokenize.punkt import PunktWordTokenizer

In [12]: tokenizer = PunktWordTokenizer()

In [13]: tokenizer.tokenize(text)
Out[13]: ['She', 'looked', 'at', 'her', 'father', "'s", 'arm-chair.']

In [14]: tokenizer.tokenize(text_fr)
Out[14]: ['Qu', "'est-ce", 'que', 'c', "'est", '?']

# use of makettrans to tokenize on spaces, stripping punctuation
# see python documentation for string.translate
# string.punctuation is simply a list of punctuation
In [15]: import string

In [16]: table = str.maketrans({ch: None for ch in string.punctuation})

In [17]: [s.translate(table) for s in text.split(' ') if s != '']
Out[17]: ['She', 'looked', 'at', 'her', 'fathers', 'armchair']

In [18]: [s.translate(table) for s in text_fr.split(' ') if s != '']
Out[18]: ['Questce', 'que', 'cest']

Stemming

Often we want to count inflected forms of a word together. This procedure is referred to as stemming. Stemming a German text treats the following words as instances of the word “Wald”: “Wald”, “Walde”, “Wälder”, “Wäldern”, “Waldes”, and “Walds”. Analogously, in English the following words would be counted as “forest”: “forest”, “forests”, “forested”, “forest’s”, “forests’”. As stemming reduces the number of unique vocabulary items that need to be tracked, it speeds up a variety of computational operations. For some kinds of analyses, such as authorship attribution or fine-grained stylistic analyses, stemming may obscure differences among writers. For example, one author may be distinguished by the use of a plural form of a word.
NLTK offers stemming for a variety of languages in the nltk.stem package. The following code illustrates the use of the popular Snowball stemmer:
In [19]: from nltk.stem.snowball import GermanStemmer

In [20]: stemmer = GermanStemmer()

# note that the stem function works one word at a time
In [21]: words = ["Wald", "Walde", "Wälder", "Wäldern", "Waldes","Walds"]

In [22]: [stemmer.stem(w) for w in words]
Out[22]: ['wald', 'wald', 'wald', 'wald', 'wald', 'wald']

# note that the stemming algorithm "understands" grammar to some extent and that if "Waldi" were to appear in a text, it would not be stemmed.
In [23]: stemmer.stem("Waldi")
Out[23]: 'waldi'

Chunking

Splitting a long text into smaller samples is a common task in text analysis. As most kinds of quantitative text analysis take as inputs an unordered list of words, breaking a text up into smaller chunks allows one to preserve context that would otherwise be discarded; observing two words together in a paragraph-sized chunk of text tells us much more about the relationship between those two words than observing two words occurring together in an 100,000 word book. Or, as we will be using a selection of tragedies as our examples, we might consider the difference between knowing that two character names occur in the same scene versus knowing that the two names occur in the same play.
To demonstrate how to divide a large text into smaller chunks, we will be working with the corpus of French tragedies. The following shows the first plays in the corpus:
In [24]: import os

In [25]: import numpy as np

# plays are in the directory data/french-tragedy
# gather all the filenames, sorted alphabetically
In [26]: corpus_path = os.path.join('data', 'french-tragedy')

# look at the first few filenames
# (we are sorting because different operating systems may list files in different orders)
In [27]: sorted(os.listdir(path=corpus_path))[0:5]
Out[27]: 
['Crebillon_TR-V-1703-Idomenee.txt',
 'Crebillon_TR-V-1707-Atree.txt',
 'Crebillon_TR-V-1708-Electre.txt',
 'Crebillon_TR-V-1711-Rhadamisthe.txt',
 'Crebillon_TR-V-1717-Semiramis.txt']

# we will need the entire path, e.g., 'data/Crebillon_TR-V-1703-Idomenee.txt'
# rather than just 'Crebillon_TR-V-1703-Idomenee.txt' alone.
In [28]: tragedy_filenames = [os.path.join(corpus_path, fn) for fn in sorted(os.listdir(corpus_path))]

# alternatively, using the Python standard library package 'glob'
In [29]: import glob

In [30]: tragedy_filenames = glob.glob(corpus_path + os.sep + '*.txt')

Every 1,000 words

One way to split a text is to read through it and create a chunk every n words, where n is a number such as 500, 1,000 or 10,000. The following function accomplishes this:
In [31]: def split_text(filename, n_words):
   ....:     """Split a text into chunks approximately `n_words` words in length."""
   ....:     input = open(filename, 'r')
   ....:     words = input.read().split(' ')
   ....:     input.close()
   ....:     chunks = []
   ....:     current_chunk_words = []
   ....:     current_chunk_word_count = 0
   ....:     for word in words:
   ....:         current_chunk_words.append(word)
   ....:         current_chunk_word_count += 1
   ....:         if current_chunk_word_count == n_words:
   ....:             chunks.append(' '.join(current_chunk_words))
   ....:             current_chunk_words = []
   ....:             current_chunk_word_count = 0
   ....:     chunks.append(' '.join(current_chunk_words) )
   ....:     return chunks
   ....: 
To divide up the plays, we simply apply this function to each text in the corpus. We do need to be careful to record the original file name and chunk number as we will need them later. One way to keep track of these details is to collect them in a list of Pythondictionaries. There will be one dictionary for each chunk, containing the original filename, a number for the chunk, and the text of the chunk.
In [32]: tragedy_filenames = [os.path.join(corpus_path, fn) for fn in sorted(os.listdir(corpus_path))]

# alternatively, using glob
In [33]: tragedy_filenames = glob.glob(corpus_path + os.sep + '*.txt')

# for consistency across platforms (Linux, OS X, Windows) we must sort the filenames
In [34]: tragedy_filenames.sort()

In [35]: chunk_length = 1000

In [36]: chunks = []

In [37]: for filename in tragedy_filenames:
   ....:     chunk_counter = 0
   ....:     texts = split_text(filename, chunk_length)
   ....:     for text in texts:
   ....:         chunk = {'text': text, 'number': chunk_counter, 'filename': filename}
   ....:         chunks.append(chunk)
   ....:         chunk_counter += 1
   ....: 

# we started with this many files ...
In [38]: len(tragedy_filenames)
Out[38]: 59

# ... and now we have this many
In [39]: len(chunks)
Out[39]: 2740

# from the triples we can create a document-term matrix
In [40]: from sklearn.feature_extraction.text import CountVectorizer

In [41]: vectorizer = CountVectorizer(min_df=5, max_df=.95)

In [42]: dtm = vectorizer.fit_transform([c['text'] for c in chunks])

In [43]: vocab = np.array(vectorizer.get_feature_names())
accableaccablentaccableraccablezaccabléaccabléeaccablésaccents
data/french-tragedy/Crebillon_TR-V-1703-Idomenee.txt000100000
data/french-tragedy/Crebillon_TR-V-1703-Idomenee.txt110000010
data/french-tragedy/Crebillon_TR-V-1703-Idomenee.txt200000000

Writing chunks to a directory

These chunks may be saved in a directory for reference or for analysis in another program (such as MALLET or R).
# make sure the directory exists
In [44]: output_dir = '/tmp/'

In [45]: for chunk in chunks:
   ....:     basename = os.path.basename(chunk['filename'])
   ....:     fn = os.path.join(output_dir,
   ....:                       "{}{:04d}".format(basename, chunk['number']))
   ....:     with open(fn, 'w') as f:
   ....:         f.write(chunk['text'])
   ....: 
(A stand-alone script for splitting texts is available: split-text.py.)

Every paragraph

It is possible to split a document into paragraph-length chunks. Finding the appropriate character (sequence) that marks a paragraph boundary requires familiarity with how paragraphs are encoded in the text file. For example, the version of Jane Eyre provided in theausten-brontë corpus, contains no line breaks within paragraphs inside chapters, so the paragraph marker in this case is simply the newline. Using the split string method with the newline as the argument (split('\n')) will break the text into paragraphs. That is, if the text of Jane Eyre is contained in the variable text then the following sequence will split the document into paragraphs:
In [46]: text = "There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the coming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the chidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed."

In [47]: text
Out[47]: 'There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the coming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the chidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed.'

In [48]: paragraphs = text.split('\n')

In [49]: paragraphs
Out[49]: 
['There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.',
 'I was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the coming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the chidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed.']
By contrast, in the Project Gutenberg edition of Brontë’s novel, paragraphs are set off by two newlines in sequence. We still use thesplit method but we will use two newlines \n\n as our delimiter:
In [50]: text = "There was no possibility of taking a walk that day.  We had been\nwandering, indeed, in the leafless shrubbery an hour in the morning; but\nsince dinner (Mrs. Reed, when there was no company, dined early) the cold\nwinter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.\n\nI was glad of it: I never liked long walks, especially on chilly\nafternoons: dreadful to me was the coming home in the raw twilight, with\nnipped fingers and toes, and a heart saddened by the chidings of Bessie,\nthe nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed."

In [51]: text
Out[51]: 'There was no possibility of taking a walk that day.  We had been\nwandering, indeed, in the leafless shrubbery an hour in the morning; but\nsince dinner (Mrs. Reed, when there was no company, dined early) the cold\nwinter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.\n\nI was glad of it: I never liked long walks, especially on chilly\nafternoons: dreadful to me was the coming home in the raw twilight, with\nnipped fingers and toes, and a heart saddened by the chidings of Bessie,\nthe nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed.'

In [52]: paragraphs = text.split('\n\n')

In [53]: paragraphs
Out[53]: 
['There was no possibility of taking a walk that day.  We had been\nwandering, indeed, in the leafless shrubbery an hour in the morning; but\nsince dinner (Mrs. Reed, when there was no company, dined early) the cold\nwinter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.',
 'I was glad of it: I never liked long walks, especially on chilly\nafternoons: dreadful to me was the coming home in the raw twilight, with\nnipped fingers and toes, and a heart saddened by the chidings of Bessie,\nthe nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed.']

Grouping

When comparing groups of texts, we often want to aggregate information about the texts that comprise each group. For instance, we may be interested in comparing the works of one author with the works of another author. Or we may be interested in comparing texts published before 1800 with texts published after 1800. In order to do this, we need a strategy for collecting information (often word frequencies) associated with every text in a group.
As an illustration, consider the task of grouping word frequencies in French tragedies by author. We have four authors (Crébillon, Corneille, Racine, and Voltaire) and 60 texts. Typically the first step in grouping texts together is determining what criterion or “key” defines a group. In this case the key is the author, which is conveniently recorded at the beginning of each filename in our corpus. So our first step will be to associate each text (the contents of each file) with the name of its author. As before we will use a list of dictionaries to manage our data.
# in every filename the author's last name is followed by an underscore ('_'),
# for example: Voltaire_TR-V-1764-Olympie.txt
# os.path.basename(...) gets us the filename from a path, e.g.,
In [54]: os.path.basename('french-tragedy/Voltaire_TR-V-1764-Olympie.txt')
Out[54]: 'Voltaire_TR-V-1764-Olympie.txt'

# using the split method we can break up the string on the underscore ('_')
In [55]: os.path.basename('french-tragedy/Voltaire_TR-V-1764-Olympie.txt').split('_')
Out[55]: ['Voltaire', 'TR-V-1764-Olympie.txt']

# putting these two steps together
In [56]: author = os.path.basename('french-tragedy/Voltaire_TR-V-1764-Olympie.txt').split('_')[0]

In [57]: author
Out[57]: 'Voltaire'

# and for all the authors
In [58]: authors = [os.path.basename(filename).split('_')[0] for filename in tragedy_filenames]

In [59]: authors
Out[59]: 
['Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire']

# to ignore duplicates we can transform the list into a set (which only records unique elements)
In [60]: set(authors)
Out[60]: {'Crebillon', 'PCorneille', 'Racine', 'Voltaire'}

# as there is no guarantee about the ordering in a set (or a dictionary) we will typically
# first drop duplicates and then save our unique names as a sorted list. Because there are
# no duplicates in this list, we can be confident that the ordering is the same every time.
In [61]: sorted(set(authors))
Out[61]: ['Crebillon', 'PCorneille', 'Racine', 'Voltaire']

# and we have a way of finding which indexes in authors correspond to each author using array indexing
In [62]: authors = np.array(authors)  # convert from a Python list to a NumPy array

In [63]: first_author = sorted(set(authors))[0]

In [64]: first_author
Out[64]: 'Crebillon'

In [65]: authors == first_author
Out[65]: 
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False], dtype=bool)

In [66]: np.nonzero(authors == first_author)  # if we want the actual indexes
Out[66]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8]),)

In [67]: authors[np.nonzero(authors == first_author)]
Out[67]: 
array(['Crebillon', 'Crebillon', 'Crebillon', 'Crebillon', 'Crebillon',
       'Crebillon', 'Crebillon', 'Crebillon', 'Crebillon'], 
      dtype='<U10')

# alternatively, we can find those indexes of texts *not* written by `first_author`
In [68]: authors[authors != first_author]
Out[68]: 
array(['PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille', 'Racine',
       'Racine', 'Racine', 'Racine', 'Racine', 'Racine', 'Racine',
       'Racine', 'Racine', 'Racine', 'Racine', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire'], 
      dtype='<U10')
The easiest way to group the data is to use NumPy’s array indexing. This method is more concise than the alternatives and it should be familiar to those comfortable with R or Octave/Matlab. (Those for whom this method is unfamiliar will benefit from reviewing the introductions to NumPy mentioned in Getting started.)
# first get a document-term-matrix of word frequencies for our corpus
In [69]: vectorizer = CountVectorizer(input='filename')

In [70]: dtm = vectorizer.fit_transform(tragedy_filenames).toarray()

In [71]: vocab = np.array(vectorizer.get_feature_names())
In [72]: authors = np.array([os.path.basename(filename).split('_')[0] for filename in tragedy_filenames])

# allocate an empty array to store our aggregated word frequencies
In [73]: authors_unique = sorted(set(authors))

In [74]: dtm_authors = np.zeros((len(authors_unique), len(vocab)))

In [75]: for i, author in enumerate(authors_unique):
   ....:     dtm_authors[i, :] = np.sum(dtm[authors==author, :], axis=0)
   ....: 
Note
Recall that gathering together the sum of the entries along columns is performed with np.sum(X, axis=0) or X.sum(axis=0). This is the NumPy equivalent of R’s apply(X, 2, sum) (or colSums(X)).
Grouping data together in this manner is such a common problem in data analysis that there are packages devoted to making the work easier. For example, if you have the pandas library installed, you can accomplish what we just did in two lines of code:
In [76]: import pandas

In [77]: authors = [os.path.basename(filename).split('_')[0] for filename in tragedy_filenames]

In [78]: dtm_authors = pandas.DataFrame(dtm).groupby(authors).sum().values
A more general strategy for grouping data together makes use of the groupby function in the Python standard library itertools. This method has the advantage of being fast and memory efficient. As a warm-up exercise, we will group just the filenames by author usinggroupby function.
In [79]: import itertools

In [80]: import operator

In [81]: texts = []

In [82]: for filename in tragedy_filenames:
   ....:     author = os.path.basename(filename).split('_')[0]
   ....:     texts.append(dict(filename=filename, author=author))
   ....: 

# groupby requires that the list be sorted by the 'key' with which we will be doing the grouping
In [83]: texts = sorted(texts, key=operator.itemgetter('author'))

# if d is a dictionary, operator.itemgetter(key)(d) does d[key]
In [84]: d = {'number': 5}

In [85]: d['number']
Out[85]: 5

In [86]: operator.itemgetter('number')(d)
Out[86]: 5
In [87]: grouped_data = {}

In [88]: for author, grouped in itertools.groupby(texts, key=operator.itemgetter('author')):
   ....:     grouped_data[author] = ','.join(os.path.basename(t['filename']) for t in grouped)
   ....: 

In [89]: grouped_data
Out[89]: 
{'Crebillon': 'Crebillon_TR-V-1703-Idomenee.txt,Crebillon_TR-V-1707-Atree.txt,Crebillon_TR-V-1708-Electre.txt,Crebillon_TR-V-1711-Rhadamisthe.txt,Crebillon_TR-V-1717-Semiramis.txt,Crebillon_TR-V-1726-Pyrrhus.txt,Crebillon_TR-V-1749-Catilina.txt,Crebillon_TR-V-1749-Xerces.txt,Crebillon_TR-V-1754-Triumvirat.txt',
 'PCorneille': 'PCorneille_TR-V-1639-Medee.txt,PCorneille_TR-V-1639-Nicomede.txt,PCorneille_TR-V-1641-Horace.txt,PCorneille_TR-V-1643-Cinna.txt,PCorneille_TR-V-1643-Polyeucte.txt,PCorneille_TR-V-1644-Pompee.txt,PCorneille_TR-V-1644-Rodogune.txt,PCorneille_TR-V-1645-Theodore.txt,PCorneille_TR-V-1647-Heraclius.txt,PCorneille_TR-V-1651-Andromede.txt,PCorneille_TR-V-1653-Pertharite.txt,PCorneille_TR-V-1659-Oedipe.txt,PCorneille_TR-V-1661-Toisondor.txt,PCorneille_TR-V-1662-Sertorius.txt,PCorneille_TR-V-1663-Sophonisbe.txt,PCorneille_TR-V-1665-Othon.txt,PCorneille_TR-V-1666-Agesilas.txt,PCorneille_TR-V-1668-Attila.txt,PCorneille_TR-V-1672-Pulcherie.txt,PCorneille_TR-V-1674-Surena.txt',
 'Racine': 'Racine_TR-V-1664-Thebaide.txt,Racine_TR-V-1666-Alexandre.txt,Racine_TR-V-1668-Andromaque.txt,Racine_TR-V-1670-Britannicus.txt,Racine_TR-V-1671-Berenice.txt,Racine_TR-V-1672-Bajazet.txt,Racine_TR-V-1673-Mithridate.txt,Racine_TR-V-1674-Iphigenie.txt,Racine_TR-V-1677-Phedre.txt,Racine_TR-V-1689-Esther.txt,Racine_TR-V-1691-Athalie.txt',
 'Voltaire': 'Voltaire_TR-V-1718-Oedipe.txt,Voltaire_TR-V-1724-Mariamne.txt,Voltaire_TR-V-1730-Brutus.txt,Voltaire_TR-V-1732-Alzire.txt,Voltaire_TR-V-1732-Zaire.txt,Voltaire_TR-V-1734-Agathocle.txt,Voltaire_TR-V-1741-Fanatisme.txt,Voltaire_TR-V-1743-Merope.txt,Voltaire_TR-V-1743-MortCesar.txt,Voltaire_TR-V-1749-LoisMinos.txt,Voltaire_TR-V-1750-RomeSauvee.txt,Voltaire_TR-V-1751-DucDAlencon.txt,Voltaire_TR-V-1755-OrphelinChine.txt,Voltaire_TR-V-1764-Olympie.txt,Voltaire_TR-V-1766-AdelaideDuGuesclin.txt,Voltaire_TR-V-1769-Guebres.txt,Voltaire_TR-V-1771-Tancrede.txt,Voltaire_TR-V-1774-Sophonisbee.txt,Voltaire_TR-V-1778-Irene.txt'}
The preceding lines of code demonstrate how to group filenames by author. Now we want to aggregate document-term frequencies by author. The process is similar. We use the same strategy of creating a collection of dictionaries with the information we want to aggregate and the key—the author’s name—that identifies each group.
In [90]: texts = []

# we will use the index i to get the corresponding row of the document-term matrix
In [91]: for i, filename in enumerate(tragedy_filenames):
   ....:     author = os.path.basename(filename).split('_')[0]
   ....:     termfreq = dtm[i, :]
   ....:     texts.append(dict(filename=filename, author=author, termfreq=termfreq))
   ....: 

# groupby requires that the list be sorted by the 'key' according to which we are grouping
In [92]: texts = sorted(texts, key=operator.itemgetter('author'))

In [93]: texts = sorted(texts, key=operator.itemgetter('author'))

In [94]: termfreqs = []

In [95]: for author, group in itertools.groupby(texts, key=operator.itemgetter('author')):
   ....:     termfreqs.append(np.sum(np.array([t['termfreq'] for t in group]), axis=0))
   ....: 

In [96]: dtm_authors = np.array(termfreqs)  # creates matrix out of a list of arrays

In [97]: np.testing.assert_array_almost_equal(dtm_authors_method_groupby, dtm_authors_method_numpy)
Now that we have done the work of grouping these texts together, we can examine the relationships among the four authors using the exploratory techniques we learned in Working with text.
In [98]: import matplotlib

In [99]: import matplotlib.pyplot as plt

In [100]: from sklearn.manifold import MDS

In [101]: from sklearn.metrics.pairwise import cosine_similarity

In [102]: dist = 1 - cosine_similarity(dtm_authors)

In [103]: mds = MDS(n_components=2, dissimilarity="precomputed")

In [104]: pos = mds.fit_transform(dist)  # shape (n_components, n_samples)
In [105]: xs, ys = pos[:, 0], pos[:, 1]

In [106]: names = sorted(set(authors))

In [107]: for x, y, name in zip(xs, ys, names):
   .....:     color = matplotlib.cm.summer(names.index(name))
   .....:     plt.scatter(x, y, c=color)
   .....:     plt.text(x, y, name)
   .....: 

In [108]: plt.show()
_images/plot_preprocessing_authors_mds.png
Note that it is possible to group texts by any feature they share in common. If, for instance, we had wanted to organize our texts into 50 year periods (1650-1699, 1700-1749, ...) rather than by author, we would begin by extracting the publication year from the filename.
# extract year from filename
In [109]: years = [int(os.path.basename(fn).split('-')[2]) for fn in tragedy_filenames]

# using a regular expression
In [110]: import re

In [111]: years = [int(re.findall('[0-9]+', fn)[0]) for fn in tragedy_filenames]
Then we would create a list of group identifiers based on the periods that interest us:
# all the texts are published between 1600 and 1800
# periods will be numbered 0, 1, 2, 3
# periods correspond to: year < 1650, 1650 <= year < 1700, ...
In [112]: period_boundaries = list(range(1650, 1800 + 1, 50))

In [113]: period_names = ["{}-{}".format(yr - 50, yr) for yr in period_boundaries]

In [114]: periods = []

In [115]: for year in years:
   .....:     for i, boundary in enumerate(period_boundaries):
   .....:         if year < boundary:
   .....:             periods.append(i)
   .....:             break
   .....: 

# examine how many texts appear in each period
In [116]: list(zip(period_names, np.bincount(periods)))
Out[116]: [('1600-1650', 9), ('1650-1700', 22), ('1700-1750', 18), ('1750-1800', 10)]
Finally we would group the texts together using the same procedure as we did with authors.
In [117]: periods_unique = sorted(set(periods))

In [118]: dtm_periods = np.zeros((len(periods_unique), len(vocab)))

In [119]: for i, period in enumerate(periods_unique):
   .....:     dtm_periods[i,:] = np.sum(dtm[periods==period,:], axis=0)
   .....: 

Exercises

  1. Write a tokenizer that, as it tokenizes, also transforms uppercase words into lowercase words. Consider using the string methodlower.
  2. Using your tokenizer, count the number of times green occurs in the following text sample.
"I find," Mr. Green said, "that there are many members here who do not know
me yet,--young members, probably, who are green from the waste lands and
road-sides of private life.
  1. Personal names that occur in lowercase form in the dictionary illustrate one kind of information that is lost by ignoring case. Provide another example of useful information lost when lowercasing all words.

Popular Posts