Home » NLP , nltk » text preprocessing with python

text preprocessing with python

Posted by jeffy Posted on 9:20 PM with 145 comments

From https://de.dariah.eu/tatom/preprocessing.html

Also refer to http://www.nltk.org/api/nltk.tokenize.html#module-nltk.tokenize

Preprocessing

Frequently the texts we have are not those we want to analyze. We may have an single file containing the collected works of an author although we are only interested in a single work. Or we may be given a large work broken up into volumes (this is the case for Les Misèrables, as we will see later) where the division into volumes is not important to us.

If we are interested in an author’s style, we likely want to break up a long text (such as a book-length work) into smaller chunks so we can get a sense of the variability in an author’s writing. If we are comparing one group of writers to a second group, we may wish to aggregate information about writers belonging to the same group. This will require merging documents or other information that were initially separate. This section illustrates these two common preprocessing step: splitting long texts into smaller “chunks” and aggregating texts together.

Another important preprocessing step is tokenization. This is the process of splitting a text into individual words or sequences of words (n-grams). Decisions regarding tokenization will depend on the language(s) being studied and the research question. For example, should the phrase "her father's arm-chair" be tokenized as as ["her", "father", "s", "arm", "chair"] or["her", "father's", "arm-chair"]. Tokenization patterns that work for one language may not be appropriate for another (What is the appropriate tokenization of “Qu’est-ce que c’est?”?). This section begins with a brief discussion of tokenization before covering splitting and merging texts.

Note

Each tutorial is self-contained and should be read through in order. Variables and functions introduced in one subsection will be referenced and used in subsequent subsections. For example, the NumPy library numpy is imported and then used later without being imported a second time.

Tokenizing

There are many ways to tokenize a text. Often ambiguity is inescapable. Consider the following lines of Charlotte Brontë’s Villette:

whose walls gleamed with foreign mirrors. Near the hearth
appeared a little group: a slight form sunk in a deep arm-
chair, one or two women busy about it, the iron-grey gentle-
man anxiously looking on. ...

Does the appropriate tokenization include “armchair” or “arm-chair”? While it would be strange to see “arm-chair” in print today, the hyphenated version predominates in Villette and other texts from the same period. “gentleman”, however, seems preferable to “gentle-man,” although the latter occurs in early nineteenth century English-language books. This is a case where a simple tokenization rule (resolve end-of-line hyphens) will not cover all cases. For very large corpora containing a diversity of authors, idiosyncrasies resulting from tokenization tend not to be particularly consequential (“arm-chair” is not a high frequency word). For smaller corpora, however, decisions regarding tokenization can make a profound difference.

Languages that do not mark word boundaries present an additional challenge. Chinese and Classical Greek provide two important examples. Consider the following sequence of Chinese characters: 爱国人. This sequence could be broken up into the following tokens: [“爱”，国人”] (to love one’s compatriots) or [“爱国”, “人”] (a country-loving person). Resolving this kind of ambiguity (when it can be resolved) is an active topic of research. For Chinese and for other languages with this feature there are a number of tokenization strategies in circulation.

Here are a number of examples of tokenizing functions:

# note: there are three spaces between "at" and "her" to make the example more
# realistic (texts are frequently plagued by such idiosyncracies)
In [1]: text = "She looked at   her father's arm-chair."

In [2]: text_fr = "Qu'est-ce que c'est?"

# tokenize on spaces
In [3]: text.split(' ')
Out[3]: ['She', 'looked', 'at', '', '', 'her', "father's", 'arm-chair.']

In [4]: text_fr.split(' ')
Out[4]: ["Qu'est-ce", 'que', "c'est?"]

# scikit-learn
# note that CountVectorizer discards "words" that contain only one character, such as "s"
# CountVectorizer also transforms all words into lowercase
In [5]: from sklearn.feature_extraction.text import CountVectorizer

In [6]: CountVectorizer().build_tokenizer()(text)
Out[6]: ['She', 'looked', 'at', 'her', 'father', 'arm', 'chair']

In [7]: CountVectorizer().build_tokenizer()(text_fr)
Out[7]: ['Qu', 'est', 'ce', 'que', 'est']

# nltk word_tokenize uses the TreebankWordTokenizer and needs to be given
# a single sentence at a time.
In [8]: from nltk.tokenize import word_tokenize

In [9]: word_tokenize(text)
Out[9]: ['She', 'looked', 'at', 'her', 'father', "'s", 'arm-chair', '.']

In [10]: word_tokenize(text_fr)
Out[10]: ["Qu'est-ce", 'que', "c'est", '?']

# nltk PunktWordTokenizer
In [11]: from nltk.tokenize.punkt import PunktWordTokenizer

In [12]: tokenizer = PunktWordTokenizer()

In [13]: tokenizer.tokenize(text)
Out[13]: ['She', 'looked', 'at', 'her', 'father', "'s", 'arm-chair.']

In [14]: tokenizer.tokenize(text_fr)
Out[14]: ['Qu', "'est-ce", 'que', 'c', "'est", '?']

# use of makettrans to tokenize on spaces, stripping punctuation
# see python documentation for string.translate
# string.punctuation is simply a list of punctuation
In [15]: import string

In [16]: table = str.maketrans({ch: None for ch in string.punctuation})

In [17]: [s.translate(table) for s in text.split(' ') if s != '']
Out[17]: ['She', 'looked', 'at', 'her', 'fathers', 'armchair']

In [18]: [s.translate(table) for s in text_fr.split(' ') if s != '']
Out[18]: ['Questce', 'que', 'cest']

Stemming

Often we want to count inflected forms of a word together. This procedure is referred to as stemming. Stemming a German text treats the following words as instances of the word “Wald”: “Wald”, “Walde”, “Wälder”, “Wäldern”, “Waldes”, and “Walds”. Analogously, in English the following words would be counted as “forest”: “forest”, “forests”, “forested”, “forest’s”, “forests’”. As stemming reduces the number of unique vocabulary items that need to be tracked, it speeds up a variety of computational operations. For some kinds of analyses, such as authorship attribution or fine-grained stylistic analyses, stemming may obscure differences among writers. For example, one author may be distinguished by the use of a plural form of a word.

NLTK offers stemming for a variety of languages in the nltk.stem package. The following code illustrates the use of the popular Snowball stemmer:

In [19]: from nltk.stem.snowball import GermanStemmer

In [20]: stemmer = GermanStemmer()

# note that the stem function works one word at a time
In [21]: words = ["Wald", "Walde", "Wälder", "Wäldern", "Waldes","Walds"]

In [22]: [stemmer.stem(w) for w in words]
Out[22]: ['wald', 'wald', 'wald', 'wald', 'wald', 'wald']

# note that the stemming algorithm "understands" grammar to some extent and that if "Waldi" were to appear in a text, it would not be stemmed.
In [23]: stemmer.stem("Waldi")
Out[23]: 'waldi'

Chunking

Splitting a long text into smaller samples is a common task in text analysis. As most kinds of quantitative text analysis take as inputs an unordered list of words, breaking a text up into smaller chunks allows one to preserve context that would otherwise be discarded; observing two words together in a paragraph-sized chunk of text tells us much more about the relationship between those two words than observing two words occurring together in an 100,000 word book. Or, as we will be using a selection of tragedies as our examples, we might consider the difference between knowing that two character names occur in the same scene versus knowing that the two names occur in the same play.

To demonstrate how to divide a large text into smaller chunks, we will be working with the corpus of French tragedies. The following shows the first plays in the corpus:

In [24]: import os

In [25]: import numpy as np

# plays are in the directory data/french-tragedy
# gather all the filenames, sorted alphabetically
In [26]: corpus_path = os.path.join('data', 'french-tragedy')

# look at the first few filenames
# (we are sorting because different operating systems may list files in different orders)
In [27]: sorted(os.listdir(path=corpus_path))[0:5]
Out[27]: 
['Crebillon_TR-V-1703-Idomenee.txt',
 'Crebillon_TR-V-1707-Atree.txt',
 'Crebillon_TR-V-1708-Electre.txt',
 'Crebillon_TR-V-1711-Rhadamisthe.txt',
 'Crebillon_TR-V-1717-Semiramis.txt']

# we will need the entire path, e.g., 'data/Crebillon_TR-V-1703-Idomenee.txt'
# rather than just 'Crebillon_TR-V-1703-Idomenee.txt' alone.
In [28]: tragedy_filenames = [os.path.join(corpus_path, fn) for fn in sorted(os.listdir(corpus_path))]

# alternatively, using the Python standard library package 'glob'
In [29]: import glob

In [30]: tragedy_filenames = glob.glob(corpus_path + os.sep + '*.txt')

Every 1,000 words

One way to split a text is to read through it and create a chunk every n words, where n is a number such as 500, 1,000 or 10,000. The following function accomplishes this:

In [31]: def split_text(filename, n_words):
   ....:     """Split a text into chunks approximately `n_words` words in length."""
   ....:     input = open(filename, 'r')
   ....:     words = input.read().split(' ')
   ....:     input.close()
   ....:     chunks = []
   ....:     current_chunk_words = []
   ....:     current_chunk_word_count = 0
   ....:     for word in words:
   ....:         current_chunk_words.append(word)
   ....:         current_chunk_word_count += 1
   ....:         if current_chunk_word_count == n_words:
   ....:             chunks.append(' '.join(current_chunk_words))
   ....:             current_chunk_words = []
   ....:             current_chunk_word_count = 0
   ....:     chunks.append(' '.join(current_chunk_words) )
   ....:     return chunks
   ....: 

To divide up the plays, we simply apply this function to each text in the corpus. We do need to be careful to record the original file name and chunk number as we will need them later. One way to keep track of these details is to collect them in a list of Pythondictionaries. There will be one dictionary for each chunk, containing the original filename, a number for the chunk, and the text of the chunk.

In [32]: tragedy_filenames = [os.path.join(corpus_path, fn) for fn in sorted(os.listdir(corpus_path))]

# alternatively, using glob
In [33]: tragedy_filenames = glob.glob(corpus_path + os.sep + '*.txt')

# for consistency across platforms (Linux, OS X, Windows) we must sort the filenames
In [34]: tragedy_filenames.sort()

In [35]: chunk_length = 1000

In [36]: chunks = []

In [37]: for filename in tragedy_filenames:
   ....:     chunk_counter = 0
   ....:     texts = split_text(filename, chunk_length)
   ....:     for text in texts:
   ....:         chunk = {'text': text, 'number': chunk_counter, 'filename': filename}
   ....:         chunks.append(chunk)
   ....:         chunk_counter += 1
   ....: 

# we started with this many files ...
In [38]: len(tragedy_filenames)
Out[38]: 59

# ... and now we have this many
In [39]: len(chunks)
Out[39]: 2740

# from the triples we can create a document-term matrix
In [40]: from sklearn.feature_extraction.text import CountVectorizer

In [41]: vectorizer = CountVectorizer(min_df=5, max_df=.95)

In [42]: dtm = vectorizer.fit_transform([c['text'] for c in chunks])

In [43]: vocab = np.array(vectorizer.get_feature_names())

	accable	accabler	accablés
data/french-tragedy/Crebillon_TR-V-1703-Idomenee.txt0	0	1	0
data/french-tragedy/Crebillon_TR-V-1703-Idomenee.txt1	1	0	1
data/french-tragedy/Crebillon_TR-V-1703-Idomenee.txt2	0	0	0

Writing chunks to a directory

These chunks may be saved in a directory for reference or for analysis in another program (such as MALLET or R).

# make sure the directory exists
In [44]: output_dir = '/tmp/'

In [45]: for chunk in chunks:
   ....:     basename = os.path.basename(chunk['filename'])
   ....:     fn = os.path.join(output_dir,
   ....:                       "{}{:04d}".format(basename, chunk['number']))
   ....:     with open(fn, 'w') as f:
   ....:         f.write(chunk['text'])
   ....: 

(A stand-alone script for splitting texts is available: split-text.py.)

Every paragraph

It is possible to split a document into paragraph-length chunks. Finding the appropriate character (sequence) that marks a paragraph boundary requires familiarity with how paragraphs are encoded in the text file. For example, the version of Jane Eyre provided in theausten-brontë corpus, contains no line breaks within paragraphs inside chapters, so the paragraph marker in this case is simply the newline. Using the split string method with the newline as the argument (split('\n')) will break the text into paragraphs. That is, if the text of Jane Eyre is contained in the variable text then the following sequence will split the document into paragraphs:

In [46]: text = "There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the coming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the chidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed."

In [47]: text
Out[47]: 'There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.\nI was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the coming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the chidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed.'

In [48]: paragraphs = text.split('\n')

In [49]: paragraphs
Out[49]: 
['There was no possibility of taking a walk that day. We had been wandering, indeed, in the leafless shrubbery an hour in the morning; but since dinner (Mrs. Reed, when there was no company, dined early) the cold winter wind had brought with it clouds so sombre, and a rain so penetrating, that further out-door exercise was now out of the question.',
 'I was glad of it: I never liked long walks, especially on chilly afternoons: dreadful to me was the coming home in the raw twilight, with nipped fingers and toes, and a heart saddened by the chidings of Bessie, the nurse, and humbled by the consciousness of my physical inferiority to Eliza, John, and Georgiana Reed.']

By contrast, in the Project Gutenberg edition of Brontë’s novel, paragraphs are set off by two newlines in sequence. We still use thesplit method but we will use two newlines \n\n as our delimiter:

In [50]: text = "There was no possibility of taking a walk that day.  We had been\nwandering, indeed, in the leafless shrubbery an hour in the morning; but\nsince dinner (Mrs. Reed, when there was no company, dined early) the cold\nwinter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.\n\nI was glad of it: I never liked long walks, especially on chilly\nafternoons: dreadful to me was the coming home in the raw twilight, with\nnipped fingers and toes, and a heart saddened by the chidings of Bessie,\nthe nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed."

In [51]: text
Out[51]: 'There was no possibility of taking a walk that day.  We had been\nwandering, indeed, in the leafless shrubbery an hour in the morning; but\nsince dinner (Mrs. Reed, when there was no company, dined early) the cold\nwinter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.\n\nI was glad of it: I never liked long walks, especially on chilly\nafternoons: dreadful to me was the coming home in the raw twilight, with\nnipped fingers and toes, and a heart saddened by the chidings of Bessie,\nthe nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed.'

In [52]: paragraphs = text.split('\n\n')

In [53]: paragraphs
Out[53]: 
['There was no possibility of taking a walk that day.  We had been\nwandering, indeed, in the leafless shrubbery an hour in the morning; but\nsince dinner (Mrs. Reed, when there was no company, dined early) the cold\nwinter wind had brought with it clouds so sombre, and a rain so\npenetrating, that further out-door exercise was now out of the question.',
 'I was glad of it: I never liked long walks, especially on chilly\nafternoons: dreadful to me was the coming home in the raw twilight, with\nnipped fingers and toes, and a heart saddened by the chidings of Bessie,\nthe nurse, and humbled by the consciousness of my physical inferiority to\nEliza, John, and Georgiana Reed.']

Grouping

When comparing groups of texts, we often want to aggregate information about the texts that comprise each group. For instance, we may be interested in comparing the works of one author with the works of another author. Or we may be interested in comparing texts published before 1800 with texts published after 1800. In order to do this, we need a strategy for collecting information (often word frequencies) associated with every text in a group.

As an illustration, consider the task of grouping word frequencies in French tragedies by author. We have four authors (Crébillon, Corneille, Racine, and Voltaire) and 60 texts. Typically the first step in grouping texts together is determining what criterion or “key” defines a group. In this case the key is the author, which is conveniently recorded at the beginning of each filename in our corpus. So our first step will be to associate each text (the contents of each file) with the name of its author. As before we will use a list of dictionaries to manage our data.

# in every filename the author's last name is followed by an underscore ('_'),
# for example: Voltaire_TR-V-1764-Olympie.txt
# os.path.basename(...) gets us the filename from a path, e.g.,
In [54]: os.path.basename('french-tragedy/Voltaire_TR-V-1764-Olympie.txt')
Out[54]: 'Voltaire_TR-V-1764-Olympie.txt'

# using the split method we can break up the string on the underscore ('_')
In [55]: os.path.basename('french-tragedy/Voltaire_TR-V-1764-Olympie.txt').split('_')
Out[55]: ['Voltaire', 'TR-V-1764-Olympie.txt']

# putting these two steps together
In [56]: author = os.path.basename('french-tragedy/Voltaire_TR-V-1764-Olympie.txt').split('_')[0]

In [57]: author
Out[57]: 'Voltaire'

# and for all the authors
In [58]: authors = [os.path.basename(filename).split('_')[0] for filename in tragedy_filenames]

In [59]: authors
Out[59]: 
['Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'Crebillon',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'PCorneille',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Racine',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire',
 'Voltaire']

# to ignore duplicates we can transform the list into a set (which only records unique elements)
In [60]: set(authors)
Out[60]: {'Crebillon', 'PCorneille', 'Racine', 'Voltaire'}

# as there is no guarantee about the ordering in a set (or a dictionary) we will typically
# first drop duplicates and then save our unique names as a sorted list. Because there are
# no duplicates in this list, we can be confident that the ordering is the same every time.
In [61]: sorted(set(authors))
Out[61]: ['Crebillon', 'PCorneille', 'Racine', 'Voltaire']

# and we have a way of finding which indexes in authors correspond to each author using array indexing
In [62]: authors = np.array(authors)  # convert from a Python list to a NumPy array

In [63]: first_author = sorted(set(authors))[0]

In [64]: first_author
Out[64]: 'Crebillon'

In [65]: authors == first_author
Out[65]: 
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False], dtype=bool)

In [66]: np.nonzero(authors == first_author)  # if we want the actual indexes
Out[66]: (array([0, 1, 2, 3, 4, 5, 6, 7, 8]),)

In [67]: authors[np.nonzero(authors == first_author)]
Out[67]: 
array(['Crebillon', 'Crebillon', 'Crebillon', 'Crebillon', 'Crebillon',
       'Crebillon', 'Crebillon', 'Crebillon', 'Crebillon'], 
      dtype='<U10')

# alternatively, we can find those indexes of texts *not* written by `first_author`
In [68]: authors[authors != first_author]
Out[68]: 
array(['PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille',
       'PCorneille', 'PCorneille', 'PCorneille', 'PCorneille', 'Racine',
       'Racine', 'Racine', 'Racine', 'Racine', 'Racine', 'Racine',
       'Racine', 'Racine', 'Racine', 'Racine', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire', 'Voltaire',
       'Voltaire', 'Voltaire'], 
      dtype='<U10')

The easiest way to group the data is to use NumPy’s array indexing. This method is more concise than the alternatives and it should be familiar to those comfortable with R or Octave/Matlab. (Those for whom this method is unfamiliar will benefit from reviewing the introductions to NumPy mentioned in Getting started.)

# first get a document-term-matrix of word frequencies for our corpus
In [69]: vectorizer = CountVectorizer(input='filename')

In [70]: dtm = vectorizer.fit_transform(tragedy_filenames).toarray()

In [71]: vocab = np.array(vectorizer.get_feature_names())

In [72]: authors = np.array([os.path.basename(filename).split('_')[0] for filename in tragedy_filenames])

# allocate an empty array to store our aggregated word frequencies
In [73]: authors_unique = sorted(set(authors))

In [74]: dtm_authors = np.zeros((len(authors_unique), len(vocab)))

In [75]: for i, author in enumerate(authors_unique):
   ....:     dtm_authors[i, :] = np.sum(dtm[authors==author, :], axis=0)
   ....: 

Note

Recall that gathering together the sum of the entries along columns is performed with np.sum(X, axis=0) or X.sum(axis=0). This is the NumPy equivalent of R’s apply(X, 2, sum) (or colSums(X)).

Grouping data together in this manner is such a common problem in data analysis that there are packages devoted to making the work easier. For example, if you have the pandas library installed, you can accomplish what we just did in two lines of code:

In [76]: import pandas

In [77]: authors = [os.path.basename(filename).split('_')[0] for filename in tragedy_filenames]

In [78]: dtm_authors = pandas.DataFrame(dtm).groupby(authors).sum().values

A more general strategy for grouping data together makes use of the groupby function in the Python standard library itertools. This method has the advantage of being fast and memory efficient. As a warm-up exercise, we will group just the filenames by author usinggroupby function.

In [79]: import itertools

In [80]: import operator

In [81]: texts = []

In [82]: for filename in tragedy_filenames:
   ....:     author = os.path.basename(filename).split('_')[0]
   ....:     texts.append(dict(filename=filename, author=author))
   ....: 

# groupby requires that the list be sorted by the 'key' with which we will be doing the grouping
In [83]: texts = sorted(texts, key=operator.itemgetter('author'))

# if d is a dictionary, operator.itemgetter(key)(d) does d[key]
In [84]: d = {'number': 5}

In [85]: d['number']
Out[85]: 5

In [86]: operator.itemgetter('number')(d)
Out[86]: 5

In [87]: grouped_data = {}

In [88]: for author, grouped in itertools.groupby(texts, key=operator.itemgetter('author')):
   ....:     grouped_data[author] = ','.join(os.path.basename(t['filename']) for t in grouped)
   ....: 

In [89]: grouped_data
Out[89]: 
{'Crebillon': 'Crebillon_TR-V-1703-Idomenee.txt,Crebillon_TR-V-1707-Atree.txt,Crebillon_TR-V-1708-Electre.txt,Crebillon_TR-V-1711-Rhadamisthe.txt,Crebillon_TR-V-1717-Semiramis.txt,Crebillon_TR-V-1726-Pyrrhus.txt,Crebillon_TR-V-1749-Catilina.txt,Crebillon_TR-V-1749-Xerces.txt,Crebillon_TR-V-1754-Triumvirat.txt',
 'PCorneille': 'PCorneille_TR-V-1639-Medee.txt,PCorneille_TR-V-1639-Nicomede.txt,PCorneille_TR-V-1641-Horace.txt,PCorneille_TR-V-1643-Cinna.txt,PCorneille_TR-V-1643-Polyeucte.txt,PCorneille_TR-V-1644-Pompee.txt,PCorneille_TR-V-1644-Rodogune.txt,PCorneille_TR-V-1645-Theodore.txt,PCorneille_TR-V-1647-Heraclius.txt,PCorneille_TR-V-1651-Andromede.txt,PCorneille_TR-V-1653-Pertharite.txt,PCorneille_TR-V-1659-Oedipe.txt,PCorneille_TR-V-1661-Toisondor.txt,PCorneille_TR-V-1662-Sertorius.txt,PCorneille_TR-V-1663-Sophonisbe.txt,PCorneille_TR-V-1665-Othon.txt,PCorneille_TR-V-1666-Agesilas.txt,PCorneille_TR-V-1668-Attila.txt,PCorneille_TR-V-1672-Pulcherie.txt,PCorneille_TR-V-1674-Surena.txt',
 'Racine': 'Racine_TR-V-1664-Thebaide.txt,Racine_TR-V-1666-Alexandre.txt,Racine_TR-V-1668-Andromaque.txt,Racine_TR-V-1670-Britannicus.txt,Racine_TR-V-1671-Berenice.txt,Racine_TR-V-1672-Bajazet.txt,Racine_TR-V-1673-Mithridate.txt,Racine_TR-V-1674-Iphigenie.txt,Racine_TR-V-1677-Phedre.txt,Racine_TR-V-1689-Esther.txt,Racine_TR-V-1691-Athalie.txt',
 'Voltaire': 'Voltaire_TR-V-1718-Oedipe.txt,Voltaire_TR-V-1724-Mariamne.txt,Voltaire_TR-V-1730-Brutus.txt,Voltaire_TR-V-1732-Alzire.txt,Voltaire_TR-V-1732-Zaire.txt,Voltaire_TR-V-1734-Agathocle.txt,Voltaire_TR-V-1741-Fanatisme.txt,Voltaire_TR-V-1743-Merope.txt,Voltaire_TR-V-1743-MortCesar.txt,Voltaire_TR-V-1749-LoisMinos.txt,Voltaire_TR-V-1750-RomeSauvee.txt,Voltaire_TR-V-1751-DucDAlencon.txt,Voltaire_TR-V-1755-OrphelinChine.txt,Voltaire_TR-V-1764-Olympie.txt,Voltaire_TR-V-1766-AdelaideDuGuesclin.txt,Voltaire_TR-V-1769-Guebres.txt,Voltaire_TR-V-1771-Tancrede.txt,Voltaire_TR-V-1774-Sophonisbee.txt,Voltaire_TR-V-1778-Irene.txt'}

The preceding lines of code demonstrate how to group filenames by author. Now we want to aggregate document-term frequencies by author. The process is similar. We use the same strategy of creating a collection of dictionaries with the information we want to aggregate and the key—the author’s name—that identifies each group.

In [90]: texts = []

# we will use the index i to get the corresponding row of the document-term matrix
In [91]: for i, filename in enumerate(tragedy_filenames):
   ....:     author = os.path.basename(filename).split('_')[0]
   ....:     termfreq = dtm[i, :]
   ....:     texts.append(dict(filename=filename, author=author, termfreq=termfreq))
   ....: 

# groupby requires that the list be sorted by the 'key' according to which we are grouping
In [92]: texts = sorted(texts, key=operator.itemgetter('author'))

In [93]: texts = sorted(texts, key=operator.itemgetter('author'))

In [94]: termfreqs = []

In [95]: for author, group in itertools.groupby(texts, key=operator.itemgetter('author')):
   ....:     termfreqs.append(np.sum(np.array([t['termfreq'] for t in group]), axis=0))
   ....: 

In [96]: dtm_authors = np.array(termfreqs)  # creates matrix out of a list of arrays

In [97]: np.testing.assert_array_almost_equal(dtm_authors_method_groupby, dtm_authors_method_numpy)

Now that we have done the work of grouping these texts together, we can examine the relationships among the four authors using the exploratory techniques we learned in Working with text.

In [98]: import matplotlib

In [99]: import matplotlib.pyplot as plt

In [100]: from sklearn.manifold import MDS

In [101]: from sklearn.metrics.pairwise import cosine_similarity

In [102]: dist = 1 - cosine_similarity(dtm_authors)

In [103]: mds = MDS(n_components=2, dissimilarity="precomputed")

In [104]: pos = mds.fit_transform(dist)  # shape (n_components, n_samples)

In [105]: xs, ys = pos[:, 0], pos[:, 1]

In [106]: names = sorted(set(authors))

In [107]: for x, y, name in zip(xs, ys, names):
   .....:     color = matplotlib.cm.summer(names.index(name))
   .....:     plt.scatter(x, y, c=color)
   .....:     plt.text(x, y, name)
   .....: 

In [108]: plt.show()

_images/plot_preprocessing_authors_mds.png

Note that it is possible to group texts by any feature they share in common. If, for instance, we had wanted to organize our texts into 50 year periods (1650-1699, 1700-1749, ...) rather than by author, we would begin by extracting the publication year from the filename.

# extract year from filename
In [109]: years = [int(os.path.basename(fn).split('-')[2]) for fn in tragedy_filenames]

# using a regular expression
In [110]: import re

In [111]: years = [int(re.findall('[0-9]+', fn)[0]) for fn in tragedy_filenames]

Then we would create a list of group identifiers based on the periods that interest us:

# all the texts are published between 1600 and 1800
# periods will be numbered 0, 1, 2, 3
# periods correspond to: year < 1650, 1650 <= year < 1700, ...
In [112]: period_boundaries = list(range(1650, 1800 + 1, 50))

In [113]: period_names = ["{}-{}".format(yr - 50, yr) for yr in period_boundaries]

In [114]: periods = []

In [115]: for year in years:
   .....:     for i, boundary in enumerate(period_boundaries):
   .....:         if year < boundary:
   .....:             periods.append(i)
   .....:             break
   .....: 

# examine how many texts appear in each period
In [116]: list(zip(period_names, np.bincount(periods)))
Out[116]: [('1600-1650', 9), ('1650-1700', 22), ('1700-1750', 18), ('1750-1800', 10)]

Finally we would group the texts together using the same procedure as we did with authors.

In [117]: periods_unique = sorted(set(periods))

In [118]: dtm_periods = np.zeros((len(periods_unique), len(vocab)))

In [119]: for i, period in enumerate(periods_unique):
   .....:     dtm_periods[i,:] = np.sum(dtm[periods==period,:], axis=0)
   .....: 

Exercises

Write a tokenizer that, as it tokenizes, also transforms uppercase words into lowercase words. Consider using the string methodlower.
Using your tokenizer, count the number of times green occurs in the following text sample.

"I find," Mr. Green said, "that there are many members here who do not know
me yet,--young members, probably, who are green from the waste lands and
road-sides of private life.

Personal names that occur in lowercase form in the dictionary illustrate one kind of information that is lost by ignoring case. Provide another example of useful information lost when lowercasing all words.

All materials are published under a Creative Commons Attribution 4.0 International license (CC-BY 4.0).

Comments are welcome, as are reports of bugs and typos. Please use the project’s issue tracker.

These tutorials have been developed with support from the DARIAH-DE initiative, the German branch of DARIAH-EU, the European Digital Research Infrastructure for the Arts and Humanities consortium. Funding has been provided by the German Federal Ministry for Research and Education (BMBF) under the identifier 01UG1110J.

145 Comments:

Maradona Jons said...: This comment has been removed by the author.; 9:12 AM
Maradona Jons said...: This comment has been removed by the author.; 9:14 AM
Alexander Debrah said...: Online football betting ufabet will definitely get the price of water more than anywhere else. When compared with other companies such as other water 1.90, we water 1.94 or more, depending on the pair. We guarantee the price of 4 sets of football betting with us, starting with a minimum of only 10 baht, because our website has no minimum deposit with an automatic system; 12:14 PM
James said...: Wow! I’m browsing away perusing your web journal from my lap! Simply needed to say I adore Buy Wesley Snipes Coat Online your website and anticipate every one of your posts! If you want to take a cheap ebook writing service at a cheap price you can contact us.; 2:54 AM
john said...: I am very inspired by your blog and give valuable knowledge so it is very useful to others and you can check our blog . Our blog is about printer .ij.start.cannon is all in one printer is ideal for both office and home . It works on both operating system ios and windows. So you can try it.; 3:55 AM
Anonymous said...: Canon Pixma MG2520 is one of the best software that will enhance your printer’s capabilities. You can easily download and use this dynamic software. We have provided you every procedure of downloading it on Mac, windows, through wireless connection and USB cable. From all of these different procedures, you can choose the best one at your convenience. The main aim of canon mg2520 printer is to help you through our best possible manner that’s why we have come up with this guide.; 4:06 AM
Daniel Lisa said...: One such issue that haunts QuickBooks is the QuickBooks Won’t Open Error. It is an error that restricts the user from opening the QB desktop software. Luckily, you have landed on the correct page. In this post, we will teach you how to eradicate my QuickBooks won't open error .; 2:35 AM
Will Kolson said...: Your tutorial is more reliable to go for professionalism. It defines how briefly you considered the procedures and tips & tricks. It must contain numbers and letters language learning procedure which is very difficult but not impossible. Thank you for sharing with us this kind of Information. Some users are gaining extra knowledge from Law assignment writing UK.; 4:01 AM
Anonymous said...: The QuickBooks connection Diagnostic Tool could be a great tool that helps QuickBooks desktop users resolve a spread of network and company file corruption problems. QuickBooks, company files, and also the info manager all have difficulties that this subtle tool will discover and fix.; 12:29 AM
Quickbooks connection diagnostic tool said...: You can use Quickbooks Connection Diagnostic Tool to diagnose several issues caused by corrupt company files and multiple network problems. By using this tool, you will be more productive. It also has a robust inbuilt technology that makes it easy to use.; 2:25 AM
Jewell J. Nguyen said...: I liked reading the topic of web development company on your website, which makes it easy to get the related services. Whenever it comes to custom software, people tend to put some of their uniqueness into the site or application and you can check technical support services outsourcing form mobilunity services. A typical scenario is that people are looking for custom-made options that have been specially tailored, created for a specific purpose.; 5:34 AM
Daniel Lisa said...: nice blog.
Quickbook user guides if you really want to learn more about quickbook so you read this quality content page related this page.; 1:05 AM
cristellecruz said...: Very good written information. It will be valuable to anybody who employees it, as well as yours truly :). Keep up the good work ? for sure i will check out more posts. Feel free to visit my website; 안전놀이터; 7:46 AM
oncasinosite 카지노사이트 said...: Wow, incredible blog format! How lengthy have you been blogging for? you make running a blog glance easy. The full glance of your site is fantastic, as smartly the content material. Feel free to visit my website; 토토; 8:07 AM
Jude noronda said...: Hey thanks for this informative post, if you by any chance face quickbooks error code c=387 in your Quickbooks accounting software, any types of network issues or company file issues make sure to visit ebetterbooks.; 12:26 AM
yadongbiz said...: I like this website its a master peace ! Glad I found this on google .
야설; 5:41 AM
yahanvideonet said...: Hey There. I found your blog using msn. This is an extremely well written article. I will be sure to bookmark it and return to read more of your useful information. Thanks for the post. I will certainly return. Feel free to visit my website;
일본야동; 5:43 AM
Anonymous said...: I am looking for and I love to post a comment that The content of your post is awesome Great work!
wedding photography packages
leather jacket; 11:37 PM
Sarah Mark said...: This is a very easy and excellent example of Python code. I know programming is bit difficult for students. They can improve their programming skills by doing lots of practices and executing different codes written by themselves. Usually students face challenges while working on programming assignments and they need help from professional experts.
Assignment Writing Services; 10:18 PM
Umair said...: Hey I am Umair.I am using cordis.us services they offered large variety of business management services in affordable rates,deals in real estate softwares for large companies they have different packages for small medium and large corporate sectors for more information visit websites pos software; 12:35 AM
CEMENT TREATED BASE service in texas said...: Looking for a reliable and affordable CEMENT TREATED BASE contractor in Houston, TX? Look no further than hastencontracting! Our team of experts is dedicated to providing quality services at a fair price, so you can get the jobCEMENT TREATED BASE service in texas done right the first time. Trust us to take care of everything from start to finish, so you can get on with your life. Book an appointment today and find out just how much we can help you achieve!; 9:51 AM
Mobile Car Detailing Ottawa said...: Do you have a car that needs a professional clean? Are you tired of having to deal with the dirty and wet car every time it rains? Look no further than envirosteam! Our team of experts will take care of your car, inside and out, whileMobile Car Detailing Ottawa leaving it looking and feeling brand new. Schedule a free consultation today to see how we can help!; 10:59 AM
Nft learning sessions said...: Looking for an easy way to learn nft? Look no further than nftlearn.org! Our platform offers a variety of resources that will help you understandNft learning sessions the nft technology better. From tutorials and articles to flashcards and practice questions, we have everything you need to start your nft learning journey today. Don't wait any longer - start your nft learning journey today at nftlearn.org!; 1:34 PM
Chocolate truffles Jeddah said...: Are you looking for the best Chocolate truffles Jeddah? trufflersa is your go-to destination! Our selection of luxurious chocolates will tantalize your taste budsChocolate truffles Jeddah with a delightful range of flavors that will leave you wanting more. From classic to adventurous, we have something for everyone. Trust us, you won't regret indulging in our heavenly chocolates.; 2:12 PM
Fly Ash for Florida said...: We believe in building to positively impact communities, infrastructure, the economy, opportunity and employment. We take great pride in being proactive with our approach to projects, while ensuring that the best interests of the stakeholders are represented at every stage.; 3:30 AM
usmN said...: Python is a best programming language that can help you in any case but you know what? what if your car gets discharged, got flat tire etc around NYC, no python or any other language can get you out of trouble but we, queens roadside assistance service providers.; 5:40 AM
usmN said...: There are a number of reasons why italian kitchen designs are such a great investment on your kitchen. Firstly, they save you time and money. Instead of having to remember to do everything yourself, you can let the machines take care of it for you. Additionally, they're more energy-efficient, meaning that you're not using as much energy as you would if you were cooking using traditional methods. And lastly, they're safer too - because there are sensors everywhere in a smart kitchen, injuries and accidents are much less likely to happen.; 12:35 AM
Mack Partee said...: Great information. Lucky me I ran across your site by accident (stumbleupon). I have book marked it for later!

commercial lawn care; 7:54 PM
houston tx chiropractors said...: An outstanding share! I have just forwarded this onto a friend who has been conducting a little research on this. And he actually bought me breakfast due to the fact that I found it for him... lol. So allow me to reword this.... Thanks for the meal!! But yeah, thanx for spending time to talk about this matter here on your website.
houston tx chiropractors; 11:47 PM
sspcoatinggarage11 said...: Hi there, I believe your web site might be having browser compatibility problems. Whenever I take a look at your web site in Safari, it looks fine however when opening in IE, it's got some overlapping issues. I merely wanted to provide you with a quick heads up! Aside from that, great website!; 7:52 PM
Smoant Mods said...: vape modules from famous brands, they're all here.Long-term stable supply, holiday discounts, regular discount code issued.Augvape Kits; 8:13 PM
usmN said...: nice post admin, one thing i must say that one must consider our best tow truck near me service which is available at cheap prices.; 11:33 PM
Kitchen KinGG said...: Smart Kitchens from Smart Renovation (Superior Living Group) is one of the most prominent kitchen design dubai and fit out project management companies in the United Arab Emirates.; 12:15 AM
Jenna said...: The best post ever we can say, admin keep sharing these kind of posts daily and get the benefits of tow truck near me services availablee at cheap prices; 3:26 AM
Debbie Bishop said...: It?s hard to find educated people about this topic, but you sound like you know what you?re talking about! Thanks; 5:28 PM
jrmarketing said...: This website was... how do you say it? Relevant!! Finally I have found something which helped me. Cheers!; 6:38 PM
Dr. Khuong Pho said...: Great information. Lucky me I ran across your site by accident (stumbleupon). I have book marked it for later!; 6:45 PM
Vijay Devayya said...: Everyone loves it when people come together and share opinions. Great blog, continue the good work!; 6:48 PM
John harry said...: The site was excellent; kindly share continue to share similar blogs, admin. best saving deals is the spot to go if you want to buy any online products from an online store and need coupons, discounts, or offers.; 3:54 AM
SBLC discounting in Dubai said...: . Tacb was established with the vision of becoming the best financial institution in Dubai by offering loans with the least amount of hassle and clear returns for any little mistakes. We have designed our services to make it as simple SBLC discounting in Dubaias possible for you to take advantage of our excellent offer because we are aware that your error is worth more to us than any amount of money; 5:20 AM
Gary A. Foster said...: great article as usual. Admin keeps sharing such valuable content. If you have any vehicle trouble then Must get this golden opportunity of Queens towings services available at accessible pricing.; 10:25 PM
sanmeer said...: During software testing, errors in a produced product are discovered. Furthermore, software testing training aids in the identification of faults, missing requirements, and gaps in real-world results so that they may be remedied or addressed. Before a new product is released, it must be examined for faults as well as various other factors such as quality, flaws, performance, and so on. This is known as software testing training.
Traditional and automated testing methods are used by experienced testers. These experts provide their results to development teams. Software testing produces the intended product for the user, which is why it is crucial. Software Testing classes in Pune; 11:52 PM
seo services dubai said...: It's interesting to see how preprocessing text data can greatly affect the results of text analysis. Splitting long texts into smaller chunks and aggregating texts together can provide a better understanding of an author's writing style and help in comparing one group of writers to another. It's crucial to carefully consider the preprocessing steps before conducting any text analysis to ensure the accuracy of results. Additionally, incorporating seo services dubai; 12:42 AM
Anushree Rathore said...: Software testing is a process where defects in a produced product are detected. Software testing training helps in identifying faults, unfulfilled requirements, and disparities with actual results so that they can be corrected or addressed. Before a product is introduced to the market, it must undergo a thorough examination for faults and various other aspects such as quality, weaknesses, performance, etc. This is called software testing training.; 6:44 AM
mrithi said...: Thanks for sharing beautiful content. I got information from your blog. keep sharing
attorney bankruptcies; 10:05 PM
sameer kumar said...: Thanks for the information, Very useful
clinicalresearchcourses; 5:11 AM
Preslin said...: The examples given are easy to understand. I can say this article is simply outstanding. With neat explanation, examples and coding also given are the best part. Thanks for sharing this informative and knowledgeable post for us and keep sharing more blogs like this. Suffolk DUI Lawyer Virginia; 12:21 AM
Anonymous said...: Thank you for sharing this valuable information. Dissertation Helper is a professional service that offers academic assistance to students with their assignments. These helpers are highly skilled and knowledgeable in their respective fields, and can provide students with the necessary guidance and support to complete their assignments successfully. In today's competitive academic environment, submitting high-quality assignments is essential for achieving good grades, and a dissertation helper can be a great solution to meet these requirements. Seeking assistance from a dissertation helper can not only save time and reduce stress for students, but also improve their academic performance. It's important to choose the right helper who can help students develop their research and writing skills, which will be beneficial in the long run.; 3:41 AM
jane robert said...: Wow, what a great post! Thank you for sharing this valuable information with us. Your article is not only interesting, but it's also very well-written. Keep up the great work, and I look forward to reading more from you in the future

Separation Agreement in Virginia; 4:25 AM
ROLEX WILLIAM said...: Thanks for sharing this informative information with us. This is a fantastic website, thanks for sharing.
I Got a Reckless Driving Ticket in Virginia; 11:15 PM
jamesanderson said...: The engaging content keeps readers hooked, and the potential discovery of a valuable website adds to its appeal. Thank you for sharing this informative piece! The meticulous research and impressive writing style have truly captivated me. Your work is commendable, and the wealth of information provided is fantastic. This insightful and wonderful post deserves my heartfelt appreciation. Thank you for enriching my knowledge
Reckless Driving In New Jersey; 5:56 AM
Combined Pump said...: "Text Preprocessing with Python" guides readers through the essential steps of refining textual data, much like the precision-driven process of transfer pumps in Dammam, ensuring the smooth movement of fluids. Both endeavors aim for clarity and efficiency, whether it's refining language or facilitating fluid transfer in industrial operations.; 3:30 AM
Wilson said...: Wow! Really an amazing information, I wish to read much more beneficial posts ahead too...playa de virginia manual de divorcio sin oposición; 1:12 AM
Julie said...: Thank you very much for sharing this useful information. I was doing a project and for that, I was looking for related information. Some of the points are very useful. Do share some more material if you have one. Cheap Tow Truck Near Me; 5:29 AM
menzone said...: Exploring text preprocessing with Python—empowering language processing enthusiasts! Just like the attention to detail you'll experience with Facial Services For Men In Mississauga where every feature is carefully refined. Elevate both your code and your grooming game! #PythonProgramming #MississaugaGrooming"; 5:17 AM
shane said...: An "Data Recovery Blog" is normally a web stage or site committed to examining and sharing emergency protective order virginiaexperiences connected with the field of data recovery. It centers around subjects like inquiry calculations, information recovery techniques, and advancements used to get to and recover data from enormous datasets or data sets. Such web journals are many times important assets for experts and analysts working in the data science and innovation space can i file a protective order online virginia.; 8:52 PM
Hasten Cleanse said...: Text preprocessing with Python is essential for refining and enhancing text data, just as vapor mitigation Texas are crucial for maintaining air quality. Both processes ensure a cleaner, more efficient outcome, whether in data analysis or environmental management."; 1:07 PM
tech recommendation platform said...: this blog post was so informative! I learned a lot about the topic. Thanks for sharing.
teaching software; 10:43 PM
Barbara Nimmo said...: I thoroughly enjoyed reading your blog post on text preprocessing with Python. It's a topic that's both interesting and essential for various fields, and your explanations and examples were quite insightful.
kohls cash expired; 8:00 PM
Ashlee Rolfson said...: I thoroughly enjoyed reading your blog post on text preprocessing with Python. Your explanations and examples were both clear and insightful, making it an excellent resource for anyone diving into natural language processing.
check cashing apps that don't use ingo; 8:01 PM
Timothee Lambert said...: "I can't thank you enough for your gardening blog. Your green thumb and gardening tips have turned my backyard into a lush paradise. You've enriched my life with beauty and nature."
7now promo code; 12:35 PM
Edvin Berg said...: "Your mental health and wellness blog has been a lifeline for those seeking inner peace and balance. Your articles on mindfulness, stress management, and mental well-being have provided solace in challenging times."
archies flip flops; 2:55 PM
Abigale Huels said...: "Your blog on productivity and time management is the key to unlocking our full potential. Your practical advice on managing time efficiently and staying focused has allowed many of us to accomplish more in our daily lives."
lowes promo code generator; 2:59 PM
abigailuna said...: Python's text preprocessing is crucial in natural language processing, utilizing libraries like NLTK and SpaCy to simplify tasks like tokenization, stemming, and lemmatization. This process enhances efficiency in tasks like sentiment analysis and text classification, making it a powerful choice for NLP applications. Abogado Conducir Sin Licencia de Condado Essex; 11:02 PM
charloteequeen said...: "Exceptional tutorial on text preprocessing with Python! The clarity of explanation and step-by-step guidance made it incredibly easy for me to grasp the concepts and apply them to my own projects. The practical examples provided valuable insights, and the code snippets were a game-changer for someone like me who is relatively new to text processing. Kudos to the author for breaking down a seemingly complex topic into digestible chunks. This tutorial has significantly enhanced my understanding and skills in text preprocessing – a must-read for anyone diving into natural language processing or text analysis. Thank you for this invaluable resource!"

divorce lawyers in glens falls new york; 11:17 PM
HelanHelan said...: Diving into the world of preprocessing is like crafting the perfect roast – it refines raw data into a harmonious blend of insights. Just as data undergoes meticulous preparation, the dedication ofcoffee roasters dubai transforms raw beans into a symphony of flavors. Here's to the art of refinement,; 6:49 AM
electradubai said...: Exploring text preprocessing with Python is a valuable journey in optimizing data for analysis. As you delve into the world of efficient coding, consider illuminating your spaces with Eglo electrical lighting from trusted Eglo electrical lighting suppliers. Elevate your surroundings with quality lighting solutions, ensuring your environment is as brilliantly designed as your data preprocessing algorithms."; 6:25 AM
Anna Thomas said...: A pivotal player in commercial space renovation, the best interior fit-out company in UAE seamlessly transforms spaces, blending functionality and aesthetics. Their expertise in design optimization and meticulous execution ensures businesses create inspiring and efficient work environments. Elevate your workspace with the unparalleled services of top-notch interior fit-out specialists in the UAE.; 4:18 AM
combined pumps said...: Diving into the world of text preprocessing with Python – a powerful journey to refine and enhance textual data! As we explore the intricacies of language, let's also appreciate the precision in other realms, exemplified by the efficiency of transfer pumps in Dammam Both demonstrate the significance of refining processes for optimal outcomes. #TextProcessingPython #DubaiChemicalPumps #PrecisionInProcessing #EfficiencyInEveryRealm"; 7:10 AM
hastencatering said...: An insightful guide on text preprocessing with Python! As you navigate the complexities of data, simplify corporate events with the precision and reliability of Offshore catering services in Texas. Just as Python streamlines text, top-notch catering ensures a seamless flow of culinary delights, creating a memorable experience for every gathering. Here's to efficient processes and exceptional taste in both code and cuisine!; 7:12 AM
hastencleanse said...: Text preprocessing with Python is like giving your data a deep clean, ensuring clarity and coherence. Just as in Industry Leading Degassing Solution for a spotless home, this process tidies up your text, making it ready for analysis. It's the essential first step for data hygiene, whether it's words or living spaces!; 9:10 AM
leonardoleo21 said...: Invaluable guide on text preprocessing in Python, offering clarity on crucial steps like splitting, aggregating, and tokenization. The provided references enhance its utility for both beginners and seasoned developers.
Conducción Imprudente Nueva Jersey; 6:47 AM
new evolution inter deco said...: Navigating through texts for analysis can indeed be a challenge when they don't align with our preferences. Similarly, fit out contractors Dubai skillfully work with diverse spaces, transforming them into customized havens that reflect individual tastes and needs. Just as in analysis, the key lies in expert adaptation for the most optimal outcome.; 8:49 AM
Alainaa said...: What a great resource your blog post on New York State divorce forms is! It can be very difficult to navigate legal procedures, but with your helpful advice and explanations, it becomes much easier. It is admirable that you are dedicated to offering useful information, and I know that many people, myself included, value having a trustworthy source. Continue your fantastic effort of empowering your audience and simplifying difficult subjects. I appreciate your commitment.New York State Divorce Forms; 11:20 PM
shira said...: Certainly! Text preprocessing is a crucial step in natural language processing tasks. Here's a simple example of text preprocessing in Python using the popular library NLTK:
In these four lines, we've converted the text to lowercase, tokenized it into words, and removed common English stopwords. This is just a basic example, and you can expand upon it based on your specific needs and the complexity of your text data.
lawyer for bankruptcies; 3:18 AM
hastencatering said...: Text Preprocessing with Python is a valuable resource for streamlining and refining textual data, showcasing the power of efficient data preparation. Similarly, oil industry catering services exemplify the importance of meticulous preparation, ensuring that events are curated with precision and culinary excellence; 2:21 AM
Alainaa said...: An really educational summary of the New York divorce procedure! Your succinct but thorough summary aids in deciphering the complexity. Your insights are a great resource for filing and settlement processes. Much thanks for bringing such a delicate subject to light with empathetic clarity. Well done! divorce process new york; 8:09 PM
albertjamesen said...: Amazing, Your blogs are really good and informative. This section illustrates these two common preprocessing step: splitting long texts into smaller “chunks” and aggregating texts together. Another important preprocessing step is tokenization. This is the process of splitting a text into individual words or sequences of words (n-grams). Decisions regarding tokenization will depend on the language(s) being studied and the research question dui lawyer emporia va. I got a lots of useful information in your blogs. It is very great and useful to all. Keeps sharing more useful blogs...; 9:48 PM
Coffee beans supplier in Dubai said...: Emrati stands out as a distinguished player in Dubai's coffee scene, offering the best wholesale coffee that mirrors the excellence of its carefully curated beans. Elevate your business withcoffee roasters dubai
commitment to premium quality, ensuring a delightful and consistent coffee experience. For those seeking the epitome of wholesomeness in coffee, Emrati sets the bar high with its exceptional offerings.; 12:09 PM
kolson said...: The podcast discusses the topic of AdSense users, providing valuable information and insights. It is available on various platforms like Spotify, Apple Podcasts, and Google Podcasts. The podcast is designed for everyone, allowing them to understand their AdSense preferences and gain practical insights to improve their products.
estate and gift taxes lawyer; 3:13 AM
stivejoy said...: In the challenging journey of crafting a dissertation, students often seek guidance and support to navigate complexities effectively. Dissertation Help encompasses a wide range of resources and services tailored to assist scholars at various stages of their research process. From refining research questions to polishing writing skills, these support mechanisms play a crucial role in enhancing the quality and coherence of academic work. Students can access help through university support services, online platforms, and professional consultants specializing in academic writing. By leveraging these resources, students can receive valuable feedback, refine their ideas, and ultimately produce a scholarly dissertation that contributes meaningfully to their field of study.; 2:05 AM
William Stephen said...: Preprocessing is critical for ensuring the accuracy, reliability, and effectiveness of data analysis and modeling tasks. By cleaning, transforming, and preparing the data appropriately, preprocessing helps to enhance the quality of insights derived from the data and improve the performance of predictive models and analytical algorithms. Abogado trafico Loudoun VA; 8:18 AM
Mary said...: Great post, keep sharing valuable information. If you're interested in learning about Full stack Java, you can find more on my profile. essay help can also offer guidance on exploring topics like this in depth.; 4:29 AM
Robert said...: Text preprocessing in Python involves tasks like tokenization, removing stopwords, and stemming/lemmatization to ||Middlesex County Trespassing Lawyer||Middlesex County Trespassing Attorney
prepare text data for analysis.; 1:53 AM
Muneera Gul said...: This was a wonderful place for me to visit. We appreciate you providing us with such a wonderful post. Please continue posting more articles like this, I would want to mention. Vat Consulting Firm UAE; 1:30 AM
Mateodavid said...: Great article! Text preprocessing is such a crucial step in NLP projects, and Python offers some excellent libraries to make it easier. I especially liked the way you explained steps like tokenization, stopword removal, and stemming/lemmatization. It's clear and easy to follow for beginners. Tools like NLTK and spaCy are definitely game changers when it comes to cleaning and preparing text data. Thanks for sharing this valuable guide. reckless driving virginia consequences; 12:00 AM
레베카 린지레베카 바넷 said...: Wow, This is incredibly charming substance! I have taken a lot of joy. Thanks; 11:02 PM
보웬 메이슨케이든 손튼 said...: This kind of clever blog work and coverage! Keep up the very good works.; 11:03 PM
칼리 멘데즈아서 맥케이 said...: I absolutely love this blog. Awesome weblog! Thankyou so much for awesome blog; 11:03 PM
맥 프루이트비한 로블스 said...: It’s really a nice and useful piece of information. Thanks for sharing.; 11:03 PM
Muneera Gul said...: The blog is exceptional. You can improve; however, despite everything I say, this is perfect. Keep striving generally for an advantage. Pakistani Clothes UK; 9:33 AM
Stephen said...: A pension lawyer in Colombia can represent clients in negotiations or legal proceedings to secure their rightful benefits. By offering expert advice and legal representation, pension lawyers in Colombia help individuals protect their financial security and rights in retirement. Pension Lawyer in Colombia; 2:20 AM
hilsonsmith said...: Analyzing large textual data often requires isolating specific sections, like separating works from an author's collection or processing multi-volume texts such as Les Misérables. Similarly, students handling extensive coursework can benefit from cheap assignment help to streamline tasks and focus effectively.; 1:27 AM
nithinsathya said...: Thank you for sharing this informative post on text preprocessing with Python! The detailed explanations and practical examples make it much easier to understand key concepts like tokenization, stop word removal, and stemming. This is a great resource for anyone looking to improve their NLP skills. Looking forward to more posts like this!
Abogado de Inmigración en Línea; 11:02 PM
Millebobby said...: Text preprocessing is a crucial step in Natural Language Processing (NLP) tasks, ensuring clean and ready data for analysis or model training. Python provides powerful libraries and tools to simplify this process, including lowercasing, removing punctuation, stopwords, tokenization, stemming, lemmatization, numbers, special characters, and extra spaces New York NY Divorce Attorneys.; 11:10 PM
Robert said...: Text preprocessing with Python is a game changer for data analysis! It helps clean and structure New York No-Fault State Divorce||New York Residency Requirements for Divorce text data, making it ready for powerful machine learning models. Essential for anyone working with natural language processing!; 1:38 AM
Muneera Gul said...: It looks like we're in sync! 😄 If you're ever in the mood to chat or need more info, I'm just a message away. Wishing you lots of happy reading and discovering new things! Financial Consultancy Services; 8:37 AM
Anonymous said...: Regarding door repair equipment, your suggestion is really useful. Before I attempted to mend my door, I must admit that I was completely unfamiliar with each of these particular items. What an achievement in writing! Storefront door repair Chicago; 5:46 AM
Anonymous said...: Wow, I loved this article! It’s so well-written and informative. I learned a lot today! hollow metal doors chicago; 7:38 PM
Anonymous said...: Wow, I absolutely adored reading your post! This is a very useful and well-written piece. I did a lot of learning today! Airport Service Limo; 7:44 PM
zoeyelena said...: ThingsBoard, an open-source IoT platform, achieved 4,000 stars on GitHub in August 2019, marking its growing popularity and community confidence. The platform has since evolved with high-performance SCADA HMIs, a mobile app center, and reusable JavaScript resources. washington dc drug crime lawyer The lawyer, with extensive knowledge in family law, criminal defense, and corporate law, is passionate about justice and dedicated to client advocacy. They empower clients with informed decisions and value clear communication, making them a reliable friend in legal cases.; 8:52 PM
Danivincent said...: An essential stage in natural language processing (NLP) is text preprocessing, which cleans and gets text data ready for analysis. Tokenization, stopword removal, stemming, lemmatization, punctuation removal, and text conversion to lowercase are common steps in this Python procedure. how much does a sex offender lawyer cost; 2:19 AM
Amit Patil said...: Brilliant work! Your writing is both engaging and educational.

Clinical Research Courses in Banglore
Clinical Research Courses in Nagpur
Clinical Research Courses in Amravati; 12:43 AM
Patil Akash said...: I really enjoyed reading this! Keep up the great work.

Medical Coding Courses in Pune
Medical Coding Courses in Amaravati
Medical Coding Courses in Banglore
Medical Coding Courses In Nagpur; 10:35 AM
Anonymous said...: Your content never disappoints. Looking forward to more from you!
Medical Coding Courses in Pune
Medical Coding Courses in Amaravati
Medical Coding Courses in Banglore
Medical Coding Courses In Nagpur
Pharmacovigilance Courses in Mumbai; 11:28 PM
Muneera Gul said...: That’s really sweet of you to say! 😊 I'm always happy to chat and help however I can. Looking forward to more great conversations with you! 💙 Mounjaro And Diet Plans; 6:38 AM
Anonymous said...: Rely on the Most Reliable Company for Any and All Door Requirements ADA Handicap Door Repair; 5:22 AM
Anonymous said...: What a fantastic piece! The variety of upscale facilities offered by contemporary limos amazed me. Is it possible to compare party buses and stretch limos? Naperville Limo; 10:41 PM
Mason Olivia said...: To perform a bank reconciliation in Sage 50, navigate to the "Bank Accounts" module, select the account, click "Reconcile," enter the ending balance from your bank statement, match transactions, and reconcile the account; 3:23 AM
patilgaurav said...: "Really enjoyed reading this—very well explained!"

Eligibility for Digital Marketing Courses

Career in Pharmacovigilance

Best pharmacovigilance training institute

Eligibility for Medical Coding Courses; 2:10 AM
ryansmith said...: Not everyone writes like an academic journal, and that’s okay. If you’re better at understanding than expressing, help exists. That’s the space where Online Assignment Help Australia really makes a difference — because students deserve to learn and pass; 4:01 AM
Ethan clark said...: Corporate law isn’t easy, and sometimes I need extra support to keep up. I started using Corporate Law Assignment Help halfway through the semester, and it really improved how I understand the material. Having some assistance helps me stay focused and do better in my assignments without feeling overwhelmed; 4:08 AM
Clifton Roob said...: Understanding preprocessing steps like text chunking and tokenization is essential for effective textual analysis. These insights truly enhance research quality. I appreciated how clearly it's explained—especially for newcomers. For academic tasks like these, I’d rather hire someone to take my proctor exam to stay focused.; 7:09 AM
kisaki said...: Tôi thấy những thông tin này rất bổ ích cho mọi người
Maxhub BM21E là sự lựa chọn hoàn hảo để nâng cao chất lượng âm thanh trong môi trường làm việc và đảm bảo mọi cuộc họp được diễn ra suôn sẻ và hiệu quả. Với công nghệ 360°, loa Bluetooth Maxhub BM21E là đối tác đáng tin cậy cho mọi cuộc họp quan trọng của bạn.
Tìm hiểu ngay: https://thietbihop.com/loa-hoi-nghi-khong-day-bluetooth-maxhub-bm21e/; 9:15 AM
Muneera Gul said...: Thank you so much for your kind words! I'm really glad to hear that you found the blog simple, interesting, and easy to understand. It means a lot to know that it inspired you and introduced you to new content. If you ever have questions, feedback, or topics you'd like to explore further, feel free to share — I'm here to help! 😊 Maria B; 1:21 PM
Muneera Gul said...: The blog is exceptional. You can improve; however, despite everything I say, this is perfect. Keep striving generally for an advantage. IT Solution; 1:59 PM
Muneera Gul said...: Amazing post and also great suggestions! This publication is very beneficial and great for us. Thanks for discussing valuable details. best weight loss treatment uk; 2:13 PM
Tarun Sharma said...: Hi, I’m Tarun from Gradding. Studying at humber college is a step toward academic success, global exposure, and professional opportunities.; 5:19 AM
samantha said...: Text preprocessing with Python is the process of cleaning and preparing raw text data for analysis or machine learning tasks. It involves steps like tokenization, lowercasing, removing stopwords, and stemming/lemmatization. Python libraries such as NLTK, spaCy, and re are commonly used for these tasks. Proper preprocessing improves the accuracy and efficiency of natural language processing (NLP) models. Text preprocessing with Python is essential for extracting meaningful insights from unstructured text data.
Foot Detox
Hair Rejuvenation; 4:24 AM
Jessica Adams said...: Whether you're building a chatbot, sentiment analysis model, or just exploring NLP, this guide offers practical tips to help improve data quality and model performance. Perfect for beginners or anyone brushing up on the basics, Text Preprocessing with Python gets you ready to work smarter with text data.

felony lawyer; 2:30 AM
Mia ethan said...: "Text Preprocessing with Python" explores the essential steps to prepare raw text for analysis. From removing punctuation and stopwords to normalizing case and correcting spelling, the blog guides readers through transforming messy data into structured input.

criminal lawyer in new york; 3:24 AM
Jesonlee said...: online khula has made the legal process of separation easier and more accessible, especially for overseas Pakistanis. With the help of a legal representative, women can now obtain Khula efficiently from anywhere in the world.; 9:37 PM
harryleo said...: An Information Retrieval blog explores methods for searching, organizing, and retrieving data efficiently. It covers topics like search engines, indexing, ranking algorithms, and text mining. Readers gain insights into IR research, tools, and practical applications. It serves as a resource for students, developers, and data professionals.
TOP BETTING SITES
Mega Casino World betting; 12:48 AM
Company registration in India said...: Great article! ISO Certification in Coimbatore; 9:08 PM
Roy Butler said...: I've always struggled with managing multiple deadlines, so I eventually turned to Native Assignment Help just to survive the workload. I didn t expect it to make such a huge difference, but it honestly did. They helped me get my workload under control and, for the first time, actually understand what each assignment was asking for. Later, when my coursework started getting tougher, I needed help understanding the clarks shoe company, especially how Clarks has built success through strategic branding, market positioning, supply chain management, product development, and adapting to changing footwear trends, and it honestly became a complete lifesaver. Instead of feeling overwhelmed by criteria, units, and endless task requirements, I finally had someone guiding me through everything in a way that actually made sense.; 10:10 PM
ahmad1234 said...: I appreciate the thoughtful insights in this post. Similarly, anyone looking for the best perfumes for men should consider their personal style and scent preferences.; 8:36 PM
ahmad1234 said...: I really appreciate the focus on professional skincare advice. Similarly, anyone considering a hydra facial Islamabad should rely on expert consultation and accurate information.; 4:14 AM
jamesstone said...: I've always struggled with managing multiple deadlines, so I eventually turned to Native Assignment Help just to survive the workload. I didn t expect it to make such a huge difference, but it honestly did. They helped me get my workload under control and, for the first time, actually understand what each assignment was asking for. Later, when my coursework started getting tougher, I needed help understanding jack ma leadership style, especially how Jack Ma s visionary, people-centred approach, emphasis on empowerment, and adaptive strategy shaped Alibaba s culture and success, and it honestly became a complete lifesaver. Instead of feeling overwhelmed by criteria, units, and endless task requirements, I finally had someone guiding me through everything in a way that actually made sense.; 8:27 PM
Company registration in India said...: GST Registration in Coimbatore Trademark Registration in Coimbatore; 10:17 PM
Bad credit Loans Guaranteed said...: Well-structured and easy to understand. Excellent effort by the writer. Cheap Towing Manhattan NY; 12:45 AM
jamesstone said...: I've always struggled with managing multiple deadlines, so I eventually turned to Native Assignment Help just to survive the workload. I didn t expect it to make such a huge difference, but it honestly did. They helped me get my workload under control and, for the first time, actually understand what each assignment was asking for. Later, when my coursework started getting tougher, I decided to try an inventory management assessment Popeyes, especially to understand how Popeyes manages stock levels, supply chain flow, waste control, and demand forecasting in a busy restaurant environment, and it honestly became a complete lifesaver. Instead of feeling overwhelmed by confusing requirements and technical terms, I finally had someone guiding me through everything in a way that actually made sense.; 10:05 PM
Company registration in India said...: FSSAI Registration in Chennai FSSAI Registration in Coimbatore; 8:30 PM
cssivasankars said...: Registration Services
Corporate Compliance; 1:07 AM
AnToNy said...: Wrap yourself in bold luxury with the BMF Fur Coat where timeless elegance meets daring style. Every detail is crafted to turn heads and make a statement that’s unmistakably you. Elevate your wardrobe with a piece as fearless, unique, and unforgettable as you are.; 4:39 AM
Company registration in India said...: GST Registration in Coimbatore OPC Registration in Coimbatore FSSAI Registration in Coimbatore; 8:54 PM
Company registration in India said...: Start your business journey with reliable Trademark Services, Copyright Services, Patent Services, Design Services, GST Services, Private Limited Registration, Partnership Registration, LLP Company Registration, OPC Company Registration, and Section 8 Company Setup in Coimbatore today.; 8:35 PM
Ethan clark said...: Strong assignments are built on clear research planning and logical structure. Students who seek Help With Assignments UK often gain a better understanding of academic writing standards and research organisation. In addition, learning about tools like saunders research onion can help learners understand how different research layers work together, enabling them to design more effective and academically sound research projects.; 3:29 AM
Gokul said...: Nice blog! For more learning, visit Digital Marketing Training in Chennai and Data Science Courses in Chennai.; 2:43 AM
Gokul said...: This helped me understand analytics basics clearly.
Also check Data Analytics for Beginners and Data Analytics Course in Coimbatore.; 12:19 AM
Discount Offers said...: The post explains Kernel Density Estimation (KDE) using SciPy to create a smooth probability distribution from data, rather than rough histograms. It focuses on using gaussian_kde and shows how bandwidth selection impacts the accuracy and smoothness of results. The guide is practical for data analysis and visualization tasks. It helps users better understand patterns in datasets. For tech tools and resources, Shopping Spout US can also help find useful deals online.; 6:49 AM
Company registration in India said...: Explore services in Coimbatore: Trademark, Copyright, Patent, Design, GST, Private Limited, Partnership, LLP, OPC, Metrology for business compliance solutions today.; 9:42 PM
smithjamessmith said...: Thanks for sharing your insights on text preprocessing with Python. Your explanation of tokenization, stop word removal, stemming, and data cleaning techniques provides valuable guidance for beginners and professionals working on natural language processing and machine learning projects efficiently.
speeding ticket lawyer goochland county; 5:32 AM
Company registration in India said...: Great insights on business compliance services. I highly recommend Trademark Registration in Chennai, Copyright Registration in Chennai, Patent Registration in Chennai, Design Registration in Chennai, GST Registration in Chennai, Private Limited Company Registration in Chennai and Partnership Firm Registration in Chennai for startups and growing businesses.; 10:08 PM
deniel said...: Producing high-quality academic work requires strong research, critical thinking, and organisation skills. Many learners explore services like assignment helper resources to gain insights into formatting, argument development, and presenting information clearly within their coursework.; 4:15 AM
cssivasankars said...: Excellent compliance insights! For business growth, explore Registration Services, Corporate Compliance, and IPR Services today.; 9:24 PM