In this blog, we will see how we can clean text, remove whitespace and punctuation. We will also do text tokenization, remove stop words, stemming words, tagging part of speach etc...
text_data = [" John is a nice boy ",
"He is in class 10th student.",
" Today he has pre board exam. "]
text_data
strip_whitespace = [string.strip() for string in text_data]
strip_whitespace
remove_periods = [string.replace(".", "") for string in strip_whitespace]
remove_periods
def remove_capital_letter(string):
return string.lower()
pass
cleaned_data = [remove_capital_letter(string) for string in remove_periods]
cleaned_data
def capitalization(string):
return string.upper()
pass
[capitalization(string) for string in cleaned_data]
import unicodedata
import sys
text_data = ['Hello!!!', 'How are you????????????????', 'Are you coming to my place, is that 100% confirmed!!!!',
'We will have lot of fun, right!!??', 'Bye!!!!!!!!!!']
classmethod fromkeys(iterable[, value])
Create a new dictionary with keys from iterable and values set to value.
fromkeys() is a class method that returns a new dictionary. value defaults to None. All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead.
punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))
punctuation
str.translate(table)
Return a copy of the string in which each character has been mapped through the given translation table. The table must be an object that implements indexing via __getitem__(), typically a mapping or sequence. When indexed by a Unicode ordinal (an integer), the table object can do any of the following: return a Unicode ordinal or a string, to map the character to one or more other characters; return None, to delete the character from the return string; or raise a LookupError exception, to map the character to itself.
[string.translate(punctuation) for string in text_data]
from bs4 import BeautifulSoup
html = """
<html><head><title>Alice's Adventures in Wonderland Story</title></head>
<body>
<p class="title"><b>Alice's Adventures in Wonderland</b></p>
<p class="story">Alice's Adventures in Wonderland (commonly Alice in Wonderland) is an 1865 English novel by Lewis Carroll.</p>
<p class="story">A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.</p>
<p class="story">It is seen as an example of the literary nonsense genre. </p>
</body>
</html>
"""
soup = BeautifulSoup(html)
soup.find("p", {"class": "title"}).text
for s in soup.find_all('p', {"class": "story"}):
print(s.text)
pass
Tokenization is breaking the raw text into words or sentences called tokens.
from nltk.tokenize import word_tokenize
string = "Alice in Wonderland is an 1865 English novel by Lewis Carroll"
tokenized_words = word_tokenize(string)
tokenized_words
from nltk.tokenize import sent_tokenize
sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."
sent_tokenize(sentences)
Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. For example these are the stop words "a", "an", "the", "is", "from", "and" etc...
from nltk.corpus import stopwords
We have already tokenized our words, let is use that.
tokenized_words
stop_words = stopwords.words('english')
len(stop_words)
type(stop_words)
stop_words[0:10]
[word for word in tokenized_words if word not in stop_words]
We can see words "is", "in", "an", "by" are removed.
In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form.
A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish.
The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.
nltk.stem package
NLTK Stemmers: Interfaces used to remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word. This is a difficult problem due to irregular words (eg. common verbs in English), complicated morphological rules, and part-of-speech and sense ambiguities (eg. ceil- is not the stem of ceiling).
Submodules
nltk.stem.api module
nltk.stem.arlstem module
nltk.stem.arlstem2 module
nltk.stem.cistem module
nltk.stem.isri module
nltk.stem.lancaster module
nltk.stem.porter module
nltk.stem.regexp module
nltk.stem.rslp module
nltk.stem.snowball module
nltk.stem.util module
nltk.stem.wordnet module
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.
import nltk
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
tokenized_words = ['cats', 'catlike', 'catty', 'playing', 'played', 'plays']
[porter.stem(word) for word in tokenized_words]
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This site describes Snowball, and presents several useful stemmers which have been implemented using it.
Snowball Stemmer supports the following languages: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer("english")
tokenized_words
[snowball_stemmer.stem(word) for word in tokenized_words]
nltk.tag package
NLTK Taggers: This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.
A "tag" is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):
from nltk import pos_tag, word_tokenize
text_data = "John's big idea isn't all that bad."
tagged_text = pos_tag(word_tokenize(text_data))
tagged_text
sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."
tokenize_sentences = sent_tokenize(sentences)
tokenize_sentences
tagged_texts= []
for sentence in tokenize_sentences:
tagged_sentence = pos_tag(word_tokenize(sentence))
tagged_texts.append(tagged_sentence)
pass
tagged_texts
Binarization is process that is used to transform any data features into binary numbers(0 or 1).
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
string = "Alice in Wonderland is an 1865 English novel by Lewis Carroll"
tokenized_words = word_tokenize(string)
tokenized_words
lb.fit_transform(tokenized_words)
from sklearn.preprocessing import MultiLabelBinarizer
sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."
tagged_sentences = []
for sentence in sentences:
tagged_sentence = nltk.pos_tag(word_tokenize(sentence))
tagged_sentences.append([tag for word, tag in tagged_sentence])
pass
mlb = MultiLabelBinarizer()
mlb.fit_transform(tagged_sentences)
The raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.
Vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
from sklearn.feature_extraction.text import CountVectorizer
text_data = ['Hello', 'How are you?', 'Are you coming to my place, is that 100% confirmed',
'We will have lot of fun, right.', 'Bye']
count = CountVectorizer()
bag_of_words = count.fit_transform(text_data)
bag_of_words
bag_of_words.toarray()
count.vocabulary_.keys()
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
cv = CountVectorizer(ngram_range=(2,2))
bag_of_words = cv.fit_transform(text_data)
bag_of_words
bag_of_words.toarray()
cv.vocabulary_.keys()
Tf–idf term weighting
In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.
In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.
Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency.
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
transformer = tfidf.fit_transform(text_data)
transformer
transformer.toarray()
tfidf.vocabulary_
According to word weight importance, we got the vector.