Handling Text For Machine Learning

In this blog, we will see how we can clean text, remove whitespace and punctuation. We will also do text tokenization, remove stop words, stemming words, tagging part of speach etc...

Remove white space

text_data = ["  John is a nice boy  ",
            "He is in class 10th student.",
            "    Today he has pre board exam.      "]
text_data

['  John is a nice boy  ',
 'He is in class 10th student.',
 '    Today he has pre board exam.      ']

strip whitespaces

strip_whitespace = [string.strip() for string in text_data]
strip_whitespace

['John is a nice boy',
 'He is in class 10th student.',
 'Today he has pre board exam.']

remove period

remove_periods = [string.replace(".", "") for string in strip_whitespace]
remove_periods

['John is a nice boy',
 'He is in class 10th student',
 'Today he has pre board exam']

remove capitalization

def remove_capital_letter(string):
    return string.lower()
    pass

cleaned_data = [remove_capital_letter(string) for string in remove_periods]
cleaned_data

['john is a nice boy',
 'he is in class 10th student',
 'today he has pre board exam']

Word capitalization

def capitalization(string):
    return string.upper()
    pass

[capitalization(string) for string in cleaned_data]

['JOHN IS A NICE BOY',
 'HE IS IN CLASS 10TH STUDENT',
 'TODAY HE HAS PRE BOARD EXAM']

Remove Punctuation

import unicodedata
import sys

text_data = ['Hello!!!', 'How are you????????????????', 'Are you coming to my place, is that 100% confirmed!!!!', 
             'We will have lot of fun, right!!??', 'Bye!!!!!!!!!!']

classmethod fromkeys(iterable[, value])

Create a new dictionary with keys from iterable and values set to value.

fromkeys() is a class method that returns a new dictionary. value defaults to None. All of the values refer to just a single instance, so it generally doesn’t make sense for value to be a mutable object such as an empty list. To get distinct values, use a dict comprehension instead.

create a dictionary of punctuation characters

punctuation = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith('P'))

punctuation

for each string, remove all punctuations

str.translate(table)

Return a copy of the string in which each character has been mapped through the given translation table. The table must be an object that implements indexing via __getitem__(), typically a mapping or sequence. When indexed by a Unicode ordinal (an integer), the table object can do any of the following: return a Unicode ordinal or a string, to map the character to one or more other characters; return None, to delete the character from the return string; or raise a LookupError exception, to map the character to itself.

[string.translate(punctuation) for string in text_data]

['Hello',
 'How are you',
 'Are you coming to my place is that 100 confirmed',
 'We will have lot of fun right',
 'Bye']

Parsing and Cleaning HTML

from bs4 import BeautifulSoup

html = """
    <html><head><title>Alice's Adventures in Wonderland Story</title></head>
    <body>
        <p class="title"><b>Alice's Adventures in Wonderland</b></p>

        <p class="story">Alice's Adventures in Wonderland (commonly Alice in Wonderland) is an 1865 English novel by Lewis Carroll.</p>
        <p class="story">A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.</p>
        <p class="story">It is seen as an example of the literary nonsense genre. </p>
    </body>
    </html>
"""

soup = BeautifulSoup(html)

find title from p tag

soup.find("p", {"class": "title"}).text

"Alice's Adventures in Wonderland"

find text from p tag, which has class "story"

for s in soup.find_all('p', {"class": "story"}):
    print(s.text)
    pass

Alice's Adventures in Wonderland (commonly Alice in Wonderland) is an 1865 English novel by Lewis Carroll.
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.
It is seen as an example of the literary nonsense genre.

Tokenizing using nltk

What is tokenization in NLP(Natural Language Processing)?

Tokenization is breaking the raw text into words or sentences called tokens.

word tokenization

from nltk.tokenize import word_tokenize

string = "Alice in Wonderland is an 1865 English novel by Lewis Carroll"

tokenized_words = word_tokenize(string)
tokenized_words

['Alice',
 'in',
 'Wonderland',
 'is',
 'an',
 '1865',
 'English',
 'novel',
 'by',
 'Lewis',
 'Carroll']

sentence tokenization

from nltk.tokenize import sent_tokenize

sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."

sent_tokenize(sentences)

['Alice in Wonderland is an 1865 English novel by Lewis Carroll.',
 'A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.',
 'It is seen as an example of the literary nonsense genre.']

Removing Stop Words

What is Stop words?

Some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words. For example these are the stop words "a", "an", "the", "is", "from", "and" etc...

from nltk.corpus import stopwords

We have already tokenized our words, let is use that.

tokenized_words

['Alice',
 'in',
 'Wonderland',
 'is',
 'an',
 '1865',
 'English',
 'novel',
 'by',
 'Lewis',
 'Carroll']

get the stopwords of English language

stop_words = stopwords.words('english')

len(stop_words)

179

type(stop_words)

list

print stopwords

stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

remove stopwords from tokenized_words

[word for word in tokenized_words if word not in stop_words]

['Alice', 'Wonderland', '1865', 'English', 'novel', 'Lewis', 'Carroll']

We can see words "is", "in", "an", "by" are removed.

Stemming Words

What is Stemming Words?

In linguistic morphology and information retrieval, stemming is the process of reducing inflected words to their word stem, base or root form—generally a written word form.

A stemmer for English operating on the stem cat should identify such strings as cats, catlike, and catty. A stemming algorithm might also reduce the words fishing, fished, and fisher to the stem fish.

The stem need not be a word, for example the Porter algorithm reduces, argue, argued, argues, arguing, and argus to the stem argu.

nltk.stem package

NLTK Stemmers: Interfaces used to remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word. This is a difficult problem due to irregular words (eg. common verbs in English), complicated morphological rules, and part-of-speech and sense ambiguities (eg. ceil- is not the stem of ceiling).

Submodules

    nltk.stem.api module
    nltk.stem.arlstem module
    nltk.stem.arlstem2 module
    nltk.stem.cistem module
    nltk.stem.isri module
    nltk.stem.lancaster module
    nltk.stem.porter module
    nltk.stem.regexp module
    nltk.stem.rslp module
    nltk.stem.snowball module
    nltk.stem.util module
    nltk.stem.wordnet module

The Porter Stemming Algorithm

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner morphological and inflexional endings from words in English. Its main use is as part of a term normalisation process that is usually done when setting up Information Retrieval systems.

import nltk
from nltk.stem.porter import PorterStemmer

Create stemmer

porter = PorterStemmer()

apply stemmer in tokenized_words

tokenized_words =  ['cats', 'catlike', 'catty', 'playing', 'played', 'plays']

[porter.stem(word) for word in tokenized_words]

['cat', 'catlik', 'catti', 'play', 'play', 'play']

The Snowball stemmers

Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval. This site describes Snowball, and presents several useful stemmers which have been implemented using it.

Snowball Stemmer supports the following languages: Arabic, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish and Swedish.

from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer("english")

tokenized_words

['cats', 'catlike', 'catty', 'playing', 'played', 'plays']

[snowball_stemmer.stem(word) for word in tokenized_words]

['cat', 'catlik', 'catti', 'play', 'play', 'play']

Tagging Part of Speech

nltk.tag package

NLTK Taggers: This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

A "tag" is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

from nltk import pos_tag, word_tokenize

text_data = "John's big idea isn't all that bad."

tagged_text = pos_tag(word_tokenize(text_data)) 
tagged_text

[('John', 'NNP'),
 ("'s", 'POS'),
 ('big', 'JJ'),
 ('idea', 'NN'),
 ('is', 'VBZ'),
 ("n't", 'RB'),
 ('all', 'PDT'),
 ('that', 'DT'),
 ('bad', 'JJ'),
 ('.', '.')]

tagging for multiple text

sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."

tokenize_sentences = sent_tokenize(sentences)
tokenize_sentences

['Alice in Wonderland is an 1865 English novel by Lewis Carroll.',
 'A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures.',
 'It is seen as an example of the literary nonsense genre.']

tagged_texts= []

for sentence in tokenize_sentences:
    tagged_sentence = pos_tag(word_tokenize(sentence)) 
    tagged_texts.append(tagged_sentence)
    pass

tagged_texts

[[('Alice', 'NNP'),
  ('in', 'IN'),
  ('Wonderland', 'NNP'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('1865', 'CD'),
  ('English', 'NNP'),
  ('novel', 'NN'),
  ('by', 'IN'),
  ('Lewis', 'NNP'),
  ('Carroll', 'NNP'),
  ('.', '.')],
 [('A', 'DT'),
  ('young', 'JJ'),
  ('girl', 'NN'),
  ('named', 'VBN'),
  ('Alice', 'NNP'),
  ('falls', 'VBZ'),
  ('through', 'IN'),
  ('a', 'DT'),
  ('rabbit', 'NN'),
  ('hole', 'NN'),
  ('into', 'IN'),
  ('a', 'DT'),
  ('fantasy', 'JJ'),
  ('world', 'NN'),
  ('of', 'IN'),
  ('anthropomorphic', 'JJ'),
  ('creatures', 'NNS'),
  ('.', '.')],
 [('It', 'PRP'),
  ('is', 'VBZ'),
  ('seen', 'VBN'),
  ('as', 'IN'),
  ('an', 'DT'),
  ('example', 'NN'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('literary', 'JJ'),
  ('nonsense', 'NN'),
  ('genre', 'NN'),
  ('.', '.')]]

Binarization

Binarization is process that is used to transform any data features into binary numbers(0 or 1).

LabelBinarizer

from sklearn import preprocessing

lb = preprocessing.LabelBinarizer()

string = "Alice in Wonderland is an 1865 English novel by Lewis Carroll"
tokenized_words = word_tokenize(string)
tokenized_words

['Alice',
 'in',
 'Wonderland',
 'is',
 'an',
 '1865',
 'English',
 'novel',
 'by',
 'Lewis',
 'Carroll']

lb.fit_transform(tokenized_words)

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0]])

MultiLabelBinarizer

from sklearn.preprocessing import MultiLabelBinarizer

sentences = "Alice in Wonderland is an 1865 English novel by Lewis Carroll. \
A young girl named Alice falls through a rabbit hole into a fantasy world of anthropomorphic creatures. \
It is seen as an example of the literary nonsense genre."

tagged_sentences = []

for sentence in sentences:
    tagged_sentence = nltk.pos_tag(word_tokenize(sentence))
    tagged_sentences.append([tag for word, tag in tagged_sentence])
    pass

mlb = MultiLabelBinarizer()
mlb.fit_transform(tagged_sentences)

array([[0, 0, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

Vectorization

The raw data, a sequence of symbols cannot be fed directly to the algorithms themselves as most of them expect numerical feature vectors with a fixed size rather than the raw text documents with variable length.

Vectorization the general process of turning a collection of text documents into numerical feature vectors. This specific strategy (tokenization, counting and normalization) is called the Bag of Words or “Bag of n-grams” representation. Documents are described by word occurrences while completely ignoring the relative position information of the words in the document.

CountVectorizer

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

from sklearn.feature_extraction.text import CountVectorizer

text_data = ['Hello', 'How are you?', 'Are you coming to my place, is that 100% confirmed', 
             'We will have lot of fun, right.', 'Bye']

count = CountVectorizer()

bag_of_words = count.fit_transform(text_data)
bag_of_words

<5x20 sparse matrix of type '<class 'numpy.int64'>'
	with 22 stored elements in Compressed Sparse Row format>

bag_of_words.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

print the vocabulary

count.vocabulary_.keys()

dict_keys(['hello', 'how', 'are', 'you', 'coming', 'to', 'my', 'place', 'is', 'that', '100', 'confirmed', 'we', 'will', 'have', 'lot', 'of', 'fun', 'right', 'bye'])

CountVectorizer with n-gram

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

cv = CountVectorizer(ngram_range=(2,2))

bag_of_words = cv.fit_transform(text_data)
bag_of_words

<5x16 sparse matrix of type '<class 'numpy.int64'>'
	with 17 stored elements in Compressed Sparse Row format>

bag_of_words.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1],
       [0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])

print the vocabulary keys with 2 gram

cv.vocabulary_.keys()

dict_keys(['how are', 'are you', 'you coming', 'coming to', 'to my', 'my place', 'place is', 'is that', 'that 100', '100 confirmed', 'we will', 'will have', 'have lot', 'lot of', 'of fun', 'fun right'])

Term Frequency–Inverse Document Frequency

Tf–idf term weighting

In a large text corpus, some words will be very present (e.g. “the”, “a”, “is” in English) hence carrying very little meaningful information about the actual contents of the document. If we were to feed the direct count data directly to a classifier those very frequent terms would shadow the frequencies of rarer yet more interesting terms.

In order to re-weight the count features into floating point values suitable for usage by a classifier it is very common to use the tf–idf transform.

Tf means term-frequency while tf–idf means term-frequency times inverse document-frequency.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()

transformer = tfidf.fit_transform(text_data)
transformer

<5x20 sparse matrix of type '<class 'numpy.float64'>'
	with 22 stored elements in Compressed Sparse Row format>

transformer.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ],
       [0.        , 0.53177225, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.659118  , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.53177225],
       [0.32788062, 0.26453202, 0.        , 0.32788062, 0.32788062,
        0.        , 0.        , 0.        , 0.        , 0.32788062,
        0.        , 0.32788062, 0.        , 0.32788062, 0.        ,
        0.32788062, 0.32788062, 0.        , 0.        , 0.26453202],
       [0.        , 0.        , 0.        , 0.        , 0.        ,
        0.37796447, 0.37796447, 0.        , 0.        , 0.        ,
        0.37796447, 0.        , 0.37796447, 0.        , 0.37796447,
        0.        , 0.        , 0.37796447, 0.37796447, 0.        ],
       [0.        , 0.        , 1.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.        , 0.        , 0.        ]])

tfidf.vocabulary_

{'hello': 7,
 'how': 8,
 'are': 1,
 'you': 19,
 'coming': 3,
 'to': 16,
 'my': 11,
 'place': 13,
 'is': 9,
 'that': 15,
 '100': 0,
 'confirmed': 4,
 'we': 17,
 'will': 18,
 'have': 6,
 'lot': 10,
 'of': 12,
 'fun': 5,
 'right': 14,
 'bye': 2}

According to word weight importance, we got the vector.

Handling Text For Machine Learning

Handling Text For Machine Learning

Remove white space

strip whitespaces

remove period

remove capitalization

Word capitalization

Remove Punctuation

classmethod fromkeys(iterable[, value])

create a dictionary of punctuation characters

for each string, remove all punctuations

str.translate(table)

Parsing and Cleaning HTML

find title from p tag

find text from p tag, which has class "story"

Tokenizing using nltk

What is tokenization in NLP(Natural Language Processing)?

word tokenization

sentence tokenization

Removing Stop Words

What is Stop words?

get the stopwords of English language

print stopwords

remove stopwords from tokenized_words

Stemming Words

What is Stemming Words?

nltk.stem package

The Porter Stemming Algorithm

Create stemmer

apply stemmer in tokenized_words

The Snowball stemmers

Tagging Part of Speech

nltk.tag package

tagging for multiple text

Binarization

LabelBinarizer

MultiLabelBinarizer

Vectorization

CountVectorizer

print the vocabulary

CountVectorizer with n-gram

print the vocabulary keys with 2 gram

Term Frequency–Inverse Document Frequency

Tf–idf term weighting

kindergarten

Python for kids

Fourier series

Linear Equations

Geometry

Laplace

Vectors

Differential equations

Functions

Jacobian

Lagrangian

Waves

Electromagnetism

Optics

Quantum mechanics concepts

Theory of relativity

Kinematics

Thermodynamics

Formulae

A level physics

Chemistry

English

Geography

Animation

Plotting

SVG

Python

Machine Learning

TensorFlow

PySpark

PyTorch

Natural Language Processing

Others