Converting Text to Features in Natural Language Processing

In this blog, we will discuss text to features methods. It will include:

One Hot Encoding
Count Vectorizer
N-grams
Hash vectorizer
Term Frequency-Inverse Document Frequency(TF-IDF)

Converting text to feature using One Hot Encoding

import pandas as pd

text = "I love my family"

tokenize_words = text.split()

pd.get_dummies(tokenize_words)

Output has 4 features as the number of unique words present in the input was 4.

Converting text to feature using Count Vectorizer

from sklearn.feature_extraction.text import CountVectorizer

class sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>):

Convert a collection of text documents to a matrix of token counts.

CountVectorizer implements both tokenization and occurrence counting in a single class. The vectorization the general process of turning a collection of text documents into numerical feature vectors.

Example 1:

text = ["I love my family and in my family there are ten memebers."]

Create CountVectorizer() object

vectorizer = CountVectorizer()

tokenize the text

vector = vectorizer.fit_transform(text)
vector

<1x9 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

View features name

vectorizer.get_feature_names()

['and', 'are', 'family', 'in', 'love', 'memebers', 'my', 'ten', 'there']

features with index

vectorizer.vocabulary_

{'love': 4,
 'my': 6,
 'family': 2,
 'and': 0,
 'in': 3,
 'there': 8,
 'are': 1,
 'ten': 7,
 'memebers': 5}

vector.toarray()

array([[1, 1, 2, 1, 1, 1, 2, 1, 1]])

We can see word 'my' and 'family' are showing twice.

Example 2:

text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]

vectorizer1 = CountVectorizer()

vector1 = vectorizer1.fit_transform(text1)
vector1

<3x13 sparse matrix of type '<class 'numpy.int64'>'
	with 16 stored elements in Compressed Sparse Row format>

vectorizer1.get_feature_names()

['all',
 'are',
 'family',
 'in',
 'love',
 'memebers',
 'my',
 'of',
 'stay',
 'ten',
 'them',
 'there',
 'together']

vectorizer1.vocabulary_

{'love': 4,
 'my': 6,
 'family': 2,
 'in': 3,
 'there': 11,
 'are': 1,
 'ten': 9,
 'memebers': 5,
 'all': 0,
 'of': 7,
 'them': 10,
 'stay': 8,
 'together': 12}

vector1.toarray()

array([[0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1]])

Converting text to feature using N-grams

Bigram features

Bigram Example 1:

text = ["I love my family and in my family there are ten memebers."]

ngram_range: tuple (min_n, max_n), default=(1, 1)

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.

vectorizer2 = CountVectorizer(ngram_range=(2, 2))

vector2 = vectorizer2.fit_transform(text)
vector2

<1x9 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

vectorizer2.vocabulary_

{'love my': 5,
 'my family': 6,
 'family and': 2,
 'and in': 0,
 'in my': 4,
 'family there': 3,
 'there are': 8,
 'are ten': 1,
 'ten memebers': 7}

vector2.toarray()

array([[1, 1, 1, 1, 1, 1, 2, 1, 1]])

We can see "my family" bigram came twice.

Bigram Example 2:

text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]

vectorizer3 = CountVectorizer(ngram_range=(2, 2))
vector3 = vectorizer3.fit_transform(text1)
vector3

<3x12 sparse matrix of type '<class 'numpy.int64'>'
	with 13 stored elements in Compressed Sparse Row format>

vectorizer3.vocabulary_

{'love my': 5,
 'my family': 6,
 'in my': 4,
 'family there': 3,
 'there are': 11,
 'are ten': 2,
 'ten memebers': 9,
 'all of': 0,
 'of them': 7,
 'them are': 10,
 'are stay': 1,
 'stay together': 8}

vector3.toarray()

array([[0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0],
       [0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1],
       [1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0]])

Trigrams features

Trigram Example1:

vectorizer4 = CountVectorizer(ngram_range=(3, 3))

vector4 = vectorizer4.fit_transform(text)
vector4

<1x9 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

vectorizer4.vocabulary_

{'love my family': 5,
 'my family and': 6,
 'family and in': 2,
 'and in my': 0,
 'in my family': 4,
 'my family there': 7,
 'family there are': 3,
 'there are ten': 8,
 'are ten memebers': 1}

vector4.toarray()

array([[1, 1, 1, 1, 1, 1, 1, 1, 1]])

Converting text to feature using Hash vectorizer

from sklearn.feature_extraction.text import HashingVectorizer

class sklearn.feature_extraction.text.HashingVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float64'>)

Convert a collection of text documents to a matrix of token occurrences.

This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.

This strategy has several advantages:

I. it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.

II. it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.

III. it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

I. there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

II. there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

III. no IDF weighting as this would render the transformer stateful.

Example 1

text = ["I love my family and in my family there are ten memebers."]

vectorizer5 = HashingVectorizer(n_features = 10)

vector5 = vectorizer5.fit_transform(text)
print(vector5.shape)

(1, 10)

vector5.toarray()

array([[ 0.5547002, -0.2773501,  0.5547002,  0.       ,  0.       ,
        -0.2773501,  0.2773501,  0.2773501,  0.       ,  0.2773501]])

It created vector of size 10.

Example 2

text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]

vectorizer6 = HashingVectorizer(n_features = 10)
vector6 = vectorizer6.fit_transform(text1)
print(vector6.shape)

(3, 10)

vector6.toarray()

array([[ 0.57735027, -0.57735027,  0.57735027,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.4472136 ,  0.        ,  0.4472136 ,  0.        ,  0.        ,
         0.        ,  0.4472136 ,  0.4472136 ,  0.        ,  0.4472136 ],
       [ 0.        ,  0.31622777,  0.        ,  0.        ,  0.        ,
         0.        ,  0.9486833 ,  0.        ,  0.        ,  0.        ]])

Converting text to feature using Term Frequency-Inverse Document Frequency(TF-IDF)

from sklearn.feature_extraction.text import TfidfVectorizer

class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)

Convert a collection of raw documents to a matrix of TF-IDF features.

Example 1

text = ["I love my family", "In my family there are ten memebers."]

vectorizer7 = TfidfVectorizer()
vectorizer7.fit_transform(text)

<2x8 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

vectorizer7.vocabulary_

{'love': 3,
 'my': 5,
 'family': 1,
 'in': 2,
 'there': 7,
 'are': 0,
 'ten': 6,
 'memebers': 4}

vectorizer7.idf_

array([1.40546511, 1.        , 1.40546511, 1.40546511, 1.40546511,
       1.        , 1.40546511, 1.40546511])

If we observe, "my" and "family" words are appearing two times in both the documents. That's why the vector value are 1, which is less than all the other vector representations of the tokens.

Example 2

text1 = [
        "I love my family", 
        "In my family there are ten memebers", 
        "All of them are stay together"
]

vectorizer8 = TfidfVectorizer()
vectorizer8.fit_transform(text1)

<3x13 sparse matrix of type '<class 'numpy.float64'>'
	with 16 stored elements in Compressed Sparse Row format>

vectorizer8.vocabulary_

{'love': 4,
 'my': 6,
 'family': 2,
 'in': 3,
 'there': 11,
 'are': 1,
 'ten': 9,
 'memebers': 5,
 'all': 0,
 'of': 7,
 'them': 10,
 'stay': 8,
 'together': 12}

vectorizer8.idf_

array([1.69314718, 1.28768207, 1.28768207, 1.69314718, 1.69314718,
       1.69314718, 1.28768207, 1.69314718, 1.69314718, 1.69314718,
       1.69314718, 1.69314718, 1.69314718])

Converting Text to Features in Natural Language Processing

Converting Text to Features in Natural Language Processing

Converting text to feature using One Hot Encoding

Converting text to feature using Count Vectorizer

Example 1:

Create CountVectorizer() object

tokenize the text

View features name

features with index

Example 2:

Converting text to feature using N-grams

Bigram features

Bigram Example 1:

Bigram Example 2:

Trigrams features

Trigram Example1:

Converting text to feature using Hash vectorizer

Example 1

Example 2

Converting text to feature using Term Frequency-Inverse Document Frequency(TF-IDF)

Example 1

Example 2

kindergarten

Python for kids

Fourier series

Linear Equations

Geometry

Laplace

Vectors

Differential equations

Functions

Jacobian

Lagrangian

Waves

Electromagnetism

Optics

Quantum mechanics concepts

Theory of relativity

Kinematics

Thermodynamics

Formulae

A level physics

Chemistry

English

Geography

Animation

Plotting

SVG

Python

Machine Learning

TensorFlow

PySpark

PyTorch

Natural Language Processing

Others