In this blog, we will discuss text to features methods. It will include:
One Hot Encoding
Count Vectorizer
N-grams
Hash vectorizer
Term Frequency-Inverse Document Frequency(TF-IDF)
import pandas as pd
text = "I love my family"
tokenize_words = text.split()
pd.get_dummies(tokenize_words)
Output has 4 features as the number of unique words present in the input was 4.
from sklearn.feature_extraction.text import CountVectorizer
class sklearn.feature_extraction.text.CountVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.int64'>):
Convert a collection of text documents to a matrix of token counts.
CountVectorizer implements both tokenization and occurrence counting in a single class. The vectorization the general process of turning a collection of text documents into numerical feature vectors.
text = ["I love my family and in my family there are ten memebers."]
vectorizer = CountVectorizer()
vector = vectorizer.fit_transform(text)
vector
vectorizer.get_feature_names()
vectorizer.vocabulary_
vector.toarray()
We can see word 'my' and 'family' are showing twice.
text1 = [
"I love my family",
"In my family there are ten memebers",
"All of them are stay together"
]
vectorizer1 = CountVectorizer()
vector1 = vectorizer1.fit_transform(text1)
vector1
vectorizer1.get_feature_names()
vectorizer1.vocabulary_
vector1.toarray()
text = ["I love my family and in my family there are ten memebers."]
ngram_range: tuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. Only applies if analyzer is not callable.
vectorizer2 = CountVectorizer(ngram_range=(2, 2))
vector2 = vectorizer2.fit_transform(text)
vector2
vectorizer2.vocabulary_
vector2.toarray()
We can see "my family" bigram came twice.
text1 = [
"I love my family",
"In my family there are ten memebers",
"All of them are stay together"
]
vectorizer3 = CountVectorizer(ngram_range=(2, 2))
vector3 = vectorizer3.fit_transform(text1)
vector3
vectorizer3.vocabulary_
vector3.toarray()
vectorizer4 = CountVectorizer(ngram_range=(3, 3))
vector4 = vectorizer4.fit_transform(text)
vector4
vectorizer4.vocabulary_
vector4.toarray()
from sklearn.feature_extraction.text import HashingVectorizer
class sklearn.feature_extraction.text.HashingVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer='word', n_features=1048576, binary=False, norm='l2', alternate_sign=True, dtype=<class 'numpy.float64'>)
Convert a collection of text documents to a matrix of token occurrences.
This text vectorizer implementation uses the hashing trick to find the token string name to feature integer index mapping.
This strategy has several advantages:
I. it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory.
II. it is fast to pickle and un-pickle as it holds no state besides the constructor parameters.
III. it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.
There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):
I. there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.
II. there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).
III. no IDF weighting as this would render the transformer stateful.
text = ["I love my family and in my family there are ten memebers."]
vectorizer5 = HashingVectorizer(n_features = 10)
vector5 = vectorizer5.fit_transform(text)
print(vector5.shape)
vector5.toarray()
It created vector of size 10.
text1 = [
"I love my family",
"In my family there are ten memebers",
"All of them are stay together"
]
vectorizer6 = HashingVectorizer(n_features = 10)
vector6 = vectorizer6.fit_transform(text1)
print(vector6.shape)
vector6.toarray()
from sklearn.feature_extraction.text import TfidfVectorizer
class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
Convert a collection of raw documents to a matrix of TF-IDF features.
text = ["I love my family", "In my family there are ten memebers."]
vectorizer7 = TfidfVectorizer()
vectorizer7.fit_transform(text)
vectorizer7.vocabulary_
vectorizer7.idf_
If we observe, "my" and "family" words are appearing two times in both the documents. That's why the vector value are 1, which is less than all the other vector representations of the tokens.
text1 = [
"I love my family",
"In my family there are ten memebers",
"All of them are stay together"
]
vectorizer8 = TfidfVectorizer()
vectorizer8.fit_transform(text1)
vectorizer8.vocabulary_
vectorizer8.idf_