Table of Contents
- What is sentiment analysis
- Load Data
- Remove missing data
- Clean, tokenize and get frequencies of positive reviews
- Visualize positive reviews
- Clean, tokenize and get frequencies of negative reviews
- Visualize negative reviews
- Convert Text into numeric matrix
- Convert sentiment column to numeric
- Split data into train and test
- Create model
- Compile model and print summary
- Train model
- The accuracy of the trained model
- Predict sentiment with new reviews
Sentiment analysis is a text analysis technique that detects people's emotions like positive, negative or neutral within texts. Sentiment analysis is often used in business to detect sentiment in social data, market research, understand customers, customer service, product reviews, product feedback etc...
In this document we are going to use reviews of hotel and find out customer's emotion.
I am using reviews.csv file. If you want you can download.
inputFolder = "input/"
filePath = inputFolder + 'reviews.csv'
filePath
read_csv() is an important pandas function which read a comma-separated values (csv) file and converts into DataFrame.
import pandas as pd
df = pd.read_csv(filePath)
df.head()
print(df.shape)
print(df.dtypes)
DataFrame.isna(): Detect missing values. Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values.
df['Sentiment'].isna()
sum(): to count the NaN values for one column df['Sentiment']
df['Sentiment'].isna().sum()
Pandas dropna() method allows the user to analyze and drop Rows/Columns with Null values.
df = df.dropna()
print(df.shape)
print(df.head())
#check df['Sentiment'] == 'positive'
positive_reviews = df.loc[df['Sentiment'] == 'positive']
print(positive_reviews.head())
#import regular expression module
import re
re module provides regular expression matching operations similar to those found in Perl. A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).
\w matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched.
[^\w] will match any character except '[a-zA-Z0-9]'.
positive_reviews = [re.sub("[^\w ]", " " , x.lower()) for x in positive_reviews['Text']]
positive_reviews = ' '.join(positive_reviews)
positive_reviews[:500]
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import collections
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Stopwords are most common words. NLTK supports stop word removal, and we can find the list of stop words in the corpus module. To remove stop words from a sentence, you can divide your text into words and then remove the word if it exits in the list of stop words provided by NLTK. NLTK has stop words in 16 languages which can be downloaded and used. Once it is downloaded, it can be passed as an argument indicating it to ignore these words.
The tok-tok tokenizer is a simple, general tokenizer, where the input has one sentence per line; thus only final period is tokenized.
#create the instance of ToktokTokenizer()
tokenizer = ToktokTokenizer()
#tokenize positive reviews
positive_tokens = tokenizer.tokenize(positive_reviews)
#print ten words
print(positive_tokens[:10])
positive_tokens = [word for word in positive_tokens if not word in stopwords.words()]
positive_tokens[:20]
word_counts = collections.Counter(positive_tokens)
print(word_counts)
import matplotlib.pyplot as plt
plt.figure(figsize=(18, 10))
plt.bar(list(word_counts.keys())[:20], list(word_counts.values())[:20])
plt.show()
NLTK in python has a function FreqDist which gives you the frequency of words within a text. FreqDist runs on an array of tokens. You're sending it a an array of characters (a string) where you should have tokenized the input first. The plot() method can be called to draw the frequency distribution as a graph for the most common tokens in the text.
import nltk
plt.figure(figsize=(18, 10))
fd = nltk.FreqDist(word_counts)
fd.plot(30, title='Frequency distribution for top 30 words')
plt.show()
from wordcloud import WordCloud, STOPWORDS
import numpy as np
from PIL import Image
A word cloud is a visual representation of text data, typically used to depict keyword metadata (tags) on websites, or to visualize free form text. Tags are usually single words, and the importance of each tag is shown with font size or color.
pic_mask = np.array(Image.open("input/rooms.jpg"))
wordcloud = WordCloud(stopwords=STOPWORDS,
background_color='white',
width=1600, height=800,
mask=pic_mask
).generate(positive_reviews)
plt.figure( figsize=(18,8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
negative_reviews = df.loc[df['Sentiment'] == 'negative']
negative_reviews[:5]
negative_reviews = [re.sub("[^\w ]", " " , x.lower()) for x in negative_reviews['Text']]
negative_reviews = ' '.join(negative_reviews)
negative_reviews[:500]
tokenizer = ToktokTokenizer()
negative_tokens = tokenizer.tokenize(negative_reviews)
print(negative_tokens[:10])
negative_tokens = [word for word in negative_tokens if not word in stopwords.words()]
negative_tokens[:20]
word_counts = collections.Counter(negative_tokens)
print(word_counts)
plt.figure(figsize=(18, 10))
plt.bar(list(word_counts.keys())[:20], list(word_counts.values())[:20])
plt.show()
plt.figure(figsize=(18, 10))
fd = nltk.FreqDist(word_counts)
fd.plot(30, title='Frequency distribution for top 30 words')
plt.show()
pic_mask = np.array(Image.open("input/rooms.jpg"))
negative_wordcloud = WordCloud(stopwords=STOPWORDS,
background_color='white',
width=1600, height=800,
mask=pic_mask
).generate(negative_reviews)
plt.figure( figsize=(18,8))
plt.imshow(negative_wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
print(df['Text'].values)
Tokenizer class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...
Arguments: num_words: the maximum number of words to keep, based on word frequency. Only the most common `num_words-1` words will be kept. filters: a string where each element is a character that will be filtered from the texts. The default is all unctuation, plus, tabs and line breaks, minus the `'` character. lower: boolean. Whether to convert the texts to lowercase. split: str. Separator for word splitting. char_level: if True, every character will be treated as a token. oov_token: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls.By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the
'
character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.
fit_on_texts(texts): Updates internal vocabulary based on a list of texts. In the case where texts contains lists, we assume each entry of the lists to be a token.
max_fatures = 2000
tokenizer = Tokenizer(num_words = max_fatures, split = ' ')
tokenizer.fit_on_texts(df['Text'].values)
texts_to_sequences(texts): Transforms each text in texts to a sequence of integers.
X = tokenizer.texts_to_sequences(df['Text'].values)
X = pad_sequences(X)
X
X.ndim
X.shape
pandas.get_dummies(data, prefix=None, prefixsep='', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None): Convert categorical variable into dummy/indicator variables.
y = pd.get_dummies(df['Sentiment']).values
y[:10]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 50)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM, SpatialDropout1D, Dropout
Keras offers an Embedding layer that can be used for neural networks on text data. So that each word is represented by a unique integer.
When we create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on).
We must specify three arguments:
input_dim: Integer. Size of the vocabulary, i.e. maximum integer index + 1. output_dim: Integer. Dimension of the dense embedding. input_length: Length of input sequences, when it is constant. This argument is required if you are going to connect Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).
model = Sequential()
output_dim = 128
model.add(Embedding(max_fatures, output_dim, input_length = X.shape[1]))
model.add(Dropout(0.4))
model.add(LSTM(64, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam', metrics = ['accuracy'])
print(model.summary())
By setting verbose 0, 1 or 2 you just say how do you want to 'see' the training progress for each epoch.
verbose=0 will show you nothing (silent)
verbose=1 will show you an animated progress bar
verbose=2 will just mention the number of epoch like this:
Epoch 1/10
- 0s - loss: 0.6853 - accuracy: 0.5764
model.fit(X_train, y_train, epochs = 20, batch_size = 25, verbose = 2)
# evaluate the model
loss, accuracy = model.evaluate(X_train, y_train)
print('Loss: %f' % (loss*100))
print('Accuracy: %f' % (accuracy*100))
reviews = ['Excellent service', 'Very uncomfortable thin mattress very bad', 'worst service']
reviews
reviews = tokenizer.texts_to_sequences(reviews)
reviews = pad_sequences(reviews, maxlen=50, dtype='int32', value=0)
print(reviews)
sentiments = model.predict(reviews, batch_size=10, verbose = 1)
sentiments
import numpy as np
Each sentiment is classified as positive "1" or negative "0".
def printSentiment(sentiment):
if(np.argmax(sentiment) == 0):
return "negative"
elif (np.argmax(sentiment) == 1):
return "positive"
pass
[printSentiment(sentiment) for sentiment in sentiments]