In this blog, we will build a sentiment analysis model in Pytorch. For that we will use Sentiment140 Dataset.
You can download data from kaggle.com.
http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/
Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter.
The data is a CSV with emoticons removed. Data file format has 6 fields:
the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
the id of the tweet (2087)
the date of the tweet (Sat May 16 23:58:44 UTC 2009)
the query (lyx). If there is no query, then this value is NO_QUERY.
the user that tweeted (robotickilldozr)
the text of the tweet (Lyx is cool)
import torch
torch.__version__
if torch.cuda.is_available():
device = torch.device("cuda")
else:
device = torch.device("cpu")
device
import pandas as pd
input_file = 'input/training.1600000.processed.noemoticon.csv'
df = pd.read_csv(input_file, header = None)
df.head()
df.tail()
df.shape
pandas.Series.value_counts
Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True):
Return a Series containing counts of unique values.
The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
df[0].value_counts()
df.info()
df["sentiment_category"] = df[0].astype('category')
df["sentiment_category"]
Dataframe after adding sentiment_category
df.head()
df.tail()
df["sentiment_category"].cat.codes
df["sentiment"] = df["sentiment_category"].cat.codes
df["sentiment"]
df.head()
df.tail()
We have too many records, so let us reduce the dataset and save random 10000 records.
df.sample(10000).to_csv("train.csv", header = None, index = None)
Installing torchtext with pip:
pip install torchtext
Installing with conda:
conda install -c derickl torchtext
import torchtext
from torchtext.legacy import data
torchtext.__version__
We need only two columns label and text of tweet. We are defining LABEL as a LabelField. TWEET is a Field object, where we are using the spaCy tokenizer and convert all the text to lower.
Field
class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='
Defines a datatype together with instructions for converting to Tensor.
Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.
If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.
LABEL = data.LabelField()
TWEET = data.Field(tokenize='spacy', tokenizer_language = 'en_core_web_sm', lower = True)
fields = [('score',None), ('id',None), ('date',None), ('query',None), ('name',None),
('tweet', TWEET),('category',None),('label',LABEL)]
fields
twitterDataset = data.dataset.TabularDataset(
path = "train.csv",
format = "CSV",
fields = fields,
skip_header = False)
split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)
Create train-test(-valid?) splits from the instance’s examples.
Parameters:
split_ratio (float or List of python:floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
stratified (bool) – whether the sampling should be stratified. Default is False.
strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().
Returns:
Datasets for train, validation, and test splits in that order, if the splits are provided.
Return type: Tuple[Dataset]
(train, validation, test) = twitterDataset.split(split_ratio = [0.8,0.1,0.1],
stratified = True, strata_field = 'label')
len(train)
len(validation)
len(test)
vocab_size = 20000
TWEET.build_vocab(train, max_size = vocab_size)
len(TWEET.vocab)
LABEL.build_vocab(train)
TWEET.vocab.freqs.most_common(10)
train_dataloader, valid_dataloader, test_dataloader = data.BucketIterator.splits(
(train, validation, test),
batch_size = 32,
device = device,
sort_key = lambda x: len(x.tweet),
sort_within_batch = False)
import torch.nn as nn
class MyLSTMModel(nn.Module):
def __init__(self, hidden_size, embedding_dim, vocab_size):
super(MyLSTMModel, self).__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.encoder = nn.LSTM(input_size=embedding_dim,
hidden_size=hidden_size, num_layers=1)
self.predictor = nn.Linear(hidden_size, 2)
pass
def forward(self, seq):
output, (hidden,_) = self.encoder(self.embedding(seq))
preds = self.predictor(hidden.squeeze(0))
return preds
pass
model = MyLSTMModel(hidden_size = 100,embedding_dim = 300, vocab_size = vocab_size)
model.to(device)
A loss function computes a value that estimates how far away the output is from the target. The main objective is to reduce the loss function's value by changing the weight vector values through backpropagation in neural networks.
The loss function represents how well our model behaves after each iteration of optimization on the training set.
criterion = nn.CrossEntropyLoss()
criterion
import torch.optim as optim
Optimizers define how the weights of the neural network are to be updated. Optimizers take model parameters and learning rate as the input arguments.
optimizer = optim.Adam(model.parameters(), lr=2e-2)
optimizer
epochs = 10
def train(epochs, model, optimizer, criterion, train_dataloader, valid_dataloader):
for epoch in range(1, epochs + 1):
#set training and valid loss to zero
training_loss = 0.0
valid_loss = 0.0
#set model for train
model.train()
for batch_idx, batch in enumerate(train_dataloader):
# get the batch; batch is a list of [tweet, label]
tweet, label = batch
#optimizer set it to zero_grad(), means clear the gradients
optimizer.zero_grad()
#Forward Pass
predict = model(tweet)
# Find the Loss
loss = criterion(predict, label)
# Calculate gradients
loss.backward()
# Update Weights
optimizer.step()
# Calculate Loss
training_loss += loss.data.item() * tweet.size(0)
training_loss /= len(train_dataloader)
#set model for evalution
model.eval()
for batch_idx,batch in enumerate(valid_dataloader):
# get the batch; batch is a list of [tweet, label]
tweet, label = batch
predict = model(tweet)
loss = criterion(predict, label)
valid_loss += loss.data.item() * tweet.size(0)
valid_loss /= len(valid_dataloader)
print('Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}'.format(epoch, training_loss, valid_loss))
train(epochs, model, optimizer, criterion, train_dataloader, valid_dataloader)
We have to call preprocess(), which performs our spaCy-based tokenization. torchtext is expecting a batch of strings, so we have to turn TWEET into a list of lists. After that, we have to call process() to the tokens into a tensor based on our already-built vocabulary. Then we feed it into the model.
def classifyTweet(tweet):
categories = {0: "Negative", 1:"Positive"}
processed = TWEET.process([TWEET.preprocess(tweet)])
processed = processed.to(device)
model.eval()
prediction = model(processed)
print("Prediction: ", prediction)
pred_cat = categories[prediction.argmax().item()]
return pred_cat
Prediction gives the tensor, so we will take the highest value. For that we will argmax() and then item() to turn that zerodimension tensor into a Python integer. That we index into our categories dictionary.
classifyTweet("Just woke up. Having no school is the best thing")
classifyTweet("Bullshit")
classifyTweet("excellent")