Twitter Sentiment Classification In PyTorch

In this blog, we will build a sentiment analysis model in Pytorch. For that we will use Sentiment140 Dataset.

Download the twitter dataset

You can download data from kaggle.com.

http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

Sentiment140 Dataset details:

Sentiment140 allows you to discover the sentiment of a brand, product, or topic on Twitter.

The data is a CSV with emoticons removed. Data file format has 6 fields:

the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
the id of the tweet (2087)
the date of the tweet (Sat May 16 23:58:44 UTC 2009)
the query (lyx). If there is no query, then this value is NO_QUERY.
the user that tweeted (robotickilldozr)
the text of the tweet (Lyx is cool)

Import torch module

import torch

torch.__version__

'1.8.1+cu102'

Check the device, if cuda is available

if torch.cuda.is_available():
    device = torch.device("cuda") 
else:
    device = torch.device("cpu")

device

device(type='cpu')

Load the data

import pandas as pd

input_file = 'input/training.1600000.processed.noemoticon.csv'

df = pd.read_csv(input_file, header = None)
df.head()

df.tail()

df.shape

(1600000, 6)

pandas.Series.value_counts

Series.value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True):

Return a Series containing counts of unique values.

The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

df[0].value_counts()

0    800000
4    800000
Name: 0, dtype: int64

View the dataframe details

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   0       1600000 non-null  int64 
 1   1       1600000 non-null  int64 
 2   2       1600000 non-null  object
 3   3       1600000 non-null  object
 4   4       1600000 non-null  object
 5   5       1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB

Create the category of twitter

df["sentiment_category"] = df[0].astype('category')
df["sentiment_category"]

0          0
1          0
2          0
3          0
4          0
          ..
1599995    4
1599996    4
1599997    4
1599998    4
1599999    4
Name: sentiment_category, Length: 1600000, dtype: category
Categories (2, int64): [0, 4]

Dataframe after adding sentiment_category

df.head()

df.tail()

df["sentiment_category"].cat.codes

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Length: 1600000, dtype: int8

pandas.Series.cat.codes

Series.cat.codes

Return Series of codes as well as the index.

df["sentiment"] = df["sentiment_category"].cat.codes
df["sentiment"]

0          0
1          0
2          0
3          0
4          0
          ..
1599995    1
1599996    1
1599997    1
1599998    1
1599999    1
Name: sentiment, Length: 1600000, dtype: int8

df.head()

df.tail()

Save random 10000 it to csv file

We have too many records, so let us reduce the dataset and save random 10000 records.

df.sample(10000).to_csv("train.csv", header = None, index = None)

What is torchtext?

Like torchvision, PyTorch provides an official library, torchtext, for handling text-processing pipelines.

Installing torchtext with pip:

pip install torchtext

Installing with conda:

conda install -c derickl torchtext

torchtext.data

The data module provides the following:

Ability to define a preprocessing pipeline
Batching, padding, and numericalizing (including building a vocabulary object)
Wrapper for dataset splits (train, validation, test)
Loader a custom NLP dataset

import torchtext
from torchtext.legacy import data

torchtext.__version__

'0.9.1'

Create the dataset

Define Label and Tweet field datatype

We need only two columns label and text of tweet. We are defining LABEL as a LabelField. TWEET is a Field object, where we are using the spaCy tokenizer and convert all the text to lower.

Field

class torchtext.data.Field(sequential=True, use_vocab=True, init_token=None, eos_token=None, fix_length=None, dtype=torch.int64, preprocessing=None, postprocessing=None, lower=False, tokenize=None, tokenizer_language='en', include_lengths=False, batch_first=False, pad_token='', unk_token='', pad_first=False, truncate_first=False, stop_words=None, is_target=False):

Defines a datatype together with instructions for converting to Tensor.

Field class models common text processing datatypes that can be represented by tensors. It holds a Vocab object that defines the set of possible values for elements of the field and their corresponding numerical representations. The Field object also holds other parameters relating to how a datatype should be numericalized, such as a tokenization method and the kind of Tensor that should be produced.

If a Field is shared between two columns in a dataset (e.g., question and answer in a QA dataset), then they will have a shared vocabulary.

LABEL = data.LabelField()
TWEET = data.Field(tokenize='spacy', tokenizer_language = 'en_core_web_sm', lower = True)

Create fields

fields = [('score',None), ('id',None), ('date',None), ('query',None), ('name',None), 
          ('tweet', TWEET),('category',None),('label',LABEL)]

fields

[('score', None),
 ('id', None),
 ('date', None),
 ('query', None),
 ('name', None),
 ('tweet', <torchtext.legacy.data.field.Field at 0x7f9565813bd0>),
 ('category', None),
 ('label', <torchtext.legacy.data.field.LabelField at 0x7f9565813c10>)]

class torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

Defines a Dataset of columns stored in CSV, TSV, or JSON format.

twitterDataset = data.dataset.TabularDataset(
        path = "train.csv", 
        format = "CSV", 
        fields = fields,
        skip_header = False)

Divide twitterDataset into train, validation and test

split(split_ratio=0.7, stratified=False, strata_field='label', random_state=None)

Create train-test(-valid?) splits from the instance’s examples.
Parameters: 

    split_ratio (float or List of python:floats) – a number [0, 1] denoting the amount of data to be used for the training split (rest is used for validation), or a list of numbers denoting the relative sizes of train, test and valid splits respectively. If the relative size for valid is missing, only the train-test split is returned. Default is 0.7 (for the train set).
    stratified (bool) – whether the sampling should be stratified. Default is False.
    strata_field (str) – name of the examples Field stratified over. Default is ‘label’ for the conventional label field.
    random_state (tuple) – the random seed used for shuffling. A return value of random.getstate().

Returns:    

Datasets for train, validation, and test splits in that order, if the splits are provided.
Return type: Tuple[Dataset]

(train, validation, test) = twitterDataset.split(split_ratio = [0.8,0.1,0.1],
                                            stratified = True, strata_field = 'label')

len(train)

8000

len(validation)

1000

len(test)

1000

Building a Vocabulary

one-hot encoding of each word of training data

vocab_size = 20000
TWEET.build_vocab(train, max_size = vocab_size)

len(TWEET.vocab)

16462

LABEL.build_vocab(train)

View most common words

TWEET.vocab.freqs.most_common(10)

[('i', 4974),
 ('!', 4713),
 ('.', 4014),
 (' ', 2972),
 ('to', 2867),
 ('the', 2545),
 (',', 2365),
 ('a', 1938),
 ('my', 1602),
 ('you', 1556)]

Create a data loader

train_dataloader, valid_dataloader, test_dataloader = data.BucketIterator.splits(
    (train, validation, test),
    batch_size = 32,
    device = device,
    sort_key = lambda x: len(x.tweet),
    sort_within_batch = False)

Create a LSTM Model

import torch.nn as nn

class MyLSTMModel(nn.Module):
    def __init__(self, hidden_size, embedding_dim, vocab_size):
        super(MyLSTMModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.encoder = nn.LSTM(input_size=embedding_dim,
        hidden_size=hidden_size, num_layers=1)
        self.predictor = nn.Linear(hidden_size, 2)
        pass
    
    def forward(self, seq):
        output, (hidden,_) = self.encoder(self.embedding(seq))
        preds = self.predictor(hidden.squeeze(0))
        return preds
        pass

model = MyLSTMModel(hidden_size = 100,embedding_dim = 300, vocab_size = vocab_size)
model.to(device)

MyLSTMModel(
  (embedding): Embedding(20000, 300)
  (encoder): LSTM(300, 100)
  (predictor): Linear(in_features=100, out_features=2, bias=True)
)

Define the loss

A loss function computes a value that estimates how far away the output is from the target. The main objective is to reduce the loss function's value by changing the weight vector values through backpropagation in neural networks.

The loss function represents how well our model behaves after each iteration of optimization on the training set.

criterion = nn.CrossEntropyLoss()
criterion

CrossEntropyLoss()

Define the optimizer

import torch.optim as optim

Optimizers define how the weights of the neural network are to be updated. Optimizers take model parameters and learning rate as the input arguments.

optimizer = optim.Adam(model.parameters(), lr=2e-2)
optimizer

Adam (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.999)
    eps: 1e-08
    lr: 0.02
    weight_decay: 0
)

Updating the Training Loop

epochs = 10

def train(epochs, model, optimizer, criterion, train_dataloader, valid_dataloader):
    for epoch in range(1, epochs + 1):
     
        #set training and valid loss to zero
        training_loss = 0.0
        valid_loss = 0.0
        
        #set model for train
        model.train()
        
        for batch_idx, batch in enumerate(train_dataloader):
            
            # get the batch; batch is a list of [tweet, label]
            tweet, label = batch
            
            #optimizer set it to zero_grad(), means clear the gradients  
            optimizer.zero_grad()
            
            #Forward Pass
            predict = model(tweet)
            
            # Find the Loss
            loss = criterion(predict, label)
            
            # Calculate gradients 
            loss.backward()
            
            # Update Weights
            optimizer.step()
            
            # Calculate Loss
            training_loss += loss.data.item() * tweet.size(0)
            
        training_loss /= len(train_dataloader)
 
        #set model for evalution
        model.eval()
        for batch_idx,batch in enumerate(valid_dataloader):
            # get the batch; batch is a list of [tweet, label]
            tweet, label = batch
            
            predict = model(tweet)
            loss = criterion(predict, label)
            valid_loss += loss.data.item() * tweet.size(0)
 
        valid_loss /= len(valid_dataloader)
        print('Epoch: {}, Training Loss: {:.2f}, Validation Loss: {:.2f}'.format(epoch, training_loss, valid_loss))

train(epochs, model, optimizer, criterion, train_dataloader, valid_dataloader)

Epoch: 1, Training Loss: 23.35, Validation Loss: 11.17
Epoch: 2, Training Loss: 19.51, Validation Loss: 13.26
Epoch: 3, Training Loss: 17.05, Validation Loss: 12.58
Epoch: 4, Training Loss: 15.26, Validation Loss: 13.24
Epoch: 5, Training Loss: 14.48, Validation Loss: 15.72
Epoch: 6, Training Loss: 13.37, Validation Loss: 15.67
Epoch: 7, Training Loss: 12.70, Validation Loss: 15.92
Epoch: 8, Training Loss: 12.42, Validation Loss: 17.10
Epoch: 9, Training Loss: 11.38, Validation Loss: 15.87
Epoch: 10, Training Loss: 10.79, Validation Loss: 15.65

Predict new tweet

We have to call preprocess(), which performs our spaCy-based tokenization. torchtext is expecting a batch of strings, so we have to turn TWEET into a list of lists. After that, we have to call process() to the tokens into a tensor based on our already-built vocabulary. Then we feed it into the model.

def classifyTweet(tweet):
    categories = {0: "Negative", 1:"Positive"}
    processed = TWEET.process([TWEET.preprocess(tweet)])
    processed = processed.to(device)
    
    model.eval()
    prediction = model(processed)
    print("Prediction: ",  prediction)
    pred_cat = categories[prediction.argmax().item()] 
    return pred_cat

Prediction gives the tensor, so we will take the highest value. For that we will argmax() and then item() to turn that zerodimension tensor into a Python integer. That we index into our categories dictionary.

classifyTweet("Just woke up. Having no school is the best thing")

Prediction:  tensor([[ 0.0990, -0.2582]], grad_fn=<AddmmBackward>)

'Negative'

classifyTweet("Bullshit")

Prediction:  tensor([[ 0.2480, -0.2117]], grad_fn=<AddmmBackward>)

'Negative'

classifyTweet("excellent")

Prediction:  tensor([[-0.3993,  0.1590]], grad_fn=<AddmmBackward>)

'Positive'

	1	2	3	4	5
0	1467810369	Mon Apr 06 22:19:45 PDT 2009	NO_QUERY	_TheSpecialOne_	@switchfoot http://twitpic.com/2y1zl - Awww, t...
1	1467810672	Mon Apr 06 22:19:49 PDT 2009	NO_QUERY	scotthamilton	is upset that he can't update his Facebook by ...
2	1467810917	Mon Apr 06 22:19:53 PDT 2009	NO_QUERY	mattycus	@Kenichan I dived many times for the ball. Man...
3	1467811184	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	ElleCTF	my whole body feels itchy and like its on fire
4	1467811193	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	Karoli	@nationwideclass no, it's not behaving at all....

	0	1	2	3	4	5
1599995	4	2193601966	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	AmandaMarie1028	Just woke up. Having no school is the best fee...
1599996	4	2193601969	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	TheWDBoards	TheWDB.com - Very cool to hear old Walt interv...
1599997	4	2193601991	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	bpbabe	Are you ready for your MoJo Makeover? Ask me f...
1599998	4	2193602064	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	tinydiamondz	Happy 38th Birthday to my boo of alll time!!! ...
1599999	4	2193602129	Tue Jun 16 08:40:50 PDT 2009	NO_QUERY	RyanTrevMorris	happy #charitytuesday @theNSPCC @SparksCharity...

	1	2	3	4	5
0	1467810369	Mon Apr 06 22:19:45 PDT 2009	NO_QUERY	_TheSpecialOne_	@switchfoot http://twitpic.com/2y1zl - Awww, t...
1	1467810672	Mon Apr 06 22:19:49 PDT 2009	NO_QUERY	scotthamilton	is upset that he can't update his Facebook by ...
2	1467810917	Mon Apr 06 22:19:53 PDT 2009	NO_QUERY	mattycus	@Kenichan I dived many times for the ball. Man...
3	1467811184	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	ElleCTF	my whole body feels itchy and like its on fire
4	1467811193	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	Karoli	@nationwideclass no, it's not behaving at all....

	0	1	2	3	4	5	sentiment_category
1599995	4	2193601966	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	AmandaMarie1028	Just woke up. Having no school is the best fee...	4
1599996	4	2193601969	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	TheWDBoards	TheWDB.com - Very cool to hear old Walt interv...	4
1599997	4	2193601991	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	bpbabe	Are you ready for your MoJo Makeover? Ask me f...	4
1599998	4	2193602064	Tue Jun 16 08:40:49 PDT 2009	NO_QUERY	tinydiamondz	Happy 38th Birthday to my boo of alll time!!! ...	4
1599999	4	2193602129	Tue Jun 16 08:40:50 PDT 2009	NO_QUERY	RyanTrevMorris	happy #charitytuesday @theNSPCC @SparksCharity...	4

	1	2	3	4	5
0	1467810369	Mon Apr 06 22:19:45 PDT 2009	NO_QUERY	_TheSpecialOne_	@switchfoot http://twitpic.com/2y1zl - Awww, t...
1	1467810672	Mon Apr 06 22:19:49 PDT 2009	NO_QUERY	scotthamilton	is upset that he can't update his Facebook by ...
2	1467810917	Mon Apr 06 22:19:53 PDT 2009	NO_QUERY	mattycus	@Kenichan I dived many times for the ball. Man...
3	1467811184	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	ElleCTF	my whole body feels itchy and like its on fire
4	1467811193	Mon Apr 06 22:19:57 PDT 2009	NO_QUERY	Karoli	@nationwideclass no, it's not behaving at all....

Twitter Sentiment Classification In PyTorch

Twitter Sentiment Classification In PyTorch

Download the twitter dataset

Sentiment140 Dataset details:

Import torch module

Check the device, if cuda is available

Load the data

pandas.Series.value_counts

View the dataframe details

Create the category of twitter

pandas.Series.cat.codes

Save random 10000 it to csv file

What is torchtext?

torchtext.data

Create the dataset

Define Label and Tweet field datatype

Field

Create fields

class torchtext.data.TabularDataset(path, format, fields, skip_header=False, csv_reader_params={}, **kwargs)

Divide twitterDataset into train, validation and test

Building a Vocabulary

one-hot encoding of each word of training data

View most common words

Create a data loader

Create a LSTM Model

Define the loss

Define the optimizer

Updating the Training Loop

Predict new tweet

kindergarten

Python for kids

Fourier series

Linear Equations

Geometry

Laplace

Vectors

Differential equations

Functions

Jacobian

Lagrangian

Waves

Electromagnetism

Optics

Quantum mechanics concepts

Theory of relativity

Kinematics

Thermodynamics

Formulae

A level physics

Chemistry

English

Geography

Animation

Plotting

SVG

Python

Machine Learning

TensorFlow

PySpark

PyTorch

Natural Language Processing

Others