Twitter Sentiment Analysis in Python using transformers

Twitter is a social media platform, and its analysis can provide plenty of useful information. In this article, we will show you, using the sentiment140 dataset as an example, how to conduct Twitter Sentiment Analysis using Python and the most advanced neural networks of today - transformers.

Transformers

Transformers technology was created in 2017, and since then, models based on it are the most popular and widely used. Earlier, we gave a detailed overview of transformers in Computer Vision. Today we will be using one of the main transformer-based models, the Robustly Optimized BERT Pre-training Approach (RoBERTa).

About RoBERTa

The RoBERTa model was proposed in RoBERTa: A Robustly Optimized BERT Pretraining Approach by Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov. It is based on Google’s BERT model released in 2018.

It builds on BERT and modifies key hyperparameters, removing the next-sentence pre-training objective and training with larger mini-batches and learning rates.

Twitter Sentiment Analysis

Now that we know the basics, we can start the tutorial. Here's what we need to do to train a sentiment analysis model:

Install the transformers library;
Download the ROBERTA model and train for fine-tuning;
Process the sentiment140 dataset and tokenize it using the Roberta tokenizer;
Make predictions with the refined model and see when the model makes the wrong predictions.

We will use the ROBERTA model from the Transformers library, and use PyTorch as the main framework.

Install the Transformers Library

To install the Transformers library, simply run the following pip line on a Google Colab cell:

!pip install transformers

After the installation is completed, we will import the torch to add some layers for fine-tuning, Roberta's model, and Roberta's tokenizer. After that, we will create a model class, where we load the pre-trained model and add new layers:

import torch
from transformers import RobertaModel

class RobertaClass(torch.nn.Module):
    def __init__(self):
        super(RobertaClass, self).__init__()
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        self.pre_classifier = torch.nn.Linear(768, 768)
        self.classifier = torch.nn.Linear(768, 2)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, 
                           attention_mask=attention_mask, 
                           token_type_ids=token_type_ids)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = torch.nn.Tanh()(pooler)
        output = self.classifier(pooler)
        return output

model = RobertaClass()
model.to(device)

“sentiment140” dataset

For our research, we will use this dataset.

The dataset consists of a table with 6 columns, we will use only 2 of them: “text” and “label” containing the tweet and the target label respectively. The volume of the dataset is 1.6 million tweets with 800 thousand positive and 800 thousand negative examples, i.e. the data is already perfectly balanced. The lengths of the tweets are distributed as follows:

Figure 1: Histogram of tweet lengths

As we can see, tweets longer than 140 characters can be discarded as statistically insignificant.

Also since a dataset with a volume of 1.6M is excessively large for a fine-tuned RoBERTa, we take only 100k of each type.

Let’s Load the Dataset

We will load the dataset from Google Drive, and for this, we will use the corresponding Google Drive library:

from google.colab import drive
# connect with your google drive
drive.mount('/content/drive') 

import pandas as pd
# paste your path to the dataset
!cp '/content/drive/MyDrive/dataset.zip' dataset.zip

We need to unzip files from the archive:

import zipfile
with zipfile.ZipFile("dataset.zip", 'r') as zip_ref:
    zip_ref.extractall("./")

Preparing the Data

Let’s drop unnecessary columns and rename the remaining ones:

full_data = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', 
                        encoding='latin-1').drop(["1467810369",
                                                  "Mon Apr 06 22:19:45 PDT 2009",
                                                  "NO_QUERY","_TheSpecialOne_"],
                                                 axis=1).dropna()
columns_names = list(full_data)
full_data.rename(columns={columns_names[0]:"label",
                        columns_names[1]:"text"}, inplace= True)

In the dataset, positive tweets are marked with the number 4, but RoBERTa will perceive the number 4 as if we were predicting 5 classes or more. Since we only have two classes, we will have to replace the labels:

NUM_SAMPLES = 30000
negative_samples = full_data[full_data["label"]==0][:NUM_SAMPLES]
positiv_samples = full_data[full_data["label"]==4][:NUM_SAMPLES]

positiv_samples["label"]=[1]*NUM_SAMPLES

full_data = pd.concat([negative_samples,  positiv_samples])

We get the finished table:

Figure 2: Table with our data

Train and Test Split

We divide the sample into training and test splits. Don't forget to reset the indices:

from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(full_data, test_size=0.3)
train_data = train_data.reset_index(drop=True)
test_data = test_data.reset_index(drop=True)
positiv_samples["label"]=[1]*NUM_SAMPLES

full_data = pd.concat([negative_samples,  positiv_samples])

Tokenization

To process the text, we need to convert it into tokens. A special tokenizer is used for Roberta. It returns 3 values: a list of tokenized texts, a list of masks, and a list of token data types. After the tokenization, training and test data are stored in the train_tokenized_data and test_tokenized_data variables, respectively:

from transformers import RobertaTokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base', 
                                             truncation=True, 
                                             do_lower_case=True)
MAX_LEN = 130

train_tokenized_data = [tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=MAX_LEN,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        for text in train_data['text']]

test_tokenized_data = [tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=MAX_LEN,
            pad_to_max_length=True,
            return_token_type_ids=True,
            truncation=True
        )
        for text in test_data['text']]

Prepare dataset

Dataset class

For the convenience of using the data, we will create a dataset class, which will store all data for processing, targets, and source texts:

from torch.utils.data import Dataset, DataLoader
TRAIN_BATCH_SIZE = 32
TEST_BATCH_SIZE = 32      
LEARNING_RATE = 1e-05

class SentimentData(Dataset):
    def __init__(self, data, inputs_tokenized):
        self.inputs = inputs_tokenized
        self.text = data['text']
        self.targets = data['label']

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())
        
        input = self.inputs[index]
        ids = input['input_ids']
        mask = input['attention_mask']
        token_type_ids = input['token_type_ids']

        return {
            'sentence': text,
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }

train_dataset = SentimentData(train_data, train_tokenized_data)
test_dataset = SentimentData(test_data, test_tokenized_data)

Dataloader

Let’s put datasets in dataloader classes for taking the batches, and then shuffle them:

train_params = {'batch_size': TRAIN_BATCH_SIZE,
                'shuffle': True
                }

test_params = {'batch_size': TEST_BATCH_SIZE,
                'shuffle': True
                }

train_loader = DataLoader(train_dataset, **train_params)
test_loader = DataLoader(test_dataset, **test_params)

Fine-Tuning the RoBERTa Model

Now let's make a function for learning, and use it. During the training process, we will save the texts of the test data that the network will classify incorrectly:

from tqdm import tqdm
import matplotlib.pyplot as plt
from IPython.display import clear_output

train_loss = []
test_loss = []

train_accuracy = []
test_accuracy = []

test_answers = [[[],[]], [[],[]]]

def train_loop(epochs):
  for epoch in range(epochs):
    for phase in ['Train', 'Test']:
      if(phase == 'Train'):
        model.train()
        loader = train_loader
      else:
        model.eval()
        loader = test_loader  
      epoch_loss = 0
      epoch_acc = 0
      for steps, data in tqdm(enumerate(loader, 0)):
        sentence = data['sentence']
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        token_type_ids = data['token_type_ids'].to(device, dtype = torch.long)
        targets = data['targets'].to(device, dtype = torch.long)

        outputs = model.forward(ids, mask, token_type_ids)

        loss = loss_function(outputs, targets)        
        
        epoch_loss += loss.detach()
        _, max_indices = torch.max(outputs.data, dim=1)
        bath_acc = (max_indices==targets).sum().item()/targets.size(0)
        epoch_acc += bath_acc

        if (phase == 'Train'):
          train_loss.append(loss.detach()) 
          train_accuracy.append(bath_acc)
          optimizer.zero_grad()
          loss.backward()
          optimizer.step()
        else:
          test_loss.append(loss.detach()) 
          test_accuracy.append(bath_acc)
          if epoch == epochs-1:
            for i in range(len(targets)):
              test_answers[targets[i].item()][max_indices[i].item()].append([sentence[i], 
                                                                 targets[i].item(), 
                                                                 max_indices[i].item()])

      print(f"{phase} Loss: {epoch_loss/steps}")
      print(f"{phase} Accuracy: {epoch_acc/steps}")

A long-awaited moment - we start to train our model. Since the model is not training from scratch, several epochs will be enough for us:

loss_function = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(params =  model.parameters(), lr=LEARNING_RATE)

EPOCHS = 4
train_loop(EPOCHS)

Code for visualizing sentiment analysis

We will evaluate the results using graphs showing loss and accuracy:

plt.plot(train_loss,  color='blue')
plt.title("Train Loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.show()

plt.plot(test_loss,  color='orange')
plt.title("Test Loss")
plt.xlabel("Batch")
plt.ylabel("Loss")      
plt.show()

plt.plot(train_accuracy,  color='blue')
plt.title("Train Accuracy")
plt.xlabel("Batch")
plt.ylabel("Accuracy")  
plt.show()

plt.plot(test_accuracy,  color='orange')
plt.title("Test Accuracy")
plt.xlabel("Batch")
plt.ylabel("Accuracy")  
 
plt.show()

Here’s the confusion matrix:

import seaborn as sn
import pandas as pd
import matplotlib.pyplot as plt

len_num = len(test_dataset)

tp=len(test_answers[1][1])/len_num
fn=len(test_answers[1][0])/len_num
fp=len(test_answers[0][1])/len_num
tn=len(test_answers[0][0])/len_num

array_matrix = [[tp,tn], 
                [fp,fn]]
df_cm = pd.DataFrame(array_matrix, index = ['T', 'F'],
                  columns = ['P', 'N'])
plt.figure(figsize = (5,5))
sn.heatmap(df_cm, annot=True)

In the end, we will display wrongly predicted sentences:

print('False Negative:\n', test_answers[0][0][:3], 
      'False Positive:\n', test_answers[0][1][:3])

Results

The model converges quickly to good values, and then converges much more slowly, changing its loss values in a limited range:

Figure 3: Test/train loss and accuracy graphs

On the confusion matrix, we see that the model rarely makes mistakes, and often incorrectly predicts negative phrases, considering them positive:

Figure 4: Confusion matrix

Let's see in which predictions our network made a mistake. As we can see, only a few of them are real network errors. The rest of the errors are due to incorrect or controversial markup of the dataset:

Figure 5: False Negative & False Positive examples

Saving the Model

Finally, We can save our model to Google Drive:

save_path="./"
torch.save(model, save_path+'trained_roberta.pt')
print('All files saved')
print('Congratulations, you complete this tutorial')

Congratulations

You have successfully built a transformer classifier based on Roberta's model. Despite the not very high quality of the dataset, we managed to train the model to 87% accuracy, which is a great result in the task of tweet sentiment analysis. You can repeat the experiment with our Google Drive file.