Twitter is a social media platform, and its analysis can provide plenty of useful information. In this article, we will show you, using the sentiment140 dataset as an example, how to conduct Twitter Sentiment Analysis using Python and the most advanced neural networks of today - transformers.
Transformers technology was created in 2017, and since then, models based on it are the most popular and widely used. Earlier, we gave a detailed overview of transformers in Computer Vision. Today we will be using one of the main transformer-based models, the Robustly Optimized BERT Pre-training Approach (RoBERTa).
It builds on BERT and modifies key hyperparameters, removing the next-sentence pre-training objective and training with larger mini-batches and learning rates.
Twitter Sentiment Analysis
Now that we know the basics, we can start the tutorial. Here's what we need to do to train a sentiment analysis model:
Install the transformers library;
Download the ROBERTA model and train for fine-tuning;
Process the sentiment140 dataset and tokenize it using the Roberta tokenizer;
Make predictions with the refined model and see when the model makes the wrong predictions.
We will use the ROBERTA model from the Transformers library, and use PyTorch as the main framework.
Install the Transformers Library
To install the Transformers library, simply run the following pip line on a Google Colab cell:
!pip install transformers
After the installation is completed, we will import the torch to add some layers for fine-tuning, Roberta's model, and Roberta's tokenizer. After that, we will create a model class, where we load the pre-trained model and add new layers:
The dataset consists of a table with 6 columns, we will use only 2 of them: “text” and “label” containing the tweet and the target label respectively. The volume of the dataset is 1.6 million tweets with 800 thousand positive and 800 thousand negative examples, i.e. the data is already perfectly balanced. The lengths of the tweets are distributed as follows:
As we can see, tweets longer than 140 characters can be discarded as statistically insignificant.
Also since a dataset with a volume of 1.6M is excessively large for a fine-tuned RoBERTa, we take only 100k of each type.
Let’s Load the Dataset
We will load the dataset from Google Drive, and for this, we will use the corresponding Google Drive library:
# connect with your google drive
importpandasaspd# paste your path to the dataset!cp '/content/drive/MyDrive/dataset.zip' dataset.zip
We need to unzip files from the archive:
importzipfilewith zipfile.ZipFile("dataset.zip", 'r') as zip_ref:
Preparing the Data
Let’s drop unnecessary columns and rename the remaining ones:
In the dataset, positive tweets are marked with the number 4, but RoBERTa will perceive the number 4 as if we were predicting 5 classes or more. Since we only have two classes, we will have to replace the labels:
To process the text, we need to convert it into tokens. A special tokenizer is used for Roberta. It returns 3 values: a list of tokenized texts, a list of masks, and a list of token data types. After the tokenization, training and test data are stored in the train_tokenized_data and test_tokenized_data variables, respectively:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base',
MAX_LEN = 130
train_tokenized_data = [tokenizer.encode_plus(
for text in train_data['text']]
test_tokenized_data = [tokenizer.encode_plus(
for text in test_data['text']]
For the convenience of using the data, we will create a dataset class, which will store all data for processing, targets, and source texts:
The model converges quickly to good values, and then converges much more slowly, changing its loss values in a limited range:
On the confusion matrix, we see that the model rarely makes mistakes, and often incorrectly predicts negative phrases, considering them positive:
Let's see in which predictions our network made a mistake. As we can see, only a few of them are real network errors. The rest of the errors are due to incorrect or controversial markup of the dataset:
Saving the Model
Finally, We can save our model to Google Drive:
print('All files saved')
print('Congratulations, you complete this tutorial')
You have successfully built a transformer classifier based on Roberta's model. Despite the not very high quality of the dataset, we managed to train the model to 87% accuracy, which is a great result in the task of tweet sentiment analysis. You can repeat the experiment with our Google Drive file.