The ability to categorize opinions expressed in the text of tweets—and especially to determine whether the writer's attitude is positive, negative, or neutral—is highly valuable. In this guide, we will use the process known as sentiment analysis to categorize the opinions of people on Twitter towards a hypothetical topic called #hashtag.
There are different ordinal scales used to categorize tweets. A five-point ordinal scale includes five categories: Highly Negative, Slightly Negative, Neutral, Slightly Positive, and Highly Positive. A three-point ordinal scale includes Negative, Neutral, and Positive; and a two-point ordinal scale includes Negative and Positive. In this guide, we will use a three-point ordinal scale to categorize tweets with #hashtag.
Sentiment analysis involves natural language processing because it deals with human-written text. You'll have to download a few Python libraries to work with the code. Use pip install <library>
to install them.
To train a machine learning model, we need data. You can download the dataset to use in this guide here.
Importing the required libraries.
1import pandas as pd
2import numpy as np
3import re
4import string
5from nltk.corpus import stopwords
6from nltk.tokenize import word_tokenize
7from sklearn.feature_extraction.text import TfidfVectorizer
8from sklearn.model_selection import train_test_split
9from nltk.stem import PorterStemmer
10from nltk.stem import WordNetLemmatizer
11# ML Libraries
12from sklearn.metrics import accuracy_score
13from sklearn.naive_bayes import MultinomialNB
14from sklearn.linear_model import LogisticRegression
15from sklearn.svm import SVC
16
17# Global Parameters
18stop_words = set(stopwords.words('english'))
After you download the CSV, you'll see that there are 1.6 million tweets already coded into three categories by hand.
This dataset encoded the target variable with a 3-point ordinal scale: 0 = negative, 2 = neutral, 4 = positive.
1def load_dataset(filename, cols):
2 dataset = pd.read_csv(filename, encoding='latin-1')
3 dataset.columns = cols
4 return dataset
The dataset has six columns 'target', 't_id', 'created_at', 'query', 'user', 'text', but we are only interested in 'text', 'target'. You can include other columns also if you like. To make it scalable, you need a small script.
1def remove_unwanted_cols(dataset, cols):
2 for col in cols:
3 del dataset[col]
4 return dataset
This is one of the essential steps in any natural language processing (NLP) task. Data scientists never get filtered, ready-to-use data. To make it workable, there is a lot of processing that needs to happen.
nltk
library, or it can be business-specific.1def preprocess_tweet_text(tweet):
2 tweet.lower()
3 # Remove urls
4 tweet = re.sub(r"http\S+|www\S+|https\S+", '', tweet, flags=re.MULTILINE)
5 # Remove user @ references and '#' from tweet
6 tweet = re.sub(r'\@\w+|\#','', tweet)
7 # Remove punctuations
8 tweet = tweet.translate(str.maketrans('', '', string.punctuation))
9 # Remove stopwords
10 tweet_tokens = word_tokenize(tweet)
11 filtered_words = [w for w in tweet_tokens if not w in stop_words]
12
13 #ps = PorterStemmer()
14 #stemmed_words = [ps.stem(w) for w in filtered_words]
15 #lemmatizer = WordNetLemmatizer()
16 #lemma_words = [lemmatizer.lemmatize(w, pos='a') for w in stemmed_words]
17
18 return " ".join(filtered_words)
Stemming is faster than lemmatization. You can uncomment the code and see how results change. Note: Do not apply both. Remember that stemming and lemmatization are normalization techniques, and it is recommended to use only one approach to normalize. Let your project requirements guide your decision, or you can always do experiments and see which one gives better results. In this case, stemming and lemmatizing yield almost the same accuracy.
In this guide, you'll implement vectorization using tf-idf. There are other techniques as well, such as Bag of Words and N-grams.
1def get_feature_vector(train_fit):
2 vector = TfidfVectorizer(sublinear_tf=True)
3 vector.fit(train_fit)
4 return vector
Important Note: I am using the dataset as the corpus to make a tf-idf vector. The same vector structure should be used for training and testing purposes.
The target column is comprised of the integer values 0, 2, and 4. But users do not usually want their results in this form. To convert the integer results to be easily understood by users, you can implement a small script.
1def int_to_string(sentiment):
2 if sentiment == 0:
3 return "Negative"
4 elif sentiment == 2:
5 return "Neutral"
6 else:
7 return "Positive"```
In this section, we will call all the functions that you have created. You'll see Naive Bayes and Logistic Regression algorithms for predictions. These two algorithms are quite popular in NLP, although you can try out other options too.
1# Load dataset
2dataset = load_dataset("data/training.csv", ['target', 't_id', 'created_at', 'query', 'user', 'text'])
3# Remove unwanted columns from dataset
4n_dataset = remove_unwanted_cols(dataset, ['t_id', 'created_at', 'query', 'user'])
5#Preprocess data
6dataset.text = dataset['text'].apply(preprocess_tweet_text)
7# Split dataset into Train, Test
8
9# Same tf vector will be used for Testing sentiments on unseen trending data
10tf_vector = get_feature_vector(np.array(dataset.iloc[:, 1]).ravel())
11X = tf_vector.transform(np.array(dataset.iloc[:, 1]).ravel())
12y = np.array(dataset.iloc[:, 0]).ravel()
13X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=30)
14
15# Training Naive Bayes model
16NB_model = MultinomialNB()
17NB_model.fit(X_train, y_train)
18y_predict_nb = NB_model.predict(X_test)
19print(accuracy_score(y_test, y_predict_nb))
20
21# Training Logistics Regression model
22LR_model = LogisticRegression(solver='lbfgs')
23LR_model.fit(X_train, y_train)
24y_predict_lr = LR_model.predict(X_test)
25print(accuracy_score(y_test, y_predict_lr))
Naive Bayes is giving nearly 76% accuracy, and Logistic Regression gives nearly 79%. These accuracy figures are recorded without implementing stemming or lemmatization. Using better techniques, you might get better accuracy.
This step is completely optional and will only apply if you have read and implemented the guide Building a Twitter Bot with Python.
1test_file_name = "trending_tweets/08-04-2020-1586291553-tweets.csv"
2test_ds = load_dataset(test_file_name, ["t_id", "hashtag", "created_at", "user", "text"])
3test_ds = remove_unwanted_cols(test_ds, ["t_id", "created_at", "user"])
4
5# Creating text feature
6test_ds.text = test_ds["text"].apply(preprocess_tweet_text)
7test_feature = tf_vector.transform(np.array(test_ds.iloc[:, 1]).ravel())
8
9# Using Logistic Regression model for prediction
10test_prediction_lr = LR_model.predict(test_feature)
11
12# Averaging out the hashtags result
13test_result_ds = pd.DataFrame({'hashtag': test_ds.hashtag, 'prediction':test_prediction_lr})
14test_result = test_result_ds.groupby(['hashtag']).max().reset_index()
15test_result.columns = ['heashtag', 'predictions']
16test_result.predictions = test_result['predictions'].apply(int_to_string)
17
18print(test_result)
Replace the file name with your own in the test_file_name
variable.
I hope you enjoyed reading this guide. Sentiment analysis is a popular project that almost every data scientist will do at some point. It can solve a lot of problems depending on you how you want to use it.
I highly recommended using different vectorizing techniques and applying feature extraction and feature selection to the dataset. Try to implement more machine learning models and you might be able to get accuracy over 85%.
If you have any questions, feel free to reach out to me at CodeAlphabet.