Sentiment Analysis Using Machine Learning Techniques: Comparison of Classifiers

ARTIFICIAL INTELLIGENCE PROJECT PORTFOLIO

What is Sentiment Analysis?

Sentiment is a thought or view or opinion of a person related to a topic or an object. “Sentiment analysis which is also known as opinion mining refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials.” [1] It is a form of Artificial Intelligence technique, as it automates the process of classification of sentiments of unstructured text. This leads to the reduction of human efforts for the evaluation of the sentiments of a text provided. One of the set of documents used for determining “polarity” about specific objects is online movie reviews. Sentiment Analysis is a text classification technique which primarily determines the binary polarity (positive or negative) of a document or determines the opinion of a speaker or a writer with respect to the topic as shown in Fig 1. It is very helpful in the survey of the trends of public opinion in social media for marketing purposes.

Text classification based on sentiment is more challenging as compared to text classification based on topic. Using the large movie reviews dataset of Internet Movie Database (IMDB)[2][3], I have focused on two possible categories for classification - positive and negative using three machine learning techniques namely Multinomial Naive Bayes, Support Vector Machine, and Logistic Regression (Max Entropy). I have further evaluated their performance to classify the movie reviews based on sentiment. Additionally, I have compared the techniques based on accuracy, recall and f1-score. I have successfully achieved better accuracy (approximately 91%) that is better than results mentioned in [4] by parameter tuning and variations in features.

Fig. 1 Sentiment Analysis : Positive or Negative? [17]

Why Sentiment Analysis is Important?

Rise of Web 2.0 has enabled the users to interact with each other and post their opinions which can be shared worldwide through Internet. Examples include social networking sites, blogs, news websites etc. This has given rise to production of enormous amount of data which is not structured. Thus, automatic text categorization and text classification has become the active area of research for computer scientists. Recently, there has been an increased growth of online expression of views or opinions through comments, ratings, reviews, recommendations and online discussion groups. Therefore, the most important characteristic of such kind of information is the sentiment or opinion of the user, for example- whether the movie review is positive or negative. Such an application of sentiment analysis is presented in Fig. 2. Thus, sentiment classification technique can be used to gather overall user opinion regarding a product or a movie, resulting in a good marketing strategy to add value to the product or movie. It can also be used in recommendation systems and intelligent systems. For example, Siri also makes use of sentiment classification to identify the tone of the user, making phones as smart as humans in understanding natural written and spoken language[11]. It has also been evident in area of politics, making it handy for politicians to know the voter’s views and the number of supporters. However, sentiment classification is not 100% accurate due to ambiguity in english language such as comparisons, mixed emotions and vague positive or negative opinions.

Sentiment Analysis and Artificial Intelligence

Overall, Sentiment Analysis acts as an Artificial Intelligence technique to incorporate human psychological knowledge into machines in order to train them to automatically identify the type of emotions in a text[12]. Various areas where this technique can be applied are mass opinion estimation, psychiatric treatment, corporate reputation measurement, political orientation categorization, stock market prediction, customer preference study, public opinion study and so on[12].

Fig. 2 Brand Monitoring in Social Media - Application of Sentiment Analysis [16]

My Work

Viewing at the wide range of applications of sentiment classification, I performed an analysis of sentiment classification on large dataset of Internet Movie Database (IMDB)[2][3] movie reviews, using three machine learning techniques (supervised classification techniques) - Multinomial Naive Bayes, Linear Support Vector Machine, and Logistic Regression (Max Entropy). I also evaluated and compared the techniques based on precision, recall, f1-score, accuracy, and confusion matrix. I further experimented and improved the accuracy using combination of features such as unigrams and bigrams, and parameter tuning such as using sublinear TF for vector transformations.

Related Work

When I was researching about this topic, I found a technical paper named “Thumbs up? Sentiment Classification using Machine Learning Techniques” a good start point to learn about text classification based on sentiments. The authors have experimented and evaluated the effectiveness of the machine learning techniques to classify based on sentiments. They have also mentioned that sentiment classification is more challenging as compared to traditional topic based classification problems as the machine learning techniques gave less accuracy for sentiment classification. However, the authors used a small corpus consisting of just 752 negative and 1301 positive reviews. Datasets of their work is available here. I also read through papers named “Machine Learning approaches to Sentiment Analysis using Dutch Netlog Corpus” and “Sentiment Classification using Machine Learning Techniques”.

Movie Review DataSet

I have used a large movie review dataset of Internet Movie Database (IMDB)[2][3]. This dataset was collected and labelled by authors of the paper “Learning Word Vectors for Sentiment Analysis”(ACL 2011) [bib]. This dataset consists of movie reviews along with their binary sentiment polarity labels. It contains 50,000 movie reviews split into 25,000 training data (12,500 positive reviews and 12,500 negative reviews) and 25,000 test data (12,500 positive reviews and 12,500 negative reviews). In the entire collection, authors have not used more than 30 reviews for any given movie as reviews of same movie tend to have correlated ratings. Further, training and test set contains a disjoint set of movies. Thus, performance cannot be increased by memorizing movie-unique terms and their associated labels. In the manually labeled dataset, a review having a score less or equal to 4 out of total of 10 is assigned a negative label and a review having score of greater than equal to 7 out of total of 10 is assigned a positive label. Thus reviews with more neutral ratings are not included in the train/test sets.

Top level directory of the dataset is ‘aclImdb’. Training set is inside a folder named ‘train’ and test set is inside a folder named ‘test’. Each folder contain sub-folder named ‘pos’ and ‘neg’ containing positive and negative review files. The text files stored in the folders are named in the pattern <movieID>_<rating>.txt. Rating is in the scale of 1-10.

Background and Definitions

5.1 What are various approaches for Sentiment Analysis?

Sentiment Analysis is the automated process of extraction of subjective content from text and prediction of the subjective content into classes such as positive or negative. Subjective opinions can be split into classes such as positive, negative or neutral or into n-point scale such as ‘excellent’, ‘good’, ‘bad’, ‘worst’ etc. In my work, I have focused on bipolar classification into positive and negative classes.

Various approaches for implementation of sentiment analysis are lexicon-based, supervised learning and unsupervised learning. Lexicon-based approach is based on information about words or phrases to indicate positivity or negativity. SentiWordNet is example of publicly available lexical resource. Supervised learning and unsupervised learning approaches usually are based on corpora or set of documents. Sentiment analysis can be done at word-level, sentence-level or document-level. Web opinion mining is important nowadays because customers survey for reviews of a product for example before making a purchase.

I have experimented with various supervised classification techniques for sentiment analysis. Let me introduce the basics of supervised learning and the techniques used by me.

5.2 Introduction to Supervised Learning

“Supervised learning is a machine learning task of predicting a label or class for the given test input, using the learning model trained on the labeled training inputs.”[18] The training data in case of supervised learning usually consists of a set of training examples where each example is a pair consisting of an input vector and an output label associated with it.[19] Supervised learning is illustrated in Fig. 3. The function of feature extractor is conversion of each input data to a feature set. In the first step of training a model, pairs of feature sets and labels are sent as input for the machine learning algorithm. During the second step of prediction phase, extracted features of test data after application of feature extractor on the test data, are input into the model to generate the predicted labels.[18]

Fig. 3 Supervised Learning [18]

I have used three techniques of supervised learning - Multinomial Naive Bayes, Logistic Regression and Support Vector Machine (SVM).

5.3 What is Naive Bayes Classifier?

“The Naive Bayes classifier is a simple probabilistic classifier which is based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features.”[20] Despite the design of the classifier being naive and assumptions of the classifier being oversimplified, Naive Bayes is a suitable classifier for complex real-world problems. It is very efficient as well, because it requires less CPU and memory and can perform well on small amount of training data. Training time required by Naive Bayes is also small as compared to other methods. [21]

Theoretical explanation of Naive Bayes is as follows: (According to [20])

For a class variable y, and independent feature vector x1 through xn (assumption of Naive Bayes), according to the Naive Bayes’ theorem the following relationship holds:

$P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y) \Downarrow \hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),$

and Maximum A Posteriori (MAP) can be used for estimation of P(y) and P(xi|y) where P(y) is then the relative frequency of class y in the training set.

The different naive Bayes classifiers differ mainly by the assumptions made regarding the distribution of P(xi|y).

5.4 Multinomial Naive Bayes

According to the theory given in [20]. “Multinomial Naive Bayes is a variant of Naive Bayes for multinomial distributed data. It is one of the two classic naive Bayes variants used in text classification where the data are typically represented as word vector counts or tf-idf vectors.” Underlying algorithm of multinomial Naive Bayes can be further explored in [20].

5.5 Logistic Regression (Max Entropy)

The name “Logistic Regression” might be misleading. It is a technique of classification and not regression. Regression is added into the nomenclature, as linear model is fit to the feature space. It is a statistical and probabilistic model of classification. It is mostly used for classification of class variable or dependent variable which is binary in nature based on one or more features. The probabilities of the outcomes are depicted as a function of predictor variables known as logistic function. Logistic function takes as input a value from negative infinity to positive infinity, whereas output function always takes on values between zero and one. A basic logistic function is shown in Fig. 4. For models having many categories or classes, multinomial logistic regression or ordered logistic regression is more suitable. [23]

Fig. 4. Basic Logistic Curve [23]

5.6 Support Vector Machine (Linear SVM)

Support Vector Machines are supervised learning models used for classification and regression. It is a binary non- probabilistic linear classifier. [24] This model takes as input labeled training data and produces an optimal hyperplane as output. SVM algorithm aims to find the hyperplane that provides largest minimum distance to the training examples. Margin is defined as twice this distance mentioned. Thus, hyperplane that maximizes the margin of the training data is optimal separating hyperplane.[26] This is illustrated in the Fig. 5.SVMs generally perform well in text classifications and outperform other classifiers.

Fig. 5 Optimal Hyperplane for Linear SVM [26]

Experimental Setup

For performing the experiment of sentiment analysis using various pre-built machine learning techniques, I used scikit-learn. Scikit-learn is a machine learning toolkit in python which provides simple and efficient tools for data mining and data analysis. It is built on NumPy, SciPy and matplotlib. This tool has built-in libraries for classification algorithms that can be used for analysis of sentiment classification for the dataset mentioned in Section 4.

I used Ubuntu OS and installed necessary packages required for the experiment such as Python 2.7, NumPy, SciPy, python-sklearn (scikit-learn), and matplotlib. I tested the performance of various classification techniques based on accuracy or accuracy, recall, f1-score and confusion matrix. Further, I evaluated the confusion matrix or error matrix of the algorithm which is a measure to test the performance of a supervised learning in a visual manner. The specifics of performance measures used for evaluation of techniques are explained in detail in Section 9.

Data Preprocessing

The raw data needed to be pre-processed into the proper format before feeding it into the algorithm for classification and evaluation. For implementation of the sentiment analysis project, I have referred scikit-learn Working with Text Data. [5]

7.1 Reading the data

First step of preprocessing was to read the filenames of the directories and store the raw data along with appropriate label in a data structure.

'''IMDB_Data class stores all relevant information related to imdb movie reviews such as ID of imdb review, sentiment and actual data of the review '''

class IMDB_Data:

def __init__(self,imdbID,sentiment):

self.imdbID = imdbID

self.sentiment = sentiment

self.data=""

IMDB_Data class stores the movie ID, associated sentiment and the data read from the text file. The file names associated with the text file determine the imdbID. The directory name “pos” or “neg” determine the sentiment associated with the movie review with ID ‘imdbID’. This is implemented using readFileNames function as mentioned below. This function is called for both test and train datasets.

''' Reads file names of the passed dirName and associates the passed polarity to each file in the directory'''

def readFileNames(dirName,polarity,imdb_data_type):

for file in os.listdir(dirName):

if file.endswith(".txt"):

fileName=str(file)

fileName=re.sub('.txt', '',fileName)

imdbObject=IMDB_Data(fileName,polarity)

imdb_data_type.append(imdbObject)

The second function readFiles() reads the data of all the fileNames and stores the data in data list for train and test (imdbData and imdbTestData), and labels in target list for test and train (imdbTarget and imdbTestTarget) respectively.

'''Reads files from the folderName passed as argument and stores it in imdb_data_type class object which can be train or test '''

def readFiles(folderName,imdb_data_type,data_test_train,target_test_train):

for imdbObject in imdb_data_type:

fileName=folderName+imdbObject.sentiment+'//'+imdbObject.imdbID+'.txt'

f=open(fileName,'r')

text=f.readlines()

imdbObject.data=str(text)

#add each imdb review data to a list

data_test_train.append(str(imdbObject.data))

target_test_train.append(str(imdbObject.sentiment))

f.close()

7.2 Feature Extraction

For performing machine learning tests on text documents, feature vectors need to be extracted from the text.

Bag of Words

Bag of Words is a vector-space representation of text documents (training set) that aids in representation of a document as vector of words without any order and maintaining the frequency of occurrence of words in the document. [10] This acts as feature for training the classifier. In scikit-learn, this is implemented as a sparse matrix to save space. As, bag of words is a huge collection of words or features in the set of documents, but each document contains only a subset of features.

Feature Vectors for the text data are obtained by Tokenization and Normalized Frequency Count.

Tokenization

Tokenization is the process of breaking up a text into individual words and phrases known as symbols. Preprocessing of text, tokenization, and removal of stopwords are done with the aid of CountVectorizer present in the package sklearn.feature_extraction.text. Output of CountVectorizer is feature vector of documents representing the frequency of occurrence of each word in the document as well.

Normalized Frequency Count

Frequency of occurrences of a word in a longer document is more than that in a shorter document. Thus, frequency count may not be a correct measure. Term frequency (tf) solves this problem, as tf is obtained by dividing frequency of occurrence of a word in document by total number of words in the document. Upscaling is done using tf and downscaling is done using idf. This whole process is achieved successfully using TfidfTransformer present in the package sklearn.feature_extraction.text.

Implementation

Training the Classifiers

Next step is to train the classifiers using the labeled data and feature vectors. Training of classifier is necessary for prediction of the correct sentiment of test data. Pipeline class functions as compound classifier for vectorizing, transforming and then training the classifier.

Naive Bayes Pipeline

''' Pipeline class behaves like a compound classifier as it combines vectorizer, transformer and classifier'''

multinomialNB_classifier = Pipeline([('mnbVectorizer',CountVectorizer(ngram_range=(1,2))),

('mnbTfidf', TfidfTransformer(sublinear_tf=True)),

('mnbClassifier', MultinomialNB()),

])

multinomialNB_classifier = multinomialNB_classifier.fit(imdbData, imdbTarget)

Logistic Regression Pipeline

maxEntropy_classifier = Pipeline([('maxEntropyVectorizer',CountVectorizer(ngram_range=(1,2))),

('maxEntropyTfidf', TfidfTransformer(sublinear_tf=True)),

('maxEntropyClassifier',LogisticRegression(penalty='l2') ),

])

maxEntropy_classifier= maxEntropy_classifier.fit(imdbData,imdbTarget)

Support Vector Machine Pipeline

supportVectorMachine_classifier = Pipeline([('svmVectorizer',CountVectorizer(ngram_range=(1,2))),

('svmTfidf', TfidfTransformer(sublinear_tf=True)),

('svmClassifier',LinearSVC() ),

])

_= supportVectorMachine_classifier.fit(imdbData,imdbTarget)

Prediction and Accuracy

The code snippet given below is used for prediction of movie reviews test data and check the accuracy percentage for each of the three classifiers. The accuracy percentage and other evaluation results are described in Section 9 and 10.

predicted_test_output=multinomialNB_classifier.predict(imdbTestData)

print("Mean or Accuracy of Multinomial NB classifier: ")

print(np.mean(predicted_test_output==imdbTestTarget))

Implementation Files

Python implementation of each of the three classifiers are included in the ‘code’ folder. The file named sentiment_nb.py is for Multinomial Naive Bayes, sentiment_me is for Logistic Regression, sentiment_svm is for Linear Support Vector Machine. The screenshots of evaluation of classifiers are included in the ‘Screenshots’ folder. Naming convention of Screenshots is as follows:

<ClassifierName>_parameterTuned.png

For example, LinearSVC_UnigramBigram_SublinearTF.png includes result of LinearSVC with unigram and bigram features and sublinearTF parameter for TfidfTransformer.

Performance Measures

Various measures used for performance analysis are :

Accuracy

Accuracy measures the closeness of predicted value to target or labeled value. In case of classification, accuracy is defined as number of correct predictions made divided by the total number of predictions made (multiply by 100 to transform the fraction into percentage).[28]

In the sentiment classification of movie reviews, accuracy of 80% for example indicates that 80 out of 100 samples are classified correctly as positive or negative sentiments, that is for 80 movie reviews the predicted class is equal to target class or standard labeled class.

Precision

Precision is also known as positive predictive value. It is defined as the fraction of retrieved instances that are relevant. In classification, precision is the number of true positives (number of movie reviews correctly labeled as belonging to positive class) divided by total number of reviews labeled or predicted as belonging to the positive class (sum of true positives and false positives-movie reviews incorrectly labeled as belonging to the class). [27]

Thus, precision =Number of true positives/(Number of true positives + Number of false positives).

Recall

Recall is also known as sensitivity. It is fraction of relevant instances that are retrieved.

Thus, recall = number of true positives/total number of elements that belong to positive class(sum of true positives and false negatives-movie reviews that should have been labeled as belonging to positive class but have been labeled as belonging to negative class).[27] Fig. 6 shows precision and recall pictorially.

Fig. 6 Precision and Recall [27]

There is always an inverse relationship between precision and recall, as one can be increased at the cost of other.

F1 Score or F Measure

F1 Score is the weighted harmonic mean of both precision and recall. It brings a balance between precision and recall.

F1 Score= 2*((precision*recall)/(precision+recall)) [27]

Confusion Matrix

It is also known as contingency table or error matrix. It is a visual representation for viewing the performance of supervised classification technique. The column represents the predicted reviews and rows represent the actual labels for the reviews.

Example of Confusion Matrix is shown in Fig. 7.

Fig. 7 Confusion Matrix

Code below used for analysis of classification techniques. The function metrics.classification_report() calculates the precision, recall and f1 score for the classifier. The function metrics.confusion_matrix calculates the confusion matrix for the classifier.

print("Metrics for Evaluation:")

print(metrics.classification_report(imdbTestTarget, predicted_test_output,

target_names=movie_review_target_names))

confusion_matrix=metrics.confusion_matrix(imdbTestTarget, predicted_test_output)

print("Confusion Matrix is :")

print(confusion_matrix)

Parameter Tuning in Machine Learning Techniques

CountVectorizer, TfidfTransformer, Classifiers - all have various parameters which can be added to further improve the accuracy, precision and other evaluation metrics. CountVectorizer can be experimented with unigrams or combination of unigrams and bigrams. TfidfTransformer can be altered by adding use_idf or sublinear_tf. Parameters can also be added to classifiers as well. I experimented with various parameters and further improved the accuracy of the classifier with combination of parameters.

Evaluation and Results

Fig. 8 shows the accuracy comparison of classifiers. It also shows various accuracies with parameter tuning. For example, for Multinomial Naive Bayes, accuracy with unigram features is 83.028% , while with both unigrams and bigrams as features along with sublinear TF, it is 87.336%. As observed from the graph and table, Linear SVC performs best accuracy of 90.744% along with unigrams plus bigram plus sublinear TF features.

Fig. 8 Accuracy of Classifiers

	Unigrams	Unigrams+Bigrams	Unigrams+Bigrams+SublinearTF
Multinomial Naïve Bayes	83.028	86.852	87.336
Logistic Regression	88.308	88.608	89.2
LinearSVC	87.772	90.24	90.744

Table 1 Accuracy values of classifiers

Fig. 9 shows the comparison of average precision (for both positive and negative reviews) of classifiers including the combination of various features of each classifier. Precision of LinearSVC is best among all classifiers with the value of 91%. Table 2 shows the values of average precision of classifiers.

Fig. 9 Average Precision of Classifiers

	Unigrams	Unigrams+Bigrams	Unigrams+Bigrams+SublinearTF
Multinomial Naïve Bayes	84	87	88
Logistic Regression	88	89	89
LinearSVC	88	90	91

Table 2 Average Precision Values for Classifiers

Fig. 10 shows the comparison of average recall value (for both positive and negative reviews) of classifiers including the combination of various features of each classifier. Recall of LinearSVC is best among all classifiers with the value of 91%. Table 3 shows the values of average recall of classifiers.

Fig. 10 Average Recall Comparison for Classifiers

	Unigrams	Unigrams+Bigrams	Unigrams+Bigrams+SublinearTF
Multinomial Naïve Bayes	83	87	87
Logistic Regression	88	89	89
LinearSVC	88	90	91

Table 3 Average Recall Values for Classifiers

Fig. 11 shows the comparison of average F1 Score (for both positive and negative reviews) of classifiers including the combination of various features of each classifier. F1 Score of LinearSVC is best among all classifiers with the value of 91%. Table 4 shows the values of average F1 score of classifiers.

Fig. 11 Average F1 Score Comparison for Classifiers

	Unigrams	Unigrams+Bigrams	Unigrams+Bigrams+SublinearTF
Multinomial Naïve Bayes	83	87	87
Logistic Regression	88	89	89
LinearSVC	88	90	91

Table 4 Average F1 Scores for Classifiers

Fig 12, 13 and 14 shows the confusion matrix for Multinomial Naive Bayes, Linear SVC and Logistic Regression respectively.

Fig. 12 Confusion Matrix for Multinomial Naive Bayes Classifier

Fig. 13 Confusion Matrix for Linear Support Vector Machine

Fig. 14 Confusion Matrix for Logistic Regression Classifier

Conclusion

To sum up in short, using the large movie reviews dataset of Internet Movie Database (IMDB)[2][3], I focused on two possible categories for classification - positive and negative using three machine learning techniques namely Multinomial Naive Bayes, Support Vector Machine, and Logistic Regression (Max Entropy). I further evaluated their performance to classify the movie reviews based on sentiment. Additionally, I compared the techniques based on accuracy, precision, recall, f1-score and confusion matrix. As it can be observed from the analysis results, all the three classifiers perform better than results mentioned in the paper [4]. It can be also concluded that LinearSVC with unigram and bigram feature vectors along with sublinearTF for tranformation outperforms all the classifiers in terms of performance with accuracy of 90.744% and precision, recall and F1-score of 91%. Further, classifiers perform better with combination of unigrams and bigrams as feature vectors along with sublinear TF transformation of vectors.