ARTIFICIAL INTELLIGENCE PROJECT PORTFOLIO
- What is Sentiment Analysis?
Sentiment is a thought or view or opinion of a person related to a topic or an object. “Sentiment analysis which is also known as opinion mining refers to the use of natural language processing, text analysis and computational linguistics to identify and extract subjective information in source materials .” [1] It is a form of Artificial Intelligence technique, as it automates the process of classification of sentiments of unstructured text. This leads to the reduction of human efforts for the evaluation of the sentiments of a text provided. One of the set of documents used for determining “polarity” about specific objects is online movie reviews. Sentiment Analysis is a text classification technique which primarily determines the binary polarity (positive or negative) of a document or determines the opinion of a speaker or a writer with respect to the topic as shown in Fig 1. It is very helpful in the survey of the trends of public opinion in social media for marketing purposes.
Text classification based on sentiment is more challenging as compared to text classification based on topic. Using the large movie reviews dataset of Internet Movie Database (IMDB) [2] [3], I have focused on two possible categories for classification - positive and negative using three machine learning techniques namely Multinomial Naive Bayes, Support Vector Machine, and Logistic Regression (Max Entropy). I have further evaluated their performance to classify the movie reviews based on sentiment. Additionally, I have compared the techniques based on accuracy, recall and f1-score. I have successfully achieved better accuracy (approximately 91%) that is better than results mentioned in [4] by parameter tuning and variations in features.
Fig. 1 Sentiment Analysis : Positive or Negative? [17]
- Why Sentiment Analysis is Important?
Sentiment Analysis and Artificial Intelligence
Overall, Sentiment Analysis acts as an Artificial Intelligence technique to incorporate human psychological knowledge into machines in order to train them to automatically identify the type of emotions in a text[ 12]. Various areas where this technique can be applied are mass opinion estimation , psychiatric treatment, corporate reputation measurement, political orientation categorization, stock market prediction, customer preference study, public opinion study and so on[ 12].
Fig. 2 Brand Monitoring in Social Media - Application of Sentiment Analysis [16]
- My Work
Viewing at the wide range of applications of sentiment classification, I performed an analysis of sentiment classification on large dataset of Internet Movie Database (IMDB) [2] [3] movie reviews, using three machine learning techniques (supervised classification techniques) - Multinomial Naive Bayes, Linear Support Vector Machine, and Logistic Regression (Max Entropy). I also evaluated and compared the techniques based on precision, recall, f1-score, accuracy, and confusion matrix. I further experimented and improved the accuracy using combination of features such as unigrams and bigrams, and parameter tuning such as using sublinear TF for vector transformations.
- Related Work
When I was researching about this topic, I found a technical paper named “Thumbs up? Sentiment Classification using Machine Learning Techniques” a good start point to learn about text classification based on sentiments. The authors have experimented and evaluated the effectiveness of the machine learning techniques to classify based on sentiments. They have also mentioned that sentiment classification is more challenging as compared to traditional topic based classification problems as the machine learning techniques gave less accuracy for sentiment classification. However, the authors used a small corpus consisting of just 752 negative and 1301 positive reviews. Datasets of their work is available here. I also read through papers named “Machine Learning approaches to Sentiment Analysis using Dutch Netlog Corpus” and “Sentiment Classification using Machine Learning Techniques”.
- Movie Review DataSet
I have used a large movie review dataset of Internet Movie Database (IMDB) [2] [3]. This dataset was collected and labelled by authors of the paper “Learning Word Vectors for Sentiment Analysis”(ACL 2011) [bib]. This dataset consists of movie reviews along with their binary sentiment polarity labels. It contains 50,000 movie reviews split into 25,000 training data (12,500 positive reviews and 12,500 negative reviews) and 25,000 test data (12,500 positive reviews and 12,500 negative reviews). In the entire collection, authors have not used more than 30 reviews for any given movie as reviews of same movie tend to have correlated ratings. Further, training and test set contains a disjoint set of movies. Thus, performance cannot be increased by memorizing movie-unique terms and their associated labels. In the manually labeled dataset, a review having a score less or equal to 4 out of total of 10 is assigned a negative label and a review having score of greater than equal to 7 out of total of 10 is assigned a positive label. Thus reviews with more neutral ratings are not included in the train/test sets.
Top level directory of the dataset is ‘aclImdb ’. Training set is inside a folder named ‘train’ and test set is inside a folder named ‘test’. Each folder contain sub-folder named ‘pos’ and ‘neg ’ containing positive and negative review files. The text files stored in the folders are named in the pattern <movieID >_<rating>. txt . Rating is in the scale of 1-10.
- Background and Definitions
5.1 What are various approaches for Sentiment Analysis?
Sentiment Analysis is the automated process of extraction of subjective content from text and prediction of the subjective content into classes such as positive or negative. Subjective opinions can be split into classes such as positive, negative or neutral or into n-point scale such as ‘excellent’, ‘good’, ‘bad’, ‘worst’ etc. In my work, I have focused on bipolar classification into positive and negative classes.
Various approaches for implementation of sentiment analysis are lexicon-based, supervised learning and unsupervised learning. Lexicon-based approach is based on information about words or phrases to indicate positivity or negativity. SentiWordNet is example of publicly available lexical resource . Supervised learning and unsupervised learning approaches usually are based on corpora or set of documents. Sentiment analysis can be done at word-level, sentence-level or document-level. Web opinion mining is important nowadays because customers survey for reviews of a product for example before making a purchase.
I have experimented with various supervised classification techniques for sentiment analysis. Let me introduce the basics of supervised learning and the techniques used by me.
5.2 Introduction to Supervised Learning
“Supervised learning is a machine learning task of predicting a label or class for the given test input, using the learning model trained on the labeled training inputs.”[18] The training data in case of supervised learning usually consists of a set of training examples where each example is a pair consisting of an input vector and an output label associated with it. [19] Supervised learning is illustrated in Fig. 3. The function of feature extractor is conversion of each input data to a feature set. In the first step of training a model, pairs of feature sets and labels are sent as input for the machine learning algorithm. During the second step of prediction phase, extracted features of test data after application of feature extractor on the test data, are input into the model to generate the predicted labels. [18]
Fig. 3 Supervised Learning [18]
I have used three techniques of supervised learning - Multinomial Naive Bayes, Logistic Regression and Support Vector Machine (SVM).
5.3 What is Naive Bayes Classifier?
“The Naive Bayes classifier is a simple probabilistic classifier which is based on applying Bayes’ theorem with the “naive” assumption of independence between every pair of features.”[20] Despite the design of the classifier being naive and assumptions of the classifier being oversimplified, Naive Bayes is a suitable classifier for complex real-world problems. It is very efficient as well, because it requires less CPU and memory and can perform well on small amount of training data. Training time required by Naive Bayes is also small as compared to other methods. [21]
Theoretical explanation of Naive Bayes is as follows: (According to [20])
For a class variable y, and independent feature vector x1 through xn (assumption of Naive Bayes), according to the Naive Bayes’ theorem the following relationship holds:
The different naive Bayes classifiers differ mainly by the assumptions made regarding the distribution of P( xi|y).
5.4 Multinomial Naive Bayes
According to the theory given in [20]. “Multinomial Naive Bayes is a variant of Naive Bayes for multinomial distributed data. It is one of the two classic naive Bayes variants used in text classification where the data are typically represented as word vector counts or tf-idf vectors.” Underlying algorithm of multinomial Naive Bayes can be further explored in [20].
5.5 Logistic Regression (Max Entropy)
The name “Logistic Regression” might be misleading. It is a technique of classification and not regression. Regression is added into the nomenclature, as linear model is fit to the feature space. It is a statistical and probabilistic model of classification. It is mostly used for classification of class variable or dependent variable which is binary in nature based on one or more features. The probabilities of the outcomes are depicted as a function of predictor variables known as logistic function. Logistic function takes as input a value from negative infinity to positive infinity, whereas output function always takes on values between zero and one. A basic logistic function is shown in Fig. 4. For models having many categories or classes, multinomial logistic regression or ordered logistic regression is more suitable. [23]
Fig. 4. Basic Logistic Curve [23]
5.6 Support Vector Machine (Linear SVM)
Support Vector Machines are supervised learning models used for classification and regression. It is a binary non- probabilistic linear classifier. [24] This model takes as input labeled training data and produces an optimal hyperplane as output. SVM algorithm aims to find the hyperplane that provides largest minimum distance to the training examples. Margin is defined as twice this distance mentioned. Thus, hyperplane that maximizes the margin of the training data is optimal separating hyperplane. [26] This is illustrated in the Fig. 5. SVMs generally perform well in text classifications and outperform other classifiers.
Fig. 5 Optimal Hyperplane for Linear SVM [26]
- Experimental Setup
For performing the experiment of sentiment analysis using various pre-built machine learning techniques, I used scikit -learn. Scikit -learn is a machine learning toolkit in python which provides simple and efficient tools for data mining and data analysis. It is built on NumPy, SciPy and matplotlib. This tool has built-in libraries for classification algorithms that can be used for analysis of sentiment classification for the dataset mentioned in Section 4.
I used Ubuntu OS and installed necessary packages required for the experiment such as Python 2.7, NumPy, SciPy, python-sklearn (scikit -learn), and matplotlib . I tested the performance of various classification techniques based on accuracy or accuracy, recall, f1-score and confusion matrix. Further, I evaluated the confusion matrix or error matrix of the algorithm which is a measure to test the performance of a supervised learning in a visual manner. The specifics of performance measures used for evaluation of techniques are explained in detail in Section 9.
- Data Preprocessing
The raw data needed to be pre-processed into the proper format before feeding it into the algorithm for classification and evaluation. For implementation of the sentiment analysis project, I have referred scikit -learn Working with Text Data. [5]
7.1 Reading the data
'''IMDB_Data class stores all relevant information related to imdb movie reviews such as ID of imdb review, sentiment and actual data of the review '''
IMDB_Data class stores the movie ID, associated sentiment and the data read from the text file. The file names associated with the text file determine the imdbID . The directory name “pos ” or “neg ” determine the sentiment associated with the movie review with ID ‘imdbID ’. This is implemented using readFileNames function as mentioned below. This function is called for both test and train datasets.
''' Reads file names of the passed dirName and associates the passed polarity to each file in the directory'''
fileName=str( file)
fileName=re. sub( '.txt', '', fileName )
imdbObject=IMDB_Data( fileName , polarity)
imdb_data_type. append( imdbObject )
The second function readFiles ( ) reads the data of all the fileNames and stores the data in data list for train and test (imdbData and imdbTestData ), and labels in target list for test and train (imdbTarget and imdbTestTarget ) respectively.
'''Reads files from the folderName passed as argument and stores it in imdb_data_type class object which can be train or test '''
def readFiles(folderName,imdb_data_type,data_test_train,target_test_train):
fileName=folderName+imdbObject. sentiment+'//'+imdbObject. imdbID+'. txt '
f=open( fileName ,'r ')
text=f. readlines ( )
#add each imdb review data to a list
data_test_train. append( str ( imdbObject . data))
target_test_train. append( str ( imdbObject . sentiment))
7.2 Feature Extraction
For performing machine learning tests on text documents, feature vectors need to be extracted from the text.
Bag of Words
Bag of Words is a vector-space representation of text documents (training set) that aids in representation of a document as vector of words without any order and maintaining the frequency of occurrence of words in the document. [10] This acts as feature for training the classifier. In scikit -learn, this is implemented as a sparse matrix to save space. As, bag of words is a huge collection of words or features in the set of documents, but each document contains only a subset of features.
Feature Vectors for the text data are obtained by Tokenization and Normalized Frequency Count.
Tokenization
Tokenization is the process of breaking up a text into individual words and phrases known as symbols. Preprocessing of text, tokenization, and removal of stopwords are done with the aid of CountVectorizer present in the package sklearn . feature_extraction. text. Output of CountVectorizer is feature vector of documents representing the frequency of occurrence of each word in the document as well.
Normalized Frequency Count
- Implementation
Training the Classifiers
Next step is to train the classifiers using the labeled data and feature vectors. Training of classifier is necessary for prediction of the correct sentiment of test data. Pipeline class functions as compound classifier for vectorizing, transforming and then training the classifier.
Naive Bayes Pipeline
''' Pipeline class behaves like a compound classifier as it combines vectorizer , transformer and classifier'''
multinomialNB_classifier = Pipeline( [('mnbVectorizer ', CountVectorizer( ngram_range=( 1,2))),
('mnbTfidf ', TfidfTransformer( sublinear_tf=True)),
('mnbClassifier ', MultinomialNB( )),
])
multinomialNB_classifier = multinomialNB_classifier. fit( imdbData , imdbTarget )
Logistic Regression Pipeline
maxEntropy_classifier = Pipeline( [('maxEntropyVectorizer ', CountVectorizer( ngram_range=( 1,2))),
('maxEntropyTfidf ', TfidfTransformer( sublinear_tf=True)),
('maxEntropyClassifier ', LogisticRegression ( penalty='l2') ) ,
])
maxEntropy_classifier= maxEntropy_classifier. fit( imdbData , imdbTarget )
Support Vector Machine Pipeline
supportVectorMachine_classifier = Pipeline( [('svmVectorizer ', CountVectorizer( ngram_range=( 1,2))),
('svmTfidf ', TfidfTransformer( sublinear_tf=True)),
('svmClassifier ', LinearSVC( ) ) ,
])
_= supportVectorMachine_classifier. fit( imdbData , imdbTarget )
Prediction and Accuracy
The code snippet given below is used for prediction of movie reviews test data and check the accuracy percentage for each of the three classifiers. The accuracy percentage and other evaluation results are described in Section 9 and 10.
predicted_test_output=multinomialNB_classifier. predict( imdbTestData )
Implementation Files
<ClassifierName >_parameterTuned.png
For example, LinearSVC_UnigramBigram_SublinearTF.png includes result of LinearSVC with unigram and bigram features and sublinearTF parameter for TfidfTransformer.
- Performance Measures
Various measures used for performance analysis are :
Accuracy
Accuracy measures the closeness of predicted value to target or labeled value. In case of classification, accuracy is defined as number of correct predictions made divided by the total number of predictions made (multiply by 100 to transform the fraction into percentage). [28]
In the sentiment classification of movie reviews, accuracy of 80% for example indicates that 80 out of 100 samples are classified correctly as positive or negative sentiments, that is for 80 movie reviews the predicted class is equal to target class or standard labeled class.
Precision
Precision is also known as positive predictive value. It is defined as the fraction of retrieved instances that are relevant. In classification, precision is the number of true positives (number of movie reviews correctly labeled as belonging to positive class) divided by total number of reviews labeled or predicted as belonging to the positive class (sum of true positives and false positives-movie reviews incorrectly labeled as belonging to the class). [27]
Thus, precision =Number of true positives/(Number of true positives + Number of false positives).
Recall
Recall is also known as sensitivity. It is fraction of relevant instances that are retrieved.
Thus, recall = number of true positives/total number of elements that belong to positive class ( sum of true positives and false negatives-movie reviews that should have been labeled as belonging to positive class but have been labeled as belonging to negative class). [27] Fig. 6 shows precision and recall pictorially.
Fig. 6 Precision and Recall [27]
There is always an inverse relationship between precision and recall, as one can be increased at the cost of o ther .
F1 Score or F Measure
F1 Score is the weighted harmonic mean of both precision and recall. It brings a balance between precision and recall.
F1 Score= 2*( (precision*recall) /(precision+recall)) [27]
Confusion Matrix
It is also known as contingency table or error matrix. It is a visual representation for viewing the performance of supervised classification technique. The column represents the predicted reviews and rows represent the actual labels for the reviews.
Example of Confusion Matrix is shown in Fig. 7.
Fig. 7 Confusion Matrix
target_names=movie_review_target_names))
confusion_matrix=metrics. confusion_matrix( imdbTestTarget , predicted_test_output)
Parameter Tuning in Machine Learning Techniques
- Evaluation and Results
Fig. 8 shows the accuracy comparison of classifiers. It also shows various accuracies with parameter tuning. For example, for Multinomial Naive Bayes, accuracy with unigram features is 83.028% , while with both unigrams and bigrams as features along with sublinear TF, it is 87.336%. As observed from the graph and table, Linear SVC performs best accuracy of 90.744% along with unigrams plus bigram plus sublinear TF features.
Fig. 8 Accuracy of Classifiers
Unigrams+Bigrams
|
Unigrams+Bigrams+SublinearTF
| ||
Multinomial Naïve Bayes
|
83.028
|
86.852
|
87.336
|
Logistic Regression
|
88.308
|
88.608
|
89.2
|
87.772
|
90.24
|
90.744
|
Table 1 Accuracy values of classifiers
Fig. 9 shows the comparison of average precision (for both positive and negative reviews) of classifiers including the combination of various features of each classifier. Precision of LinearSVC is best among all classifiers with the value of 91%. Table 2 shows the values of average precision of classifiers.
Fig. 9 Average Precision of Classifiers
Unigrams+Bigrams
|
Unigrams+Bigrams+SublinearTF
| ||
Multinomial Naïve Bayes
|
84
|
87
|
88
|
Logistic Regression
|
88
|
89
|
89
|
88
|
90
|
91
|
Table 2 Average Precision Values for Classifiers
Fig. 10 shows the comparison of average recall value (for both positive and negative reviews) of classifiers including the combination of various features of each classifier. Recall of LinearSVC is best among all classifiers with the value of 91%. Table 3 shows the values of average recall of classifiers.
Fig. 10 Average Recall Comparison for Classifiers
Unigrams+Bigrams
|
Unigrams+Bigrams+SublinearTF
| ||
Multinomial Naïve Bayes
|
83
|
87
|
87
|
Logistic Regression
|
88
|
89
|
89
|
88
|
90
|
91
|
Table 3 Average Recall Values for Classifiers
Fig. 11 shows the comparison of average F1 Score (for both positive and negative reviews) of classifiers including the combination of various features of each classifier. F1 Score of LinearSVC is best among all classifiers with the value of 91%. Table 4 shows the values of average F1 score of classifiers.
Fig. 11 Average F1 Score Comparison for Classifiers
Unigrams+Bigrams
|
Unigrams+Bigrams+SublinearTF
| ||
Multinomial Naïve Bayes
|
83
|
87
|
87
|
Logistic Regression
|
88
|
89
|
89
|
88
|
90
|
91
|
Table 4 Average F1 Scores for Classifiers
Fig 12, 13 and 14 shows the confusion matrix for Multinomial Naive Bayes, Linear SVC and Logistic Regression respectively.
Fig. 12 Confusion Matrix for Multinomial Naive Bayes Classifier
Fig. 13 Confusion Matrix for Linear Support Vector Machine
Fig. 14 Confusion Matrix for Logistic Regression Classifier
- Conclusion
To sum up in short, using the large movie reviews dataset of Internet Movie Database (IMDB) [2] [3], I focused on two possible categories for classification - positive and negative using three machine learning techniques namely Multinomial Naive Bayes, Support Vector Machine, and Logistic Regression (Max Entropy). I further evaluated their performance to classify the movie reviews based on sentiment. Additionally, I compared the techniques based on accuracy, precision, recall, f1-score and confusion matrix. As it can be observed from the analysis results, all the three classifiers perform better than results mentioned in the paper [4]. It can be also concluded that LinearSVC with unigram and bigram feature vectors along with sublinearTF for tranformation outperforms all the classifiers in terms of performance with accuracy of 90.744% and precision, recall and F1-score of 91%. Further, classifiers perform better with combination of unigrams and bigrams as feature vectors along with sublinear TF transformation of vectors.
- References
[6] http://scikit-learn.org/stable/modules/classes.html#text-feature-extraction-ref [7]http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html
[9]http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
AGGIE CODE OF HONOR
‘An Aggie does not lie, cheat or steal or tolerate those who do’