Simple Sentiment Analysis with Python

21 Mar 2018

Sentiment Analysis is a very useful (and fun) technique when analysing text data. In this piece, we'll explore three simple ways to perform sentiment analysis on Python.

Dataset to be used

I slowly extracted by hand several reviews of my favourite Korean and Thai restaurants in Singapore. I chose what I perceive as "easier" texts that are less ambiguous and has words that characterise positive or negative sentiments.

dataset = ["Food is good and not too expensive. Serving is just right for adult. Ambient is nice too.",
           "Used to be good. Chicken soup was below average, bbq used to be good.",
           "Food was good, standouts were the spicy beef soup and seafood pancake! ",
           "Good office lunch or after work place to go to with a big group as they have a lot of private areas with large tables",
           "As a Korean person, it was very disappointing food quality and very pricey for what you get. I won't go back there again. ",
           "Not great quality food for the price. Average food at premium prices really.",
           "Fast service. Prices are reasonable and food is decent.",
           "Extremely long waiting time. Food is decent but definitely not worth the wait.",
           "Clean premises, tasty food. My family favourites are the clear Tom yum soup, stuffed chicken wings, chargrilled squid.",
           "really good and authentic Thai food! in particular, we loved their tom yup clear soup with sliced fish. it's so well balanced that we can have it just on its own. "
           ]

GCP Cloud Natural Language API

Google Cloud Platform (GCP) is the first sentiment analysis tool that I've used and is the one I'm most familiar with.

~~For GCP authentication, I recommend using Google Cloud SDK. (I forgot the steps but I'm sure you can find something on the page)~~

Google Cloud has an update that has prevented the use of Google Cloud SDK to authenticate, instead recommending using service accounts. Please refer to the guide on getting authentication.

To perform a sentiment analysis with Natural Language API, we use this piece of code:

def gc_sentiment(text):  
    from google.cloud import language
    
    path = '/Users/Yourname/YourProjectName-123456.json' #FULL path to your service account key
    client = language.LanguageServiceClient.from_service_account_json(path)
    document = language.types.Document(
            content=text,
            type=language.enums.Document.Type.PLAIN_TEXT)
    annotations = client.analyze_sentiment(document=document)
    score = annotations.document_sentiment.score
    magnitude = annotations.document_sentiment.magnitude
    return score, magnitude

To install all the GCP packages available for Python, run a pip install:

$ pip install google.cloud

We'll be iterating the list above through the gc_sentiment function that will output the sentiment score and its magnitude.

The score will be positive and negative based on the sentiment (from -1 to 1), while the magnitude explains the extent of the sentiment (i.e. "OH MY GOODNESS I just got a new pigeon! YAY!" vs "I got a new pigeon and I think I am feeling happiness.")

Now let's run our dataset through that function and output a pandas dataframe.

from tqdm import tqdm # This is an awesome package for tracking for loops
import pandas as pd
gc_results = [gc_sentiment(row) for row in tqdm(dataset, ncols = 100)]
gc_score, gc_magnitude = zip(*gc_results) # Unpacking the result into 2 lists
gc = list(zip(dataset, gc_score, gc_magnitude))
columns = ['text', 'score', 'magnitude']
gc_df = pd.DataFrame(gc, columns = columns)

Make sure you run pip install for tqdm. It is an awesome package for tracking the progress of for loops and it's very helpful when you iterate through large lists.

Result of sentiment analysis using GCP's Natural Language API

I'm really impressed by this result. Usually, GCP's Natural Language API is quite accurate in identifying the sentiments and its magnitude.

The result above is exported to Google Sheets, but you can output to Excel too with the various tools to operate Python in Excel.

Azure Text Analytics API

Next, we'll work on Azure's Text Analytics API.

For Azure, I find the authentication process really simple.

Firstly, register for a free Azure account, and then follow the steps linked to obtain Text Analytics API's resource.

Location to get your Azure Text Analytics API's keys and endpoint Once you're done, you can get the two required pieces of information for running the Text Analytics API. Note that the endpoint will differ based on the configuration you chose.

You're now ready to use the API in Python:

def azure_sentiment(text):
    import requests
    documents = { 'documents': [
            { 'id': '1', 'text': text }
            ]}
    
    azure_key = '[your key]' # Update here
    azure_endpoint = '[your endpoint]' # Update here
    assert azure_key
    sentiment_azure = azure_endpoint + '/sentiment'
    
    headers   = {"Ocp-Apim-Subscription-Key": azure_key}
    response  = requests.post(sentiment_azure, headers=headers, json=documents)
    sentiments = response.json()
    return sentiments

The "documents" object formats the text input into the required input format for Azure's Text Analytics API.

To run our dataset through Azure's Text Analytics API, we use the following piece of code:

azure_results = [azure_sentiment(text) for text in dataset]
azure_score = [row['documents'][0]['score'] for row in azure_results] # Extract score from the dict
azure = list(zip(dataset, azure_score))
columns = ['text', 'score']
azure_df = pd.DataFrame(azure, columns = columns)

While GCP's Natural Language API grades the sentiment from -1 to 1, Azure's Text Analytics API grades from 0 to 1, where 1 is positive sentiment and 0 is otherwise.

Azure Text Analytics API sentiment analysis result

I find the result less ideal than GCP's. It is still okay, but I would certainly rate row 9's "Extremely long waiting time" closer to row 7's "As a Korean person" review.

Vader NLTK

This final one is by Python's NLTK package.

It is a very flexible package where you can actually train and build your own sentiment analyser with the NaiveBayesClassifier class.

However, this post is about "Simple" sentiment analysis, so we'll be using the VADER's SentimentIntensityAnalyzer instead of training our own.

To use NLTK's Vader library, we first need to download it using nltk.download().

import nltk
nltk.download('vader_lexicon')

And the following is the function we'll iterate the dataset with:

def nltk_sentiment(sentence):
    from nltk.sentiment.vader import SentimentIntensityAnalyzer
    
    nltk_sentiment = SentimentIntensityAnalyzer()
    score = nltk_sentiment.polarity_scores(sentence)
    return score

It's a very simple function, but let's see how accurate it is. We use the following set of code to build the result into a dataframe:

nltk_results = [nltk_sentiment(row) for row in dataset]
results_df = pd.DataFrame(nltk_results)
text_df = pd.DataFrame(dataset, columns = ['text'])
nltk_df = text_df.join(results_df)

NLTK VADER's sentiment analysis output

The "compound" column is the definitive rating of the sentiment, while the other three columns are a more detailed view of negativity, neutrality and positivity of the review.

While I like the level of detail, I find the result not exactly accurate, especially where row 3 and 9 are clearly negative and row 8 should be more positive than neutral.

Perhaps the dataset is not ideal for NLTK, or the texts could be too short or too ambiguous. But again, NLTK's strength lies in the trainable NaiveBayesClassifier.

Here's a chart to easily compare the performance of the three methods above. The Azure's result needed a bit of a manipulation to expand the range to the negative range.

Comparison of Azure, Google Cloud Platform and NLTK's Sentiment Analysis methods

Again, I would say that Google Cloud Platform's Natural Language API did very well here. It was the only one out of the three that managed to detect the negativity in the "Used to be good" and "Extremely long waiting time" reviews.

Note on the absence of AWS

It might be weird as I evaluated GCP, Azure and not Amazon Web Services' (AWS) text analytics service i.e. Amazon Comprehend

I find the authentication step for Amazon Comprehend extremely complex and I lack the patience to complete it.

To be honest, GCP's authentication flow could be complex too. However, I've had my Macbook authenticated a long time ago for work purpose. So using GCP's APIs is a very easy thing for me.

Final Words

While it seems that GCP's Natural Language API trumps the rest, note that the accuracy could be dependent on the dataset and a trainable model catered to your particular dataset is the best way forward.

Google owns a lot of data and it could be why it trumps the rest. It's certainly a quick, simple way to analyse any interesting dataset you might have.

Note that this post is not a critique or review of the sentiment analyser products. Please go ahead and use whichever method that gives you the best result!

Analysing text data is very fun as there is a lot of text data online for us to analyse (e.g. Reddit threads, Twitter tweets, Facebook comments). Analysing sentiment is very useful when you want to track how people perceive a topic, product or you're just curious how good or nasty people are.

I hope this post has empowered and upskilled you in analysing sentiments of any text data. Go forth and start analysing!

If you're interested in analysing text data and would like to do a term frequency analysis of a webpage, this tool I've built might interest you.

Simple Sentiment Analysis with Python

Dataset to be used

GCP Cloud Natural Language API

Azure Text Analytics API

Vader NLTK

Note on the absence of AWS

Final Words

Share the love

Related Posts

Quick Guide - Google Cloud Authentication on Python

Collecting & Analysing Google Home Reviews with Python