Sentiment Analysis is a very useful (and fun) technique when analysing text data. In this piece, we'll explore three simple ways to perform sentiment analysis on Python.
I slowly extracted by hand several reviews of my favourite Korean and Thai restaurants in Singapore. I chose what I perceive as "easier" texts that are less ambiguous and has words that characterise positive or negative sentiments.
dataset = ["Food is good and not too expensive. Serving is just right for adult. Ambient is nice too.",
"Used to be good. Chicken soup was below average, bbq used to be good.",
"Food was good, standouts were the spicy beef soup and seafood pancake! ",
"Good office lunch or after work place to go to with a big group as they have a lot of private areas with large tables",
"As a Korean person, it was very disappointing food quality and very pricey for what you get. I won't go back there again. ",
"Not great quality food for the price. Average food at premium prices really.",
"Fast service. Prices are reasonable and food is decent.",
"Extremely long waiting time. Food is decent but definitely not worth the wait.",
"Clean premises, tasty food. My family favourites are the clear Tom yum soup, stuffed chicken wings, chargrilled squid.",
"really good and authentic Thai food! in particular, we loved their tom yup clear soup with sliced fish. it's so well balanced that we can have it just on its own. "
]
Google Cloud Platform (GCP) is the first sentiment analysis tool that I've used and is the one I'm most familiar with.
For GCP authentication, I recommend using Google Cloud SDK. (I forgot the steps but I'm sure you can find something on the page)
Google Cloud has an update that has prevented the use of Google Cloud SDK to authenticate, instead recommending using service accounts. Please refer to the guide on getting authentication.
To perform a sentiment analysis with Natural Language API, we use this piece of code:
def gc_sentiment(text):
from google.cloud import language
path = '/Users/Yourname/YourProjectName-123456.json' #FULL path to your service account key
client = language.LanguageServiceClient.from_service_account_json(path)
document = language.types.Document(
content=text,
type=language.enums.Document.Type.PLAIN_TEXT)
annotations = client.analyze_sentiment(document=document)
score = annotations.document_sentiment.score
magnitude = annotations.document_sentiment.magnitude
return score, magnitude
To install all the GCP packages available for Python, run a pip install:
$ pip install google.cloud
We'll be iterating the list above through the gc_sentiment function that will output the sentiment score and its magnitude.
The score will be positive and negative based on the sentiment (from -1 to 1), while the magnitude explains the extent of the sentiment (i.e. "OH MY GOODNESS I just got a new pigeon! YAY!" vs "I got a new pigeon and I think I am feeling happiness.")
Now let's run our dataset through that function and output a pandas dataframe.
from tqdm import tqdm # This is an awesome package for tracking for loops
import pandas as pd
gc_results = [gc_sentiment(row) for row in tqdm(dataset, ncols = 100)]
gc_score, gc_magnitude = zip(*gc_results) # Unpacking the result into 2 lists
gc = list(zip(dataset, gc_score, gc_magnitude))
columns = ['text', 'score', 'magnitude']
gc_df = pd.DataFrame(gc, columns = columns)
Make sure you run pip install for tqdm. It is an awesome package for tracking the progress of for loops and it's very helpful when you iterate through large lists.
I'm really impressed by this result. Usually, GCP's Natural Language API is quite accurate in identifying the sentiments and its magnitude.
The result above is exported to Google Sheets, but you can output to Excel too with the various tools to operate Python in Excel.
Next, we'll work on Azure's Text Analytics API.
For Azure, I find the authentication process really simple.
Firstly, register for a free Azure account, and then follow the steps linked to obtain Text Analytics API's resource.
Once you're done, you can get the two required pieces of information for running the Text Analytics API. Note that the endpoint will differ based on the configuration you chose.
You're now ready to use the API in Python:
def azure_sentiment(text):
import requests
documents = { 'documents': [
{ 'id': '1', 'text': text }
]}
azure_key = '[your key]' # Update here
azure_endpoint = '[your endpoint]' # Update here
assert azure_key
sentiment_azure = azure_endpoint + '/sentiment'
headers = {"Ocp-Apim-Subscription-Key": azure_key}
response = requests.post(sentiment_azure, headers=headers, json=documents)
sentiments = response.json()
return sentiments
The "documents" object formats the text input into the required input format for Azure's Text Analytics API.
To run our dataset through Azure's Text Analytics API, we use the following piece of code:
azure_results = [azure_sentiment(text) for text in dataset]
azure_score = [row['documents'][0]['score'] for row in azure_results] # Extract score from the dict
azure = list(zip(dataset, azure_score))
columns = ['text', 'score']
azure_df = pd.DataFrame(azure, columns = columns)
While GCP's Natural Language API grades the sentiment from -1 to 1, Azure's Text Analytics API grades from 0 to 1, where 1 is positive sentiment and 0 is otherwise.
I find the result less ideal than GCP's. It is still okay, but I would certainly rate row 9's "Extremely long waiting time" closer to row 7's "As a Korean person" review.
This final one is by Python's NLTK package.
It is a very flexible package where you can actually train and build your own sentiment analyser with the NaiveBayesClassifier class.
However, this post is about "Simple" sentiment analysis, so we'll be using the VADER's SentimentIntensityAnalyzer instead of training our own.
To use NLTK's Vader library, we first need to download it using nltk.download().
import nltk
nltk.download('vader_lexicon')
And the following is the function we'll iterate the dataset with:
def nltk_sentiment(sentence):
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk_sentiment = SentimentIntensityAnalyzer()
score = nltk_sentiment.polarity_scores(sentence)
return score
It's a very simple function, but let's see how accurate it is. We use the following set of code to build the result into a dataframe:
nltk_results = [nltk_sentiment(row) for row in dataset]
results_df = pd.DataFrame(nltk_results)
text_df = pd.DataFrame(dataset, columns = ['text'])
nltk_df = text_df.join(results_df)
The "compound" column is the definitive rating of the sentiment, while the other three columns are a more detailed view of negativity, neutrality and positivity of the review.
While I like the level of detail, I find the result not exactly accurate, especially where row 3 and 9 are clearly negative and row 8 should be more positive than neutral.
Perhaps the dataset is not ideal for NLTK, or the texts could be too short or too ambiguous. But again, NLTK's strength lies in the trainable NaiveBayesClassifier.
Here's a chart to easily compare the performance of the three methods above. The Azure's result needed a bit of a manipulation to expand the range to the negative range.
Again, I would say that Google Cloud Platform's Natural Language API did very well here. It was the only one out of the three that managed to detect the negativity in the "Used to be good" and "Extremely long waiting time" reviews.
It might be weird as I evaluated GCP, Azure and not Amazon Web Services' (AWS) text analytics service i.e. Amazon Comprehend
I find the authentication step for Amazon Comprehend extremely complex and I lack the patience to complete it.
To be honest, GCP's authentication flow could be complex too. However, I've had my Macbook authenticated a long time ago for work purpose. So using GCP's APIs is a very easy thing for me.
While it seems that GCP's Natural Language API trumps the rest, note that the accuracy could be dependent on the dataset and a trainable model catered to your particular dataset is the best way forward.
Google owns a lot of data and it could be why it trumps the rest. It's certainly a quick, simple way to analyse any interesting dataset you might have.
Note that this post is not a critique or review of the sentiment analyser products. Please go ahead and use whichever method that gives you the best result!
Analysing text data is very fun as there is a lot of text data online for us to analyse (e.g. Reddit threads, Twitter tweets, Facebook comments). Analysing sentiment is very useful when you want to track how people perceive a topic, product or you're just curious how good or nasty people are.
I hope this post has empowered and upskilled you in analysing sentiments of any text data. Go forth and start analysing!
If you're interested in analysing text data and would like to do a term frequency analysis of a webpage, this tool I've built might interest you.
Like the post? Consider donating to fund the maintenance of this website: