Product reviews online are the best source for knowing and understanding a product before making a purchase decision.
With the ample amount of reviews available online, we'll use Python to quickly understand the gist of the review, analyse the sentiment and stance of the reviews, and basically automate the boring stuff of picking which review to dive deep into.
We'll be building off the previous post on scraping search results from Google for this purpose. Please refer to that if you need to understand the preliminary scraping code that won't be covered again here.
To ensure that we're only collecting "Google Home Review", we'll use the Google Search parameters to ensure that the three words are present in the page title.
Of course, there's a chance that some review web pages do not have those three words in their page title, but in that case, that website will have a very bad SEO optimisation anyway. So we can safely ignore those.
We'll also not be scraping YouTube results on the search results page, as those reviews are all-audio.
# =============================================================================
# Setting Up
# =============================================================================
import urllib
search_title = "'google home review'"
exclude_url = ['youtube'] # You may add more exclusions
query = "intitle:" + search_title
for exclude in exclude_url:
query = query + " -inurl:" + exclude
query = urllib.parse.quote_plus(query)
number_result = 20
More URL exclusions can be added if you found your results to have a lousy, biased or click-bait sites.
As mentioned earlier, I won't be running through this portion as the code sample is the exact same one as the one I've covered on the post about scraping search results from Google. Please read that post for a full understanding of this section.
For convenience, I will paste the full code sample here for ease of following, but please refer to that post before asking any questions on this section.
import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup
ua = UserAgent()
google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")
result_div = soup.find_all('div', attrs = {'class': 'g'})
links = []
titles = []
descriptions = []
for r in result_div:
try:
link = r.find('a', href = True)
title = r.find('h3', attrs={'class': 'r'}).get_text()
description = r.find('span', attrs={'class': 'st'}).get_text()
if link != '' and title != '' and description != '':
links.append(link['href'])
titles.append(title)
descriptions.append(description)
except:
continue
import re
to_remove = []
clean_links = []
for i, l in enumerate(links):
clean = re.search('\/url\?q\=(.*)\&sa',l)
if clean is None:
to_remove.append(i)
continue
clean_links.append(clean.group(1))
for x in to_remove:
del titles[x]
del descriptions[x]
Now that we have all the links under the "clean_links" variable, we'll be scraping all of their body texts.
For those unfamiliar to web coding, we're not specifically pulling the "content" blocks, as all websites have different ways of defining their "content" blocks i.e. the part where the actual, written post is found.
The "body texts" are everything visible on the website, including header, footer, social links etc. So do expect junk to appear on the scrape results.
Note that a one-size-fits-all body text scrape function is never possible as all websites that are vastly different from one another. This attempt for a one-size-fits-all body text scraping will have consequences later on.
Firstly, we create a function that will assist in removing all unnecessary HTML tags that are not "body" texts.
def tag_visible(element):
from bs4.element import Comment
if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
return False
if isinstance(element, Comment):
return False
return True
Next, the following function will be the one scraping the body texts:
def get_body_text(url, user_agent):
"""Get the full body text of specified URL """
import requests
from bs4 import BeautifulSoup
import time
from random import randint
time.sleep(randint(0, 2))
ua = {"User-Agent": user_agent}
try:
print(requests.get(url, headers = ua))
if requests.get(url, headers = ua).status_code != 200:
print('Status code is not 200')
return '', requests.get(url, headers = ua).status_code
except:
print('critical error')
return '', requests.get(url, headers = ua).status_code
body = requests.get(url, headers = ua).content
soup = BeautifulSoup(body, 'html.parser')
texts = soup.findAll(text=True)
visible_texts = filter(tag_visible, texts)
return_text = u"\n".join(t.strip() for t in visible_texts)
response_code = requests.get(url, headers = ua).status_code
return return_text, response_code
Some explanations about the function:
To initiate this function, we use the following script:
texts = []
response_code = []
for link in clean_links:
t, r = get_body_text(link, ua.random)
texts.append(t)
response_code.append(r)
"texts" now houses all your body texts, which can be a load of incomprehensible texts - we'll use a summariser later.
To get a good gist of how favourable the reviews are to Google Home, we'll use Google's own Natural Language API. I've previously tested that it is the most reliable amongst the easily accessible sentiment analyser on Python.
Please refer to the post linked for the full analysis and comparison, but here's the code snippet we need to analyse the sentiment:
def gc_sentiment(text):
from google.cloud import language
path = '/Users/Yourname/YourProjectName-123456.json' #FULL path to your service account key
client = language.LanguageServiceClient.from_service_account_json(path)
document = language.types.Document(
content=text,
type=language.enums.Document.Type.PLAIN_TEXT)
annotations = client.analyze_sentiment(document=document)
score = annotations.document_sentiment.score
magnitude = annotations.document_sentiment.magnitude
return score, magnitude
Now let's call the function and loop through all our body texts.
scores = []
magnitudes = []
for x in texts:
# Check if the body text actually exists
if x == '':
scores.append(None)
magnitudes.append(None)
else:
s, m = gps.sentiment(x)
scores.append(s)
magnitudes.append(m)
As for product reviews, I expect most of the scores to be around the middle i.e. zero, unless the product is absolute shit or absolute greatness.
While there are reliable text summarisers out there for Python, I'll be using a simple TF-IDF vectoriser method instead.
TF-IDF is a frequency weighting algorithm that ranks the frequency of terms in a body of text based on two frequencies: term frequency and document frequency, the latter of which is "inversed".
This means that a term that exists too frequent in the document like particle will be penalised and wouldn't rank high in TF-IDF frequency.
To run the TF-IDF frequency analysis, we'll use the following function:
def term_frequency(terms, length, stopwords = 'english'):
"""Get term frequency of particular ngram length"""
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(lowercase = True, stop_words = stopwords, ngram_range = (length,length))
td_matrix = tfidf.fit_transform(terms).toarray()
terms = tfidf.get_feature_names()
frequency = td_matrix.sum(axis = 0).tolist()
df = pd.DataFrame(
{'Term' : terms,
'Frequency' : frequency
})
df = df[df.columns[::-1]]
df.sort_values('Frequency', axis = 0, ascending = False, inplace = True)
df.reset_index(drop = True, inplace = True)
return df
We won't be using the default "english" stopwords list provided by TfidfVectorizer. Instead, we'll use a general English stopwords list appended with some of our own keywords relevant to this analysis.
english_stopwords = [ '''please get the list from here https://gist.github.com/sebleier/554280''' ]
custom_stopwords = ['google home', 'google assistant','smart home','smart speaker',
'home','google','speaker','smart','assistant',
'facebook', 'twitter', 'pinterest']
Before we proceed, the TfidfVectorizer class takes a corpus as its input. A corpus, in Python's terms, would be a list of strings, where each string represent a "document", and the "document" is long string represent the texts present in the "document"
For example, if you want to find out the most common words used by J.K. Rowling in the Harry Potter series, the corpus would be a list of all the words in the each Harry Potter books, i.e.:
corpus = ['all the text in Harry Potter and the Sorcerer's Stone', 'all the text in Harry Potter and the Chamber of Secrets', ....]
In our use case, to make things easier, we'll split each element in "texts" list by the newline character "\n".
As mentioned earlier when we scrape the body text, we've joined each body of texts by a newline character. Now, we'll split them back into their preceding form and represent them as a list of "documents". In our use case, it would be closer to "paragraphs" instead of "documents".
freq_terms_one = [] # We'll start with frequency of 1-word terms
for t in texts:
corpus = [x for x in t.split("\n") if x != ''] # Split by newline while clearing up empty rows
corpus_filtered = [x for x in corpus if len(x.split()) != 1] # Remove all single word paragraphs
try:
freq_one = gps.term_frequency(corpus_filtered, 1, english_stopwords + custom_stopwords)
except:
freq_terms_one.append(None)
continue
top_10 = freq_one['Term'].head(10).tolist() # We take only the top 10 to analyse
freq_terms_one.append(', '.join(top_10))
The output would look something like this:
If you find anything unsatisfactory (e.g. you want to remove "oct" and "nov" from the 4th element), just update your custom_stopwords list and re-run this.
A one-size-fits-all body text scrape function is not possible as all websites that are vastly different from one another. Hence there will be freak cases where some irrelevant texts were determined to be high frequency when we tried to use a one-size-fits-all body text scraping function.
The 4th element had such a weird output due to the feeds section taken into account in the body text scrape.
In such cases, you might want to modify the function to fit your use case, or just copy paste the actual content paragraphs into the list.
Nevertheless, let's continue and you may run another function call to extract terms of length 2.
Now we pretty much have everything we need to display in a table. You may tabulate the data we collected into a dataframe and display it on Google Sheets.
output_dict = {
'Title': titles,
'Description': descriptions,
'URL': clean_links,
'Response code': response_code,
'Sentiment Score': scores,
'Sentiment Magnitude': magnitudes,
'One-Word Frequent Terms': freq_terms_one,
'Two-Word Frequent Terms': freq_terms_two, # If you've ran for terms of length 2
}
import pandas as pd
df = pd.DataFrame(output_dict, columns = output_dict.keys())
'''
Export to Google Sheets after that. Follow the guide here
https://www.pingshiuanchua.com/blog/post/overpower-your-google-sheets-with-python
'''
You should get something like this:
The reviewers seem to be pretty neutral to Google Home and I'm not sure if I'll be worried if I'm Google, as none of the top reviews appearing on Google seems to be heavily vouching for my product.
This technique is a good way to generate a report to understand the reception and sentiment around your brand or product and get a good sense of how the market perceives it in an instant.
You may try using this for more controversial topics like Brexit or Trump's immigration policies to determine the sentiments of the news coverage of those topics.
Hope the guide will empower you in your work!
Like the post? Consider donating to fund the maintenance of this website: