Collecting & Analysing Google Home Reviews with Python

28 Nov 2018

Product reviews online are the best source for knowing and understanding a product before making a purchase decision.

With the ample amount of reviews available online, we'll use Python to quickly understand the gist of the review, analyse the sentiment and stance of the reviews, and basically automate the boring stuff of picking which review to dive deep into.

We'll be building off the previous post on scraping search results from Google for this purpose. Please refer to that if you need to understand the preliminary scraping code that won't be covered again here.

Preparing the Google Search Query

To ensure that we're only collecting "Google Home Review", we'll use the Google Search parameters to ensure that the three words are present in the page title.

Of course, there's a chance that some review web pages do not have those three words in their page title, but in that case, that website will have a very bad SEO optimisation anyway. So we can safely ignore those.

We'll also not be scraping YouTube results on the search results page, as those reviews are all-audio.

# =============================================================================
# Setting Up
# =============================================================================
import urllib

search_title = "'google home review'"
exclude_url = ['youtube'] # You may add more exclusions

query =  "intitle:" + search_title

for exclude in exclude_url:
    query = query + " -inurl:" + exclude

query = urllib.parse.quote_plus(query)

number_result = 20

More URL exclusions can be added if you found your results to have a lousy, biased or click-bait sites.

Start Scraping

As mentioned earlier, I won't be running through this portion as the code sample is the exact same one as the one I've covered on the post about scraping search results from Google. Please read that post for a full understanding of this section.

For convenience, I will paste the full code sample here for ease of following, but please refer to that post before asking any questions on this section.

import requests
from fake_useragent import UserAgent
from bs4 import BeautifulSoup

ua = UserAgent()

google_url = "https://www.google.com/search?q=" + query + "&num=" + str(number_result)
response = requests.get(google_url, {"User-Agent": ua.random})
soup = BeautifulSoup(response.text, "html.parser")

result_div = soup.find_all('div', attrs = {'class': 'g'})

links = []
titles = []
descriptions = []
for r in result_div:
    try:
        link = r.find('a', href = True)
        title = r.find('h3', attrs={'class': 'r'}).get_text()
        description = r.find('span', attrs={'class': 'st'}).get_text()
        
        if link != '' and title != '' and description != '':
            links.append(link['href'])
            titles.append(title)
            descriptions.append(description)
    except:
        continue

import re   

to_remove = []
clean_links = []
for i, l in enumerate(links):
    clean = re.search('\/url\?q\=(.*)\&sa',l)
    if clean is None:
        to_remove.append(i)
        continue
    clean_links.append(clean.group(1))
    
for x in to_remove:
    del titles[x]
    del descriptions[x]

Getting All the Body Texts

Now that we have all the links under the "clean_links" variable, we'll be scraping all of their body texts.

For those unfamiliar to web coding, we're not specifically pulling the "content" blocks, as all websites have different ways of defining their "content" blocks i.e. the part where the actual, written post is found.

The "body texts" are everything visible on the website, including header, footer, social links etc. So do expect junk to appear on the scrape results.

Note that a one-size-fits-all body text scrape function is never possible as all websites that are vastly different from one another. This attempt for a one-size-fits-all body text scraping will have consequences later on.

Firstly, we create a function that will assist in removing all unnecessary HTML tags that are not "body" texts.

def tag_visible(element):
    from bs4.element import Comment
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

Next, the following function will be the one scraping the body texts:

def get_body_text(url, user_agent):
    """Get the full body text of specified URL """    
    import requests
    from bs4 import BeautifulSoup
    import time
    from random import randint
    
    time.sleep(randint(0, 2))
    ua = {"User-Agent": user_agent}
    
    
    try:
        print(requests.get(url, headers = ua))
        if requests.get(url, headers = ua).status_code != 200:
            print('Status code is not 200')
            return '', requests.get(url, headers = ua).status_code
    except:
        print('critical error')
        return '', requests.get(url, headers = ua).status_code
        
    body = requests.get(url, headers = ua).content
    soup = BeautifulSoup(body, 'html.parser')
    texts = soup.findAll(text=True)
    
    visible_texts = filter(tag_visible, texts)  
    return_text = u"\n".join(t.strip() for t in visible_texts)
    response_code = requests.get(url, headers = ua).status_code
    return return_text, response_code

Some explanations about the function:

The user agent input will be a ua.random to vary all the user agents scraping the websites.
It checks each link to make sure they have 200 response code and working. If it's not working, it'll skip that URL. It also prints the response every time a link is scraped.
The "response_code" return variable is a list of the response codes for each link, useful for filtering later.
"texts" is a list of HTML tags of the website. This list gets filtered by the previous function to return only body texts.
Each body of texts in the visible_texts variable will be joined by a newline "\n". We'll be using this later.

To initiate this function, we use the following script:

texts = []
response_code = []
for link in clean_links:
    t, r = get_body_text(link, ua.random)
    texts.append(t)
    response_code.append(r)

"texts" now houses all your body texts, which can be a load of incomprehensible texts - we'll use a summariser later.

Analysing Sentiments

To get a good gist of how favourable the reviews are to Google Home, we'll use Google's own Natural Language API. I've previously tested that it is the most reliable amongst the easily accessible sentiment analyser on Python.

Please refer to the post linked for the full analysis and comparison, but here's the code snippet we need to analyse the sentiment:

def gc_sentiment(text):  
    from google.cloud import language
    
    path = '/Users/Yourname/YourProjectName-123456.json' #FULL path to your service account key
    client = language.LanguageServiceClient.from_service_account_json(path)
    document = language.types.Document(
            content=text,
            type=language.enums.Document.Type.PLAIN_TEXT)
    annotations = client.analyze_sentiment(document=document)
    score = annotations.document_sentiment.score
    magnitude = annotations.document_sentiment.magnitude
    return score, magnitude

Now let's call the function and loop through all our body texts.

scores = []
magnitudes = []
for x in texts:
    # Check if the body text actually exists
    if x == '': 
        scores.append(None)
        magnitudes.append(None)
    else:
        s, m = gps.sentiment(x)
        scores.append(s)
        magnitudes.append(m)

As for product reviews, I expect most of the scores to be around the middle i.e. zero, unless the product is absolute shit or absolute greatness.

Text Summariser

While there are reliable text summarisers out there for Python, I'll be using a simple TF-IDF vectoriser method instead.

TF-IDF is a frequency weighting algorithm that ranks the frequency of terms in a body of text based on two frequencies: term frequency and document frequency, the latter of which is "inversed".

This means that a term that exists too frequent in the document like particle will be penalised and wouldn't rank high in TF-IDF frequency.

To run the TF-IDF frequency analysis, we'll use the following function:

def term_frequency(terms, length, stopwords = 'english'):
    """Get term frequency of particular ngram length"""
    
    import pandas as pd
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    tfidf = TfidfVectorizer(lowercase = True, stop_words = stopwords, ngram_range = (length,length))
    td_matrix = tfidf.fit_transform(terms).toarray()
    terms = tfidf.get_feature_names()
    frequency = td_matrix.sum(axis = 0).tolist()
    df = pd.DataFrame(
            {'Term' : terms,
             'Frequency' : frequency
             })
    df = df[df.columns[::-1]]
    df.sort_values('Frequency', axis = 0, ascending = False, inplace = True)
    df.reset_index(drop = True, inplace = True)
    return df

We won't be using the default "english" stopwords list provided by TfidfVectorizer. Instead, we'll use a general English stopwords list appended with some of our own keywords relevant to this analysis.

english_stopwords = [ '''please get the list from here https://gist.github.com/sebleier/554280''' ]

custom_stopwords = ['google home', 'google assistant','smart home','smart speaker',
                    'home','google','speaker','smart','assistant',
                    'facebook', 'twitter', 'pinterest']

Before we proceed, the TfidfVectorizer class takes a corpus as its input. A corpus, in Python's terms, would be a list of strings, where each string represent a "document", and the "document" is long string represent the texts present in the "document"

For example, if you want to find out the most common words used by J.K. Rowling in the Harry Potter series, the corpus would be a list of all the words in the each Harry Potter books, i.e.:

corpus = ['all the text in Harry Potter and the Sorcerer's Stone', 'all the text in Harry Potter and the Chamber of Secrets', ....]

In our use case, to make things easier, we'll split each element in "texts" list by the newline character "\n".

As mentioned earlier when we scrape the body text, we've joined each body of texts by a newline character. Now, we'll split them back into their preceding form and represent them as a list of "documents". In our use case, it would be closer to "paragraphs" instead of "documents".

freq_terms_one = [] # We'll start with frequency of 1-word terms
for t in texts:
    corpus = [x for x in t.split("\n") if x != ''] # Split by newline while clearing up empty rows
    corpus_filtered = [x for x in corpus if len(x.split()) != 1] # Remove all single word paragraphs
    
    try:
        freq_one = gps.term_frequency(corpus_filtered, 1, english_stopwords + custom_stopwords)
    except:
        freq_terms_one.append(None)
        continue
    top_10 = freq_one['Term'].head(10).tolist()  # We take only the top 10 to analyse
    freq_terms_one.append(', '.join(top_10))

The output would look something like this:

Output of single word term or n-gram of 1 from the body texts

If you find anything unsatisfactory (e.g. you want to remove "oct" and "nov" from the 4th element), just update your custom_stopwords list and re-run this.

A one-size-fits-all body text scrape function is not possible as all websites that are vastly different from one another. Hence there will be freak cases where some irrelevant texts were determined to be high frequency when we tried to use a one-size-fits-all body text scraping function.

The 4th element had such a weird output due to the feeds section taken into account in the body text scrape.

In such cases, you might want to modify the function to fit your use case, or just copy paste the actual content paragraphs into the list.

Nevertheless, let's continue and you may run another function call to extract terms of length 2.

Wrapping It Up

Now we pretty much have everything we need to display in a table. You may tabulate the data we collected into a dataframe and display it on Google Sheets.

output_dict = {
        'Title': titles,
        'Description': descriptions,
        'URL': clean_links,
        'Response code': response_code,
        'Sentiment Score': scores,
        'Sentiment Magnitude': magnitudes,
        'One-Word Frequent Terms': freq_terms_one,
        'Two-Word Frequent Terms': freq_terms_two, # If you've ran for terms of length 2
        }

import pandas as pd
df = pd.DataFrame(output_dict, columns = output_dict.keys())


'''
Export to Google Sheets after that. Follow the guide here
https://www.pingshiuanchua.com/blog/post/overpower-your-google-sheets-with-python
'''

You should get something like this:

Output of Google Search Result scrape and analysis of sentiment and text

The reviewers seem to be pretty neutral to Google Home and I'm not sure if I'll be worried if I'm Google, as none of the top reviews appearing on Google seems to be heavily vouching for my product.

This technique is a good way to generate a report to understand the reception and sentiment around your brand or product and get a good sense of how the market perceives it in an instant.

You may try using this for more controversial topics like Brexit or Trump's immigration policies to determine the sentiments of the news coverage of those topics.

Hope the guide will empower you in your work!

Collecting & Analysing Google Home Reviews with Python

Preparing the Google Search Query

Start Scraping

Getting All the Body Texts

Analysing Sentiments

Text Summariser

Wrapping It Up

Share the love

Related Posts

Quick Guide - Google Cloud Authentication on Python

Simple Sentiment Analysis with Python