Python

Using YouTube API to Analyse YouTube Comments on Python

05 Nov 2018

YouTube comments are often fun to read while its anonymity also helps to provide some deep insight into some issues from both ends of the argument/discussion.

It does house some of the funniest comments you'll find online too.

YouTube API is a free-to-use API for anyone who's keen to capture and analyse online conversations about a brand or a topic on YouTube.

The same method can be used to analyse conversation on Reddit, though Reddit has a different structure and could potentially house more useful conversation instead of the anonymous conversations on YouTube.

Here I'll introduce the methods for capturing user comments on the top videos of a particular topic/brand.

Authentication

The first step is always to get authenticated. This can be a bit painful at times.

The good thing is Google provides a good Python quickstart.

Head on over to Google's developer console, create a project, activate the YouTube API and download your JSON credentials file. The quickstart provides a more in-depth guide.

Next, run the following code with the JSON file in your working directory. The code is taken directly from the quickstart.

import os

from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow

# The CLIENT_SECRETS_FILE variable specifies the name of a file that contains
# the OAuth 2.0 information for this application, including its client_id and
# client_secret.
CLIENT_SECRETS_FILE = "client_secret.json" #This is the name of your JSON file

# This OAuth 2.0 access scope allows for full read/write access to the
# authenticated user's account and requires requests to use an SSL connection.
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'

def get_authenticated_service():
  flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRETS_FILE, SCOPES)
  credentials = flow.run_console()
  return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)

os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()

There will be a prompt to open a URL to get the authentication code. Proceed to copy the URL and login to your Google account whilst providing permission to make changes to your YouTube channel (The API does more than just retrieving data).

You'll then find a long string of characters to paste in your Python console to get authenticated.

Getting the Most Popular Videos via Search

We'll be pulling the comments from the most popular videos of a particular topic - Brexit. It is a hot topic now that is widely covered on YouTube from political and educational channels.

To get the most popular videos, we'll get the top 20 videos when we search the term "brexit" on YouTube.

We'll tap on the "service" object created earlier and using the parameters listed here.

# =============================================================================
# Search Query Initialisation
# =============================================================================
query = 'brexit'

query_results = service.search().list(
        part = 'snippet',
        q = query,
        order = 'relevance', # You can consider using viewCount
        maxResults = 20,
        type = 'video', # Channels might appear in search results
        relevanceLanguage = 'en',
        safeSearch = 'moderate',
        ).execute()

The "query_results" object has all the results of the search in a JSON/dictionary structure. We'll grab the video IDs (+ some contextual info) from the result's dictionary to be used later on.

# =============================================================================
# Get Video IDs
# =============================================================================
video_id = []
channel = []
video_title = []
video_desc = []
for item in query_results['items']:
    video_id.append(item['id']['videoId'])
    channel.append(item['snippet']['channelTitle'])
    video_title.append(item['snippet']['title'])
    video_desc.append(item['snippet']['description'])

Getting the Comments from the Top Videos

We'll be using the video IDs we got earlier to pull the comments from them.

Note that we'll be using comments only and not replies to comments. The reason for this is purely a question of methodology. I believe replies have more messy/dirty data due to the tendency of replies turning into arguments and having plenty of strawmen statements.

You may refer to this documentation to pull replies to comments using the comment's ID that you can get in the following segment.

We'll use the following, pretty long code snippet consisting of nested for loops to pull the comments from the video ID.

# =============================================================================
# Get Comments of Top Videos
# =============================================================================
video_id_pop = []
channel_pop = []
video_title_pop = []
video_desc_pop = []
comments_pop = []
comment_id_pop = []
reply_count_pop = []
like_count_pop = []

from tqdm import tqdm
for i, video in enumerate(tqdm(video_id, ncols = 100)):
    response = service.commentThreads().list(
                    part = 'snippet',
                    videoId = video,
                    maxResults = 100, # Only take top 100 comments...
                    order = 'relevance', #... ranked on relevance
                    textFormat = 'plainText',
                    ).execute()
    
    comments_temp = []
    comment_id_temp = []
    reply_count_temp = []
    like_count_temp = []
    for item in response['items']:
        comments_temp.append(item['snippet']['topLevelComment']['snippet']['textDisplay'])
        comment_id_temp.append(item['snippet']['topLevelComment']['id'])
        reply_count_temp.append(item['snippet']['totalReplyCount'])
        like_count_temp.append(item['snippet']['topLevelComment']['snippet']['likeCount'])
    comments_pop.extend(comments_temp)
    comment_id_pop.extend(comment_id_temp)
    reply_count_pop.extend(reply_count_temp)
    like_count_pop.extend(like_count_temp)
    
    video_id_pop.extend([video_id[i]]*len(comments_temp))
    channel_pop.extend([channel[i]]*len(comments_temp))
    video_title_pop.extend([video_title[i]]*len(comments_temp))
    video_desc_pop.extend([video_desc[i]]*len(comments_temp))
    
query_pop = [query] * len(video_id_pop)

The _pop lists are the lists we'll use to populate the dataframe later.

The _temp lists are created as temporary placeholders to determine the length or number of comments pulled from the particular video so we can lengthen the initial list of video ID, channel name, video title and video descriptions accordingly to build the dataframe.

We're done now. We build the dataframe with the following code snippet and output to Google Sheets.

# =============================================================================
# Populate to Dataframe
# =============================================================================
import pandas as pd

output_dict = {
        'Query': query_pop,
        'Channel': channel_pop,
        'Video Title': video_title_pop,
        'Video Description': video_desc_pop,
        'Video ID': video_id_pop,
        'Comment': comments_pop,
        'Comment ID': comment_id_pop,
        'Replies': reply_count_pop,
        'Likes': like_count_pop,
        }

output_df = pd.DataFrame(output_dict, columns = output_dict.keys())

Let's Take a Look and Analyse

This output is based on the results I pulled from 4th Nov 2018. The results will, of course, differ by the time it is pulled.

Analysis of Brexit on YouTube We seemed to have obtained a pretty good set of result. The reason I said it is good is that there's a mix of videos in the top 10 when I sorted it by Replies.

On some occasions, you might see the top 10 consisting of a single video as the video might have been too popular, galvanising, the YouTuber made a mistake or is just too funny that it generated a lot of conversation.

I'm happy about the results and happy that people are talking about Brexit. I sometimes wonder why these discussions didn't happen before the all-important election, and why do these videos only come now instead of before the vote?

Nevertheless, here are some highly liked and replied to comments:

"Infantile comments show how ignorant many people are of what they are in for.

BREXIT is bad enough but it will be the end of this socalled UNITED Kingdom, too. The arrogance of Westminster plus the consequences of BREXIT will destroy what there is left of this 'union'.

These dreams of the revival of the British Empire are so ridiculous when you are running towards the cliff cheeringly and happily and totally deaf and blind for the fact that there's only one place you'll end up at - its called rock bottom...

And of course it's nobody's fault but EU, Merkel, the bloody Scots with their independence, the Irish, the Muslims and Rumpelstilzchen.

Wake up, grow up, roll up your sleeves and prepare for hard times coming. I hope you'll embrace your homemade crisis with as much vehemence as you show in your denial of facts and responsibilities."
"For many years I wished openly that the UK left the EU. I always considered them the snowflake of the union because they needed special norms to be inside, and they were a break for a stronger union.

However, I was shocked by the Brexit result. Secretly I always hoped to solve our issues.

Now, I don't want the UK votes to remain unless they want to have the same conditions than the rest of members. I realize that the EU became the scapegoat of all issues on your island and demagogy is not going to get better if you remain.

I think that the best scenario for all of us long term is that you leave, and come back whenever you really feel it. It's going to be tough, the World is increasingly bigger, and is becoming harder to be strong as a 60 millioninsh country standing alone.

Good luck. I'm sorry we are in this point, but I support the EU protecting our interests, and that means not giving away all the perks of being a member without being actually a member."
"I am happy to be American. Trump is dumb, but temporary. Brexit is a long term dumpster fire."

Further Analysis

While we can have fun scrolling down the results, let's take a look at the keywords present in the comments.

Here's the word cloud with the standard English stopwords

Word cloud of Brexit comments on YouTube - YouTube API

Some keywords apart from "EU", "Brexit" and "UK" are:

deal
people
leave
vote

Now let's do a keyword density analysis with TF-IDF count vectorizer. This will help us identify the most common 1-word, 2-word and 3-word terms.

Term Frequency of Brexit comments on YouTube - YouTube API

As I used TF-IDF vectorizer, the Frequency means nothing more than an index to compare the frequency between the terms.

While many who follow the topic of Brexit will certainly say that this result set is obvious, like the talk of the second referendum, Chequers plan, and EU's Single Market and Custom Union being the most contentious issue, this is a great way for those not familiar to quickly get a grasp of the conversation surrounding Brexit recently.

Closing Note

While I would debate the range of usefulness to using YouTube API as a way to gauge the conversation of a topic due to the great pool of nonsense comments YouTube has, it is certainly useful for certain topics and Brexit is a good example.

For other topics, you might want to train a classification model to filter our video or channel-related comments. Comments targeting the YouTuber, the video or the channel often gets the most likes as they can be pretty funny but are mostly useless.

Using YouTube API to Analyse YouTube Comments on Python

Authentication

Getting the Most Popular Videos via Search

Getting the Comments from the Top Videos

Let's Take a Look and Analyse

Further Analysis

Closing Note

Share the love

Related Posts

Analysing Subreddit's Keyword Concentration with Python and PRAW

Scraping Internal Links from Entire Domain

Using Graph Theory for Keyword Ideas Expansion via Networkx