YouTube comments are often fun to read while its anonymity also helps to provide some deep insight into some issues from both ends of the argument/discussion.
It does house some of the funniest comments you'll find online too.
YouTube API is a free-to-use API for anyone who's keen to capture and analyse online conversations about a brand or a topic on YouTube.
The same method can be used to analyse conversation on Reddit, though Reddit has a different structure and could potentially house more useful conversation instead of the anonymous conversations on YouTube.
Here I'll introduce the methods for capturing user comments on the top videos of a particular topic/brand.
The first step is always to get authenticated. This can be a bit painful at times.
The good thing is Google provides a good Python quickstart.
Head on over to Google's developer console, create a project, activate the YouTube API and download your JSON credentials file. The quickstart provides a more in-depth guide.
Next, run the following code with the JSON file in your working directory. The code is taken directly from the quickstart.
import os
from googleapiclient.discovery import build
from google_auth_oauthlib.flow import InstalledAppFlow
# The CLIENT_SECRETS_FILE variable specifies the name of a file that contains
# the OAuth 2.0 information for this application, including its client_id and
# client_secret.
CLIENT_SECRETS_FILE = "client_secret.json" #This is the name of your JSON file
# This OAuth 2.0 access scope allows for full read/write access to the
# authenticated user's account and requires requests to use an SSL connection.
SCOPES = ['https://www.googleapis.com/auth/youtube.force-ssl']
API_SERVICE_NAME = 'youtube'
API_VERSION = 'v3'
def get_authenticated_service():
flow = InstalledAppFlow.from_client_secrets_file(CLIENT_SECRETS_FILE, SCOPES)
credentials = flow.run_console()
return build(API_SERVICE_NAME, API_VERSION, credentials = credentials)
os.environ['OAUTHLIB_INSECURE_TRANSPORT'] = '1'
service = get_authenticated_service()
There will be a prompt to open a URL to get the authentication code. Proceed to copy the URL and login to your Google account whilst providing permission to make changes to your YouTube channel (The API does more than just retrieving data).
You'll then find a long string of characters to paste in your Python console to get authenticated.
We'll be pulling the comments from the most popular videos of a particular topic - Brexit. It is a hot topic now that is widely covered on YouTube from political and educational channels.
To get the most popular videos, we'll get the top 20 videos when we search the term "brexit" on YouTube.
We'll tap on the "service" object created earlier and using the parameters listed here.
# =============================================================================
# Search Query Initialisation
# =============================================================================
query = 'brexit'
query_results = service.search().list(
part = 'snippet',
q = query,
order = 'relevance', # You can consider using viewCount
maxResults = 20,
type = 'video', # Channels might appear in search results
relevanceLanguage = 'en',
safeSearch = 'moderate',
).execute()
The "query_results" object has all the results of the search in a JSON/dictionary structure. We'll grab the video IDs (+ some contextual info) from the result's dictionary to be used later on.
# =============================================================================
# Get Video IDs
# =============================================================================
video_id = []
channel = []
video_title = []
video_desc = []
for item in query_results['items']:
video_id.append(item['id']['videoId'])
channel.append(item['snippet']['channelTitle'])
video_title.append(item['snippet']['title'])
video_desc.append(item['snippet']['description'])
We'll be using the video IDs we got earlier to pull the comments from them.
Note that we'll be using comments only and not replies to comments. The reason for this is purely a question of methodology. I believe replies have more messy/dirty data due to the tendency of replies turning into arguments and having plenty of strawmen statements.
You may refer to this documentation to pull replies to comments using the comment's ID that you can get in the following segment.
We'll use the following, pretty long code snippet consisting of nested for loops to pull the comments from the video ID.
# =============================================================================
# Get Comments of Top Videos
# =============================================================================
video_id_pop = []
channel_pop = []
video_title_pop = []
video_desc_pop = []
comments_pop = []
comment_id_pop = []
reply_count_pop = []
like_count_pop = []
from tqdm import tqdm
for i, video in enumerate(tqdm(video_id, ncols = 100)):
response = service.commentThreads().list(
part = 'snippet',
videoId = video,
maxResults = 100, # Only take top 100 comments...
order = 'relevance', #... ranked on relevance
textFormat = 'plainText',
).execute()
comments_temp = []
comment_id_temp = []
reply_count_temp = []
like_count_temp = []
for item in response['items']:
comments_temp.append(item['snippet']['topLevelComment']['snippet']['textDisplay'])
comment_id_temp.append(item['snippet']['topLevelComment']['id'])
reply_count_temp.append(item['snippet']['totalReplyCount'])
like_count_temp.append(item['snippet']['topLevelComment']['snippet']['likeCount'])
comments_pop.extend(comments_temp)
comment_id_pop.extend(comment_id_temp)
reply_count_pop.extend(reply_count_temp)
like_count_pop.extend(like_count_temp)
video_id_pop.extend([video_id[i]]*len(comments_temp))
channel_pop.extend([channel[i]]*len(comments_temp))
video_title_pop.extend([video_title[i]]*len(comments_temp))
video_desc_pop.extend([video_desc[i]]*len(comments_temp))
query_pop = [query] * len(video_id_pop)
The _pop lists are the lists we'll use to populate the dataframe later.
The _temp lists are created as temporary placeholders to determine the length or number of comments pulled from the particular video so we can lengthen the initial list of video ID, channel name, video title and video descriptions accordingly to build the dataframe.
We're done now. We build the dataframe with the following code snippet and output to Google Sheets.
# =============================================================================
# Populate to Dataframe
# =============================================================================
import pandas as pd
output_dict = {
'Query': query_pop,
'Channel': channel_pop,
'Video Title': video_title_pop,
'Video Description': video_desc_pop,
'Video ID': video_id_pop,
'Comment': comments_pop,
'Comment ID': comment_id_pop,
'Replies': reply_count_pop,
'Likes': like_count_pop,
}
output_df = pd.DataFrame(output_dict, columns = output_dict.keys())
This output is based on the results I pulled from 4th Nov 2018. The results will, of course, differ by the time it is pulled.
We seemed to have obtained a pretty good set of result. The reason I said it is good is that there's a mix of videos in the top 10 when I sorted it by Replies.
On some occasions, you might see the top 10 consisting of a single video as the video might have been too popular, galvanising, the YouTuber made a mistake or is just too funny that it generated a lot of conversation.
I'm happy about the results and happy that people are talking about Brexit. I sometimes wonder why these discussions didn't happen before the all-important election, and why do these videos only come now instead of before the vote?
Nevertheless, here are some highly liked and replied to comments:
While we can have fun scrolling down the results, let's take a look at the keywords present in the comments.
Here's the word cloud with the standard English stopwords
Some keywords apart from "EU", "Brexit" and "UK" are:
Now let's do a keyword density analysis with TF-IDF count vectorizer. This will help us identify the most common 1-word, 2-word and 3-word terms.
As I used TF-IDF vectorizer, the Frequency means nothing more than an index to compare the frequency between the terms.
While many who follow the topic of Brexit will certainly say that this result set is obvious, like the talk of the second referendum, Chequers plan, and EU's Single Market and Custom Union being the most contentious issue, this is a great way for those not familiar to quickly get a grasp of the conversation surrounding Brexit recently.
While I would debate the range of usefulness to using YouTube API as a way to gauge the conversation of a topic due to the great pool of nonsense comments YouTube has, it is certainly useful for certain topics and Brexit is a good example.
For other topics, you might want to train a classification model to filter our video or channel-related comments. Comments targeting the YouTuber, the video or the channel often gets the most likes as they can be pretty funny but are mostly useless.
Like the post? Consider donating to fund the maintenance of this website: