The Python Reddit API wrapper is a quick-and-easy way to scrape and analyse the content of Reddit whilst also allowing the building of bot apps.
In this post, we will utilise a few easy lines of code to analyse the keyword frequency in several subreddits via TF-IDF analysis and wordcloud generation.
First of all, you'll need a Reddit profile, ideally one that's created for several days already if you're looking to post content.
Then, create an app on Reddit's old interface.
Once done, collect these information to be used in the Python script later:
Run the following for initialisation. The last "print" line should reveal your username if everything is done correctly.
import praw
reddit = praw.Reddit(client_id='1234abcd',
client_secret='1234-abcd-1234',
password='myP@ssword:123456'
user_agent='mypythonapp',
username='myusername')
print(reddit.user.me())
Here are the steps that I'll use to get the content:
With those content, I'll do a keyword frequency analysis and wordcloud after.
You can choose to get ALL comments of the post. However, from experience, they usually contain more junk comments (e.g. users singing songs, rick-rolling, meme-ing) than meaningful discussions.
The comments directly related to the submission usually stop at the top or second-level comment.
The following script allows us to perform the steps mentioned above.
subreddit = reddit.subreddit('financialindependence') # Change the subreddit's name here
sub_ids = []
for submission in subreddit.hot(limit = 50): # Define the limit here and filter method
sub_ids.append(submission.id)
top_level_comments = []
second_level_comments = []
title = []
selftext = []
for sub_id in sub_ids:
submission = reddit.submission(id = sub_id)
title.append(submission.title) # Get submission title
selftext.append(submission.selftext) # Get submission content
submission.comments.replace_more(limit = None)
for top_level_comment in submission.comments:
top_level_comments.append(top_level_comment.body) # Get top-level comments
for second_level_comment in top_level_comment.replies:
second_level_comments.append(second_level_comment.body) # Get second-level comments
You can run through several subreddits by changing the first line of code.
Note that the script will take longer for more content or discussion-heavy subreddits (i.e. it'll take longer for /r/AmITheAsshole vs /r/pics).
Next, just add up the three lists to get a larger list with all the text content to analyse.
We'll use the same code shown in this post to analyse the keyword frequency of the subreddits.
For the wordcloud, we'll use the python wordcloud package initiated via this function:
def generate_wordcloud(text, stopwords = None, mask = None):
"""Generate Word Cloud"""
from PIL import Image
import numpy as np
from wordcloud import WordCloud
mask_object = None
if mask != None:
mask_object = np.array(Image.open(mask))
wordcloud = WordCloud(width = 1200, height = 600, stopwords = stopwords, max_font_size = 200, mask = mask_object, background_color = 'white', colormap = 'viridis')
wordcloud = wordcloud.generate(' '.join(text))
# Display the generated image:
# the matplotlib way:
import matplotlib.pyplot as plt
plt.figure()
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Here's what we find for the Financial Independence subreddit:
Top 20 terms:
Wordcloud:
As expected most of the terms in the wordcloud relates to adult life, money and financial planning.
The subreddit is well-known to be very strict in terms of the type of post/comments users can post to prevent reposts since it has a very comprehensive FAQ section. Hence, the word "deleted" is right at the top of the term frequency.
Next, let's take a look at /r/ShowerThoughts.
Funnily enough, a lot of think about water in the shower, and killing people?
Since the text has a higher concentration of comments vs actual submission content and this subreddit usually has short submission content (usually just the title), the "jail" and "kill" terms are likely the commenters' reaction to the interesting shower thoughts of the OPs.
As this subreddit's post can be superbly varied from time-to-time, I wouldn't dive too deep into their recent fascination of water.
Now let's look at the controversial /r/AmITheAsshole subreddit where it's believed that many assholes go here to get validated as not the asshole (NTA).
There are way more NTA (not the asshole) than YTA (you're the asshole) as that's the whole point of the subreddit.
It's interesting to note that most contents relate to family. Also, we can roughly guess that most posts and comments come from females as "husband" is mentioned in the top 10.
While there are many controversial subreddits we can look into that would be interesting, I don't want to increase the visibility nor promote those subreddits. You may use the code I've shared above freely for your analysis.
Just to end it off, here's one from /r/gonewild where most comments are very kind, full of "raw emotions" and full of compliments to the OPs. Lots of simping going on here.
Like the post? Consider donating to fund the maintenance of this website: