Following up on the previous post on generating a mass amount of keywords, we'll use Graph Theory to help identify several keyword themes worth exploring.
Identifying themes is useful for:
If you've tried the previous code, you would have found a quick way to generate a massive trove of keywords, and you would have realised that it's challenging to comb through everything to find the right keyword themes.
Here, we'll use the NetworkX package and a little bit on Graph Theory to make it easier.
Note that the analysis will be based on the keyword search patterns and doesn't incorporate search volume.
This post follows up on the last code, but we'll shift our focus to eco-friendly living. Imagine you're a new green-living website that's looking to produce the next content piece.
These keywords below will be our seed keywords.
keyword_list = [
'eco friendly ',
'environmentally friendly',
'earth friendly',
]
For a full reference of Networkx documentation.
First, we'll need to create a networks graph object. There are many ways to do this, but as each of our network nodes have to contain the keyword tokens as their labels (e.g. "earth", "product", "laundry"), we'll build a Pandas dataframe of a sparse matrix (versus a numpy matrix, which doesn't retain labels).
# Create sparse co-occurrence matrix
from sklearn.feature_extraction.text import CountVectorizer
count_model = CountVectorizer(ngram_range=(1,1), stop_words = 'english') # Vectorise to 1-word tokens. Use (1,2) if you want both 1- and 2-words token
X = count_model.fit_transform(grab_keywords)
Xc = (X.T * X) # Matrix operation to create a square matrix
Xc.setdiag(0) # Set the diagonal (self-looping) values to zero
# Create dataframe from sparse matrix
names = count_model.get_feature_names() # Important to retain keyword tokens as node indices later
df = pd.DataFrame(data = Xc.toarray(), columns = names, index = names)
Once we have the sparse matrix dataframe, we'll build a networkx graph object with it.
import networkx as nx
G = nx.from_pandas_adjacency(df) # from_pandas_adjacency accepts a square, dataframe matrix
Note that I've skipped some steps here that will make your visualisation prettier (e.g. normalisation) as we're not going to go through how to visualise in this post.
Before proceeding for the analysis, let's add the node's degrees (number of nodes that it connects to) as one of its attributes. This will help in the analysis later.
# Add attribute for degrees to all nodes
from collections import defaultdict
node_dictionary = defaultdict(int)
for n in G.nodes():
node_dictionary[n] = nx.degree(G, n)
nx.set_node_attributes(G, node_dictionary, 'degrees')
Next step is optional, but I would like to remove some outliers from the graph.
Here's a view of the network graph without removing the outliers.
And here's a view of how it looks after removing the outliers.
Though we're not going through the visualisation in this post, you can see from the two graphs that our analysis would contain a lot of erratic search query tokens if we retain the outliers.
To remove the outliers, we'll remove any nodes that have degree less than 5.
# Remove outliers
remove = [node for node,degree in dict(G.degree()).items() if degree < 5]
G.remove_nodes_from(remove)
Now we're ready for some analysis. Firstly, we'll look into Degree Centrality.
Degree Centrality is one of the centrality measures in Graph Theory and it's calculated using the number of links incident upon a node divided by the total number of possible links incident upon a node. It provides an idea of how connected the node is.
In another context, say a pandemic situation, degree centrality can be used to identify the highest risk group of catching and spreading the virus.
In our search query context, we can use it to identify the keywords that are central to the theme i.e. a keyword with high degree centrality is contained in many search queries.
We start by creating a set object containing all token from our keyword_list list, as they'll surely be the most-connected tokens and would mess up the analysis.
from nltk.tokenize import word_tokenize
tokens = [word_tokenize(i) for i in keyword_list]
tokens = set([x for token in tokens for x in token])
Now we find the degree centrality after removing those tokens.
# Find degree centrality excluding the keyword list's tokens
degree_centrality = nx.degree_centrality(G)
for token in tokens:
degree_centrality.pop(token, None) # This removes our keyword_list tokens
sort_dc = sorted(degree_centrality.items(), key=lambda x: x[1], reverse=True) # Sort descending
print(list(sort_dc)[:20]) # Filters to top 20
Following is what we find
[('products', 0.1059190031152648),
('clothing', 0.040498442367601244),
('uk', 0.040498442367601244),
('yoga', 0.040498442367601244),
('paper', 0.03894080996884735),
('toilet', 0.03894080996884735),
('cleaner', 0.03426791277258567),
('killer', 0.03426791277258567),
('brands', 0.03271028037383177),
('companies', 0.03115264797507788),
('cleaning', 0.029595015576323987),
('mat', 0.029595015576323987),
('baby', 0.028037383177570093),
('packaging', 0.028037383177570093),
('weed', 0.028037383177570093),
('bags', 0.0264797507788162),
('laundry', 0.0264797507788162),
('detergent', 0.024922118380062305),
('shoes', 0.024922118380062305),
('soap', 0.024922118380062305)]
Here we see "products" is the highest connected node by degree centrality, which makes sense since our seed keywords are "eco friendly", "environmentally friendly" and "earth friendly".
I'm interested to see what are the nodes connected to "clothing". To do that, we find the neighbours of "clothing" node and sort them by degrees.
neighbors = list(G.neighbors('clothing'))
neighbors = [n for n in neighbors if n not in tokens] # Remove the seed tokens again
neighbors_degrees = [G.node[n]['degrees'] for n in neighbors]
import pandas as pd
df_neighbors = pd.DataFrame({'Neighbors': neighbors,
'Degrees': neighbors_degrees
})
df_neighbors = df_neighbors.sort_values('Degrees', ascending = False).reset_index(drop=True)
From the results above, "brands", "packaging" and "gifts" are interesting ideas we can expand for eco-friendly clothing. Similarly, you can do the same for other nodes.
Geographic-related tokens are interesting too for localised content, though it'll appear in any keyword ideas generator.
The other centrality measure we can use is Betweenness Centrality. It is based on the amount of shortest path that passes through the nodes.
In a telecommunications context, a node with high Betweenness Centrality has the most control over the network as it serves as the main passage of information in the network. In an influencer marketing context, an influencer with high Betweenness Centrality is the most efficient in passing info from one community to the next i.e. a good broker of info.
In our context, it differs little with Degree Centrality, but in some search patterns, it can differ a lot especially when your network graph contains many disparate communities.
# Find betweenness centrality excluding the keyword list's tokens
betw_centrality = nx.betweenness_centrality(G)
for token in tokens:
betw_centrality.pop(token, None)
sort_bc = sorted(betw_centrality.items(), key=lambda x: x[1], reverse=True)
print(list(sort_bc)[:20])
[('products', 0.003190751760087056),
('killer', 0.0007072500114184381),
('weed', 0.0005978998430191811),
('yoga', 0.00034407285672102616),
('clothing', 0.00027484352283512473),
('cleaner', 0.00026887032332670725),
('toilet', 0.0002167298444289711),
('companies', 0.00021137373478196083),
('paper', 0.00020665865359466187),
('uk', 0.0001894204756992165),
('bags', 0.00014678384959925197),
('mat', 0.00013571768970022284),
('baby', 0.00013535131451063043),
('packaging', 0.0001306397533087722),
('cleaning', 0.00012076024658584039),
('gifts', 0.00010596750102778943),
('brands', 0.00010547804772657882),
('wash', 9.705363015757116e-05),
('shoes', 9.22577916031486e-05),
('mattress', 8.760975197763512e-05)]
In our case, "killer" suddenly comes second, but note that its Betweenness Centrality is far from "products" (78% smaller). In our network graph, it seems like the nodes that are the main passage of shortest paths are our seed tokens (which were excluded) and "products".
The "killer" here relates to "weed killer" and other lawn-related searches.
Common neighbours are the nodes that are common between two nodes.
In our context, we can use it to find product terms relating to a particular geography. Let's compare UK's eco-friendly product searches with Canada's.
common_nodes = list(nx.common_neighbors(G,"products", "uk"))
common_nodes_degrees = []
for node in common_nodes:
common_nodes_degrees.append(G.node[node]['degrees'])
df_common = pd.DataFrame(data =
{ 'Common Nodes': common_nodes,
'Degrees': common_nodes_degrees
})
df_common = df_common.sort_values('Degrees', ascending = False).reset_index(drop = True)
Here's the result for UK:
and here's the result for Canada:
We see that the UK has searches related to yoga and baby that don't appear in Canada-related searches.
Note that I ran the code while running a VPN from USA, hence the search queries wouldn't be very comprehensive for those markets.
Finally, we'll run a community search to develop groups of keywords that are thematically similar. This is useful for your SEM/paid search keyword grouping and for content-writing to focus on keywords of similar themes in your content.
We'll use Clauset-Newman-Moore greedy modularity maximisation, which is included in the networks package but has to be imported separately.
# Get communities by modularity
from networkx.algorithms.community import greedy_modularity_communities
c = list(greedy_modularity_communities(G))
communities = [sorted(list(a)) for a in c]
By combing through the 39 communities discovered by the algorithm, we can find interesting groups like:
Category | Tokens |
---|---|
lawn-related | ['care', 'fertilizer', 'lawn', 'mower', 'service', 'signs', 'yard'] |
wedding-related | ['binders', 'engagement', 'invitations', 'pack', 'ring', 'rings', 'wedding'] |
kitchen-related | ['cabinets', 'countertops', 'kitchen', 'sponge', 'sponges', 'towels', 'utensils'] |
construction-related | ['3d', 'building', 'construction', 'epoxy', 'filament', 'materials', 'printer', 'printing', 'resin'] |
home-related | ['basement', 'bowl', 'cleaner', 'deck', 'drain', 'floor', 'flooring', 'hardwood', 'oven', 'plank', 'purpose', 'sealer', 'siding', 'stickers', 'vacuum', 'vinyl', 'window'] |
birthday- & gifting-related | ['18th', '1st', '21st', '30th', '40th', '50th', '60th', 'birthday', 'crackers', 'day', 'decorations', 'gifts', 'ideas', 'mother', 'party', 'presents', 'tree', 'unique', 'wrapping', 'xmas'] |
Some of the bigger nodes contain less thematically-relevant tokens that could just be a collection of outliers, as Google search queries can be pretty erratic.
---
Using the knowledge Graph Theory coupled with the networkx package, we've managed to find several interesting search patterns and themes relating to eco-friendly living that would otherwise be difficult to isolate and identify.
Do give it a try and share if you find other ways of using networkx to discover or expand on keyword themes.
Like the post? Consider donating to fund the maintenance of this website: