Python

Keyword Network Analysis with Python and Gephi

11 Mar 2018

One of the key techniques when analysing keyword, headline or any text-related data is to find the connection between terms.

Such analysis is called Network Theory or Graph Theory.

In its simplest term, we are going to do a co-occurrence analysis between words. This can be extremely useful for discovering

the word that is most often paired with your product name in Google searches
adjacent topics, brands and products
hidden links between seemingly unrelated themes

Here is a network graph for the data that we'll analyse in this tutorial.

Sample of network graph analysis with Python and Gephi

This is a network graph of headlines and descriptions from Financial Times for the topic of Bitcoin from Jan 1, 2017 to Mar 9, 2018.

I'm not someone who follows Bitcoin news closely (though I do follow blockchain news), but it seems like there are news pieces that relate Bitcoin with Samsung, Facebook and Kodak. The only news I know is Facebook blocking Bitcoin-related ads, so that's very interesting.

As you have seen, this technique can be used for keeping up with the news too.

Getting the News Data from News API

This first step is optional as you can replace the input data with any of your own or public data. For this example, I'm gonna pull some news data.

The first thing to do is to get your API key from News API. Just sign up for an individual account will do.

Next, you can refer to their Get Started page or their Endpoints page that will be more specific to your use cases. We're using the Everything endpoint for this example.

You can refer to the request parameters on the endpoint page for the parameters that you can define your request. For our example, these will be the parameters, followed by the code:

Query: +bitcoin (+ to make sure that "bitcoin" exists in the article headline or description)
Source: Financial Times
Period: Jan 1, 2017 to Mar 9, 2018
Sort by: Popularity
Language: English
Page size: 100 (the maximum)

api = '[api key here]' # Replace with yours
query = '+bitcoin'

import requests
url = ('https://newsapi.org/v2/everything?'
       'q='+query+
       '&sources=financial-times'
       'from=2017-01-01&to=2018-03-09'
       '&sortBy=popularity'
       '&language=en'
       '&pageSize=100'
       '&apiKey='+api)
response = requests.get(url)
json = response.json()

Your JSON object will look something like this:

JSON viewed on Spyder IDE for Python

If you use print(json), it will look more illegible than this. I use Spyder IDE that is amazing at making this complex data structure super readable (highly recommended). The arrows above points to the dictionaries within dictionary structure typical for a json dataset.

Next, we extract the articles, headlines and descriptions into their respective Python objects.

articles = json['articles']
headline = [article['title'] for article in articles]
description = [article['description'] for article in articles]

all_text = headline + description

The final all_text object is an appended list of headlines and descriptions that we'll use to analyse for co-occurrence between words.

Creating a Co-occurrence Matrix

Gephi is a powerful network analysis tool that works best when given a co-occurrence matrix.

Here is an example of a very beautiful co-occurrence matrix that visualises the co-occurrence of characters on Victor Hugo’s Les Misérables.

Co-occurrence matrix for Les Miserables

The coloured boxes represent when both characters (top & left) appear in a scene together.

For our use case, our top and left attributes will be words, and the coloured boxes (or in our case, anything non-zero) will represent the two words occurring in the same article (in either headline or description).

To do that, we utilise the Python package, scikit -learn.

There's a library in scikit-learn for Text Preprocessing, and we'll use the CountVectorizer class to create a co-occurrence matrix.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(ngram_range=(1,1), stop_words = 'english') # You can define your own parameters
X = cv.fit_transform(all_text)

X is now a matrix of token counts, which look something like this (on a travel example):

Matrix of token counts from CountVectorizer Python This is NOT a co-occurrence matrix. On a co-occurrence matrix, the top and left attributes should be the tokens and the "ones" mark the co-occurrence of two entities (city/country in this example).

This is what we need based on the data above:

Co-occurrence matrix for a travel example For those of you who know about matrix manipulation, you will know that we can achieve this by multiplying the transpose of the earlier matrix of token counts by its original, untransposed matrix to get this co-occurrence matrix. In Python, this looks like:

Xc = (X.T * X) # This is the matrix manipulation step
Xc.setdiag(0) # We set the diagonals to be zeroes as it's pointless to be 1

X.T is the transposed version of matrix X (matrix of token counts).

The second step zeroes the diagonals as you might have noticed that the diagonals in a co-occurrence matrix are referring to the entity existing with itself (e.g. Bangkok exists together with Bangkok in a sentence that mentions Bangkok).

That's a useless piece of data so we removed it.

Finally, we turn it into a pandas DataFrame and export it as a CSV.

import pandas as pd
names = cv.get_feature_names() # This are the entity names (i.e. keywords)
df = pd.DataFrame(data = Xc.toarray(), columns = names, index = names)
df.to_csv('to gephi.csv', sep = ',')

Porting and Displaying on Gephi as a Network Graph

We're now ready to use Gephi.

Note that Gephi is a unique tool and it requires a lot of learning to utilise its full capability. I will be going through some basics to make your network graph readable, but I won't dive into detail. Refer to Gephi's user guides if you want to learn more.

After downloading, installing and opening Gephi, you will see this window.

The welcome window on Gephi Click on the "Open Graph File" option and navigate to our file, "to_gephi.csv", in your Python working directory.

Your co-occurrence matrix output from Python and Pandas

Here you'll see how your CSV file looks like if you opened the CSV file in Excel or Google Sheets.

Click Next then Finish.

On the final screen, choose "Undirected" from the Graph Type option (default is "Mixed"). Use "Directed" to display arrow-heads that indicate directional data (e.g. "from Singapore to Bangkok" should have an arrow pointing from Singapore to Bangkok).

Select "Undirected" on the final screen of Gephi Once you click "OK", you will encounter a huge blob of black nodes and edges.

If you know how to format on Gephi, you may skip the rest of this post.

There are a lot of ways to make it more presentable, but I will explain the basic steps here:

Enable labels
Formatting of Nodes and Edges
Filter Gallery
Filter Implementation & Settings
Layout Simulation

Visual Guide for visualisation on Gephi

1 - Enable Labels

Click on the Dark "T" to enable labels on your Nodes (the circles). They represent your keywords.

At the slider beside the font (Arial-BoldMT, 32), drag the slider to the far-left. This option governs the size of all labels and will be useful later on if your labels are too small.

2 - Formatting of Nodes and Edges

While the Nodes represent your keywords, the Edges (the lines connecting the nodes) represent the co-occurrence between the keywords.

Formatting of nodes and edges on Gephi

Select the top left option to format either Nodes or Edges. Let's start with Nodes.

Just to the right of that, there are 4 options/icons. From left to right, they are the options for:

Colour of the node
Size of the node
Colour of the text label
Size of the text label

Let's select Size of the text label, the rightmost icon with the small and big capital T.

Below the option for Nodes or Edges, there's an option of Unique, Partition or Ranking.

Unique applies the similar formatting for all labels/nodes, while ranking applies based on the magnitude of the nodes/edges. That is, the more co-occurring terms will have bigger/darker edges and more important keywords will have bigger/darker nodes.

I don't really use Partition.

Let's select Ranking and select Degree in the drop-down menu and set the range from 0.5 to 50. Click apply to apply the formatting.

You will notice that some important terms are bigger now, like "bitcoin" and "cryptocurrency".

You may continue to experiment with some other formatting to achieve a nicer visualisation.

3 - Filter Gallery

Over here there are some options to limit the number of data/noise that is being visualised. It's important to remove noise from visualisation to provide focus and to deliver a compelling story.

Navigate to the "Degree Range" and drag it down to section 4: Queries

4 - Filter Implementation & Settings

Once you've dragged over, you'll see a slider at the bottom. Click "Filter" and start dragging the slider.

Use degree range to filter your noisy data

You will notice that the blob of nodes and edges in the centre view will diminish and become more visible now.

What we're doing is to limit the visualisation to nodes and edges that have a higher degree of co-occurrence. This eliminates noise or trashy data that provides more confusion than clarity.

Set the minimum (left slider) to around 60-70 for this example and click "Stop" at the bottom right.

5 - Layout Simulation

We have reached the final step. This step uses some algorithm to achieve a layout that is more presentable.

For this example, we'll use "Force Atlas", "Expand" and "Label Adjust".

Firstly, select "Force Atlas" from the drop-down menu and click "Run". Your visualisation will look worse than before, but be patient.

Next, select "Expand" and click "Run" a few times until the graph looks more legible.

Finally, use "Label Adjust" to prevent overlaps between the labels. You should reach something like this.

Final product in Gephi after some effort

We've reached something similar to the graph I've shown earlier. To create a dark background, click on the light-bulb icon just beside the Dark T icon that we'd used to enable the labels.

You can use the Brush tool (the paint bucket icon) to colour selected nodes and its neighbours to different colours to distinguish between themes in your data.

Final Words

This has been a very long post so I will stop here. You can make it more beautiful to present to your boss by heading to the "Preview" section at the top left.

Visualisation with Gephi for Bitcoin terms on Financial Times You can do some cleanup too by deleting some nodes as it's much easier to identify trash data here than in Excel.

With this skill, you have an extremely powerful technique for digital marketing to find areas for keyword expansions and for keeping up with news quickly without reading any news.

You can apply this technique to other datasets too where co-occurrence can be useful, like for analysis on audience interest (e.g. people who like A also like B), consumer purchase behaviour (e.g. people who buy A also buys B) and social network analysis.

Have fun and happy visualising!

Keyword Network Analysis with Python and Gephi

Getting the News Data from News API

Creating a Co-occurrence Matrix

Porting and Displaying on Gephi as a Network Graph

1 - Enable Labels

2 - Formatting of Nodes and Edges

3 - Filter Gallery

4 - Filter Implementation & Settings

5 - Layout Simulation

Final Words

Share the love

Related Posts

Analysing Subreddit's Keyword Concentration with Python and PRAW

Scraping Internal Links from Entire Domain

Using Graph Theory for Keyword Ideas Expansion via Networkx