One of the key techniques when analysing keyword, headline or any text-related data is to find the connection between terms.
In its simplest term, we are going to do a co-occurrence analysis between words. This can be extremely useful for discovering
Here is a network graph for the data that we'll analyse in this tutorial.
This is a network graph of headlines and descriptions from Financial Times for the topic of Bitcoin from Jan 1, 2017 to Mar 9, 2018.
I'm not someone who follows Bitcoin news closely (though I do follow blockchain news), but it seems like there are news pieces that relate Bitcoin with Samsung, Facebook and Kodak. The only news I know is Facebook blocking Bitcoin-related ads, so that's very interesting.
As you have seen, this technique can be used for keeping up with the news too.
This first step is optional as you can replace the input data with any of your own or public data. For this example, I'm gonna pull some news data.
The first thing to do is to get your API key from News API. Just sign up for an individual account will do.
You can refer to the request parameters on the endpoint page for the parameters that you can define your request. For our example, these will be the parameters, followed by the code:
api = '[api key here]' # Replace with yours query = '+bitcoin' import requests url = ('https://newsapi.org/v2/everything?' 'q='+query+ '&sources=financial-times' 'from=2017-01-01&to=2018-03-09' '&sortBy=popularity' '&language=en' '&pageSize=100' '&apiKey='+api) response = requests.get(url) json = response.json()
Your JSON object will look something like this:
If you use print(json), it will look more illegible than this. I use Spyder IDE that is amazing at making this complex data structure super readable (highly recommended). The arrows above points to the dictionaries within dictionary structure typical for a json dataset.
Next, we extract the articles, headlines and descriptions into their respective Python objects.
articles = json['articles'] headline = [article['title'] for article in articles] description = [article['description'] for article in articles] all_text = headline + description
The final all_text object is an appended list of headlines and descriptions that we'll use to analyse for co-occurrence between words.
Gephi is a powerful network analysis tool that works best when given a co-occurrence matrix.
Here is an example of a very beautiful co-occurrence matrix that visualises the co-occurrence of characters on Victor Hugo’s Les Misérables.
The coloured boxes represent when both characters (top & left) appear in a scene together.
For our use case, our top and left attributes will be words, and the coloured boxes (or in our case, anything non-zero) will represent the two words occurring in the same article (in either headline or description).
There's a library in scikit-learn for Text Preprocessing, and we'll use the CountVectorizer class to create a co-occurrence matrix.
from sklearn.feature_extraction.text import CountVectorizer cv = CountVectorizer(ngram_range=(1,1), stop_words = 'english') # You can define your own parameters X = cv.fit_transform(all_text)
X is now a matrix of token counts, which look something like this (on a travel example):
This is NOT a co-occurrence matrix. On a co-occurrence matrix, the top and left attributes should be the tokens and the "ones" mark the co-occurrence of two entities (city/country in this example).
This is what we need based on the data above:
For those of you who know about matrix manipulation, you will know that we can achieve this by multiplying the transpose of the earlier matrix of token counts by its original, untransposed matrix to get this co-occurrence matrix. In Python, this looks like:
Xc = (X.T * X) # This is the matrix manipulation step Xc.setdiag(0) # We set the diagonals to be zeroes as it's pointless to be 1
X.T is the transposed version of matrix X (matrix of token counts).
The second step zeroes the diagonals as you might have noticed that the diagonals in a co-occurrence matrix are referring to the entity existing with itself (e.g. Bangkok exists together with Bangkok in a sentence that mentions Bangkok).
That's a useless piece of data so we removed it.
Finally, we turn it into a pandas DataFrame and export it as a CSV.
import pandas as pd names = cv.get_feature_names() # This are the entity names (i.e. keywords) df = pd.DataFrame(data = Xc.toarray(), columns = names, index = names) df.to_csv('to gephi.csv', sep = ',')
We're now ready to use Gephi.
Note that Gephi is a unique tool and it requires a lot of learning to utilise its full capability. I will be going through some basics to make your network graph readable, but I won't dive into detail. Refer to Gephi's user guides if you want to learn more.
After downloading, installing and opening Gephi, you will see this window.
Click on the "Open Graph File" option and navigate to our file, "to_gephi.csv", in your Python working directory.
Here you'll see how your CSV file looks like if you opened the CSV file in Excel or Google Sheets.
Click Next then Finish.
On the final screen, choose "Undirected" from the Graph Type option (default is "Mixed"). Use "Directed" to display arrow-heads that indicate directional data (e.g. "from Singapore to Bangkok" should have an arrow pointing from Singapore to Bangkok).
Once you click "OK", you will encounter a huge blob of black nodes and edges.
If you know how to format on Gephi, you may skip the rest of this post.
There are a lot of ways to make it more presentable, but I will explain the basic steps here:
Click on the Dark "T" to enable labels on your Nodes (the circles). They represent your keywords.
At the slider beside the font (Arial-BoldMT, 32), drag the slider to the far-left. This option governs the size of all labels and will be useful later on if your labels are too small.
While the Nodes represent your keywords, the Edges (the lines connecting the nodes) represent the co-occurrence between the keywords.
Select the top left option to format either Nodes or Edges. Let's start with Nodes.
Just to the right of that, there are 4 options/icons. From left to right, they are the options for:
Let's select Size of the text label, the rightmost icon with the small and big capital T.
Below the option for Nodes or Edges, there's an option of Unique, Partition or Ranking.
Unique applies the similar formatting for all labels/nodes, while ranking applies based on the magnitude of the nodes/edges. That is, the more co-occurring terms will have bigger/darker edges and more important keywords will have bigger/darker nodes.
I don't really use Partition.
Let's select Ranking and select Degree in the drop-down menu and set the range from 0.5 to 50. Click apply to apply the formatting.
You will notice that some important terms are bigger now, like "bitcoin" and "cryptocurrency".
You may continue to experiment with some other formatting to achieve a nicer visualisation.
Over here there are some options to limit the number of data/noise that is being visualised. It's important to remove noise from visualisation to provide focus and to deliver a compelling story.
Navigate to the "Degree Range" and drag it down to section 4: Queries
Once you've dragged over, you'll see a slider at the bottom. Click "Filter" and start dragging the slider.
You will notice that the blob of nodes and edges in the centre view will diminish and become more visible now.
What we're doing is to limit the visualisation to nodes and edges that have a higher degree of co-occurrence. This eliminates noise or trashy data that provides more confusion than clarity.
Set the minimum (left slider) to around 60-70 for this example and click "Stop" at the bottom right.
We have reached the final step. This step uses some algorithm to achieve a layout that is more presentable.
For this example, we'll use "Force Atlas", "Expand" and "Label Adjust".
Firstly, select "Force Atlas" from the drop-down menu and click "Run". Your visualisation will look worse than before, but be patient.
Next, select "Expand" and click "Run" a few times until the graph looks more legible.
Finally, use "Label Adjust" to prevent overlaps between the labels. You should reach something like this.
We've reached something similar to the graph I've shown earlier. To create a dark background, click on the light-bulb icon just beside the Dark T icon that we'd used to enable the labels.
You can use the Brush tool (the paint bucket icon) to colour selected nodes and its neighbours to different colours to distinguish between themes in your data.
This has been a very long post so I will stop here. You can make it more beautiful to present to your boss by heading to the "Preview" section at the top left.
You can do some cleanup too by deleting some nodes as it's much easier to identify trash data here than in Excel.
With this skill, you have an extremely powerful technique for digital marketing to find areas for keyword expansions and for keeping up with news quickly without reading any news.
You can apply this technique to other datasets too where co-occurrence can be useful, like for analysis on audience interest (e.g. people who like A also like B), consumer purchase behaviour (e.g. people who buy A also buys B) and social network analysis.
Have fun and happy visualising!