When starting a search engine marketing (SEM) campaign on Adwords, Bing etc, the first step is always generating the relevant keywords.
That is the easy and fun part. There's plenty of tools at your disposal, such as Adwords Keyword Planner and Ubersuggest.
However, after getting all of those keywords, perhaps tenths of thousands of them, the frustrating part comes...
How do you structure these keywords into ad groups? How do you ensure maximum relevance for each ad groups while making sure that they're not too granular that it becomes difficult to manage?
That's where Keyword Clustering Tool comes into play.
Machine learning, at its core, uses statistical techniques to analyze, learn and make inferences from a set of data.
Hence, the first step is to turn a collection of words/strings into vectors or to vectorise it. Statistics works with numbers, not strings of words.
The most basic way to vectorise words would be to tokenise them using an n-gram tokeniser. The snapshot below shows how it looks like on a basic dataset.
The n-gram value to use depends heavily on your dataset. The n-gram used by the Keyword Clustering tool is 1 to 2.
As you can see from the snapshot above, it is basically an occurrence counter. To bring it to the next level, we can incorporate latent semantic analysis into the tokenising step.
The latent semantic analysis aims to discover the connection between terms and documents. This is done on the via singular value decomposition (SVD) that aims to trim the dataset down while preserving connection and similarity in the structure.
Once that is done, the keyword clustering tool leverages k-means clustering to discover the best way to group your keywords based on the number of desired clusters provided.
Source: Practical Cryptography
The first row contains some descriptors for your keyword clusters that can quickly help you analyse if the clusters are up to your standards in terms of its homogeneity and size.
Sometimes you might find a group that has not homogenous at all, and those are the outliers i.e. keywords that can't be grouped with anything else and is too small in quantity to be its own group.
This dataset is an output based on 15 clusters. From here on, you can further experiment with a few more clusters to find the best one for your use case.
The keyword clustering tool in this site is the most basic implementation of the idea but is sufficient for most use cases.
An advanced feature that I've yet to incorporate here is the ability to recommend the best number of clusters after learning about your dataset.
That advanced analysis using various statistical analysis, such as silhouette analysis, to recommend several numbers of clusters that you can start with.
The technique uses more computational power on the server and is slower, hence I've yet to implement them.
If you need some help on that or you want a bespoke solution (e.g. custom n-gram range), please contact me via my Twitter and we can discuss from there =).
"Keyword Clustering Tool for Search Engine Marketing (SEM)"
Python, Machine Learning, Natural Language Processing