🎲 Want to jump right in? Try the Streamlit app here!
Automatic keyword generation methods have been around for a while (TF-IDF, Rake, YAKE!, just to name a few), all widely implemented in Python, all widely used in fields such Information Retrieval, Text Mining and of course, SEO!
Although techniques vary, they usually extract keywords and keyphrases from a document, assign a weight to each word, to signify the importance of that word in the wider document and corpus.
What's KeyBERT?
While all valuable, the KeyBERT library goes a step further than most in terms of accuracy by leveraging BERT embeddings!
What also makes KeyBERT stand out from the library crowd is its lightweightness, power and versatility.
Lightweight, as unlike other libraries, KeyBERT works very well with CPU configs. It can be used with a wide range of applications as a result.
Powerful, as KeyBERT supports the latest and best-performing embedding models, such as:
Flair
Spacy
Gensim
You can even select any sentence-transformers model and pass it through KeyBERT!
KeyBERT is also versatile with a bazillion of parameters to choose from. Here's a non-exhaustive list below:
That being said, like with any elaborate library, that versatility may come with a trade-off.
It can sometimes be cumbersome to choose the right model embedding and set of parameters to quickly iterate through your use cases.
This is where Streamlit comes in handy!
Introducing the BERT Keyword Extractor! 🎈
With the BERT Keyword Extraction (BERT KE), I wanted to create a simple interface that provides the relevant parameters at your fingertips, allowing you to iterate through in seconds and allow you to export your results!
🎲 Try the app here!
The BERT Keyword Extractor is currently in early beta with the following limitations:
2 embedding models (DistilBERT and Flair)
Only the first 500 words are currently reviewed
Once the app is deemed stable, I will add more models, more parameters, and more text allowance, so keep your eyes peeled!
Let's see what settings are currently available:
Choosing your model
At present, you can choose between two embedding models: DistilBERT, which is the default engine, and Flair. More to come soon!
Top N results
You can choose the number of results to be displayed. Between 1 and 30, the default number is 10.
Min/Max Ngrams
You can choose the minimum and maximum values for the ngram range.
This sets the length of the resulting keywords/keyphrases.
To extract a set of single keywords only, set the ngram range to (1, 1)
To extract keyphrases, set the minimum ngram value to 2. The maximum ngram value can be set to 2 or higher, depending on the number of words you would like to see in each keyphrase.
Check Stop Words
Tick this box to remove stop words from the document (currently English only).
Use MMR
You can use Maximal Margin Relevance (MMR) to diversify the results. It creates keywords/keyphrases based on cosine similarity.
Try high/low 'Diversity' settings for interesting variations.
Diversity
The higher the setting, the more diverse the keywords. Note that the *Keyword diversity* slider only works if the *MMR* checkbox is ticked.
Credits
Just a note where credit is due, KeyBERT has been created by the amazing Maarten Grootendorst
Maarten writes insightful Data Science articles in Medium, and is also the creator of 2 other awesome Python libraries: BERTopic and PolyFuzz!
BERTopic is a semi-supervised topic modelling library with a built-in visualiser. Check out Koray’s excellent article for some SEO use cases!
PolyFuzz is a mighty fuzzy string-matching/string-grouping library. It has been my go-to tool for fuzzy matching for over a year now, and it’s bang on for SEO tasks!
It can be used for mapping keywords to URLs, site migrations & redirect management.
It’s also got some good momentum in the SEO community, check-out what Greg Bernhardt, SearchSolved's Lee Foot, and yours truly have been doing with it!
tool is DEAD... not working
This is great. I wanted to embed this in to my document intellignce portal (built using streamlit as well). Can you kindly share your github code for this?
Amazing! Where can I find the code for this project?