Finding relevant patents via a simple BM25 search engine in Python
While I was self-learning information retrieval in NLP for a document search project, TF-IDF first came up as it is a common and useful technique to extract keywords and determine document relevancy. I later came across the BM25 (BM25 stands for Best Match 25). BM25 is a ranking function that ranks a set of text documents based on a given search query.
There’s a Python library rank-bm25 that contains a collection of BM25 algorithms that save developers a lot of time to create a quick custom search engine.
Let’s export 900+ Disney patents from Google Patents to a .csv file.
After importing data into Python using Pandas,
I use several libraries (nltk, spacy, numpy, re, etc.) for data preprocessing, including stopwords/ punctuation removal.
The preprocessed patent title column will look something like below:
Next, let’s build a BM25 search engine. I found the BM25 code from this notebook extremely helpful to guide you on the implementation.
Here you go with a simple BM25 invention search engine using rank-bm25 library 🔍