Finding relevant patents via a simple BM25 search engine in Python

While I was self-learning information retrieval in NLP for a document search project, TF-IDF first came up as it is a common and useful technique to extract keywords and determine document relevancy. I later came across the BM25 (BM25 stands for Best Match 25). BM25 is a ranking function that ranks a set of text documents based on a given search query.

There’s a Python library rank-bm25 that contains a collection of BM25 algorithms that save developers a lot of time to create a quick custom search engine.

Let’s export 900+ Disney patents from Google Patents to a .csv file.

The Walt Disney Company (Source: Google Patents)

After importing data into Python using Pandas,

Import Disney patent data downloaded from Google Patents

I use several libraries (nltk, spacy, numpy, re, etc.) for data preprocessing, including stopwords/ punctuation removal.

Preprocessing patent titles

The preprocessed patent title column will look something like below:

Next, let’s build a BM25 search engine. I found the BM25 code from this notebook extremely helpful to guide you on the implementation.

Searching Disney patents that contain “3d”
Looking for Disney patents related to “image processing”
Disney patents related to “virtual reality”

Here you go with a simple BM25 invention search engine using rank-bm25 library 🔍

🇲🇾 https://twitter.com/foongminwong 👩🏻‍💻