Jeremy Trips

Where I share about my journey in tech and life.

This article is writen manually and uses ai in order to check the spelling.

On Building a Hybrid Search Engine

1. Problem Definition and Constraints

2. Data Modeling and Eligibility Logic

3. Query Pathologies

4. Search Pipeline Architecture

5. Baseline Ranking Approach (v1)

6. Evaluation Framework

7. Diagnosing the Failure

8. Ranking Improvements

8.1 BM25 as a Lexical Baseline

8.2 Multi-Field Embedding Ranking

8.3 LLM-Based Relevance Filtering

9. Final Results and Analysis

10. Lessons Learned

On writing a search engine

One year ago, I wrote a search engine for a company I am working on. After many times working other parts of the software I decided to test it and analyse the results... It was sad, really sad. Here is how I fixed it.

1. Subalta's search engine

At Subalta, our goal is to help companies identify public funding. We try to clean the path of public funding, fraught with pitfalls.

To address that, I have developed a hybrid search engine combining eligibility filtering and relevance based ranking, allowing users to discover funding opportunities based on their project, company location, size, and other constraints.

The core problem quickly became obvious, the ranking algorithm was fundamentally broken. It surfaced irrelevant fundings and, more criticaly, failed to return the perfect match for the user.

2. Data Modeling and Eligibility Logic

The funding world is higly constrained, at each level, many programs apply to specific geographic zones (for example, Regional Aid Areas), company profiles, ... in this context, false positive are not tolerable.

For that reason, we have modeled a set of hard constraints based on specific informations comming from the funding. These hard constraints are translated to sql queries before the ranking step.

The first stage, enforce that a funding returned is eligible from the set of constraints point of view.

3. Query Pathologies

The query pathologies is an issue faced by every search engine, one thing hard to understand was that not all queries requied a response. Too vague queries were leading to nothing but noise. Here is an example of the query users tried in the field describe you project:

These queries contains not actionnable signal about the user intent. To tackle this issue, we developped a request preprocessor that uses embeddings to determine whether the query is precise enough for the search engine and eventually return a bad request.

4. Search Pipeline Architecture