This article is writen manually and uses ai in order to check the spelling.

On Building a Hybrid Search Engine

1. Problem Definition and Constraints

Business goal and failure criteria
Why naïve search was insufficient
Precision vs recall trade-offs in public funding discovery

2. Data Modeling and Eligibility Logic

Funding schema overview
Selective vs representative fields
Static vs dynamic constraints
Hard eligibility filtering as a first-stage gate

3. Query Pathologies

Underspecified and noisy user input
Implicit intent vs explicit filters
Why “good data + bad queries” still fails

4. Search Pipeline Architecture

End-to-end query flow
Separation of filtering, scoring, and post-processing

5. Baseline Ranking Approach (v1)

Initial scoring strategy
Assumptions and shortcuts
Why relevance collapsed in practice

6. Evaluation Framework

Why existing metrics were insufficient
Custom evaluation methodology (“vibe-coded”)
Ground truth approximation and bias

7. Diagnosing the Failure

Quantitative results
Qualitative failure modes
Mismatch between business relevance and score

8. Ranking Improvements

8.1 BM25 as a Lexical Baseline

Field weighting rationale
Why BM25 fit the data distribution

8.2 Multi-Field Embedding Ranking

Embedding strategy per field
Score aggregation
Failure modes from missing or sparse fields

8.3 LLM-Based Relevance Filtering

Why LLMs belong after ranking
Prompt design and constraints
Cost and latency considerations

9. Final Results and Analysis

Before/after comparison
Evaluation framework outcomes
What actually moved the needle

10. Lessons Learned

What not to do when building search
When hybrid systems are worth the complexity
Open problems and next steps

On writing a search engine

One year ago, I wrote a search engine for a company I am working on. After many times working other parts of the software I decided to test it and analyse the results... It was sad, really sad. Here is how I fixed it.

1. Subalta's search engine

At Subalta, our goal is to help companies identify public funding. We try to clean the path of public funding, fraught with pitfalls.

To address that, I have developed a hybrid search engine combining eligibility filtering and relevance based ranking, allowing users to discover funding opportunities based on their project, company location, size, and other constraints.

The core problem quickly became obvious, the ranking algorithm was fundamentally broken. It surfaced irrelevant fundings and, more criticaly, failed to return the perfect match for the user.

2. Data Modeling and Eligibility Logic

The funding world is higly constrained, at each level, many programs apply to specific geographic zones (for example, Regional Aid Areas), company profiles, ... in this context, false positive are not tolerable.

For that reason, we have modeled a set of hard constraints based on specific informations comming from the funding. These hard constraints are translated to sql queries before the ranking step.

The first stage, enforce that a funding returned is eligible from the set of constraints point of view.

3. Query Pathologies

The query pathologies is an issue faced by every search engine, one thing hard to understand was that not all queries requied a response. Too vague queries were leading to nothing but noise. Here is an example of the query users tried in the field describe you project:

Public funding
Find me pertinent funding for my company
Funding

These queries contains not actionnable signal about the user intent. To tackle this issue, we developped a request preprocessor that uses embeddings to determine whether the query is precise enough for the search engine and eventually return a bad request.