OPEN SOURCE · MAY 5, 2026 · 8 MIN READ

GSoC 2026 Rejection: Why Semantic Search for Kiwix Didn't Happen

Amit Yadav
Amit YadavAI/ML Engineer & On-Device ML Builder

When I started looking for GSoC 2026 projects, I was drawn to Kiwix immediately. I'd been using it for years — downloading Wikipedia, educational content, entire libraries for offline access in places where connectivity isn't guaranteed. The idea that someone could download a complete section of human knowledge and carry it in their pocket resonated with me.

I dug through the open issues. Most were incremental — performance tweaks, bug fixes, UI polish. Then I found kiwix/overview#93. It was asking for BERT embeddings search. Natural language question answering. The thread had been open since 2023, untouched.

I read it three times. By the third read, I knew exactly what I wanted to propose.

The Problem That Started It All

Kiwix's current search is keyword-based. You type "Moby Dick" and it finds Moby Dick. You type "author:Herman Melville" and it finds his books. For Wikipedia, this is fine. The user usually knows what they're looking for.

But Kiwix also hosts Gutenberg. Seventy thousand books. And Gutenberg doesn't work that way.

Try this in Kiwix: search for "a story about obsession and revenge at sea." You get nothing. Not because the book doesn't exist. It does. Multiple copies. But you need to already know what you're looking for.

This is especially brutal for offline users — the people Kiwix actually exists for. They're discovering books in the dark. They're not fact-checking a Wikipedia article; they're browsing a library without a librarian. The person who downloads Gutenberg in rural India or rural Peru can't just Google the title. They're working with whatever text search gives them.

That's the asymmetry. Online services solved this problem years ago with embeddings and semantic search. Kiwix hadn't, not because the engineers weren't skilled, but because the obvious approach — bundling a 400MB language model inside the app — makes the offline experience worse, not better.

The Idea: Split the Work

What if you didn't ship the heavy part at all?

The insight I had was simple: do the expensive work once, at scrape time. Ship only the query side.

Here's how it would work:

At build time (on a server, no constraints):

  • When creating a Gutenberg ZIM file, embed every book passage using all-MiniLM-L6-v2 from sentence-transformers
  • Each passage becomes a 384-dimensional embedding (float32)
  • Quantize those to INT8 (0.4KB per passage instead of 1.6KB)
  • Build a FAISS index and store it inside the ZIM
  • Total overhead: ~10MB per 10 books

At query time (on the device):

  • Ship one INT8-quantized CoreML encoder (~22MB) inside the app
  • User types a query
  • Embed it on-device with the CoreML model, producing a 384-dimensional vector
  • Load the FAISS index from the ZIM
  • Run cosine similarity and return top-k passages
  • Fully offline. No server. No extra download. No privacy concerns.

The critical constraint is that both sides use the exact same model. The scraper embeds with MiniLM. The app queries with MiniLM. The vectors live in the same 384-dimensional space. Swap either side to a different model and the results become garbage.

Why This Design Matters

This architecture solves three real problems at once:

1. Size: 22MB is significant but reasonable. A full transformer would be 3-5x larger and still wouldn't give better results. This is the minimum viable model size for decent semantic search.

2. Simplicity: The app doesn't need to implement FAISS search itself. Swift bindings exist, but they're not trivial. The scraper handles all of it at build time, once per book. The app just deserializes and queries.

3. Maintenance: Future model improvements happen at scrape time. Existing ZIM files don't break. New books use the new model. You don't have to choose between shipping a stale model and forcing every user to download a new app.

The results appear as a "Semantic results" section below keyword search. Not replacing it. Just catching what keyword search misses.

The Proposal I Submitted

I spent weeks writing the technical details. Sketched the architecture across two repositories:

  • openzim/gutenberg (Python): Embedding pipeline, FAISS index creation, ZIM integration
  • kiwix/kiwix-apple (Swift): CoreML query encoder, on-device search

I included concrete numbers: file sizes, performance estimates, quantization trade-offs. I showed example queries that would fail with keyword search but succeed with embeddings. I even tested the CoreML conversion locally — verified that INT8 quantization didn't destroy accuracy.

I thought I had it all figured out.

Then the org reviewed it and said no.

Why Kiwix Said No

The feedback wasn't dismissive. It was clear and strategic:

"We appreciate the work, but we're focused on enhancing what we already have. We're not planning to adopt more AI-based features right now."

In those two sentences, I learned more about open source sustainability than I had in months of coding.

Kiwix isn't a company that can hire a team to maintain machine learning infrastructure. It's a volunteer-driven project. Adding semantic search would mean:

  • Committing to maintaining embedding models across multiple app platforms (iOS, Android, macOS, Windows)
  • Handling user support for edge cases (queries that return no results, performance on older phones, etc.)
  • Deciding how to handle model updates without breaking existing ZIM files
  • Triaging bugs that cross the boundary between the scraper and the app

And all of this for a feature that solves a specific, somewhat niche problem: offline book discovery.

The organization chose depth over breadth. They'd rather ship one thing incredibly well than ship ten things okay.

What I Missed

If I'm honest, I was thinking like an ML engineer, not an open source maintainer.

I optimized for elegance: "We can put the heavy lifting at build time and keep the app lightweight." Technically brilliant. Organizationally hard. The scraper team and the app team would need to coordinate. New features in the embedding pipeline would require changes to the ZIM format. Testing becomes more complex.

I didn't weigh the maintenance cost properly. A well-designed feature that the team can't afford to maintain is worse than no feature at all.

What I Gained Instead

So the proposal got rejected. But something else happened.

I made three merged PRs to Kiwix before the proposal even landed:

  • Removed hardcoded Cloudflare analytics from PhET simulations (PR #324)
  • Fixed a language variant bug in the scraper (PR #333)
  • Diagnosed a macOS MapLibre rendering issue (PR #1523)

Those were depth work. Small, focused, maintainable. The kind of work that makes a codebase slower to break and faster to ship.

I learned more from those three PRs than I would have from implementing semantic search. I learned how the team thinks. I learned what they care about. I learned that small wins matter more than ambitious features.

And I learned that "no" from an organization that knows what it's doing is worth more than "yes" from one that doesn't.

The Lesson

If you're proposing features to open source projects, here's what I'd tell my past self:

Ask yourself first: can they maintain it?

Not "is it technically feasible" — can they maintain it? Are there full-time maintainers who can review PRs? Is there a team who understands the domain? Is the feature in the critical path of what the org is trying to do, or is it a nice-to-have that depends on external volunteers?

If the answer to any of those is "no," your best contribution isn't the feature. It's recognizing that limit and working within it instead.

Kiwix said no to semantic search. But they said yes to three other fixes. Those are the contributions that move the org forward, because the team can actually sustain them.

The Broader Point

There's this cult of "move fast and break things" in tech. But open source operates differently. Breaking things is free if you have infinite engineers. Open source doesn't. So organizations that survive are the ones that are ruthlessly honest about their constraints.

Kiwix chose depth over breadth, focus over feature creep. That decision is probably what keeps them alive.

The semantic search idea was good. The architecture was sound. But it was a "good idea executed by someone else" — not a "this is what we're prioritizing right now." There's a difference.

So I'm grateful for the rejection. It taught me that sometimes the best contribution to open source isn't the feature you propose. It's understanding why the organization says no, and working within that boundary instead.


Note: The proposal document is embedded below if you'd like to review the full technical details, timeline, and how this would have been implemented. It's worth reading not because it was accepted, but because it illustrates the kind of thinking that goes into a real GSoC proposal.

Discussion (0)

Sign in to join the discussion.

Receive my Digest of
Curated Research.

Bi-weekly insights on ML architecture, design systems, and the future of on-device intelligence. No noise, just the core notes.