Running Transformers.js in a Chrome Extension: What I Learned Building With Gemma 4

Running Transformers.js in a Chrome Extension: What I Learned Building With Gemma 4

1 0 0

I recently built a browser extension that runs Gemma 4 E2B locally using Transformers.js, and I hit enough weird edge cases that I figured I’d write them down before I forget.

If you’re trying to cram local AI into a Chrome extension under Manifest V3, this is the architecture that worked for me. No fluff, just the decisions that mattered.

Who should read this

Developers who want to run Transformers.js in a Chrome extension and need to navigate MV3’s annoying constraints around service workers, model caching, and messaging. If you’ve never touched Manifest V3, go read the overview first – I’m not rehashing basics here.

The architecture in one picture

The extension has three runtimes:

  • Background service worker: the brain. Hosts models, runs inference, manages agent state.
  • Side panel: the chat UI. Sends user input, renders streaming responses.
  • Content script: the page bridge. Extracts DOM content, applies highlights.

Everything heavy lives in the background. The UI and content script are deliberately dumb – they request actions and render results, nothing more.

Why this split matters

The obvious alternative is running models in the side panel or a popup. Don’t do that. You’ll end up loading the same model multiple times, which is both wasteful and slow. By keeping inference in the background, you get one model instance shared across all tabs and sessions.

There’s also a security angle: content scripts have DOM access but limited APIs. The background has full extension APIs but no DOM access. Keeping orchestration in the background lets you bridge both worlds cleanly.

The messaging contract

Once you split runtimes, messaging becomes the backbone. Here’s what I settled on:

Side panel → Background:

  • CHECK_MODELS, INITIALIZE_MODELS
  • AGENT_INITIALIZE, AGENT_GENERATE_TEXT, AGENT_GET_MESSAGES, AGENT_CLEAR
  • EXTRACT_FEATURES

Background → Side panel:

  • DOWNLOAD_PROGRESS, MESSAGES_UPDATE

Background → Content script:

  • EXTRACT_PAGE_DATA, HIGHLIGHT_ELEMENTS, CLEAR_HIGHLIGHTS

The rule is simple: the background coordinates everything. Side panel and content script are workers that ask for work and render results.

A typical flow looks like:

  1. User types a message in the side panel
  2. Side panel sends AGENT_GENERATE_TEXT to background
  3. Background appends to chat history, runs inference, calls tools
  4. Background emits MESSAGES_UPDATE back
  5. Side panel re-renders from the updated message list

Conversation history lives in the background, not the UI. This means if the side panel closes and reopens, state is preserved. Handy.

Model loading and caching

I split the models into two roles:

  • Text generation: onnx-community/gemma-4-E2B-it-ONNX (q4f16)
  • Embeddings: onnx-community/all-MiniLM-L6-v2-ONNX (fp32)

Gemma handles reasoning and tool decisions. MiniLM generates embeddings for semantic search. The split is intentional – you don’t need a 7B model to compute cosine similarity.

All inference runs in the background. This means model artifacts get cached under the extension origin (chrome-extension://...) rather than per-website. One cache for the whole install, which is nice.

The MV3 service worker gotcha

Manifest V3 service workers can be suspended and restarted at any time. This means you can’t assume your model stays loaded. The runtime state should be treated as recoverable.

In practice, this means:

  • Check what’s cached on startup
  • Re-initialize models if they got evicted
  • Emit progress events so the UI knows what’s happening

I used a CHECK_MODELS task that inspects the cache and estimates remaining download size, then INITIALIZE_MODELS handles the actual download with progress events. It’s not elegant, but it works.

What I’d do differently

If I were starting over, I’d spend more time on the messaging types upfront. I ended up refactoring the enum definitions three times because I kept finding edge cases where I needed more granular events.

Also, the DynamicCache class for KV caching in text generation is new and worth using. It keeps consistent caching across generations, which matters when you’re doing multi-turn conversations.

One thing I wouldn’t change: keeping embeddings and text generation as separate pipelines. The temptation to merge them is real, but they have different memory profiles and quantization needs. Keep them separate.

The source if you want to dig in

The extension is on the Chrome Web Store and the code is open source. I won’t link it here because links rot, but search for “Transformers.js Gemma 4 Browser Assistant” and you’ll find it.

That’s it. No grand conclusions. Just practical architecture decisions that worked for shipping local AI in a Chrome extension under MV3.

Comments (0)

Be the first to comment!