Training mRNA Language Models Across 25 Species for $165: What Worked and What Didn’t

Training mRNA Language Models Across 25 Species for $165: What Worked and What Didn’t

1 0 0

I’ve been watching the protein AI space for a while now, and most of what I see is either hype about AlphaFold or yet another fine-tuned ESM variant that nobody outside a handful of labs can actually run. So when OpenMed dropped their latest post about training mRNA language models across 25 species for $165, I had to dig in.

This is Part II of their series, and it’s refreshingly honest. No polished success story here—just a transparent account of what worked, what surprised them, and what they’d do differently. And the best part? Everything is open-source and runnable.

What They Built

The pipeline has three stages: protein folding (predicting 3D structure), sequence design (finding amino acids that fold into that structure), and mRNA optimization (choosing DNA codons that express well in a target organism). The first two use established tools—ESMFold from Meta and ProteinMPNN from the Baker Lab. The third is entirely their own work: new models, new training infrastructure, new evaluation metrics.

For folding, they ran ESMFold on 30 protein chains and got an average pTM of 0.79. That’s solid. Not AlphaFold-level, but good enough for most therapeutic protein work, and way faster. For sequence design, they used ProteinMPNN on scaffold 7K00 and got 42% sequence recovery. That’s in line with what I’ve seen from other groups.

But the mRNA optimization is where the real work happened. They trained multiple transformer variants on 250k coding sequences, then scaled to 381k sequences across 25 species. The best model, CodonRoBERTa-large-v2, hit a perplexity of 4.10 and a Spearman CAI correlation of 0.40. That’s significantly better than ModernBERT, which I honestly expected to perform better given its efficiency innovations.

The Architecture Showdown

The core question was: which transformer architecture works best for codon-level language modeling? This matters because codon optimization is crucial for therapeutic mRNA, vaccines, and recombinant protein production. The genetic code is degenerate—the same protein can be encoded by astronomically many different DNA sequences, but some codon arrangements express 100x better than others. The Pfizer-BioNTech COVID vaccine was codon-optimized for human expression, for example.

They tested five models:

  • CodonBERT baseline (6M params): tiny BERT, just to establish floor performance
  • ModernBERT-base (90M params): 22 layers with RoPE, representing the latest NLP efficiency innovations
  • CodonRoBERTa-base (92M params): 12-layer RoBERTa, same family as ESM-2
  • CodonRoBERTa-large (312M params): 24-layer RoBERTa
  • CodonRoBERTa-large-v2 (312M params): same architecture, better hyperparameters

The choice of RoBERTa was deliberate. Meta’s ESM-2 (which powers ESMFold) is itself a RoBERTa variant trained on protein sequences. The hypothesis was that the same architecture that learned amino acid patterns would also learn codon patterns, but codon sequences have different statistical properties—they’re triplets from a 64-token alphabet with strong positional dependencies and species-specific usage biases.

What Surprised Me

ModernBERT underperformed. I had high hopes for it because of the efficiency gains from things like RoPE and alternating attention patterns. But on codon sequences, it just didn’t work as well. The perplexity was higher, and the CAI correlation was lower. The OpenMed team notes that “the architectural innovations that make ModernBERT efficient on long NLP sequences don’t necessarily transfer to the shorter, more structured codon domain.” That makes sense in hindsight, but it’s still a useful data point.

CodonRoBERTa-large-v2 was the clear winner, and the improvements came from hyperparameter tuning rather than architecture changes. They adjusted learning rate schedules, batch sizes, and masking strategies. The v2 model used a cosine learning rate decay with warmup and a higher mask ratio (20% vs 15%). Small changes, big impact.

The Cost Story

Here’s the part that got my attention: they trained 4 production models in 55 GPU-hours on an NVIDIA A100. At standard cloud pricing, that’s about $165 total. For a multi-species codon optimization system that no other open-source project offers. That’s absurdly cheap for what you get.

They scaled to 25 species, including human, mouse, E. coli, yeast, and several plants. The species-conditioned system lets you specify the target organism and get codon-optimized sequences that match that organism’s natural usage biases. This is something that commercial tools like GeneArt and IDT offer, but they charge per sequence and don’t share their models. OpenMed’s is open-source.

The End-to-End Workflow

The pipeline runs in a single afternoon. You start with a protein concept—say, a therapeutic antibody—and the system predicts its 3D structure, designs amino acid sequences that fold into that structure, and optimizes the codons for your target species. The output is a synthesis-ready DNA sequence.

They provide runnable code for each stage, with clear instructions on dependencies and runtime. The folding step takes about 30 minutes for a typical protein on a single GPU. The sequence design step is faster—ProteinMPNN runs in minutes. The codon optimization is the bottleneck, but even that only takes a few hours for a full batch.

Where This Stands

This is not a polished commercial product. The folding component uses ESMFold, which is less accurate than AlphaFold for complex multi-chain proteins. The sequence design component uses ProteinMPNN, which is good but not state-of-the-art for all scenarios. And the codon optimization models, while better than ModernBERT, still have room for improvement—the perplexity of 4.10 means they’re still occasionally surprised by codon choices.

But as an open-source alternative to expensive commercial tools, this is impressive. The multi-species conditioning alone is something I haven’t seen in any other open-source project. And the cost—$165 for training—means any lab with a decent GPU can replicate or extend this work.

What I’d like to see next: more species (especially non-model organisms), integration with mRNA stability prediction, and maybe a web interface for biologists who don’t want to run code. But for now, this is a solid piece of engineering that fills a real gap in the open-source protein AI ecosystem.

Comments (0)

Be the first to comment!