Training mRNA Language Models Across 25 Species for $165

  • Thread starter maziyar
  • Start date
  • Replies 0
  • Views 7
Status
Not open for further replies.
M

maziyar

Guest
We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers. Complete results, architectural decisions, and runnable code below.



Comments URL: https://news.ycombinator.com/item?id=47606244

Points: 54

# Comments: 17
 
Status
Not open for further replies.
Top