HN중요요약2026. 04. 24. 13:49

25종에 걸친 mRNA 언어 모델 구축: 단돈 $165로 성과 내기

요약

본 글은 구조 예측, 서열 설계, 코돈 최적화까지 아우르는 엔드투엔드(end-to-end) 단백질 AI 파이프라인을 소개합니다. 특히 25개 종에 걸쳐 mRNA 언어 모델을 구축하는 과정을 다루며, 최고 성능의 트랜스포머 아키텍처로 CodonRoBERTa-large-v2를 채택했습니다. 이 시스템은 55 GPU시간이라는 적은 자원으로 4개의 프로덕션 모델을 학습시키고, 다른 오픈소스 프로젝트에서는 찾아볼 수 없는 종(species)-조건부 시스템을 완성했다는 점이 핵심 가치입니다.

핵심 포인트

CodonRoBERTa-large-v2를 사용하여 코돈 수준 언어 모델링에서 Perplexity 4.10 및 Spearman CAI 상관관계 0.40의 우수한 성능을 달성했습니다.
구조 예측, 서열 설계, 코돈 최적화 기능을 통합한 엔드투엔드 단백질 AI 파이프라인을 성공적으로 구축했습니다.
25개 종에 걸쳐 총 4개의 프로덕션 모델을 55 GPU시간 만에 학습시키고, 독자적인 종-조건부(species-conditioned) 시스템을 구현했습니다.

Training mRNA Language Models Across 25 Species for $165

We built an end-to-end protein AI pipeline covering structure prediction, sequence design, and codon optimization. After comparing multiple transformer architectures for codon-level language modeling, CodonRoBERTa-large-v2 emerged as the clear winner with a perplexity of 4.10 and a Spearman CAI correlation of 0.40, significantly outperforming ModernBERT. We then scaled to 25 species, trained 4 production models in 55 GPU-hours, and built a species-conditioned system that no other open-source project offers.

Complete results, architectural decisions, and runnable code below.

AI 자동 생성 콘텐츠

원문 바로가기

25종에 걸친 mRNA 언어 모델 구축: 단돈 $165로 성과 내기

요약

핵심 포인트

Training mRNA Language Models Across 25 Species for $165

댓글