A New Paradigm for RNA Homology Search Empowered by Large Language Models and Contrastive Learning
Research Article
Open Access
CC BY

A New Paradigm for RNA Homology Search Empowered by Large Language Models and Contrastive Learning

Zihou Jiang 1*
1 Shanghai Pinghe Bilingual School, Shanghai, China
*Corresponding author: jiangzihou2020@163.com
Published on 3 December 2025
Volume Cover
ACE Vol.211
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-80590-579-0
ISBN (Online): 978-1-80590-580-6
Download Cover

Abstract

RNA’s growing therapeutic impact demands fast, structure-aware comparative analysis. Existing search tools trade speed for sensitivity: sequence-only methods are rapid but miss distant, structure-conserved homologs, whereas covariance-model pipelines are accurate but slow. We present a two-stage framework that reframes RNA homology detection as geometric retrieval in a learned embedding space, followed by structure-aware multiple sequence alignment (MSA). A frozen RNA foundation model (Rinalmo) embeds sequences; a lightweight Transformer head trained with supervised contrastive learning (family-balanced sampling, clan-aware hard negatives) sculpts the space so homologs cluster and non-homologs separate. Approximate nearest-neighbor search (FAISS/HNSW) enables sub-linear retrieval from millions of sequences. Top-khits are then aligned via a hybrid pipeline—MAFFT X-INS-i seeding and Infernal covariance-model refinement—to produce structure-consistent MSAs. On a family-level split of Rfam v14.10, Our method answers a query in 0.45 s on average (∼20×faster than BLASTn;>3,500×faster than cmscan) while achieving 0.95 precision, 0.93 recall, and 0.94 F1. Using retrieved sets, the MAFFT→Infernal workflow attains SPS = 0.91 versus 0.68 for BLASTn-based sets, enabling scalable, sensitive RNA homology discovery and downstream analysis.

Keywords:

RNA Homology Search, Deep Metric Learning, Multiple Sequence Alignment (MSA), Contrastive Learning

View PDF
Jiang,Z. (2025). A New Paradigm for RNA Homology Search Empowered by Large Language Models and Contrastive Learning. Applied and Computational Engineering,211,27-41.

References

[1]. Draper, D. E. A guide to ions and rna structure. RNA 10, 335–343 (2004). URL https: //rnajournal.cshlp.org/content/10/3/335.

[2]. Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster rna homology searches. Bioin- formatics 29, 2933–2935 (2013). URL https: //academic.oup.com/bioinformatics/ article/29/22/2933/316439.

[3]. Tommaso, P. D. et al. T-coffee: a web server for the multiple sequence alignment of protein and rna sequences using structural information and homology extension. Nucleic Acids Research 39, W13–W17 (2011). URL https: //pubmed.ncbi.nlm.nih.gov/21558174/.

[4]. Jumper, J. et al. Highly accurate protein structure prediction with alphafold. Nature 596, 583–589 (2021). URL https: //www.nature.com/articles/s41586-021-03819-2.

[5]. Chen, Y. et al. Hs-blastn: a platform-independent, high-speed nucleotide database search tool for the next-generation sequencing era. Nucleic Acids Research 43, 7762–7768 (2015). URL https: //academic.oup.com/nar/article/43/16/7762/1077466.

[6]. Zhang, T. et al. Rnacmap: a fully automatic pipeline for predicting contact maps of rnas by evolutionary coupling analysis. Bioinformatics 37, 3494–3500 (2021). URL https: //pubmed.ncbi.nlm.nih.gov/34021744/.

[7]. Eggenhofer, F., Hofacker, I. L. & zu Siederdissen, C. H. Rnalien: Unsupervised rna family model construction. Nucleic Acids Research 44, 8433–8441 (2016). URL https: //academic.oup.com/nar/article/44/17/8433/2468316.

[8]. Zhang, T., Singh, J. & Zhou, Y. rmsa: A sequence search and alignment algorithm to compute accurate rna homologs. Journal of Molecular Biology 435, 167969 (2023). URL https: //www.sciencedirect.com/science/article/pii/S0022283622005709.

[9]. Singh, J., Hanson, J., Paliwal, K. & Zhou, Y. Rna secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nature Communi- cations 10, 5407 (2019). URL https: //www.nature.com/articles/s41467-019-13395-9.

[10]. Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microrna families. Nucleic acids research 49, D192–D200 (2021).

[11]. Penić, R. J., Vlašić, T., Huber, R. G., Wan, Y. & Šikić, M. Rinalmo: General-purpose rna language models can generalize well on structure prediction tasks. Nature Communications 16, 5671 (2025).

[12]. Khosla, P. et al. Supervised contrastive learning. Advances in neural information processing systems 33, 18661–18673 (2020).

[13]. Katoh, K. & Standley, D. M. Mafft multiple sequence alignment software version 7: im- provements in performance and usability. Molecular biology and evolution 30, 772–780 (2013).

[14]. Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster rna homology searches. Bioin- formatics 29, 2933–2935 (2013).

Cite this article

Jiang,Z. (2025). A New Paradigm for RNA Homology Search Empowered by Large Language Models and Contrastive Learning. Applied and Computational Engineering,211,27-41.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

About volume

Volume title: Proceedings of CONF-SPML 2026 Symposium: The 2nd Neural Computing and Applications Workshop 2025

ISBN: 978-1-80590-579-0(Print) / 978-1-80590-580-6(Online)
Editor: Marwan Omar, Guozheng Rao
Conference date: 21 December 2025
Series: Applied and Computational Engineering
Volume number: Vol.211
ISSN: 2755-2721(Print) / 2755-273X(Online)