Researchers at Los Alamos National Laboratory have developed a groundbreaking artificial intelligence (AI) model, EPBDxDNABERT-2, which is transforming gene regulation research. This new model has demonstrated a 9.6% improvement in predicting where transcription factors bind to DNA, a critical aspect of understanding how genes work—especially in diseases like cancer.
The study, led by researcher Anowarul Kabir, focuses on a process called “DNA breathing.” This process involves the spontaneous opening and closing of the DNA double helix. The model’s success lies in its ability to combine DNA breathing dynamics with advanced genomic analysis.
DNA acts as a blueprint for growth and function. It consists of an equivalent of 3 billion English letters in each human cell. Transcription factors bind to DNA regions, regulating how genes guide cell development and function. This regulation plays a role in diseases such as cancer, so accurately predicting transcription factor binding locations could have a significant impact on drug development.
“There are many types of transcription factors, and the human genome is incomprehensibly large. We tried to solve that problem with artificial intelligence, particularly deep-learning algorithms, ” Kabir emphasized.
Processing large genomic data was made possible by using the laboratory’s newest supercomputer, Venado. In training EPBDxDNABERT-2, the researchers used sequencing data from 690 experiments, covering 161 different transcription factors across 91 human cell types.
The model’s use of DNA breathing features, combined with genomic data, made it better at finding where transcription factors attach to DNA. Specifically, a 9.6% increase in the accuracy of predicting over 660 transcription factors was found. This is especially helpful for understanding different cell types and their gene regulation.
“The integration of the DNA breathing features with the DNABERT-2 foundational model greatly enhanced transcription factor-binding predictions,” explained researcher Manish Bhattarai.“We give sections of DNA code as input to the model and ask the model whether it binds to a transcription factor, or not, across many cell lines.”
This breakthrough could innovate the process of making medicines for diseases caused by gene problems. Integrating the understanding of DNA breathing into computational genomics positions this model as a versatile tool, revolutionizing how scientists explore genetic mechanisms underlying various diseases.
“As demonstrated by its performance across multiple, diverse datasets, our multimodal foundational model exhibits versatility, robustness, and efficacy,” Bhattarai added.
The research represents a leap forward in the field of computational genomics, demonstrating AI’s potential to transform disease research and treatment. With its demonstrated efficacy, EPBDxDNABERT-2 stands to become a crucial component in future genetic and pharmaceutical research efforts.