Understanding how genes interact in complex biological systems has long been a cornerstone of molecular biology. One of the most powerful ways to study these interactions is through perturbation experiments, where scientists selectively disrupt genes to observe their effects on cellular functions. Techniques like Perturb-seq have revolutionized this field by enabling large-scale interventions and single-cell sequencing to map genetic influences. However, the sheer volume of data and the high costs of conducting these experiments present major barriers to their widespread use.
Thanks to machine learning (ML) and artificial intelligence (AI), it is possible to predict cellular responses and extract meaningful insights without the need for exhaustive laboratory experiments. But there’s a problem: many current AI models treat biological data as just numbers, failing to capture the semantic richness of genetic relationships. They focus on raw correlations rather than deeper biological reasoning, limiting their ability to support meaningful discoveries.
A recent study led by Menghua Wu (MIT), Russell Littman, Jacob Levine, David Richmond, Tommaso Biancalani, Jan-Christian Hütter (Genentech), and Lin Qiu (Meta AI) proposes a new approach. They introduce PERTURBQA, a benchmark designed to align AI-driven perturbation models with real biological decision-making. More importantly, they demonstrate how large language models (LLMs)—the same technology that powers AI chatbots—can be repurposed for biological research. Their method, called SUMMER (SUMMarize, retrievE, and answeR), shows that AI can interpret and reason over perturbation experiments using natural language, potentially outperforming existing models.
Why current AI approaches fall short
The biggest limitation of perturbation experiments is their cost. These experiments rely on single-cell RNA sequencing (scRNA-seq), a technique that allows scientists to measure how gene expression changes when specific genes are knocked down or overexpressed. While powerful, these experiments are expensive and time-consuming, requiring thousands of cells and intricate data analysis.
To address this, machine learning models attempt to predict how genes will behave under perturbation before actually conducting experiments. These models use knowledge graphs—databases of known biological interactions—to infer how a new gene disruption might affect a cell. However, this approach has several shortcomings:
- Loss of information: When biological relationships are reduced to numerical adjacency matrices, much of the detailed context is lost.
- Misaligned objectives: Most models focus on predicting changes in gene expression levels rather than answering biological questions that researchers actually care about.
- Black-box nature: Many AI models work as “black boxes,” making it difficult to interpret why they arrive at a particular prediction.
AI now handles molecular simulations: Thanks to MDCrow
A language-based alternative
To overcome these limitations, the research team proposes a language-based approach. Instead of treating genes as mere data points, they argue that biological relationships should be represented through natural language—the way scientists naturally describe genetic interactions.
This is where large language models (LLMs) come in.
PERTURBQA: A new benchmark for AI in biology
To test whether language models can reason about genetic perturbations, the researchers created PERTURBQA, a benchmark designed to evaluate AI models on three real-world biological tasks:
- Differential expression prediction: Given a gene perturbation, predict whether another gene’s expression will significantly change.
- Direction of change: If a gene’s expression changes, determine whether it increases or decreases.
- Gene set enrichment: Identify clusters of genes that behave similarly under perturbations and describe their common function.
Unlike previous benchmarks, which mostly assess whether AI can recall existing biological knowledge, PERTURBQA is designed to predict and reason about new, unseen perturbations. The dataset includes five large-scale Perturb-seq experiments that cover multiple cell types.
SUMMER: An AI model that thinks like a biologist
To solve the PERTURBQA tasks, the researchers introduced SUMMER, a language-based AI framework that outperforms traditional machine learning models in reasoning over perturbation data.
SUMMER works in three key steps:
- Summarization: The LLM reads and summarizes biological knowledge graphs, extracting key descriptions of genes and their interactions.
- Retrieval: The model retrieves relevant experimental data from previously seen perturbations, grounding its reasoning in real-world examples.
- Question-Answering: Finally, SUMMER answers biological questions about perturbations using a step-by-step reasoning process, similar to how a biologist would analyze experimental results.
Unlike conventional models that blindly correlate genes, SUMMER explains why a perturbation might cause a certain effect, making its predictions more interpretable.
How well does SUMMER perform?
The researchers tested SUMMER against state-of-the-art AI models, including:
- Graph-based models (GEARS, GAT): These rely on structured biological networks but often discard key semantic information.
- Single-cell ML models (SCGPT): These use deep learning to predict gene expression levels but struggle to provide clear biological explanations.
- Text-based AI models (GENEPT): These encode genetic descriptions into numerical representations but lack explicit reasoning steps.
The results showed that SUMMER outperformed all baseline models on both differential expression and gene set enrichment tasks. Notably, models without structured reasoning or experimental retrieval performed no better than random guessing, highlighting the importance of SUMMER’s approach.
Can AI describe biological patterns?
One of the most impressive achievements of SUMMER was in gene set enrichment. Traditionally, scientists use statistical tests to group genes into functional sets, but these methods struggle with poorly characterized genes. SUMMER, on the other hand, was able to generate accurate, interpretable descriptions of gene clusters, often matching or exceeding human annotations.
For example, when analyzing a gene cluster involved in RNA modification, traditional statistical methods failed to provide meaningful insights. SUMMER, however, generated the following description:
“M6A Methylation Complex-Associated Genes: This set includes genes regulating N6-methyladenosine (m6A) methylation of RNAs, influencing mRNA splicing and RNA processing.”
Such descriptions are not only more readable but also capture the broader biological significance of gene interactions.
While SUMMER represents a major step forward, biological reasoning with AI is far from a solved problem. The study highlights several future directions:
- Integrating multimodal AI models: Combining language models with specialized AI trained on raw genomic data could improve accuracy.
- Scaling AI-driven perturbation predictions: More comprehensive datasets could help AI models learn finer details about genetic interactions.
- Real-world applications in drug discovery: AI models like SUMMER could accelerate the identification of potential drug targets by predicting how cells respond to genetic modifications.
Featured image credit: digitale.de/Unsplash