Advancing Protein Science with Large Language Models: From Sequence Understanding to Drug Discovery

Proteins, essential macromolecules for biological processes like metabolism and immune response, follow the sequence-structure-function paradigm, where amino acid sequences determine 3D structures and functions. Computational protein science AIms to decode this relationship and design proteins with desired properties. Traditional AI models have achieved significant success in specific protein modeling tasks, such as structure prediction and design. However, these models face challenges in understanding the “grammar” and “semantics” of protein sequences and lack generalization across tasks. Recently, protein Language Models (pLMs) leveraging LLM techniques have emerged, enabling advancements in protein understanding, function prediction, and design.

Researchers from institutions like The Hong Kong Polytechnic University, Michigan State University, and Mohamed bin Zayed University of Artificial Intelligence have advanced computational protein science by integrating LLMs to develop pLMs. These models effectively capture protein knowledge and address sequence-structure-function reasoning problems. This survey systematically categorizes pLMs into sequence-based, structure- and function-enhanced, and multimodal models, exploring their applications in protein structure prediction, function prediction, and design. It highlights pLMs’ impact on antibody design, enzyme engineering, and drug discovery while discussing challenges and future directions, providing insights for AI and biology researchers in this growing field.

Protein structure prediction is a critical challenge in computational biology due to the complexity of experimental techniques like X-ray crystallography and NMR. Recent advancements like AlphaFold2 and RoseTTAFold have significantly improved structure prediction by incorporating evolutionary and geometric constraints. However, these methods still face challenges, especially with orphan proteins lacking homologous sequences. To address these issues, single-sequence prediction methods, like ESMFold, use pLMs to predict protein structures without relying on multiple sequence alignments (MSAs). These methods offer faster and more universal predictions, particularly for proteins with no homology, though there is still room for improvement in accuracy.

pLMs have significantly impacted computational and experimental protein science, particularly in applications like antibody design, enzyme design, and drug discovery. In antibody design, pLMs can propose antibody sequences that specifically bind to target antigens, offering a more controlled and cost-effective alternative to traditional animal-based methods. These models, like PALMH3, have successfully designed antibodies targeting various SARS-CoV-2 variants, demonstrating improved neutralization and affinity. Similarly, pLMs play a key role in enzyme design by optimizing wild-type enzymes for enhanced stability and new catalytic functions. For example, InstructPLM has been used to redesign enzymes like PETase and L-MDH, improving their efficiency compared to the wild-type.

In drug discovery, pLMs help predict interactions between drugs and target proteins, accelerating the screening of potential drug candidates. Models like TransDTI can classify drug-target interactions, aiding in identifying promising compounds for diseases. Additionally, ConPLex leverages contrastive learning to predict kinase-drug interactions, successfully confirming several high-affinity binding interactions. These advances in pLM applications streamline the drug discovery process and contribute to developing more effective therapies with better efficiency and safety profiles.

In conclusion, the study provides an in-depth look at the role of LLMs in protein science, covering both foundational concepts and recent advancements. It discusses the biological basis of protein modeling, the categorization of pLMs based on their ability to understand sequences, structures, and functional information, and their applications in protein structure prediction, function prediction, and design. The review also highlights pLMs’ potential in practical fields like antibody design, enzyme engineering, and drug discovery. Lastly, it outlines promising future directions in this rapidly advancing field, emphasizing the transformative impact of AI on computational protein science.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.