Learning the Language of Antibody Hypervariability Through Biological Property Prediction

Machine learning-based protein language models (PLMs) have proven to be successful in a variety of structure and function-prediction contexts. However, foundational PLMs (those trained on the corpus of all proteins) rely on evolutionary co-conservation of protein sub-sequences, but this distribution...

Full description

Bibliographic Details
Main Author: Im, Chiho
Other Authors: Berger, Bonnie
Format: Thesis
Published: Massachusetts Institute of Technology 2023
Online Access:https://hdl.handle.net/1721.1/151427
Description
Summary:Machine learning-based protein language models (PLMs) have proven to be successful in a variety of structure and function-prediction contexts. However, foundational PLMs (those trained on the corpus of all proteins) rely on evolutionary co-conservation of protein sub-sequences, but this distributional hypothesis does not hold for antibody hypervariable regions. Consequently, methods like AlphaFold 2 have relatively weak performance on antibody sequences. In this work, we propose AbMAP (Antibody Mutagenesis-Augmented Processing), a new transfer learning framework that fine-tunes foundational models specifically for antibody-sequence inputs by supervising on examples of antibody structure and binding specificity. We demonstrate how our feature representations can be applied to the accurate prediction of an antibody’s local and global 3D structures, mutational effects on antigen binding specificity, as well as identification of its paratope. The scalability of AbMAP newly enables large-scale analysis of human antibody repertoires. We find that the AbMAP representations of individual repertoires have remarkable overlap, more so than can be discerned by sequence analysis. Our findings provide robust evidence in support of the hypothesis that antibody repertoires across individuals converge towards similar structural and functional coverage. We anticipate AbMAP will accelerate efficient and effective design and modeling of antibodies and expedite antibody-based therapeutics discovery.