OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Ahdritz, Gustaf; Bouatta, Nazim; Floristean, Christina; Kadyan, Sachin; Xia, Qinghui; Gerecke, William; O’Donnell, Timothy J.; Berenberg, Daniel; Fisk, Ian; Zanichelli, Niccolò; Zhang, Bo; Nowaczynski, Arkadiusz; Wang, Bei; Stepniewska-Dziubinska, Marta M.; Zhang, Shang; Ojewole, Adegoke; Guney, Murat Efe; Biderman, Stella; Watkins, Andrew M.; Ra, Stephen; Lorenzo, Pablo Ribalta; Nivon, Lucas; Weitzner, Brian; Ban, Yih-En Andrew; Chen, Shiyang; Zhang, Minjia; Li, Conglong; Song, Shuaiwen Leon; He, Yuxiong; Sorger, Peter K.; Mostaque, Emad; Zhang, Zhao; Bonneau, Richard; AlQuraishi, Mohammed

doi:10.1038/s41592-024-02272-z

Article
Published: 14 May 2024

OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Gustaf Ahdritz ORCID: orcid.org/0000-0001-8283-5324^1,2^na1,
Nazim Bouatta ORCID: orcid.org/0000-0002-6524-874X³^na1,
Christina Floristean¹,
Sachin Kadyan¹,
Qinghui Xia¹,
William Gerecke³,
Timothy J. O’Donnell⁴,
Daniel Berenberg⁵,
Ian Fisk⁶,
Niccolò Zanichelli⁷,
Bo Zhang ORCID: orcid.org/0000-0002-9714-2827⁸,
Arkadiusz Nowaczynski⁹,
Bei Wang⁹,
Marta M. Stepniewska-Dziubinska⁹,
Shang Zhang⁹,
Adegoke Ojewole ORCID: orcid.org/0000-0003-2661-4388⁹,
Murat Efe Guney⁹,
Stella Biderman^10,11,
Andrew M. Watkins¹²,
Stephen Ra¹²,
Pablo Ribalta Lorenzo⁹,
Lucas Nivon¹³,
Brian Weitzner ORCID: orcid.org/0000-0002-1909-0961¹⁴,
Yih-En Andrew Ban¹⁵,
Shiyang Chen ORCID: orcid.org/0000-0003-2626-7865¹⁶,
Minjia Zhang¹⁷,
Conglong Li¹⁸,
Shuaiwen Leon Song¹⁸,
Yuxiong He¹⁸,
Peter K. Sorger ORCID: orcid.org/0000-0002-3364-1838³,
Emad Mostaque¹⁹,
Zhao Zhang ORCID: orcid.org/0000-0001-5921-0035¹⁶,
Richard Bonneau¹² &
…
Mohammed AlQuraishi ORCID: orcid.org/0000-0001-6817-1322¹

Nature Methods (2024)Cite this article

2 Citations
237 Altmetric
Metrics details

Subjects

Abstract

AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein–ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model’s capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: OpenFold matches the accuracy of AlphaFold2.**

**Fig. 2: OpenFold generalization capacity on elided training sets.**

**Fig. 4: Secondary structure categories are learned in succession.**

**Fig. 5: Learning proceeds at multiple scales.**

AlphaFold predictions are valuable hypotheses and accelerate but do not replace experimental structure determination

Article Open access 30 November 2023

Highly accurate protein structure prediction for the human proteome

Article Open access 22 July 2021

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Data availability

OpenProteinSet and OpenFold model parameters are hosted on the Registry of Open Data on AWS and can be accessed at https://registry.opendata.aws/openfold/. Both are available under the permissive CC BY 4.0 license. Throughout the study, we use validation sets derived from the PDB via CAMEO. We also use CASP evaluation sets. Source data are provided with this paper.

Code availability

OpenFold can be accessed at https://github.com/aqlaboratory/openfold. It is available under the permissive Apache 2 Licence.

References

Anfinsen, C. B. Principles that govern the folding of protein chains. Science 181, 223–230 (1973).
Article CAS PubMed Google Scholar
Dill, K. A., Ozkan, S. B., Shell, M. S. & Weikl, T. R. The protein folding problem. Annu. Rev. Biophys. 37, 289–316 (2008).
Article CAS PubMed PubMed Central Google Scholar
Jones, D. T., Singh, T., Kosciolek, T. & Tetchner, S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31, 999–1006 (2015).
Article CAS PubMed Google Scholar
Golkov, V. et al. Protein contact prediction from amino acid co-evolution using convolutional networks for graph-valued images. In Advances in Neural Information Processing Systems (eds Lee, D. et al.) (Curran Associates, 2016).
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
Article PubMed PubMed Central Google Scholar
Liu, Y., Palmedo, P., Ye, Q., Berger, B. & Peng, J. Enhancing evolutionary couplings with deep convolutional neural networks. Cell Syst. 6, 65–74 (2018).
Article PubMed Google Scholar
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Article CAS PubMed Google Scholar
Xu, J., McPartlon, M. & Li, J. Improved protein structure prediction by deep learning irrespective of co-evolution information. Nat. Mach. Intell. 3, 601–609 (2021).
Article PubMed PubMed Central Google Scholar
Šali, A. & Blundell, T. L. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815 (1993).
Article PubMed Google Scholar
Roy, A., Kucukural, A. & Zhang, Y. I-TASSER: a unified platform for automated protein structure and function prediction. Nat. Protoc. 5, 725–738 (2010).
Article CAS PubMed PubMed Central Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 577, 583–589 (2021).
Article Google Scholar
Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19, 679–682 (2022).
Article CAS PubMed PubMed Central Google Scholar
Baek, M. Adding a big enough number for ‘residue_index’ feature is enough to model hetero-complex using AlphaFold (green&cyan: crystal structure / magenta: predicted model w/ residue_index modification). Twitter twitter.com/minkbaek/status/1417538291709071362?lang=en (2021).
Tsaban, T. et al. Harnessing protein folding neural networks for peptide–protein docking. Nat. Commun. 13, 176 (2022).
Article CAS PubMed PubMed Central Google Scholar
Roney, J. P. & Ovchinnikov, S. State-of-the-art estimation of protein model accuracy using AlphaFold. Phys. Rev. Lett. 129, 238101 (2022).
Article CAS PubMed Google Scholar
Baltzis, A. et al. Highly significant improvement of protein sequence alignments with AlphaFold2. Bioinformatics 38, 5007–5011 (2022).
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein–protein interactions using AlphaFold2. Nat. Commun. 13, 1265 (2022).
Article CAS PubMed PubMed Central Google Scholar
Wayment-Steele, H. K., Ovchinnikov, S., Colwell, L. & Kern, D. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. Nature 625, 832–839 (2024).
Article CAS PubMed Google Scholar
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Article CAS PubMed PubMed Central Google Scholar
Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50, D439–D444 (2021).
Article PubMed Central Google Scholar
Callaway, E. ‘The entire protein universe’: AI predicts shape of nearly every known protein. Nature 608, 15–16 (2022).
Article CAS PubMed Google Scholar
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. Preprint at bioRxiv https://doi.org/10.1101/2021.10.04.463034 (2021).
Ahdritz, G. et al. OpenProteinSet: training data for structural biology at scale. In Advances in Neural Information Processing Systems (eds Oh, A. et al.) 4597-4609 (Curran Associates, 2023).
Paszke, A. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems (eds Wallach, H. et al.) 8026–8037 (Curran Associates, 2019).
Bradbury, J. et al. JAX: composable transformations of Python+NumPy programs. GitHub github.com/google/jax (2018).
Rasley, J., Rajbhandari, S., Ruwase, O. & He, Y. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20 3505–3506 (Association for Computing Machinery, 2020).
Charlier, B., Feydy, J., Glaunès, J., Collin, F.-D. & Durif, G. Kernel operations on the GPU, with autodiff, without memory overflows. J. Mach. Learn. Res. 22, 1–6 (2021).
Google Scholar
Falcon, W. & the PyTorch Lightning team. PyTorch Lightning (PyTorch Lightning, 2019).
Dao, T., Fu, D. Y., Ermon, S., Rudra, A. & Ré, C. FlashAttention: fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems (eds Koyejo, S. et al.) 16344–16359 (Curran Associates, 2022).
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).
Article CAS PubMed Google Scholar
wwPDB Consortium. Protein Data Bank: the single global archive for 3D macromolecular structure data. Nucleic Acids Res. 47, D520–D528 (2018).
Article Google Scholar
Haas, J. ürgen et al. Continuous automated model evaluation (CAMEO) complementing the critical assessment of structure prediction in CASP12. Proteins 86, 387–398 (2018).
Article CAS PubMed Google Scholar
Mariani, V., Biasini, M., Barbato, A. & Schwede, T. lDDT: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics 29, 2722–2728 (2013).
Article CAS PubMed PubMed Central Google Scholar
Orengo, C. A. et al. CATH—a hierarchic classification of protein domain structures. Structure 5, 1093–1108 (1997).
Article CAS PubMed Google Scholar
Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic Acids Res. 49, D266–D273 (2021).
Article CAS PubMed Google Scholar
Andreeva, A., Kulesha, E., Gough, J. & Murzin, A. G. The SCOP database in 2020: expanded classification of representative family and superfamily domains of known protein structures. Nucleic Acids Res. 48, D376–D382 (2020).
Article CAS PubMed Google Scholar
Saitoh, Y. et al. Structural basis for high selectivity of a rice silicon channel Lsi1. Nat. Commun. 12, 6236 (2021).
Article CAS PubMed PubMed Central Google Scholar
Mota, DaniellyC. A. M. et al. Structural and thermodynamic analyses of human TMED1 (p241) Golgi dynamics. Biochimie 192, 72–82 (2022).
Article CAS PubMed Google Scholar
Vaswani, A. et al. Attention is all you need. In Advances in Neural Information Processing Systems (eds Guyon, I. et al.) (Curran Associates, 2017).
Rabe, M. N. & Staats, C. Self-attention does not need O(n²) memory. Preprint at https://doi.org/10.48550/arXiv.2112.05682 (2021).
Cheng, S. et al. FastFold: Optimizing AlphaFold Training and Inference on GPU Clusters. In Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming 417–430 (Association for Computing Machinery, 2024).
Li, Z. et al. Uni-Fold: an open-source platform for developing protein folding models beyond AlphaFold. Preprint at bioRxiv https://doi.org/10.1101/2022.08.04.502811 (2022).
Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Science 22, 2577–2637 (1983).
CAS Google Scholar
Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374 (2003).
Article CAS PubMed PubMed Central Google Scholar
Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011).
Article CAS PubMed PubMed Central Google Scholar
Sułkowska, J. I., Morcos, F., Weigt, M., Hwa, T. & Onuchic, José Genomics-aided structure prediction. Proc. Natl Acad. Sci. USA 109, 10340–10345 (2012).
Article PubMed PubMed Central Google Scholar
Kaplan, J. et al. Scaling laws for neural language models. Preprint at https://doi.org/10.48550/arXiv.2001.08361 (2020).
Hoffmann, J. et al. An empirical analysis of compute-optimal large language model training. In Advances in Neural Information Processing Systems (eds Oh, A. H. et al.) 30016–30030 (NeurIPS, 2022).
Tay, Y. et al. Scaling laws vs model architectures: how does inductive bias influence scaling? In Findings of the Association for Computational Linguistics: EMNLP 2023 (eds Bouamor, H. et al.) 12342–12364 (Association for Computational Linguistics, 2023).
Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130 (2023).
Article CAS PubMed Google Scholar
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article CAS PubMed PubMed Central Google Scholar
Chowdhury, R. et al. Single-sequence protein structure prediction using a language model and deep learning. Nat. Biotechnol. 40, 1617–1623 (2022).
Wu, R. et al. High-resolution de novo structure prediction from primary sequence. Preprint at bioRxiv https://doi.org/10.1101/2022.07.21.500999 (2022).
Singh, J., Paliwal, K., Litfin, T., Singh, J. & Zhou, Y. Predicting RNA distance-based contact maps by integrated deep learning on physics-inferred secondary structure and evolutionary-derived mutational coupling. Bioinformatics 38, 3900–3910 (2022).
Article CAS PubMed PubMed Central Google Scholar
Baek, M., McHugh, R., Anishchenko, I., Baker, D. & DiMaio, F. Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat. Methods 21, 117–121 (2024).
Article CAS PubMed Google Scholar
Pearce, R., Omenn, G. S. & Zhang, Y. De novo RNA tertiary structure prediction at atomic resolution using geometric potentials from deep learning. Preprint at bioRxiv https://doi.org/10.1101/2022.05.15.491755 (2022).
McPartlon, M., Lai, B. & Xu, J. A deep SE(3)-equivariant model for learning inverse protein folding. Preprint at bioRxiv https://doi.org/10.1101/2022.04.15.488492 (2022).
McPartlon, M. & Xu, J. An end-to-end deep learning method for protein side-chain packing and inverse folding. In Proceedings of the National Academy of Sciences e2216438120 (PNAS, 2023).
Knox, H. L., Sinner, E. K., Townsend, C. A., Boal, A. K. & Booker, S. J. Structure of a B₁₂-dependent radical SAM enzyme in carbapenem biosynthesis. Nature 602, 343–348 (2022).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
Article CAS PubMed Google Scholar
Rajbhandari, S., Rasley, J., Ruwase, O. & He, Y. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (IEEE Press, 2020).
Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. In 3rd International Conference on Learning Representations (eds Bengio, Y. & LeCun, Y.) (ICLR, 2015).
Wang, G. et al. HelixFold: an efficient implementation of AlphaFold2 using PaddlePaddle. Preprint at https://doi.org/10.48550/arXiv.2207.05477 (2022).
Yuan, J. et al. OneFlow: redesign the distributed deep learning framework from scratch. Preprint at https://doi.org/10.48550/arXiv.2110.15032 (2021).
Ovchinnikov, S. Weekend project! nerd-face So now that OpenFold weights are available. I was curious how different they are from AlphaFold weights and if they can be used for AfDesign evaluation. More specifically, if you design a protein with AlphaFold, can OpenFold predict it (and vice-versa)? (1/5). Twitter twitter.com/sokrypton/status/1551242121528520704?lang=en (2022).
Wei, X. et al. The α-helical cap domain of a novel esterase from gut Alistipes shahii shaping the substrate-binding pocket. J. Agric. Food Chem. 69, 6064–6072 (2021).
Article CAS PubMed Google Scholar
Carroll, B. L. et al. Caught in motion: human NTHL1 undergoes interdomain rearrangement necessary for catalysis. Nucleic Acids Res. 49, 13165–13178 (2021).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank the Flatiron Institute, OpenBioML, Stability AI, the Texas Advanced Computing Center and NVIDIA for providing compute for experiments in this paper. Individually, we thank M. Mirdita, M. Steinegger and S. Ovchinnikov for valuable support and expertise. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the US Department of Energy under contract no. DE-AC02-05CH11231. We acknowledge the Texas Advanced Computing Center at the University of Texas at Austin for providing HPC resources that have contributed to the research results reported within this paper. G.A. is supported by a Simons Investigator Fellowship, NSF grant DMS-2134157, DARPA grant W911NF2010021, DOE grant DE-SC0022199 and a graduate fellowship from the Kempner Institute at Harvard University. N.B. is supported by DARPA Panacea program grant HR0011-19-2-0022 and NCI grant U54-CA225088. C.F. and S.K. are supported by NIH grant R35GM150546. B.Z. and Z.Z. are supported by grants NSF OAC-2112606 and OAC-2106661. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

These authors contributed equally: Gustaf Ahdritz, Nazim Bouatta.

Authors and Affiliations

Department of Systems Biology, Columbia University, New York, NY, USA
Gustaf Ahdritz, Christina Floristean, Sachin Kadyan, Qinghui Xia & Mohammed AlQuraishi
Harvard University, Cambridge, MA, USA
Gustaf Ahdritz
Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
Nazim Bouatta, William Gerecke & Peter K. Sorger
Icahn School of Medicine at Mount Sinai, New York, NY, USA
Timothy J. O’Donnell
Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
Daniel Berenberg
Flatiron Institute, New York, NY, USA
Ian Fisk
OpenBioML, Cambridge, MA, USA
Niccolò Zanichelli
Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA
Bo Zhang
NVIDIA, Santa Clara, CA, USA
Arkadiusz Nowaczynski, Bei Wang, Marta M. Stepniewska-Dziubinska, Shang Zhang, Adegoke Ojewole, Murat Efe Guney & Pablo Ribalta Lorenzo
EleutherAI, New York, NY, USA
Stella Biderman
Booz Allen Hamilton, McLean, VA, USA
Stella Biderman
Prescient Design, Genentech, New York, NY, USA
Andrew M. Watkins, Stephen Ra & Richard Bonneau
Cyrus Bio, Seattle, WA, USA
Lucas Nivon
Outpace Bio, Seattle, WA, USA
Brian Weitzner
Arzeda, Seattle, WA, USA
Yih-En Andrew Ban
Rutgers University, New Brunswick, NJ, USA
Shiyang Chen & Zhao Zhang
University of Illinois at Urbana-Champaign, Champaign, IL, USA
Minjia Zhang
Microsoft, Redmond, WA, USA
Conglong Li, Shuaiwen Leon Song & Yuxiong He
Stability AI, Los Altos, CA, USA
Emad Mostaque

Authors

Gustaf Ahdritz
View author publications
You can also search for this author in PubMed Google Scholar
Nazim Bouatta
View author publications
You can also search for this author in PubMed Google Scholar
Christina Floristean
View author publications
You can also search for this author in PubMed Google Scholar
Sachin Kadyan
View author publications
You can also search for this author in PubMed Google Scholar
Qinghui Xia
View author publications
You can also search for this author in PubMed Google Scholar
William Gerecke
View author publications
You can also search for this author in PubMed Google Scholar
Timothy J. O’Donnell
View author publications
You can also search for this author in PubMed Google Scholar
Daniel Berenberg
View author publications
You can also search for this author in PubMed Google Scholar
Ian Fisk
View author publications
You can also search for this author in PubMed Google Scholar
Niccolò Zanichelli
View author publications
You can also search for this author in PubMed Google Scholar
Bo Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Arkadiusz Nowaczynski
View author publications
You can also search for this author in PubMed Google Scholar
Bei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Marta M. Stepniewska-Dziubinska
View author publications
You can also search for this author in PubMed Google Scholar
Shang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Adegoke Ojewole
View author publications
You can also search for this author in PubMed Google Scholar
Murat Efe Guney
View author publications
You can also search for this author in PubMed Google Scholar
Stella Biderman
View author publications
You can also search for this author in PubMed Google Scholar
Andrew M. Watkins
View author publications
You can also search for this author in PubMed Google Scholar
Stephen Ra
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Ribalta Lorenzo
View author publications
You can also search for this author in PubMed Google Scholar
Lucas Nivon
View author publications
You can also search for this author in PubMed Google Scholar
Brian Weitzner
View author publications
You can also search for this author in PubMed Google Scholar
Yih-En Andrew Ban
View author publications
You can also search for this author in PubMed Google Scholar
Shiyang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Minjia Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Conglong Li
View author publications
You can also search for this author in PubMed Google Scholar
Shuaiwen Leon Song
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiong He
View author publications
You can also search for this author in PubMed Google Scholar
Peter K. Sorger
View author publications
You can also search for this author in PubMed Google Scholar
Emad Mostaque
View author publications
You can also search for this author in PubMed Google Scholar
Zhao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Richard Bonneau
View author publications
You can also search for this author in PubMed Google Scholar
Mohammed AlQuraishi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.A. wrote and optimized the OpenFold codebase, generated data, trained the model, performed experiments and maintained the GitHub repository. C.F. wrote and tested code for the OpenFold implementation of AlphaFold-Multimer. S.K. and W.G. wrote data preprocessing code. G.A., N.B. and M.A. conceived of and managed the project, designed experiments, analyzed results and wrote the manuscript. G.A., B.Z., Z.Z., N.Z. and A.N. ran ablations. All authors read and approved the manuscript. The Flatiron Institute (via I.F., A.M.W., S.R. and R.B.) provided compute for ablations, all data generation and our main training experiments. NVIDIA (A.N., B. Wang, M.M.S.-D., S.Z., A.O., M.E.G. and P.R.L.) performed training stability experiments, fixed critical bugs in the codebase, added new model features and provided compute for ablations. Stability AI (via N.Z., S.B. and E.M.) provided compute for ablations. The DeepSpeed team at Microsoft (S.C., M.Z., C.L., S.L.S. and Y.H.) wrote custom optimized attention kernels. Q.X. and T.J.O.’D. debugged code and provided feedback.

Corresponding authors

Correspondence to Nazim Bouatta or Mohammed AlQuraishi.

Ethics declarations

Competing interests

M.A. is a member of the scientific advisory boards of Cyrus Biotechnology, Deep Forest Sciences, Nabla Bio, Oracle Therapeutics and FL2021-002, a Foresite Labs company. P.K.S. is a cofounder and member of the BOD of Glencoe Software, member of the BOD for Applied BioMath and a member of the SAB for RareCyte, NanoString, Reverb Therapeutics and Montai Health; he holds equity in Glencoe, Applied BioMath and RareCyte. L.N. is an employee of Cyrus Biotechnology. The other authors declare no competing interests.

Peer review

Peer review information

Nature Methods thanks the anonymous reviewers for their contribution to the peer review of this work. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 OpenFold matches the accuracy of AlphaFold2 on CASP15 targets.

Scatter plot of GDT-TS values of AlphaFold and OpenFold ‘Model 1’ predictions against all currently available ‘all groups’ CASP15 domains (n = 90). OpenFold’s mean accuracy (95% confidence interval = 68.6-78.8) is on par with AlphaFold’s (95% confidence interval = 69.7-79.2) and OpenFold does at least as well as the latter on exactly 50% of targets. Confidence intervals of each mean are estimated from 10,000 bootstrap samples.

Extended Data Fig. 2 OpenFold learns decoy ranking slowly.

Decoy ranking results (mean Spearman correlation between pLDDT and decoy TM Score) using intermediate checkpoints of OpenFold on 28 randomly chosen proteins from the Rosetta decoy ranking dataset from¹⁵. See Supplementary Information section B.1 for more details.

Extended Data Fig. 3 Fine-tuning does not materially improve prediction accuracy on long proteins.

Mean lDDT-Cα over validation proteins with at least 500 residues as a function of fine-tuning step.

Extended Data Fig. 4 The ‘Mostly alpha’ CATH class contains some beta sheets, and vice versa.

Counts for alpha helices and beta sheets in the mostly alpha and mostly beta CATH class-stratified training sets from Fig. 2, based on 1,000 random samples. Counts are binned by size, defined as the number of residues for alpha helices and number of strands for beta sheets.

Extended Data Fig. 5 Reduced dataset diversity disproportionately affects global structure.

Mean GDT-TS and lDDT-Cα of non-overlapping protein fragments from CAMEO validation set as a function of the percentage of CATH clusters in elided training sets. Data for both topology and architecture elisions are included. The fragmenting procedure is the same as that described in Fig. 5a.

Extended Data Fig. 6 Early predictions crudely approximate lower-dimensional PCA projections.

(A) Mean dRMSD, as a function of training step, between low- dimensional PCA projections of predicted structures and the final 3D prediction at step 5,000 (denoted by *). Averages are computed over the CAMEO validation set. Insets show idealized behavior corresponding to unstaggered, simultaneous growth in all dimensions and perfectly staggered growth. Empirical training behavior more closely resembles the staggered scenario. (B) Low-dimensional projections as in (A) compared to projections of the final predicted structures at step 5,000. (C) Mean displacement, as a function of training step, of C? atoms along the directions of their final structure’s PCA eigenvectors. Results are shown for two individual proteins (PDB accession codes 7DQ9_A ref. ⁶⁶ and 7RDT_A ref. ⁶⁷). Shaded regions correspond loosely to ‘1D,’ ‘2D,’ and ‘3D’ phases of dimensionality.

Extended Data Fig. 7 Radius of gyration as an order parameter for learning protein phase structure.

Radii of gyration for proteins in the CAMEO validation set (or- ange) as a function of sequence length over training time, plotted on a log-log scale against experimental structures (blue). Legends show equations of best fit curves, computed using non-linear least squares. The training steps chosen correspond loosely to four phases of dimensional growth. See Supplementary Information section B.3 for extended discussion.

Extended Data Fig. 8 Contact prediction for beta sheets at different ranges.

Binned contact F1 scores (8 Å threshold) for beta sheets of various widths as a function of training step at different residue-residue separation ranges (SMLR ≥ 6 residues apart; LR ≥ 24 residues apart, as in⁸). Sheet widths are weighted averages of sheet thread counts within each bin, as in Fig. 5b.

Extended Data Table 1 Secondary structure recovery by class-stratified models

Full size table

Supplementary information

Supplementary Information

Supplementary Discussion

Reporting Summary

Supplementary Video 1

Folding animation for PDB protein 7B3A, chain A. Predictions are from successive early checkpoints of an OpenFold model (training step is shown at the bottom left).

Supplementary Video 2

Folding animation for PDB protein 7DMF, chain A. Predictions are from successive early checkpoints of an OpenFold model (training step is shown at the bottom left).

Supplementary Video 3

Folding animation for PDB protein 7DQ9, chain A. Predictions are from successive early checkpoints of an OpenFold model (training step is shown at the bottom left).

Supplementary Video 4

Folding animation for PDB protein 7LBU, chain A. Predictions are from successive early checkpoints of an OpenFold model (training step is shown at the bottom left).

Supplementary Video 5

Folding animation for PDB protein 7RDT, chain A. Predictions are from successive early checkpoints of an OpenFold model (training step is shown at the bottom left).

Source data

Source Data Fig. 1

Numerical source data.

Source Data Fig. 2

Numerical source data.

Source Data Fig. 3

Numerical source data.

Source Data Fig. 4

Numerical source data.

Source Data Fig. 5

Numerical source data.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ahdritz, G., Bouatta, N., Floristean, C. et al. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02272-z

Download citation

Received: 14 August 2023
Accepted: 03 April 2024
Published: 14 May 2024
DOI: https://doi.org/10.1038/s41592-024-02272-z

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links