Original Source Here

AlphaFold-based databases and fully-fledged, easy-to-use AlphaFold interfaces poised to revolutionize biology

Not only computational but also experimental biology. Thoughts on the future of data science niches in biology.

In a recent story I covered the release of the academic paper describing AlphaFold’s version 2 and its source code, and I showed you how scientists around the world were starting to apply the program to their favorite proteins through Google Colab notebooks, for free and without any hardware needs. These notebooks are rapidly evolving to enable more features, allowing anybody to model not only isolated proteins but also complexes of multiple proteins, and including known structures of related proteins and multiple sequence alignments to improve the program’s results. Moreover, Deepmind and the European Bioinformatics Institute started to upload AlphaFold-calculated models for “all” proteins, already having covered 20 full organisms and available for free download. Scientists trying the program and the database of models report on Twitter several success stories that anticipate how these and related technologies will disrupt the area of structural biology. Not only computational but also experimental structural biology, as the predicted models facilitate experimental determination of protein structures.

In this story published last week I covered the formal release of the details of AlphaFold 2, the CASP14-winning program for structure prediction developed by Google’s Deepmind, in a peer-reviewed article in the journal Nature and also of its code in GitHub. I also showed you how despite the huge size of this model, its complex dependencies on other libraries and hardware needs, scientists around the world were already running the program right online thanks to Google Colab notebooks developed by some very kind scientists. But history unveils very fast in these times, so many additional exciting news saw the light since my previous article.

After a quick recap of what proteins are, why biologists are interested in knowing their structures, how they can be experimentally determined or predicted by computers, and how AlphaFold 2 works, I develop on the breaking news: enhanced Colab notebooks that expose AlphaFold’s most advanced features, a growing database of free models precomputed by AlphaFold 2 for a major fraction of all proteins known from sequenced genomes, successful applications that are already happening, and outcomes of “experiments” testing the limits of the program. Last, I discuss what all this entails for the future of biology and what niches of structural biology and bioinformatics will likely flourish as the result of these new technologies brought about by AlphaFold and all the previous academic work that led Deepmind to master it.

Table of contents

Background: Proteins, why biologists are interested in knowing their structures, how structures can be determined through experiments or predicted by computer programs, and how AlphaFold 2 works

In a nutshell, proteins are linear chains of multiple amino acids, each of which consists in a constant unit of 4 non-hydrogen atoms plus a sidechain of variable size, ranging from none to around 20 atoms. The amino acids are connected through the constant unit, called backbone, to form a polypeptide that does not remain random but rather acquires one or more arrangements in space. That is, they fold into 3D structures. What exact structure a protein will adopt in 3D depends essentially on the identity of the amino acid sidechains, i.e. its amino acid sequence. Very briefly and simplifying definitions that are quite more complex, amino acid sequences are encoded by genes; the collection of genes of an organism is its genome; and the collection of proteins encoded in a genome is the proteome.

To be more precise, and this will be important later, the polypeptide actually can fold into multiple substructures, each of which is called a domain. (In principle it is these domains, not necessarily whole proteins, what AlphaFold has mastered -because that’s what CASP keeps track of, mainly.) And moreover, certain proteins or regions of proteins do not actually fold into well-defined 3D structures, rather remaining “disordered”. Disordered regions can be just small, connecting well-folded domains, or quite long which can in turn have some biological relevance (most times) or not; besides, some proteins are totally, “intrinsically” disordered. I know all this went a bit beyond the classical definition of proteins and protein structures for non-biologists, but it will be important for discussions later on in the article. Predicting not only the structures of proteins but also their disordered regions and how proteins move, is all key to modern structural bioinformatics.

Examples of proteins with different extents of disorder and ordered domains, and what one should expect current programs like AlphaFold to predict reasonably well. Image by author.

Why do biologists want to know the structures of proteins? As briefly touched upon in the introduction to my other story, knowing the structures of proteins allows advancing biotechnology and pharma. Medicaments are small molecules that bind to specific pockets in protein structures, modulating their structures with a positive physiological consequence. For example a small molecule can target a protein that controls how cells divide to attack a cancer. Another small molecule may interfere with an essential bacterial protein thus killing it. And the list goes on. Knowing a protein’s structure also helps us to understand how it performs its function, so we can then change it (by introducing mutations in the encoding gene) to adapt its use in say some biotechnological process such as fermentations, oil degumming, etc.

We can easily sequence genes and whole genomes, but passing from amino acid sequences to actual 3D structures is not straightforward. In the best case, when biologists want to know the structure of a new protein, they can check if other proteins of similar sequence have their structure solved (the Protein Data Bank is a free database where academics deposit and find all experimentally determined structures). If there is no known structure that can be used to reliably model the new protein by homology, then two main options stand: either attempting experimental determination of the new protein, or applying a prediction method that does not rely in homology to know structures. Experimental structure determination is in most cases tedious, expensive and labor-intensive, and very often fails. There are three main techniques to solve protein structures experimentally: X-ray diffraction of protein crystals, which requires your protein to produce well-diffracting crystals, Nuclear Magnetic Resonance spectroscopy which has severe limitations in tractable size and solution conditions, and cryo-electron microscopy which is developing very rapidly but is still quite limited to rather large, well-defined proteins or complexes and for many proteins it doesn’t yet produce atomic resolution but just some blobs of atom densities. On the other hand, predicting or “modeling” protein structures without any homolog proteins of known structures is (or kind of “was”) an extremely hard problem, that now got easier thanks not only to AlphaFold but also to several technologies that preceded it.

The field of predicting protein structures without known structures for related proteins is what CASP has been tracking for over a quarter of a century. You can see in my last week story how these predictions were rather bad for a long time, until methods for detection of contacts between pairs of amino acids were introduced that helped to guide folding of protein models. These methods essentially exploit alignments of sequences similar to the one under study, seeking for pairs of amino acids that change together and inferring when the co-variation reflects the two amino acids being in contact in the 3D structure. This was by CASP11 and 12, and then for CASP13 some academic groups and also Deepmind rerouted similar alignment-based analyses through machine learning models to predict not only contacts but also distances and orientations between residues, which redounded in better constraints for folding protein models. Then in CASP14 many groups pushed this a bit forward, gaining some prediction capability, but Alphafold 2 mastered the problem through a couple of novel ideas. While the details are in their paper, to me its most interesting ingredients are (i) the novel way how they treat the input sequence alignments and the structures known for related proteins; (ii) the fact that they represent the protein folding problem within the network, i.e. they don’t use external folding as all academics were doing by CASP14 and even AlphaFold 1 was doing by CASP13; and the fact that everything from sequence or alignment input to 3D model output flows in a single, huge, end-to-end differentiable network.

How protein structures induces correlation between amino acids during evolution, from which modeling programs can infer contacts to drive folding or even more complex geometrical features based on knowledge about protein structures learned from large databases like the Protein Data Bank. Image by author.

Enhanced Colab notebooks that expose AlphaFold’s features needed to get the most of it: multiple sequence alignments, known structures of related proteins, and oligomerization states.

As I discussed in my previous story, the AlphaFold 2 model is huge, and many researchers feared that they would never be able to have at hand the hardware resources needed to run it. However, less than a week after release some cool, very kind scientists put up Google Colab notebooks where anybody with a Google account could run AlphaFold on its favorite protein sequence. The early notebooks were quite simple, allowing only the modeling of individual protein chains, which after all is what AlphaFold was designed to achieve and mainly tested in the most popular track of CASP. But quickly, scientists started to add more of AphaFold’s features into the notebooks, such that users can now run essentially any protein with full control and inputs.

The main additions were two that are very important for AlphaFold (and any other modern structure prediction program) to perform well. One is the possibility of calculating a multiple sequence alignment from the input sequence, to be fed into the program so that it can extract structural information from it. CASP12 showed that proteins for which more sequences could be found were in average modeled better. CASP13 showed the same trend but also found that programs could work with less sequences. CASP14 showed that programs could work with even fewer sequences, but they (AlphaFold 2 included) still needed them for high-quality predictions. In the early Google Colab notebooks users run the program with single input sequences and no alignments at all. This most often resulted in mid-quality to poor models, as indicated by the LDDT plots (which predict the expected quality of the model at each amino acid). Users could also provide their own alignments, but good alignments for structure prediction have some special requirements. The new Colab notebooks take this into account; moreover, they exploit some ad hoc methods that have been optimized for this, and search for protein sequences in multiple databases that add up to over 20 million sequences.

The second very important addition is the possibility of fishing out “templates” i.e. experimental structures of proteins that probably share structural features with the protein one wants to model, and pass these templates to AlphaFold. This is of course very helpful for modeling, to the extent that until recently, modeling based on this kind of homology was the only procedure that guaranteed some success. The higher the similarity between template and target, the better. But of course for many targets, there are no good structures that kind be used as templates.

The best Colab notebook I around is this one by Sergey Ovchinnikov and Martin Steinegger: https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb

Besides the options for building sequence alignments and using templates, this notebook includes two additional interesting features: the possibility of doing a final refinement through molecular simulations, and the possibility of modeling multiple copies of the protein together. The former feature is important to optimize small problems, and makes sense only when the models look already quite reliable. The option for modeling multiple copies of the protein together makes sense when you know or suspect that your protein might actually be oligomeric, as I describe in the next section.

To complete this short description of this Colab notebook, it delivers 5 models and a plot of predicted quality estimates, all easily downloadable. Although there might be some improvements and a few further extensions to this notebook, I think the main new interesting features will come up from the different tests that scientists are doing to test AlphaFold 2’s limits.

The Google Colab notebook at https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb Image by author.

Pushing the limits: AlphaFold is so good that it can do a few things that it wasn’t probably even devised to, but it also gets stuck at some long-standing problems

Scientists quickly began to test the limits of the program by forcing it to make predictions that in principle it was not much expected to accomplish well. Surprisingly, some of the outcomes of these tests were quite positive!

First, already known from the CASP14 assessment (which I remind you runs blindly and independently of the people doing structure predictions), AlphaFold could not only fold correctly the individual domains that make up proteins but also their relative arrangements. That is, it could get whole structures right, at least for well-structured proteins for which large alignments could be built and for which some experimental structures were available for similar proteins. The CASP14 assessment also found that AlphaFold 2 can not only model correctly the backbone of the protein, but also the amino acid side chains. Historically, the problem was so difficult that CASP used to evaluate essentially only the backbone, but by CASP13 when I was an assessor it became obvious that the top programs (including AlphaFold 1 and some academic programs) were so good at modeling the backbone that side chains should be considered too. Now for most target proteins AlphaFold 2 modeled the sidechains very well.

Next, and this was reported by several scientists running AlphaFold 2 themselves, the program seems to have a made a huge advancement in predicting the complexes formed by two or more proteins. In biology, such complexes are essential to transmit information between proteins, and in many cases because the truly functional units are assemblies of multiple proteins that do not perform any function if isolated. Some actually aren’t even stable without their partners. Protein complexes come in two main flavors: the so called homo-oligomers which consist on multiple copies of the same molecule, and hetero-oligomers which consist in two or more different proteins. CASP has a specific track dedicated to complexes, which has historically shown that this remains a very difficult problem -on which many are working, of course. And there is even another contest specific for protein complex prediction, called CAPRI, where it wouldn’t surprise me to soon see Deepmind competing too.

More specifically, what scientists have discovered is that if they input two concatenated protein sequences into AlphaFold 2 then there is a good chance that it will return quite reasonable complexes. This is by no means trivial, and has been under study for some years, for two reasons. One is that concatenating proteins means concatenating them throughout the whole input multiple sequence alignment, and this is by itself no easy task. Second, coevolution signals between pairs of proteins do exist, but they are usually weaker and, for the case of homo-oligomers, inter-molecular signals (which contain information about the complex by two proteins) are very hard to disentangle from intra-molecular signals (which contain information about how each protein folds).

All this I just told you remains at the moment anecdotal, just on twitter and not peer-reviewed, but increasingly more cases of successful protein complex predictions show up. I am sure that right now some are groups benchmarking this further at a larger scale, and that we will soon see a peer-reviewed work assessing AlphaFold’s capability to predict complexes.

Example of two different proteins forming a complex, that is an heterodimer. Image by author.

What structural features cannot AlphaFold 2 yet model? That’s of course also important. First, it can model quite well the short disordered loops that connect structured pieced within well-folded domains, or that connect separate domains. But it cannot model the longer loops at all, so it’s better to just try predictions by leaving them out, especially if they are terminal. After all, long disordered regions are very easy to identify from the sequence only. The good thing is that LDDT plots predict very well that these loops are inaccurate. Always pay attention to LDDT plots.

Talking about LDDT, it is important to bear in mind that it is a quite local estimate of model quality. This is very important because two regions distant in sequence may have high LDDT scores, indicating a likely good quality around these amino acids, yet be totally off in their relative orientations and distances. In my own tests I saw this especially happening when an oligomeric protein is modeled as a monomer. If the program is told that the protein is monomeric, it will of course try to satisfy all contacts and short distance restraints by deforming the monomer. The overall shape of the model will thus be wrong, even though the LDDT estimate at each amino acid is quite high. A new run indicating how many copies of the protein must be included in the prediction will hopefully resolve that problem.

Another problem I have seen is with proteins that are mostly water-soluble but contain elements that traverse membranes. Membranes are laminar arrangements of lipids, which do not like water so they self-assemble into, ehm, membranes, to hide themselves from water. Some proteins exist only integrated into membranes, these ones are mostly modeled quite well. Others fulfill or their roles in solution; these ones are also mostly modeled correctly. But others are soluble yet have small elements, usually helicoidal, that insert into membranes. In the tests I did, such helices tend to give AlphaFold problems, apparently because it attempts to pack them with the folded part. Somehow this is not all wrong, because that is likely what this element would do if the protein were in solution without any membranes to bind to! Reality is actually more complicated, because such scenario usually renders the protein intractable so experimentalists cannot solve the full structures of these types of proteins; rather, in these cases they only solve the structures of the soluble domains.

And one more problem I have seen reported and seen myself is a quite strong bias towards 3D arrangements that are available in the Protein Data Bank. This is perfectly ok, because AlphaFold was trained on this database, and every cautious, knowleadgable user will probably realize about these limitation. What is the problem, exactly? Well, many proteins bind smaller molecules or even elemental ions as parts of their normal function, and for this they adopt different structures. Very often, the structure that has the small molecule or ion bound is more stable and so this is the one that ends up in the Protein Data Bank. The other molecule may be studied with other experimental methods that do not give atomic detail but often do reveal substantial differences between the small molecule-bound and free states. When you then train a model to predict structures, it will most likely predict this state because that is what it knows, even if you provide just a sequence and do not indicate that any ion or small molecule is bound. In fact neither AlphaFold nor any other structure prediction program provides input fields for “relevant small molecules that bind”. Likewise, they do not consider any other source of structural variability, if the multiple options are not reflected in the Protein Data Bank. A typical example of all this, that I saw people discussing in Twitter for AlphaFold 2, is proteins that bind metal ions like say copper or zinc: the amino acids involved in binding the ions are usually only structured in the bound form, but floppy in the unbound forms. A user was surprised that AlphaFold was predicting the ion-bound form even though it had not provided any information about bound ions. That’s perfectly expected, because for this kind of proteins the Protein Data Bank is dominated vastly by ion-bound forms, as ion-free forms are often difficult to characterize.

The summary for this section on limitations is clear: Users still need to know their chemistry and biology, and use them to interpret what new a model is telling, what parts can be trusted, and why it is telling what it is telling. That’s why like with any other protein model, it is essential to look at the quality estimates and the templates and alignments used by the program, of course always keeping in mind what’s known about the protein.

Deepmind and the European Bioinformatics Institute joined forces to compute protein models for “all” proteins with AlphaFold 2. Models for 20 organisms are already available for free download.

The latest news for the community of biologists was that Deepmind and the European Bioinformatics Institute joined efforts to attempt modeling all proteins of all known genomes. This means around 20 million structural models, of which they have already released 350,000 from human and 20 other species. This means cutting time and no need for expertise to run the program (although the Colab notebooks already make it very easy!).

The models are accessible for free at https://alphafold.ebi.ac.uk/. For the organisms whose proteomes have been processed, there are direct download links at https://alphafold.ebi.ac.uk/download. Users can search the database through a variety of keywords, from free text to protein identifiers widely used in biology. But to me the best tool is one that allows you to search models by comparing amino acid sequences at https://www.ebi.ac.uk/Tools/sss/fasta/. Why? Because your exact protein of interest may not have been modeled, but there might be a model for a highly similar one, that you can use to easily model your protein by homology.

One important note with the models provided by this database is that they are all based assuming monomeric states, and I have seen this causing problems in many proteins that I know are dimeric, trimeric, etc. AlphaFold has no way to know (or guess) this. Perhaps it would be good that Deepming + EBI to provide models assuming different oligomerization states? For the moment, you better run calculations for oligomers to complement what EBI offers. Likewise, as shown above the runs are biased to produce models that resemble what is already there in the Protein Data Bank; this means that if you input the sequence of a protein that is only stable for experimental characterization with a ligand bound, you will most likely get a model that corresponds to this form, even though strictly speaking you wanted to model the protein alone. On finishing this article I found that EBI does stress out all these limitations, in its note at https://www.ebi.ac.uk/about/news/opinion/alphafold-potential-impacts

Left: the main AlphaFold-EBI portal at https://alphafold.ebi.ac.uk/. Right: The tool to search AlphaFold’s models through protein sequence matches at https://www.ebi.ac.uk/Tools/sss/fasta/ Image by author.

Some concrete successful applications of AlphaFold 2 and related technologies

The wave of structure prediction methods using alignments and templates processed through machine learning methods had already given a couple of practical surprises in earlier editions of CASP, but AlphaFold 2 brought monopolized these cases in CASP14. Namely, the models themselves helped to complete the interpretation of experimental data that was waiting to be used to round up an experimental structural determination, in a procedure called molecular replacement where draft models are used to interpret X-ray diffraction data. Once that was done, the model (and others) could be finally assessed; of course this one was quite close to the experimental structure.

Similar reports have been around outside of CASP for a few years. For example, a structure resolved in this recent work through experimental methods relied substantially on partial models built using programs that academics wrote -in the process setting the bases for AlphaFold 1, and some actually not much different from it.

Another field that has already benefitted from this kind of models, and probably the one that will benefit the most from AlphaFold 2, is that of cryo-electron microscopy. In this technique, the main experimental data obtained is essentially a 3D map of electron densities. When this resolution is high enough, special programs (making a long process short, but of course it actually takes a lot of human intervention) can fit amino acids inside and in this way obtain the experimental structures. But high resolution maps are relatively rare compared to the amount of data produced. Thus, for the larger number of proteins inspected by cryo-electron microscopy it is very hard to thread the amino acid sequence to determine the structure. But with high-quality models from programs like AlphaFold one can -talking fast and superficially- simply fit the model inside the electron density map to “validate” it and the best case distort it a bit to better match the experimental data. This is nothing new, it is done quite routinely but it was so far limited to proteins that were easy to model. Now, AlphaFold 2 will allow to apply this same trick to almost all proteins. Moreover, cryo-electron microscopy usually works best for assemblies of several proteins, so AlphaFold’s purported capacity to also model them will be useful if confirmed.

And these applications were only to structural biology, i.e. the part of biology that deals with structural details at atomic level. But having access to models will also impact directly on other areas of biology such as cell biology, where even gross models of protein structures are often enough to interpret the results of experiments and to design new ones.

The future of biology and data science niches within biology, and impacts on machine learning science itself

The direct implications of AlphaFold and similar programs are of course the improved ways to model protein structures. As I showed with some examples above this is by no means the end of experimental structural biology, because none of these programs can get details sufficiently right and because they are inherently biased to what is available in the Protein Data Bank. On the contrary, in the last section I made the point that the models coming out from these programs are of inmense help in experimental structure solving. This means that AlphaFold and other modeling technologies will by no means replace structural biologists, but rather make their job easier and let them focus on more complicated aspects and systems.

The early 2000s had several genome projects under the spotlight, as they promised to “crack the code of life”. Despite the many advancements that DNA sequencing and whole genomes brought, from the point of structural biology it was obvious that having these genomes available was not enough. The era of solving structures soon started, but this was far less efficient and more expensive than sequencing DNA. Efforts on structural genomics consortia were set to prioritize which proteins were most important to solve structures for, trying to speed up the rate of completion of the Protein Data Bank. Now, the possibility of modeling proteins quite accurately (given sufficient information is available and all the limits discussed in this article plus some others for sure!) makes the dream of structural coverage of the proteome much more feasible. Of course, this fulfilment is due in part to the work done during the structural genomics projects, which filled the Protein Data bank with structural data for programs to learn. And also to the millions of protein sequences obtained throughout years of DNA sequencing proteins, that now allow for the data-rich alignments required by protein modeling programs.

The impact on structural biology is not only on predicting the structures of proteins and assisting their experimental determination, but also on designing them, even from scratch. Neural networks that predict protein structures from intermediate predictions of contacts, distances and orientations were recently repurposed into iterative methods that optimize protein sequences to obtain a given protein structure. Protein design is a field in itself, I recommend this review to learn about different methods including those based on neural networks.

The other big impact the Deepmind work will likely have is much broader, and has to do with all the technologies developed and put together into AlphaFold 2. As I described already, this is not an improvement of AlphaFold 1 but rather a whole redesign that involves many smaller (yet big!) new tools for the machine learning and data science communities. Taken from their paper, these new developments include:

A new architecture to jointly treat sequence alignments and pairwise features, which could also have applications in other problems dealing with text.
A new output representation of the 3D models and a new loss function that together enable full end-to-end structure prediction, which could be adapted to various other problems too reshaping them into end-to-end forms too.
A new attention architecture that assists how different portions of information that are key at different stages of the prediction problem move through the network.
The use of intermediate losses to achieve iterative refinement of predictions, that help the program to gradually improve the models internally as it runs, at different levels of resolution.
Masked MSA loss that gets trained jointly with structure, which allows it to perform well even when the alignment does not have too many sequences.
Learning from protein sequences that had no experimental structures but could be modeled reliably. This I guess was risky, but clearly worked out.
Self-estimates of accuracy that are computed within the same network that predicts the structures. Estimates of accuracy should benefit applications of neural networks in all domains!

Each of these new “small” inventions and improvements to existing components of neural networks are of potential use in other problems of biology and more broadly of computer sciences. Deepmind said they are working on improving prediction of protein complexes and also on small molecule binding, all key, still open problems in biology. And they are already working on many other areas of biology and clinic, for example in automatically detecting tumors in medical images just to mention one example.

Academics are more excited than ever to apply these and new ideas to other problems. Niches of chemistry and biology beyond protein structure prediction include predicting the effects of mutations on disease, automated interpretation of human-written texts for annotation into databases, automated image analyses to detect, delimit, extract and identify objects, better automation of chemical calculations, and many more. We can hence expect that the technologies used to build AlphaFold will likely impact, and inspire impacts, on many fields of science and engineering in the near future.

Links and further reads

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

via WordPress https://ramseyelbasheer.io/2021/07/26/alphafold-based-databases-and-fully-fledged-easy-to-use-alphafold-interfaces-poised-to/

AlphaFold-based databases and fully-fledged, easy-to-use AlphaFold interfaces poised to…

AlphaFold-based databases and fully-fledged, easy-to-use AlphaFold interfaces poised to revolutionize biology

Popular posts from this blog

I’m Sorry! Evernote Has A New ‘Home’ Now

Jensen Huang: Racism is one flywheel we must stop

Streamlit — Deploy your app in just a few minutes