Unlocking the protein universe

This year’s Nobel prize in chemistry celebrates a breakthrough in predicting the structures of proteins, the molecules underlying the incredible functions of biology. This breakthrough immediately revolutionized research in biochemistry and medicine, including mine, but we are just scratching the surface of its potential. As is often the case, many independent scientists are following in the shadows cast by this flashy breakthrough, exploring the new frontier it unlocked.

In 2021, a team from Google Deepmind—led by new Nobel Laureates John Jumper and Demis Hassabis—released AlphaFold2 [1]. Their program can predict protein structures with astonishing accuracy. The next year, their team released the AlphaFold Protein Structure Database, which contains predictions of over 200 million protein structures [2]. The massive scale of the database can be daunting to researchers, especially since these predictions often need further adjustments. The AlphaFold Database is a new frontier of biology, a sprawling wild-west of unedited computer outputs. That is, until last week, when a team of London scientists released The Encyclopedia of Domains [3], a map and guidebook for the frontier.

Nature’s creations truly come in all shapes and sizes, even down to the molecular level. Proteins in particular have incredibly diverse structures, which allow them to perform most of life’s essential functions. Above is one example of a protein structure determined by Yale scientists in 2005 [4]. While every protein is different, our pictures of them tend to have some features in common. Notably, we usually see compact, neatly-folded regions called domains. These domains are major units of protein function and evolution. In fact, scientists often isolate and study individual domains in order to understand how the larger protein works.

However, many of the structures in AlphaFold’s database look like this:

Much less pretty. And harder for scientists like me to use. And, unfortunately, inevitable.

Naturally-occurring protein sequences often contain regions without well-defined structures—which are invisible in the still pictures of proteins we can create in the lab. Since AlphaFold’s predictions are ultimately based on learning from those pictures, it doesn’t know what to do with unstructured regions. In the image above, blue colors indicate high confidence in the predicted structure, while oranges indicate low confidence. As you can see, AlphaFold tends to make long stretches of random spaghetti when it doesn’t know what to do. AlphaFold itself admits that the details shown in the spaghetti are meaningless. Trying to study a protein by looking at the orange parts of an AlphaFold structure would be like trying to divine the future from an actual plate of spaghetti.

In my experience, any application of the AlphaFold Database involves first removing low-confidence regions and trimming down to the domain of interest. I’ve done this manually. Dozens of times. Such is life on the frontier, hacking my way through the thicket of data.

Andy Lau and Nicola Bordin, along with their colleagues and supervisors at University College London, solved this problem for everyone. They detected, annotated, analyzed, and categorized the domains in all 214 million proteins in the AlphaFold Database, resulting in The Encyclopedia of Domains. The encyclopedia contains 365 million protein domains, including 100 million that were undetected by previous studies. This includes thousands of protein folds that are completely distinct from any known before, and over 10,000 new kinds of interactions between proteins.

The scale and quality of Lau and Bordin’s work is truly remarkable. This drastically increases the usefulness of an already amazing resource. We are living through a revolution in our understanding of life and our technology for studying it. This work by Lau and Bordin goes to show: big, flashy innovations may win Nobel Prizes, but the really useful stuff comes in their shadows.


Engage Science