Protein folding in the cell often relies on the help of chaperonins, naturally occurring cellular nanomachines that fold many critical cellular proteins in all human and animal cells. Knowledge of protein folding is important because proteins must assume the correct 3-D structure to function properly.
Different classes of chaperones work together to form elaborate cooperative networks and also ensure that potentially damaging misfolded polypeptides are cleared from the cell. Such misfolded proteins would otherwise cause a cascade of cellular damage and ultimately lead to globalized cell death. The term “proteomics” refers to a large-scale comprehensive study of a specific proteome resulting from its genome, including abundances of proteins, their variations and modifications, and interacting partners and networks in order to understand cellular processes involved.
Hundreds of enzymes depend on fully-functioning chaperones. As our population grows older, an increasing socio-economic burden stems from a class of diseases resulting from protein misfolding and protein aggregation. Millions of Americans suffer from the most common of these: Alzheimer’s disease, Parkinson’s disease, and Huntington’s disease. In addition to such neurodegenerative diseases, folding defects play important roles in stroke, various types of cancer, and cataract formation. Protein chains that are required for healthy cell and organ function can misfold and ultimately aggregate into toxic fibers and large complexes in all these diseases.
Despite recent technological advances in proteomics, comprehensively characterizing an entire proteome still poses a challenge inherent in proteomics. This lies in a proteome’s increased degree of complexity compared to its genome and argues for the need of continuous development of technology/platform. For example:
- One gene can encode more than one protein. The human genome contains about 21,000 protein-encoding genes, but the total number of proteins in human cells is estimated to be between 250,000 to one million.
- Proteins are dynamic and spatial. Proteins are continually undergoing changes, e.g., binding to the cell membrane, partnering with other proteins to form complexes, or undergoing synthesis and degradation. The genome, on the other hand, is relatively static.
- Proteins are co- and post-translationally modified. The types of proteins measured can vary considerably from person to person under different environmental conditions, or even within the same person at different ages or health status. Additionally, certain modifications such as phosphorylation are highly dynamic while regulating the function of their respective proteins.
- Proteins exist in a wide range of concentrations in the body. For instance, if working with blood, the concentration of albumin is more than a billion times greater than that of interleukin-6, making it extremely difficult to detect low abundance proteins in a complex biological matrix. Scientists believe that the most important proteins such as p53 for cancer may be those found in the lowest concentrations.
Figuring out what shapes proteins fold into is known as the “protein folding problem.”
A project called Alphafold entered CASP13 (Critical Assessment of Structure Prediction) in 2018 and achieved the highest accuracy among participants. Afterwards, the team published a paper on our CASP13 methods in Nature with associated code, that has gone on to inspire other work and community-developed open source implementations. Now, the new deep learning architectures have driven changes in methods for CASP14, enabling the team to achieve unparalleled levels of protein folding accuracy. These methods draw inspiration from the fields of biology, physics, and machine learning, as well as of course the work of many scientists in the protein folding field over the past half-century.
A folded protein can be thought of as a “spatial graph”, where residues are the nodes and edges connect the residues in close proximity. This graph is important for understanding the physical interactions within proteins, as well as their evolutionary history. For the latest version of AlphaFold, used at CASP14, the team created an attention-based neural network system, trained end-to-end, that attempts to interpret the structure of this graph, while reasoning over the implicit graph that it’s building. It uses evolutionarily related sequences, multiple sequence alignment (MSA), and a representation of amino acid residue pairs to refine this graph.
“It’s a very substantial advance,” says Mohammed AlQuraishi, a systems biologist at Columbia University who has developed his own software for predicting protein structure. “It’s something I simply didn’t expect to happen nearly this rapidly. It’s shocking, in a way.” Quote source: Technology Review.