Longevity briefs provides a short summary of novel research in biology, medicine, or biotechnology that caught the attention of our researchers in Oxford, due to its potential to improve our health, wellbeing, and longevity.
Why is this research important: Proteins are biology’s functional molecules – they are responsible for contracting our muscles, allowing our neurons to fire and much more. Each protein is made from a sequence of amino acids dictated by the genetic code, and this sequence determines the shape and function of the protein.
Scientists can engineer new proteins for medical purposes. This is usually done by first identifying an existing protein with a desired property. By fusing this with parts from other proteins, or by tinkering with the amino acid sequence, scientists can create a new protein with a desired function, such as a particularly potent antibody or an anti-inflammatory molecule. But what if we could simply design entirely new proteins from scratch to do exactly what we need them to do? This is already possible to some extent, but machine learning promises to revolutionise this process, enabling us to rapidly design many new proteins with a desired function.
What did the researchers do: In this study, researchers tested the ability of a deep learning language model called ProGen. Just as natural language models like GPT-3 can generate grammatically and semantically correct sentences on a desired topic, ProGen is designed to generate amino acid sequences for proteins that are biologically functional for a desired task. ProGen was trained on 80 million sequences from over 19,000 protein families, each of which was ‘tagged’ with their specific protein properties. Researchers then asked ProGen to generate new proteins with properties of five families of lysozyme, a type of antimicrobial enzyme found in various secretions including saliva, and which is also important for the cell’s waste disposal system.
Key takeaway(s) from this research: ProGen generated about one million artificial sequences from across these five lysozyme families. While these sequences were diverse, ProGen captured the evolutionarily conserved patterns associated with each lysozyme family. Researchers then identified the highest quality sequences and synthesised genes that would encode the artificial proteins that ProGen had designed. Most of these genes produced working proteins, and those proteins proved to be about as effective as a natural protein when it came to attacking bacterial cell walls.
It shouldn’t really come as a surprise that natural language models work well for protein design. After all, the blueprint for proteins is the DNA, and DNA is the language of life. Since this model was trained on existing natural proteins, it can’t be expected to invent proteins for a task that nature has not already solved – that’s probably still some way off. However, it can greatly expand our libraries of existing proteins, giving us much more material to work with when it comes to optimising what we already have.
Large language models generate functional protein sequences across diverse families: https://doi.org/10.1038/s41587-022-01618-2
Title image by ANIRUDH, Upslash