Early in the pandemic, a group of computer scientists at IBM had an idea. What if generative AI could be used to design never-before-seen molecules to block SARS-CoV-2, the virus that causes COVID-19? It seemed impossible at the time, years before ChatGPT introduced the world to AI-generated ads, songs, and legal briefs.
Few people were more sceptical than Professor David Stuart, Professor of Structural Biology at NDM’s Division of Structural Biology and Director of Life Sciences at Diamond Light Source Ltd. Prof Stuart said: ‘The idea that you could take a protein sequence and, with AI, pluck out of thin air chemicals that would bind to a 3D site on the virus seemed very unlikely’.
Despite his doubts, Prof Stuart joined the IBM team. Over the course of three years, they would demonstrate that generative AI could, in fact, pluck viable antivirals out of thin air, in collaboration with Enamine, a chemical supplier in Ukraine, and other researchers at Oxford.
And because their generative model was also a foundation model, pre-trained on massive amounts of raw data, it was versatile enough to create new inhibitors for multiple protein targets without extra training or any knowledge of its 3D structure.
In the end, the team hit on four potential COVID-19 antivirals in a fraction of the time it would have taken had they used conventional methods. They describe their work in a new paper published in Science Advances today. IBM has also released a web-based interface for playing with the model and chemical foundation models like it in IBM Cloud.
The validated molecules have many more hurdles to clear, including clinical trials, before companies could potentially turn them into drugs. But even if the AI-generated “hits” never materialize into actual drugs, the work provides confirmation that generative AI has an important role to play in the future of drug development, especially in a time of crisis.
Dr Payel Das, Researcher at IBM Research said: ‘It took time to develop and validate these methods, but now that we have a working pipeline in place, we can generate results much faster. When the next virus emerges, generative AI could be pivotal in the search for new treatments.’
Drug resistance and preparing for future threats
Developing new drugs is notoriously slow, often taking a decade or more. During the pandemic, an unprecedented collaboration between researchers in academia and industry worldwide brought new treatments to the market in record time. But an important factor that contributed to their success was the drugs themselves; most had already been approved for other uses and could be quickly repurposed for COVID-19.
‘In the future, new drugs may be required to tackle new viruses. Viruses mutate, and as they change shape, the drugs designed to block them become less effective. Some of the anti-COVID therapies developed early in the pandemic no longer work and it’s likely that as SARS-CoV-2 continues to mutate, it will become resistant to others’, said Prof Stuart.
Generative AI could provide an answer, with its ability to create molecules entirely new to nature. Two of the AI-generated COVID antivirals the researchers found to bind to the virus’s spike protein in a distinctly new way. If developed into drugs, they could potentially complement some of today’s COVID antivirals in the same way that HIV today is treated with a cocktail of drugs targeting different receptors.
Though the researchers focused on validating antivirals for COVID, they argue that these methods can be extended to existing viruses that continue to mutate, like the flu, or viruses that have yet to surface. Prof Stuart said: ‘If you want to be prepared for the next pandemic, you want drugs that act on different sites of the protein. It becomes much harder for the virus to escape.’
How traditional drug discovery works
Typically, the drug discovery process starts by identifying a biological target, like a protein, that plays a key role in disease. Medicinal chemists then search for compounds that can bind to the target and disrupt its activity.
The hunt often begins with high-throughput screening, which involves filtering vast libraries of small, drug-like molecules for promising candidates deemed likely to bond to the target. Once hits are identified, the molecules are refined into more drug-like “leads” by making the molecules more soluble and stable and removing any toxic ingredients.
Fewer than one in 100 compounds make it to the “hit” stage, and even fewer progress further. The odds are better with a newer technique known as fragment-based screening, which focuses on finding molecular pieces that are likely to bind to the target. When hits are found, often after extensive lab experiments, the fragments can be built into full-sized, drug-like molecules. Several anti-COVID compounds found this way are currently in pre-clinical trials, Prof Stuart said.
In the study, the researchers showed that the hit rate can be raised to as high as 50% by combining generative AI with retrosynthesis prediction, a way of automatically working out the chemical ingredients and reactions needed to manufacture a given molecule to estimate its production cost.
Controlled Generation of Molecules
The researchers built their model, Controlled Generation of Molecules (CogMol), on a generative AI architecture known as variational autoencoders, or VAEs. VAEs encode raw data into a compressed representation and then decode, or translate, it back into a statistical variation on the original sample.
They trained their model on a large dataset of molecules represented as strings of text, along with general information about proteins and their binding properties. But they deliberately left out information about SARS-CoV-2’s 3D structure or molecules known to bind to it. Their goal was to give their generative foundation model a broad base of knowledge so that it could be more easily deployed for molecular design tasks it has never seen before.
Just as foundation models have helped software developers and climate scientists to write code and analyze satellite images faster, the researchers hoped they could bring the same speed and versatility to drug design.
Their goal was to find drug-like molecules that would bind with two COVID protein targets: the spike, which transmits the virus to the host cell, and the main protease, which helps to spread it. Though the 3D structures of both proteins had been discovered by that time, the IBM researchers chose to use only their amino acid sequences, derived from their DNA. By limiting themselves in this way, they hoped that the model could learn to generate molecules without knowing the shape of their target.
Fed just this amino acid sequence for each protein target, CogMol generated 875,000 candidate molecules in three days. To narrow the pool, the researchers ran the candidates through a retrosynthesis platform, IBM RXN for Chemistry, to understand what ingredients would be needed to synthesize the compounds. Based on the platform’s predicted recipes, they selected 100 molecules for each target. Chemists at Enamine further pared the list to four molecules for each target, selecting those deemed easiest to manufacture.
After synthesizing the eight novel molecules, Enamine shipped them to Oxford for testing. The final verdict came after bombarding the CogMol-identified molecules with X-rays at Diamond Light Source, the UK’s national particle accelerator research facility.
The novel compounds were further tested in target inhibition and live virus neutralization tests. Two of the validated antivirals target the main protease; the other two not only targeted the spike protein but proved capable of neutralizing all six major COVID variants.
‘You get a map that shows exactly where things bind, and bang! you’ve got a confirmation’ said Prof Stuart, who is also Diamond’s life science director.
CogMol is one of several chemical foundation models that IBM has since developed. The largest, MoLFormer-XL, was trained on a database of more than 1.1 billion molecules and is currently being used by Moderna to design mRNA medicines.
Professor Jason Crain, Visiting Professor of Computational Biophysics at the University of Oxford and Research Leader at IBM Research said: ‘We created valid antivirals using a generative foundation model that knew relatively little about its protein targets. I’m hopeful that these methods will allow us to create antivirals and other urgently needed compounds much faster and more inexpensively in the future.’