Researchers at the University of British Columbia have found a new way to rapidly discover new strains of RNA viruses using Amazon Web Service’s (AWS) Cloud Innovation Centre to analyze raw DNA/RAW sequence data from public sequencing databases.
This could help recognize infections faster, preventing future pandemics.
Artem Babaian, the lead researcher behind Project Serratus, who holds a PhD in medical genetics and is currently working on his post-doc, was conducting genetic research into cancer prior to the pandemic. He was already working with AWS for his project on cancer genetics.
When the start of the pandemic hit in March of 2020, Babaian and his friend, UBC engineering student Jeff Taylor, also an author on the research paper about Serratus published last week in the journal Nature, came up with a “back of the napkin sketch for a computing architecture on AWS that we wanted to implement.”
The duo soon realized that this plan could also be used not just in cancer genetic research, but to try and identify coronaviruses in public databases.
The database, called the Sequence Read Archive (SRA), holds global genomics research data from published papers over the last 13 to 14 years.
“So the data is just rapidly exploding…the biggest limitation to looking at the global community data is how fast can you read it,” Babaian said.
Luckily, AWS and the National Institutes of Health (NIH) had started a giant infrastructure project called STRIDES, and had copied all the data to AWS S3, a service for cloud object storage, where it can be read quickly.
Cloud Innovation Centre
Using this database, Babaian and his team decided they would work with the UBC Cloud Innovation Centre to discover unknown coronaviruses and perhaps head off another pandemic.
The Cloud Innovation Centre is a global program at AWS that works with academic institutions for collaborative projects. The UBC Cloud Innovation Centre launched the Open Virome project, a global initiative led by Babaian, that aims to avoid future pandemics by identifying hundreds of thousands of previously undiscovered viruses.
“He (Babaian) had this hypothesis that by looking at this very large data set that had been recently made available as open source, so available to anybody on AWS, could we potentially build some form of architecture that would allow them to start to analyze this data set and see if we could actually start to discover new strains of coronavirus,” Coral Kennett from the Cloud Innovation Centre said.
Expanding the project
According to Babaian, the group wanted to expand its scope even more, so rather than focusing on just coronaviruses, they decided to search all RNA viruses.
They searched for a gene found in all RNA viruses called RNA Dependent RNA Polymerase (RdRP). The public database revealed that there were 15,000 known RNA viruses
“That’s essentially the core of what our project is. We took 15,000 RNA viruses, that is everything that was publicly described by the world community in the last 100 years and in 11 days, we re-analyzed 5.7 million data sets and discovered over 130,000 new species of RNA virus. So that’s almost a log increase in the amount of RNA viruses that we found,” he said.
“Trying to analyze this massive, massive data set using traditional computing would not have been possible,” Kennett noted. It could have taken 2,000 years with a regular computer, and over a year with standard high-performance computing. The Serratus infrastructure did all the work in 11 days and cost $24,000 in cloud computing credits, provided by AWS.
In addition, the researchers were able to then zero in on the novel coronaviruses and discover nine new varieties in aquatic species such as axolotls, seahorses and pipefish.
The Cloud Innovation Centre was able to support the research group with experts from AWS. Kennett said they were able to bring in people from their genomics team, experts from the bioinformatics team, and also people from the Spot Instance team which helps with cost efficiencies.
The overall goal of Project Serratus is to create a free scientific resource of coronavirus sequences which could help predict secondary pandemics and improve the accuracy of current PCR tests.
“You have to be able to describe what the problem is and that’s where we’re focusing, very early on with the transmission of different viruses across different species. That’s the network that I’m working to build and understand . And I have the data… there’s probably like a trillion dollars worth of sequencing data sitting there, waiting to be analyzed and made sense of, and now we can do it with this AWS architecture,” Babaian said.