From viruses to laws: deciphering data to predict the future

Anna Bernasconi
Anna Bernasconi

In recent years, the increasing availability of biological data has allowed us to transform our approach to addressing challenges in scientific research and public health. In particular, bioinformatics and data science are playing a crucial role in the study of infectious diseases, genomics and virology, and consequently also the development of innovative tools for the prevention of pandemics.

By analysing huge amounts of genetic data, scientists are able to identify viral mutations, evolutionary processes and recombination phenomena, which are key to understanding the spread and adaptation of pathogens. These developments not only improve our ability to respond to global health crises, but also open new perspectives for the use of data in interdisciplinary fields.

Anna Bernasconi, researcher at the Department of Electronics, Information and Bioengineering of the Politecnico di Milano, told us about her work on the integration of genomic data and the development of computational methods for the analysis of viral sequences, with the aim of contributing to the prevention of future pandemics.

When I started my three-year course, I first studied Mathematical Engineering. Alessandro Campi, my professor of Computer Science, recommended that I read “Algorithmics: The Spirit of Computing” by Harel & Feldman. This introduced me to the fascinating world of algorithms and their computational complexity, leading me to Computer Engineering.

After a period in the United States for my double major – during which I focused on algorithms and software systems – I decided to deepen my knowledge in the other side of computer science: data. I worked for a year and a half in consulting, building databases for Business Intelligence processes, basically for the analytical activity of large companies.

I realised that I would prefer to work for other types of “clients” and purposes, putting my computer skills at the service of research in the life sciences. So I went back to the University, at the time when Professor Stefano Ceri’s ambitious European project “Data-driven genomic computing” began.

For a couple of years, just after the outbreak of the COVID-19 pandemic, we were working on methods and computational systems for viral sequences. When the Omicron variant arrived in late 2021, which led to the highest peak of ICU patients to date, someone in the scientific community surmised that it could be a recombination, that is, the proliferation of a virus composed of different virus traits, which had managed to combine within the cells of some immunocompromised patients.

This hypothesis piqued our interest and so we began to explore (or, as we like to say, “sniff”) the data, to establish whether this could make sense, at least from a computational point of view. In fact, Omicron was later discovered not to have been generated in this manner at all, but we exploited the knowledge we had gained as a result to develop a very general method, which – given any viral genome – could allow us to ascertain provenance from a recombination phenomenon or from a normal (more standard) evolutionary process.

It then took a year to formalise and validate the method (both for SARS-CoV-2 and monkey pox) and another year to see it published in Nature Communications!

Recombination is an uncommon evolutionary process of a virus; viruses normally accumulate only a few mutations at a time, while recombination confers a sudden, potentially disruptive change that can combine dangerous characteristics of multiple organisms into a single organism (e.g. increased virulence and ability to evade vaccines or other antiviral drugs).

Before our method, in the field of virology research, recombinations were proposed by researchers who discuss on technical forums and manually perform analyses on individual sequences. Instead, our method allows us to control sequences in large quantities and produce precise responses very quickly, also providing a potential tool for early warning systems, which can be very useful to prevent and control new pandemics (with so-called genomic surveillance mechanisms).

Also as part of the SENSITIVE project, we are collaborating with the University of Milan to develop monitoring and early warning techniques for new pandemics. We try to identify the precise traits of viruses that alert us to them and can be communicated to facilitate informed decisions in the public health sector.

In the case of influenza, we are collaborating with the Experimental Zooprophylactic Institute of Venice, a benchmark in Europe for Avian Influenza, to identify markers that indicate the possibility that a virus is predisposed to a “spillover”: that is, to jump species, adapting from one host species to another (e.g., from avian species to wild species, up to even mammals and humans).

We have not even stopped at SARS-CoV-2, of which we are studying convergent mutations, i.e. those mutations that recur even at different times and places, representing in fact the evolutionary preference of the virus; these are very interesting because they can be the starting point to engineer (and update cyclically) the new annual vaccines for COVID.

My field of research allows, using common techniques, to explore many types of applications, even far from the life sciences (which I started out in). I am currently involved in many other projects.

In the context of the European project TETYS (Topics Evolution That You See) we are implementing a Web tool for exploring topics of interest in scientific literature and their temporal evolution. At the same time, we are studying graph databases, which represent and store data in the form of a network, in which each point (node) represents an entity (a person, object or concept) and the lines connecting these points (arcs) represent the relationships between these entities.

We study these databases for a series of applications: the first is the analysis of Italian legislation and support for monitoring laws to understand their evolution and complexity (our prototype has been awarded by the Chamber of Deputies, in the context of the expression of interest for the collection of proposals for the use of generative artificial intelligence).

A second application is the production of rules of association to obtain recommendation mechanisms in large knowledge graphs (which describe, for example, online shopping networks or social network connections).

A third application is the design of an explorable map on Large Language Models (to allow users to select the most appropriate ones for their tasks) and of a Causal Loop Diagram exploration system (to support Systemic Designers in the analysis of complex systems and resolution levers).

The field of life sciences, in which different computational techniques can be applied, is still the one I am most interested in. I am taking part in a European project that aims to create a system for the distributed analysis of hospital data on various rare diseases. The integration and analysis of clinical and genomic data is certainly central to my studies and will continue to be so.

As for data not of human origin, we started with the study of SARS-CoV-2 but the possibilities of application are manifold throughout the world of pathogens. This is the direction that fascinates me the most, also because of the opportunities for taking part in interdisciplinary research with experts in molecular biology, virology and medicine, whose ideas are always of great interest to me.

Another area that fascinates me is that of user interfaces; all the research I have talked about has high scientific potential in this regard. I am particularly interested in the use of information and results that can be offered to end users (stakeholders), through web applications that allow them to explore the data and insights.

Condividi