Computational infectious disease modelling is the attempt to approximate the real-world biological processes of pathogen transmission, control, and evolution using mathematical and/or simulation-based techniques. In this review, we provide an overview of three distinct branches of disease modelling and consider the methods, approaches, and challenges within them. First, we explore the difficulties of modelling the problem of antimicrobial resistance (AMR). Correctly understanding the biological mechanisms driving AMR is highly complex and involves many pathogens and demographies, which makes accurately predicting changes in the prevalence of AMR difficult to achieve using models. Nevertheless, the outputs of these AMR prevalence models are fed into further models that try to predict the long-term health and economic impacts of AMR at a global scale.

We next consider the challenges of estimating how vaccines will alter subsequent pathogen evolution, focusing on the bacterial colonisers of the upper respiratory tract, Neisseria meningitidis and Streptococcus pneumoniae. The modelling in this section predicts the dynamics of competing bacterial strains that result from vaccinating human populations. Finally, we consider models of within-host immune response, primarily for COVID-19. Traditionally, one difficulty in modelling viral or antibody kinetics was a lack of high-quality data. However, for COVID-19, and an increasing number of other pathogens, modellers have access to more and better data than ever before. New approaches to modelling antibody kinetics must be developed for modelling data of a quality and quantity that was previously thought unobtainable.

These three examples were chosen to give a sufficient variety in the spatial and temporal scope of the modelling work. We cover changes in the antibodies levels within one person over a matter of months, changes in the bacterial strains within a vaccinated population in the years following a vaccination campaign, right through to global estimates of the impact of worsening AMR over the next few decades. Despite their differing scopes and methodologies, we will try to identify broad trends and challenges shared across these three fields of modelling. Where possible, we will also link our discussion in each of these examples to the COVID-19 pandemic, where computational modelling was used extensively and brought infectious disease modelling to the forefront of scientific and public consciousness.

The original emphasis of modelling was on using mathematical analysis tools to understand the qualitative behaviour of a “model,” defined by a system of equations. Knowing how the behaviour of a model changes over time for certain combinations of parameter values can lead to useful qualitative insights regarding the real-world biological system, insights that might not be obvious without the use of models as an explanatory tool. In contrast, we think that contemporary modelling places greater emphasis on statistical and computational machinery, allowing the available data to guide the form of the equations within the model. The aim is to use the ability of the model to predict the data as evidence to accept or reject different model structures, each ideally corresponding to different hypotheses about the biological system being studied.

To give an example of the changes in the approach to infectious disease modelling over time, we will briefly turn to malaria modelling. An early and influential malaria transmission model was the Ross-MacDonald model that was in development from the 1950s to the 1970s. The Ross-MacDonald model began with modellers trying to inscribe their existing theories of the process of malaria transmission in the structure of mathematical equations. From a few basic assumptions of how transmission works, an equation was derived for the reproduction number (R0, the average number of secondary infections for each new infection). Despite the fact that this model was not fitted to data on malaria cases, it provided considerable insights into strategies that might control malaria transmission.

In the Ross-MacDonald model, R0 turns out to have a linear relationship to all model parameters, except for the biting rate on humans which has a quadratic effect, and mosquito survival which has an approximately cubic effect (Smith et al, 2012). Therefore, reducing the biting rate and mosquito survival has a larger impact on reducing R0 than the other parameters. This model-derived insight supported early malaria eradication efforts through indoor spraying with the insecticide DDT. It is also the rationale behind current malaria prevention tools: insecticide treated bed nets that both kill mosquitoes and provide a physical barrier preventing them from biting people. Although this result has had a positive effect on malaria control to date, the concern that motivated a move towards a more data-led contemporary modelling approach over time is that this result, derived from the mathematical structure of the model, is only applicable if the model is a “good enough” approximation of real-world malaria transmission. A contemporary example of malaria transmission modelling considered multiple structural forms for the model equations, choosing between them based on their ability to best explain the data from experimental settings and long-term malaria incidence trends over multiple countries (White et al, 2011).

Returning to speaking of modelling more generally, contemporary modelling practice usually involves some form of data-based model selection from a set of biologically plausible candidate models. The model fit is evaluated predominantly on the model’s ability to predict the values of the same data that it is fitted to, combined with a penalty for model complexity that aims to prevent “overfitting” the model to finite data. This sort of model selection only measures predictive power compared to the other candidate models, the best model of the set may still predict poorly. Two different modelling approaches emerge here: the “scientific” approach, which is more concerned with linking model development and model fitting to answering scientific questions, and the “pragmatic” approach, which searches for the model with the best objective predictive accuracy for use in practical decision making (Navarro, 2019). The two approaches may not always select the same “best” model. Which approach is followed should depend on what the model is being developed for and what sort of questions it is being designed to answer.

Most of infectious disease models split the population into separate compartments that represent the different states of disease (susceptible, infected, recovered, etc.). The flow of people between the compartments can be deterministic, typically represented by a system of coupled differential (or difference) equations; or stochastic, represented as a set of rules describing the probability that individuals move between compartments over time. Model complexity can vary depending on the detail of available data and the structure required to model a given problem. Often models further split the population by age, spatial structure, or behaviour.

Alternatively, individual-based models simulate individuals following rules describing the probability that they transition between disease states. Here, each individual’s specific disease state is tracked rather than the total number of individuals within each disease state like in compartmental models. This approach can provide more granular estimates than compartmental modelling, especially if individual-level data are available to parameterise the model accurately. Because of the great interest in SARS-CoV-2, and the resulting colossal data collection, a number of recent studies modelled the spread of SARS-CoV-2 using detailed individual-level and/or household-level data. For example, Ferretti et al (2023), arrived at time-dependent estimates of the probability an individual would be infected with SARS-CoV-2 after they were exposed—a key quantity crucial for parameterising other models of COVID-19 transmission. We now turn to our case studies, before discussing the common threads alluded to across the the studies.

Modelling of AMR

AMR refers to the general problem of microbial pathogens which are able to withstand treatment with antimicrobials. By convention, AMR is often used to refer to the problems of resistance to antibiotics in bacteria, but it should be noted that the term can also be used to encompass resistance in fungi (antifungal resistance) and viruses (antiviral resistance). AMR modelling is more disparate than for other pathogen threats (e.g., viral epidemics) because it is a diffuse cross-pathogen threat. For our purposes, we can treat modelling of AMR as falling into three broad areas: calculating the levels of AMR, explaining why we observe those levels, and informing us how to reduce them. Rather than aiming to be comprehensive, we first introduce the general features of AMR before giving some clear case studies for each area.

The full complexity of AMR is beyond the scope of this review, but it is a diverse set of phenomena driven by diverse biological mechanisms (Darby et al, 2023). Depending on the question, modelling of AMR pathogen threats may not need to engage with the genetically determined complexity of AMR. For example, in November 2016, an outbreak of typhoid fever in the Sindh region of Pakistan of an extensively drug-resistant (XDR) form of Salmonella Typhi caused global concern (Klemm et al, 2018). XDR Typhi quickly became the dominant⁠ cause of typhoid fever in Pakistan: ⁠from no cases in 2017 to 50% of cases in 2019 (Nizamuddin et al, 2021). The presence of XDR Typhi could be determined from its phenotypic resistance profile and so it could be modelled as a new and separate pathogen to “regular” Typhi. The underlying biological causes of the resistance—a combination of new resistance genes (including on a plasmid) as well as mutations in a chromosomal gene—did not need to be incorporated into models of its spread. A modelling analysis assessed the global risk of further outbreaks of XDR Typhi using air travel data in combination with reported cases, finding that countries with more passengers arriving from Pakistan were far more likely to have cases (Walker et al, 2023). This analysis highlighted the probable existence of unreported cases in countries with high air traffic with Pakistan (Saudi Arabia, Turkey, and Malaysia) as well as countries at high risk of XDR Typhi outbreaks. Afghanistan was judged at high risk of XDR Typhi outbreaks given its already high incidence of typhoid cases and high connectivity to Pakistan.

This analysis had clear public health implications, but a common reason for developing computational models of AMR is to better understand the drivers of resistance to help us work out how to reduce it. The fact that AMR is inherently an ecological problem makes this more challenging than modelling a single pathogen with an “SIR”-type model of transmission. To take an important example: Escherichia coli is a diverse species, with subtypes including common gut commensal strains, but also phenotypically quite different strains that cause opportunistic extraintestinal infections. Both subtypes may be either sensitive or resistant to a given antibiotic. Resistance genes can be carried and exchanged between both commensal and pathogenic strains—or even with other bacterial species—and resistance to one antibiotic may be correlated with resistance to another. The boundaries of what needs to be included in a model to accurately capture the system are unclear. Not only that, but the underlying data quality is often poor because of a bias towards sequencing resistant isolates, varying regional surveillance capacity and sometimes a lack of standardisation between laboratories. This is part of the reason that AMR modelling is less advanced than for other pathogen threats. As we argue in what follows, modelling must tackle these data challenges now rather than waiting for better data.

Calculating the incidence of AMR How much of a problem is AMR? Answering the question requires modelling. Well-known statistics about AMR are often the products of models. For example, the much-cited O’Neill report commissioned by the British government stated that AMR could cause 10 million deaths a year by 2050 (O’Neill, 2016). This alarming statistic was based on analysis commissioned from two consultancy firms, KPMG and Rand, which do not go into much methodological detail (KPMG LLP, 2014). Criticising the “10 million deaths by 2050” figure, de Kraker et al noted it came from a hypothetical scenario where infection rates doubled and resistance rates rose by 40 percentage points then remained stable—strong assumptions without data supporting them (de Kraker et al, 2016). De Kraker et al argued that “modeling future scenarios using unreliable contemporary estimates is of questionable utility.” More recently, a Global Burden of Disease study tried to estimate the current burden of AMR—or, strictly speaking, the burden of 23 key pathogens and 88 pathogen-drug combinations across 204 countries in 2019 (Ranjbar & Alam, 2023). By considering the counterfactual scenario where every resistant infection was instead a sensitive infection, Murray et al (2022) aimed to estimate the number of deaths that were directly attributable to AMR. After a complex modelling process involving sub-models for each pathogen, their final estimate was 1.27 million deaths, with a 95% uncertainty interval of 0·91 – 1·71 million derived by propagating uncertainty through models and taking quantile ranges from the posterior distribution of parameters. It is worth highlighting how much modelling is behind this headline figure. First, because causes of death are rarely coded using pathogen or resistance profile but rather by infectious syndromes with diverse underlying microbial causes, the authors used models to relate syndromes to pathogens. Second, poor data availability meant that the authors used models to generate data for the next stage of modelling. Their final estimates are necessarily built up from a succession of models, with “10 estimation steps that occur within five broad modelling components” that hierarchically create inputs for the next models: from models at the level of infectious syndromes, to case-fatality ratios, pathogen distributions, the fraction of resistance, and finally the relative risk of resistant versus susceptible infections. Making such a complex set of models across pathogens is clearly a difficult task and models may miss aspects known to be important for a particular pathogen. For example, the model for S. pneumoniae did not account for serotype replacement after vaccination. Murray et al (2022) acknowledged significant limitations, including a lack of data from many low- and middle-income countries. Indeed, 19 countries had no available data at all for any aspect of the study’s modelling. This lack of data is particularly problematic given that, where data are available, it suggests that AMR is much more of a problem in low- and middle-income countries. Data scarcity—because of systematic global inequalities—has been highlighted again and again in the context of AMR. However, even where we have good data, the situation is far from clear because our ability to explain resistance with simple models is poor.

Explaining observed levels of resistance In its fundamentals, AMR is an evolutionary process: an effective antimicrobial exerts a selective pressure for resistance. Put so starkly, AMR might appear like a trivial problem to model. We know that increased usage of an antibiotic should lead to more prevalent resistance. But despite this, predicting population-levels of resistance is surprisingly difficult. To take a simple example, consider a pathogen with two subpopulations: a sensitive strain and a resistant strain. Assuming the resistant strain is fitter in the presence of antibiotics, this simple model would predict competitive exclusion: there will be a level of antibiotic prescribing below which the sensitive strain dominates and above which the resistant strain dominates. But empirically, we usually observe the persistent coexistence of sensitive and resistant strains over many years, such as for Streptococcus pneumoniae (Lehtinen et al, 2017; Blanquart, 2019). Many possible model structures can reproduce some form of this coexistence. One early effort used a Monte Carlo simulation of 10,000 human hosts that could exchange bacteria with the environment, in effect producing a “migration-selection balance” where an influx of sensitive strains balanced the selection of resistant strains (Levin et al, 1997). Although many other model structures can also reproduce coexistence patterns, one group of authors argued that models should have no intrinsic mechanism that promotes stable coexistence of strains that are otherwise indistinguishable. Otherwise, models can artificially increase the conditions under which coexistence occurs, rather than explaining it realistically (Lipsitch et al, 2009). The same group of authors compared related models, where a host could be infected by both sensitive and resistant strains at the same time (represented by two equally sized compartments) concluding that within-host interactions play a more important role in coexistence than treatment and contact heterogeneity (Colijn et al, 2010). A more recent study criticised this model, arguing that the subcompartment assumptions (amounting to equal abundance within a host) inhibited coexistence by reducing the scope for within-host competition (Davies et al, 2019b). Those authors argued for a “mixed-carriage” model that explicitly tracks within-host strain frequencies, arguing that this outperformed the previous model when capturing the relationship between national antibiotic consumption and resistance prevalence in E. coli and S. pneumoniae. However, Davies et al did not use any real within-host data. Similar models accounting for maintained structure and separation in the host population, or structure within the pathogen population, have also been adapted to attempt to explain observed frequencies of multiple resistance (Lehtinen et al, 2019). Some modelling efforts do not try to explain in term of mechanisms but combine antibiotic prescribing data (from electronic health records) with resistance data from longitudinal surveys and look for time–series correlations. These correlations can be modelled with elastic net regularisation and generalised boosted regression models. Such models have highlighted that use of one antibiotic can correlate with resistance to other antibiotics, either because of shared resistance mechanisms or the genetic linkage of resistance genes. One analysis of antibiotic use in primary care in England found that regional levels of resistance to trimethoprim (a sulfonamide antibiotic) were better explained by prescribing levels of amoxicillin than by prescriptions of trimethoprim itself (Pouwels et al, 2018). Amoxicillin prescribing was also correlated with resistance to ciprofloxacin, a different class of antibiotic (Pouwels et al, 2019). AMR varies seasonally, and this can be captured using oscillatory models with a period of 1 yr. A recent study of AMR in the USA showed that resistance to all antibiotic classes was most correlated with the use of penicillins and macrolides, the most highly prescribed antibiotic classes (Sun et al, 2022a). Usage typically peaks in winter, suggesting that seasonal selection is dominated by only some antibiotic classes. There are many further complications of the use-resistance relationship: for example, high levels of resistance can lead to reduced use of an antibiotic because it is less likely to be effective. Reducing antibiotic prescribing is one obvious action to take to reduce AMR, but understanding how effective our actions will be requires modelling.