Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cross-validation and out-of-sample mean squared-error #206

Open
phelps-sg opened this issue May 10, 2020 · 14 comments
Open

cross-validation and out-of-sample mean squared-error #206

phelps-sg opened this issue May 10, 2020 · 14 comments

Comments

@phelps-sg
Copy link
Contributor

Related to #194, I could not find out how the model output is validated against out-of-sample data in the model overview or any of the cited papers (apologies if missed). Do we need some code to perform e.g. k-fold-cross-validation using the latest time-series data on numbers of infected etc.? It would be good to see some out-of-sample MSE calculations to check for over-fitting or model bias.

@bbolker
Copy link

bbolker commented May 10, 2020

This is a fine idea, but if you're not familiar with time series modeling (sorry if you are and I'm telling you things you already know) you should note that cross-validating time series fits can be tricky, e.g. see here. Naive cross-validation doesn't in general work, instead you need to set up some way to do 'n-step-ahead' prediction (1-step-ahead is often overoptimistic) on unused data and compare the fitted and observed values.

@phelps-sg
Copy link
Contributor Author

Yes I have some familiarity with time-series modelling, but I am not very familiar with this particular model, particularly how it is calibrated. I think the appropriate cross-validation methodology will depend on what variables in the model we consider as predictors, which are responses, whether the model is auto-regressive (I am not sure I understand #194, but it seems to suggest the use of an auto-regressive or recurrent approach). Also, if we include spatially-distributed data in the validation, then we should also consider spatial auto-correlation.

I think the most important starting might be to get at least some out-of-sample MSE calculations, so just calculating the MSE for a single block of held-back validation data would be a good start. Does anybody know if this has been done already?

@insidedctm
Copy link
Contributor

The cited papers are in the top-level README. The code is for a mathematical model that simulates the evolution of a pandemic. It is not a statistical or machine learning model although some of the input parameters will have been derived from statistical modelling.

@bbolker
Copy link

bbolker commented May 10, 2020

+100 to @insidedctm's comment. That said, it would be interesting and valuable to formalize the calibration procedure and quantify how reliable it is (for example, first by running the calibration procedure as described in the various papers on data simulated from the model itself, then by running on subsets of the time-series data and quantifying forecast accuracies).

@phelps-sg
Copy link
Contributor Author

phelps-sg commented May 10, 2020

In reply to @insidedctm, I do not understand what you mean. Any mathematical model (simulation or otherwise) that takes input parameters (predictors) and has corresponding dependent variables (responses) which represent things in the real-world is, by definition, a statistical model; you provide some inputs to the model, and you get some corresponding quantitative outputs. These quantitative outputs corresponding to measurable things in the real-world, which can be compared against actual data. In common with many other statistical models your dependent variables are stochastic. Just because the mapping between predictors and responses is implemented in C++ instead of an equation, and just because it is an individual-based model, that does not mean that it no longer meets the definition of a statistical model.

Moreover, as I understand it the simulation is being used to make quantitative predictions about what will happen to e.g. infection rates, if certain parameters are changed. If that it is the case then it is vitally important to validate the model by comparing its predictions with reality. Or are you claiming that this model should not be used to make quantitative predictions - is this what you mean by "it is not a statistical model"?

There is a plenty of literature on calibration, estimation and validation of agent-based models (agent-based models are another word for individual-based simulation model). I don't see any reason why this ABM in particular can't or shouldn't be validated against empirical data

@insidedctm
Copy link
Contributor

No you are quite right and that was slightly sloppy language. However I took from your use of terms like K-fold cross-validation that you believed this to be some sort of statistical regression model; there is no data to hold out so cross-validation isn't applicable. Something along the lines @bbolker 's suggestion would be interesting.

@phelps-sg
Copy link
Contributor Author

Ok I'm confused. As I understand it the model is initially calibrated so that "it reproduced the observed cumulative number of deaths in GB or the US seen by 14thMarch 2020". So we use some training data (cumulative deaths D_t0 at t0=14/3/20), and then we can predict the number of deaths forward at some later date \hat{D}_t1 by running the model? We can then validate the model by computing (\hat{D}_t1 - \D_t1)^2. I would bet that if you carry on running the model in time that the squared errors will start to increase a lot at later dates. Therefore it might make sense to regularly recalibrate the model at later points in time as per #194. If such a methodology is followed it will be necessary to think carefully about how to divide data D_t into training and validation sets?

@bbolker
Copy link

bbolker commented May 10, 2020

Yes, exactly. That's the point of the link I posted above. The model is almost certainly not "autoregressive" in the sense that would make naive splits into training and validation subsets suitable. See here for some starting points in thinking about validating/calibrating epidemic models during the course of an epidemic.

@phelps-sg
Copy link
Contributor Author

phelps-sg commented May 10, 2020

The Funk et al. paper looks like an excellent basis for a validation methodology. I also like their suggestion to compare out-of-sample errors against those obtained from a simpler null model.

@insidedctm
Copy link
Contributor

See here for some starting points in thinking about validating/calibrating epidemic models during the course of an epidemic.

Really interesting paper, thank you. One further complication that would need to be dealt with is how do you validate forecasts when days after the forecast is made large-scale social changes are enacted. What you have with, for example, Report 9 is a whole variety of forecasts for different input parameters and different NPIs. At best only one can be the valid forecast - how do you select which one is a valid forecast?

@bbolker
Copy link

bbolker commented May 11, 2020

You have to distinguish between scenarios (expected outcome if ...) and forecasts. You raise some important issues, but these are all standard (and basically intractable) problems in epidemic forecasting. The best you can ever do is show that the machine would work if you correctly specified the inputs ...

@phelps-sg
Copy link
Contributor Author

phelps-sg commented May 11, 2020

@bbolker, regarding the auto-regressive form or otherwise of the model, is it not the case that the model is being used to predict deaths D_t as a function of previous deaths D_{t-i}, and could do so on rolling basis e.g.
D_t = f(D_{t-i}) + \epsilon ?

I know that in this model f() is non-linear, so strictly speaking it is not an autoregressive model, but if a model has this form then time-series cross-validation would still be a valid approach?

@bbolker
Copy link

bbolker commented May 11, 2020

It's a nice idea, but I doubt it's that simple. Here's another paper on testing calibration of epidemic models (albeit in a much simpler framework); most of the emphasis is on calibration of parameter estimates rather than of forecasts, but the same ideas apply. With actual data I don't think you're going to be able to do better than calibrating to progressively longer subsets and evaluating the accuracy of n-step-ahead forecasts for each subset. For more general purposes you can simulate as many epidemics from the model as you like and see how well the calibration procedure works for the simulated dynamics.

@NeilFerguson
Copy link
Collaborator

An interesting discussion. We have included simple calibration (to approximately match a defined cumulative number of deaths by a particular time) within the code, but more statistically rigorous calibration is arguably best done in two ways: (a) estimating input parameters (eg transmission rates in schools) from epidemiological data using simpler models more suited to inference; (b) using wrapper code to run thousands of model runs across a parameter grid (eg spanning R0 and intervention effectiveness) and then evaluating the fit of each model run against available data using a likelihood function. We are doing both. For the latter - fitting to UK hospitalizations, ICU demand and deaths. Computationally intensive though. For most purposes it is better to use simpler (eg age structured SEIR compartmental) models for such things. We are doing it because there is a need to match current transmission patterns quite closely when looking at the logistical implications of exit strategies (eg track and trace). We will add the input files for those runs to the repo in coming weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants