API documentation¶
Generating a series of forecasts¶
Model estimation and forecasting is provided as a single function:
-
pypfilt.
forecast
(params, start, end, streams, dates, summary, filename)¶ Generate forecasts from various dates during a simulation.
Parameters: - params (dict) – The simulation parameters.
- start (datetime.date) – The start of the simulation period.
- end (datetime.date) – The (exclusive) end of the simulation period.
- streams – A list of observation streams.
- dates – The dates at which forecasts should be generated.
- summary – An object that generates summaries of each simulation.
- filename – The output file to generate (can be
None
).
Returns: The simulation state for each forecast date.
This function returns a dictionary that contains the following keys:
'obs'
: a (flattened) list of every observation;'complete'
: the simulation state obtained by assimilating every observation; anddatetime.datetime
instances: the simulation state obtained for each forecast, identified by the forecasting date.
The simulation states are generated by pypfilt.run()
and contain the
following keys:
'params'
: the simulation parameters;'summary'
: the dictionary of summary statistics; and'hist'
: the matrix of particle state vectors, including individual particle weights (hist[..., -2]
) and the index of each particle at the previous time-step (hist[..., -1]
), since these can change due to resampling.
The matrix has dimensions \(N_{Steps} \times N_{Particles} \times (N_{SV} + 2)\) for state vectors of size \(N_{SV}\).
Note: if max_days > 0
was passed to pypfilt.default_params()
,
only a fraction of the entire simulation period will be available.
Particle filter parameters¶
Default values for the particle filter parameters are provided:
-
pypfilt.
default_params
(model, max_days=0, px_count=0)¶ The default particle filter parameters.
Memory usage can reach extreme levels with a large number of particles, and so it may be necessary to keep only a sliding window of the entire particle history matrix in memory.
Parameters: - model – The system model.
- max_days – The number of contiguous days that must be kept in memory (e.g., the largest observation period).
- px_count – The number of particles.
The bootstrap particle filter¶
The bootstrap particle filter is exposed as a single-step function, which will update particle weights and perform resampling as necessary:
-
pypfilt.
step
(params, hist, hist_ix, step_num, when, step_obs, max_back, is_fs)¶ Perform a single time-step for every particle.
Parameters: - params – The simulation parameters.
- hist – The particle history matrix.
- hist_ix – The index of the current time-step in the history matrix.
- step_num – The time-step number.
- when – The current simulation time.
- step_obs – The list of observations for this time-step.
- max_back – The number of time-steps into the past when the most
recent resampling occurred; must be either a positive integer or
None
(no limit). - is_fs – Indicate whether this is a forecasting simulation (i.e., no observations). For deterministic models it is useful to add some random noise when estimating, to allow identical particles to differ in their behaviour, but this is not desirable when forecasting.
Returns: True
if resampling was performed, otherwiseFalse
.
Running a single simulation¶
-
pypfilt.
run
(params, start, end, streams, summary, state=None, save_when=None, save_to=None)¶ Run the particle filter against any number of data streams.
Parameters: - params (dict) – The simulation parameters.
- start (datetime.date) – The start of the simulation period.
- end (datetime.date) – The (exclusive) end of the simulation period.
- streams – A list of observation streams (see
with_observations()
). - summary – An object that generates summaries of each simulation.
- state – A previous simulation state as returned by, e.g., this function.
- save_when – Dates at which to save the particle history matrix.
- save_to – The filename for saving the particle history matrix.
Returns: The resulting simulation state: a dictionary that contains the simulation parameters (
'params'
), the particle history matrix ('hist'
), and the summary statistics ('summary'
).
Simulation models¶
All simulation models should derive the following base class.
-
class
pypfilt.model.
Base
¶ The base class for simulation models, which defines the minimal set of methods that are required.
-
static
init
(params, vec)¶ Initialise a state vector.
Parameters: - params – Simulation parameters.
- vec – An uninitialised state vector of correct dimensions (see
state_size()
).
Raises: NotImplementedError – Derived classes must implement this method.
-
static
state_size
()¶ Return the size of the state vector.
Raises: NotImplementedError – Derived classes must implement this method.
-
static
priors
(params)¶ Return a dictionary of model parameter priors. Each key must identify a parameter by name. Each value must be a function that returns samples from the associated prior distribution, and should have the following form:
lambda r, size=None: r.uniform(1.0, 2.0, size=size)
Here, the argument
r
is a PRNG instance andsize
specifies the output shape (by default, a single value).Parameters: params – Simulation parameters. Raises: NotImplementedError – Derived classes must implement this method.
-
classmethod
update
(params, step_date, dt, is_fs, prev, curr)¶ Perform a single time-step.
Parameters: - params – Simulation parameters.
- step_date – The date and time of the current time-step.
- dt – The time-step size (days).
- is_fs – Indicates whether this is a forecasting simulation.
- prev – The state before the time-step.
- curr – The state after the time-step (destructively updated).
Raises: NotImplementedError – Derived classes must implement this method.
-
classmethod
state_info
()¶ Describe each state variable as a
(name, index)
tuple, wherename
is a descriptive name for the variable andindex
is the index of that variable in the state vector.Raises: NotImplementedError – Derived classes must implement this method.
-
classmethod
param_info
()¶ Describe each model parameter as a
(name, index)
tuple, wherename
is a descriptive name for the parameter andindex
is the index of that parameter in the state vector.Raises: NotImplementedError – Derived classes must implement this method.
-
classmethod
stat_info
()¶ Describe each statistic that can be calculated by this model as a
(name, stat_fn)
tuple, wherename
is a string that identifies the statistic andstat_fn
is a function that calculates the value of the statistic.Raises: NotImplementedError – Derived classes must implement this method.
-
static
is_valid
(hist)¶ Identify particles whose state and parameters can be inspected. By default, this function returns
True
for all particles. Override this function to ensure that inchoate particles are correctly ignored.
-
static
Weighted statistics¶
The pypfilt.stats
module provides functions for calculating weighted
statistics across particle populations.
-
pypfilt.stats.
cov_wt
(x, wt, cor=False)¶ Estimate the weighted covariance matrix, based on a NumPy pull request.
Equivalent to
cov.wt(x, wt, cor, center=TRUE, method="unbiased")
as provided by thestats
package for R.Parameters: - x – A 2-D array; columns represent variables and rows represent observations.
- wt – A 1-D array of observation weights.
- cor – Whether to return a correlation matrix instead of a covariance matrix.
Returns: The covariance matrix (or correlation matrix, if
cor=True
).
-
pypfilt.stats.
avg_var_wt
(x, weights, biased=True)¶ Return the weighted average and variance (based on a Stack Overflow answer).
Parameters: - x – The data points.
- weights – The normalised weights.
- biased – Use a biased variance estimator.
Returns: A tuple that contains the weighted average and weighted variance.
-
pypfilt.stats.
qtl_wt
(x, weights, probs)¶ Equivalent to
wtd.quantile(x, weights, probs, normwt=TRUE)
as provided by the Hmisc package for R.Parameters: - x – The numerical data.
- weights – The weight of each data point.
- probs – The quantile(s) to compute.
Returns: The array of weighted quantiles.
-
pypfilt.stats.
cred_wt
(x, weights, creds)¶ Calculate weighted credible intervals.
Parameters: - x – The numerical data.
- weights – The weight of each data point.
- creds (List(int)) – The credible interval(s) to compute (
0..100
, where0
represents the median and100
the entire range).
Returns: A dictionary that maps credible intervals to the lower and upper interval bounds.
Simulation metadata¶
Every simulation data file should include metadata that documents the
simulation parameters and working environment. The pypfilt.summary
provides a function to automatically generate such metadata:
-
pypfilt.summary.
metadata
(params, pkgs=None)¶ Construct a metadata dictionary that documents the simulation parameters and system environment. Note that this should be generated at the start of the simulation, and that the git metadata will only be valid if the working directory is located within a git repository.
Parameters: - params – The simulation parameters.
- pkgs – A dictionary that maps package names to modules that define
appropriate
__version__
attributes, used to record the versions of additional relevant packages; see below for an example:
By default, the versions of
pypfilt
,h5py
,numpy
andscipy
are recorded. The following example demonstrates how to also record the installed version of theepifx
package:import epifx import pypfilt.summary params = ... metadata = pypfilt.summary.metadata(params, {'epifx': epifx})
If the above function isn’t sufficiently flexible, several other utility functions are provided to assist with generating metadata:
-
pypfilt.summary.
metadata_priors
(params)¶ Return a dictionary that describes the model parameter priors.
Each key identifies a parameter (by name); the corresponding value is a string representation of the prior distribution, which is typically a
numpy.random.RandomState
method call.For example:
{'alpha': "random.uniform(0.1, 1.0)"}
-
pypfilt.summary.
encode_value
(value)¶ Encode values in a form suitable for serialisation in HDF5 files.
- Integer values are converted to
numpy.int32
values. - Floating-point values and arrays retain their data type.
- All other (i.e., non-numerical) values are converted to UTF-8 strings.
- Integer values are converted to
-
pypfilt.summary.
filter_dict
(values, ignore, encode_fn)¶ Recursively filter items from a dictionary, used to remove parameters from the metadata dictionary that, e.g., have no meaningful representation.
Parameters: - values – The original dictionary.
- ignore – A dictionary that specifies which values to ignore.
- encode_fn – A function that encodes the remaining values (see
encode_value()
).
For example, to ignore
['px_range']
,['resample']['rnd']
, and'expect_fn'
and'log_llhd_fn'
for every observation system:ignore = { 'px_range': None, 'resample': {'rnd': None}, # Note the use of ``None`` to match any key under 'obs'. 'obs': {None: {'expect_fn': None, 'log_llhd_fn': None}} } filter_dict(params, ignore, encode_value)
Summary statistics¶
Summary statistics are stored in tables, each of which comprises a set of named columns and a specific number of rows.
Base classes¶
-
class
pypfilt.summary.
Table
(name)¶ The base class for summary statistic tables.
Tables are used to record rows of summary statistics as a simulation progresses.
Parameters: name – the name of the table in the output file. -
dtype
(params, obs_list)¶ Return the column names and data types, represented as a list of
(name, data type)
tuples. See the NumPy documentation for details.Parameters: - params – The simulation parameters.
- obs_list – A list of all observations.
Raises: NotImplementedError – Derived classes must implement this method.
-
n_rows
(start_date, end_date, n_days, n_sys, forecasting)¶ Return the number of rows required for a single simulation.
Parameters: - start_date – The date at which the simulation starts.
- end_date – The date at which the simulation ends.
- n_days – The number of days for which the simulation runs.
- n_sys – The number of observation systems (i.e., data sources).
- forecasting –
True
if this is a forecasting simulation, otherwiseFalse
.
Raises: NotImplementedError – Derived classes must implement this method.
-
add_rows
(hist, weights, fs_date, dates, obs_types, insert_fn)¶ Record rows of summary statistics for some portion of a simulation.
Parameters: - hist – The particle history matrix.
- weights – The weight of each particle at each date in the
simulation window; it has dimensions
(d, p)
ford
days andp
particles. - fs_date – The forecasting date; if this is not a forecasting simulation, this is the date at which the simulation ends.
- dates – A list of
(datetime, ix, hist_ix)
tuples that identify each day in the simulation window, the index of that day in the simulation window, and the index of that day in the particle history matrix. - obs_types – A set of
(unit, period)
tuples that identify each observation system from which observations have been taken. - insert_fn – A function that inserts one or more rows into the underlying data table; see the examples below.
Raises: NotImplementedError – Derived classes must implement this method.
The row insertion function can be used as follows:
# Insert a single row, represented as a tuple. insert_fn((x, y, z)) # Insert multiple rows, represented as a list of tuples. insert_fn([(x0, y0, z0), (x1, y1, z1)], n=2)
-
finished
(hist, weights, fs_date, dates, obs_types, insert_fn)¶ Record rows of summary statistics at the end of a simulation.
The parameters are as per
add_rows()
.Derived classes should only implement this method if rows must be recorded by this method; the provided method does nothing.
-
monitors
()¶ Return a list of monitors required by this Table.
Derived classes should implement this method if they require one or more monitors; the provided method returns an empty list.
-
In some cases, the Table
model is not sufficiently flexible and a
Monitor
may be needed.
-
class
pypfilt.summary.
Monitor
¶ The base class for simulation monitors.
Monitors are used to calculate quantities that:
- Are used by multiple Tables (i.e., avoiding repeated computation); or
- Require a complete simulation for calculation (as distinct from Tables, which incrementally record rows as a simulation progresses).
The quantities calculated by a Monitor can then be recorded by
Table.add_rows()
and/orTable.finished()
.-
prepare
(params, obs_list)¶ Perform any required preparation prior to a set of simulations.
Parameters: - params – The simulation parameters.
- obs_list – A list of all observations.
-
begin_sim
(start_date, end_date, n_days, n_sys, forecasting)¶ Perform any required preparation at the start of a simulation.
Parameters: - start_date – The date at which the simulation starts.
- end_date – The date at which the simulation ends.
- n_days – The number of days for which the simulation runs.
- n_sys – The number of observation systems (i.e., data sources).
- forecasting –
True
if this is a forecasting simulation, otherwiseFalse
.
-
monitor
(hist, weights, fs_date, dates, obs_types)¶ Monitor the simulation progress.
Parameters: - hist – The particle history matrix.
- weights – The weight of each particle at each date in the
simulation window; it has dimensions
(d, p)
ford
days andp
particles. - fs_date – The forecasting date; if this is not a forecasting simulation, this is the date at which the simulation ends.
- dates – A list of
(datetime, ix, hist_ix)
tuples that identify each day in the simulation window, the index of that day in the simulation window, and the index of that day in the particle history matrix. - obs_types – A set of
(unit, period)
tuples that identify each observation system from which observations have been taken.
Raises: NotImplementedError – Derived classes must implement this method.
Provided statistics¶
The following derived classes are provided to calculate basic summary statistics of any generic simulation model.
-
class
pypfilt.summary.
ModelCIs
(probs=None, name='model_cints')¶ Calculate fixed-probability central credible intervals for all state variables and model parameters.
Parameters: - probs – an array of probabilities that define the size of each
central credible interval.
The default value is
numpy.uint8([0, 50, 90, 95, 99, 100])
. - name – the name of the table in the output file.
- probs – an array of probabilities that define the size of each
central credible interval.
The default value is
-
class
pypfilt.summary.
ParamCovar
(name='param_covar')¶ Calculate the covariance between all pairs of model parameters during each simulation.
Parameters: name – the name of the table in the output file.
Utility functions¶
The following column types are provided for convenience when defining custom tables.
-
pypfilt.summary.
dtype_date
(name='date')¶ The dtype for columns that store dates.
-
pypfilt.summary.
dtype_unit
(obs_list, name='unit')¶ The dtype for columns that store observation units.
-
pypfilt.summary.
dtype_period
(name='period')¶ The dtype for columns that store observation periods.
Summary data files¶
-
class
pypfilt.summary.
HDF5
(params, all_obs, meta=None, first_day=False)¶ Save tables of summary statistics to an HDF5 file.
Parameters: - params – The simulation parameters.
- obs_list – A list of all observations.
- meta – The simulation metadata; by default the output of
metadata()
is used. - first_day – If
False
(the default) statistics are calculated from the date of the first observation. IfTrue
, statistics are calculated from the very beginning of the simulation period.
-
add_tables
(*tables)¶ Add summary statistic tables that will be included in the output file.
-
save_forecasts
(fs, filename)¶ Save forecast summaries to disk in the HDF5 binary data format.
This function creates the following datasets that summarise the estimation and forecasting outputs:
'data/TABLE'
for each table.
The provided metadata will be recorded under
'meta/'
.If dataset creation timestamps are enabled, two simulations that produce identical outputs will not result in identical files. Timestamps will be disabled where possible (requires h5py >= 2.2):
'hdf5_track_times'
: Presence of creation timestamps.
Parameters: - fs – Simulation outputs, as returned by
pypfilt.forecast()
. - filename – The filename to which the data will be written.