Bootstrap particle filter for Python

Welcome to the pypfilt documentation. This package implements a bootstrap particle filter that can be used for recursive Bayesian estimation and forecasting.

Quick-start

Here’s a brief example of how to use pypfilt directly:

import datetime
import pypfilt

params = ...   # Simulation parameters, model parameters, etc.
obs = ...      # A list of observation streams.
summary = ...  # An object that calculates summary statistics.
start = datetime.datetime(2015, 1, 1)
until = datetime.datetime(2015, 12, 31)
outfile = 'forecasts.hdf5'

# Generate forecasts and save them to disk (in HDF5 format).
forecasts = pypfilt.forecast(params, start, until, obs, summary, outfile)

See Getting Started for a more detailed example.

User Documentation

Installation

The requirements for pypfilt are:

Installing required packages

Source installation on Linux and OS X

Warning

Installing from source on Windows is effectively impossible, due to the dependencies of h5py.

Note

If you are using Python 3, you will most likely need to substitute “python3” for “python” in all of the package names listed here.

Alternatively, these packages can be manually installed in a Virtual Environment, by using virtualenv. This requires the following development tools:

  • C and Fortran compilers (typically gcc and gfortran).
    • Debian: sudo apt-get install gcc fortran.
    • Red Hat Enterprise Linux and CentOS: sudo yum install gcc gcc-gfortran.
    • Fedora: sudo dnf install gcc gcc-gfortran.
    • OS X: Install Command Line Tools for Xcode (instructions).
  • Python header files.
    • Debian: sudo apt-get install python-dev.
    • Red Hat Enterprise Linux and CentOS: sudo yum install python-devel.
    • Fedora: sudo dnf install python-devel.
    • OS X: brew install python (see why installing a separate version of Python is a good idea).
  • Linear algebra libraries (typically ATLAS and LAPACK, or MKL, or ACML).

Then install virtualenv and the libhdf5 development files:

# For Debian and Debian-based distributions such as Ubuntu.
sudo apt-get install virtualenv libhdf5-dev

# For Red Hat Enterprise Linux and CentOS.
sudo yum install python-virtualenv hdf5-devel

# For Fedora.
sudo dnf install python-virtualenv hdf5-devel

# For OS X.
brew install python homebrew/science/hdf5
pip install virtualenv

Then create a virtual environment (called venv-pypfilt in the following example):

# Create and activate the virtual environment.
virtualenv venv-pypfilt
source venv-pypfilt/bin/activate
# Install NumPy before SciPy, and Cython before h5py.
pip install 'numpy>=1.8' 'Cython >=0.17'
pip install 'scipy>=0.11'
# Note: may need to identify the directory that contains `include/hdf5.h`.
# For example, for 64-bit Debian and Debian-based distributions:
# export HDF5_DIR=/usr/lib/x86_64-linux-gnu/hdf5/serial
pip install 'h5py>=2.2'

Note

In order to install h5py, you may need to identify the directory that contains include/hdf5.h by defining the HDF5_DIR environment variable (see the comments in the code block above).

You can search for include/hdf5.h by running find -L /usr -name hdf5.h. On Red Hat Enterprise Linux, CentOS, and Fedora, this file is located at /usr/include/hdf5.h and there is no need to define HDF5_DIR. On OS X, this file is located at /usr/local/include/hdf5.h when using Homebrew, and there should be no need to define HDF5_DIR.

Installing pypfilt

Once the required packages have been installed (see instructions, above), you can clone the pypfilt repository and install it in the venv-pypfilt virtual environment:

# Activate the virtual environment.
source venv-pypfilt/bin/activate
# Clone the pypfilt repository.
git clone https://bitbucket.org/robmoss/particle-filter-for-python.git
# Install pypfilt in the virtual environment.
cd particle-filter-for-python
python setup.py install

If you are not using a virtual environment, and you don’t have permission to install pypfilt system-wide, you can install the package locally:

# Clone the pypfilt repository.
git clone https://bitbucket.org/robmoss/particle-filter-for-python.git
# Install pypfilt in the user's "site-packages" directory.
cd particle-filter-for-python
python setup.py install --user

Building the documentation

If you want to build the documentation locally, you will need to install Sphinx 1.3 or newer, and the Read the Docs Sphinx Theme.

These can be installed through a package manager:

# For Debian and Debian-based distributions such as Ubuntu.
sudo apt-get install python-sphinx python-sphinx-rtd-theme

# For Red Hat Enterprise Linux and CentOS.
sudo yum install python-sphinx python-sphinx_rtd_theme

# For Fedora.
sudo dnf install python-sphinx python-sphinx_rtd_theme

# For OS X.
brew install sphinx
pip install sphinx_rtd_theme

Alternatively, they can be installed in the venv-pypfilt virtual environment:

# Activate the virtual environment.
source venv-pypfilt/bin/activate
pip install 'Sphinx>=1.3' sphinx_rtd_theme

You can then build the documentation from the pypfilt repository, which will be written to the doc/build/html directory:

python setup.py build_sphinx

Getting Started

This page assumes that you have already installed the pypfilt package, and shows how to generate forecasts for the following system:

\[\begin{split}\begin{align} \frac{dx}{dt} &= - \alpha \cdot x \implies x(t) = x(0) \cdot e^{- \alpha t} \end{align}\end{split}\]

Parameters

Particle filter parameters are provided by default_params().

Observation model parameters (and fixed parameters for the process model, if any) should be added to this parameter dictionary, so that all parameters pertaining to the simulation are stored together.

def get_params(regularisation=False):
    """
    The default simulation parameters.

    :param regularisation: Whether to use the post-regularisation particle
        filter (post-RPF).
    """
    # The particle filter parameters.
    params = pypfilt.default_params(Model, px_count=1000)
    # Provide an observation model.
    params['log_llhd_fn'] = log_llhd
    # System model parameters.
    if regularisation:
        # No need to add stochastic noise to the model equations.
        params['sys'] = {'noise_alpha': 0, 'noise_dx': 0}
    else:
        # Need to add some stochastic noise to the model equations.
        params['sys'] = {'noise_alpha': 5e-3, 'noise_dx': 0.025}
    # Observation model parameters.
    params['obs'] = {'sdev': 0.05}
    # Set a fixed PRNG seed.
    params['resample']['prng_seed'] = 42
    # Define whether to use the post-regularised particle filter.
    params['resample']['regularisation'] = regularisation
    return params

Observations

Observations are represented as dictionaries that have the following keys:

{'date': datetime.datetime(...),  # When the observation was made
 'value': 200,  # The numerical quantity that was measured
 'unit': 'Some measure',  # A description of the measurement units
 'period': 7,  # The observation period, in days
 'source': 'Some system',  # A description of the data source
}

An observation stream is represented as a chronologically sorted list of observations (oldest first). The particle filter accepts any number of observation streams, which must be provided as a list (i.e., a list of observation lists); see forecast() and run().

Observations can be read from external files:

def obs_from_file(filename, year):
    """
    Read observations from a file with the following format:

    year date       value
    2009 2009-01-02 0.680762021741
    2009 2009-01-03 0.62923826359
    2009 2009-01-04 0.621926239641
    2009 2009-01-05 0.618802422847
    ...
    """
    col_types = [('year', np.int32), ('date', '|O4'), ('value', np.float32)]
    col_types = pypfilt.summary.dtype_names_to_str(col_types)
    col_convs = {1: lambda s: datetime.datetime.strptime(s, '%Y-%m-%d')}
    with codecs.open(filename, encoding='ascii') as f:
        df = np.loadtxt(f, skiprows=1, dtype=col_types, converters=col_convs)
    df = df[df['year'] == year]
    nrows = df.shape[0]
    # Note that counts are assumed to be reported daily (period = 1).
    return [{'date': df['date'][i],
             'value': df['value'][i],
             'unit': 'x',
             'period': 1,
             'source': 'file: {}'.format(filename)}
            for i in range(nrows)]

They can also be generated synthetically:

def generate_obs(x0, alpha, sdev, start, days, seed=42):
    """Generate noisy observations from a known truth."""
    rng = np.random.RandomState(seed)
    time = np.array(range(1, days + 1))
    # Here, x(t) = x(0) * e^{- alpha * t} + error
    xs = x0 * np.exp(- alpha * time)
    if sdev > 0:
        xs += rng.normal(scale=sdev, size=days)
    return [{'date': start + datetime.timedelta(days=i + 1),
             'value': xs[i], 'unit': 'x', 'period': 1,
             'source': 'synthetic'}
            for i in range(days)]

System models

The model of the underlying system must inherit from pypfilt.model.Base. Here is a simulation model for the example system:

class Model(pypfilt.model.Base):
    """
    A model of the following system:

        dx/dt = - alpha . x

    The state vector is ``[x(t), alpha]``.
    """

    @staticmethod
    def state_size():
        """Return the length of a single state vector."""
        return 2

    @staticmethod
    def priors(params):
        """Return a dictionary of model parameter priors."""
        a_min, a_max = params['param_min'][0], params['param_max'][0]
        return {
            'alpha': lambda r, size=None: r.uniform(a_min, a_max, size=size),
        }

    @staticmethod
    def init(params, vec):
        """Initialise any number of state vectors."""
        rnd = params['resample']['rnd']
        rnd_size = vec[..., 0].shape
        # Assume that x(0) is somewhere between 0.5 and 1.
        vec[..., 0] = rnd.uniform(0.5, 1.0, size=rnd_size)
        # Select alpha according to the prior.
        vec[..., 1] = params['prior']['alpha'](rnd, size=rnd_size)

    @staticmethod
    def update(params, step_date, dt, is_fs, prev, curr):
        """Perform a single time-step for any number of state vectors."""
        rnd = params['resample']['rnd']
        rnd_size = curr[..., 0].shape
        # Calculate the deterministic change for x.
        dx = prev[..., 0] * prev[..., 1] * dt
        # Add stochastic noise to the rate.
        noise = params['sys']['noise_dx']
        noise *= rnd.normal(size=rnd_size) * dt
        dx += noise * np.sqrt(dx / dt)
        # Add stochastic noise to alpha.
        noise_alpha = params['sys']['noise_alpha']
        noise_alpha *= rnd.normal(size=rnd_size) * dt
        # Update the state vectors, ensuring alpha remains strictly positive.
        curr[..., 0] = prev[..., 0] - dx
        curr[..., 1] = np.clip(prev[..., 1] + noise_alpha,
                               params['param_min'][0], params['param_max'][0])

    @classmethod
    def state_info(cls):
        """Describe each state variable."""
        return [("x", 0)]

    @classmethod
    def param_info(cls):
        """Describe each model parameter."""
        return [("alpha", 1)]

    @classmethod
    def param_bounds(cls):
        """Return the (default) lower and upper parameter bounds."""
        return ([0.01], [0.10])

    @classmethod
    def stat_info(cls):
        """Describe each statistic that can be calculated by this model."""
        return []

Observation models

The observation model must be stored in params['log_llhd_fn'] and have the following form:

def log_llhd(params, obs_list, curr, prev_dict):
    """
    Calculate the log-likelihood of obtaining specific observations from each
    particle.
    """
    log_llhd = np.zeros(curr.shape[:-1])
    # The expected observation is x(t).
    exp = curr[..., 0]
    # The standard deviation of the observation error.
    sdev = params['obs']['sdev']
    # The likelihood distribution for each particle.
    obs_dist = scipy.stats.norm(loc=exp, scale=sdev)
    for o in obs_list:
        # Calculate the likelihood of this observation for each particle.
        log_llhd += obs_dist.logpdf(o['value'])
    return log_llhd

While the argument prev_dict was not used in this example, it can be used to obtain the state vectors at the beginning of an observation period:

def log_llhd(params, obs_list, curr, prev_dict):
    # Obtain the state vectors for one week prior.
    # This is only valid if an observation has a period of 7.
    one_week_ago = prev_dict[7]
    dx = curr[..., 0] - one_week_ago[..., 0]
    ...

This is useful for situations where the observation depends on the change in the state vector over the observation period.

Summary objects

Simulations typically comprise a large number of both particles and time steps, and so it is generally preferable to record statistics that summarise the particles than to store the entire state history of each simulation.

This functionality is provided by pypfilt.summary.HDF5, which allows any number of summary tables to be recorded. Once all of the estimation and forecasting simulations have been performed, save_forecasts() will save the results to disk.

Here is an example of how to record fixed-probability central credible intervals for the state variable \(x\) and model parameter \(\alpha\) of the example system, using the ModelCIs summary table:

def main(args=None):
    """Generate forecasts against noisy synthetic data."""
    # Parse the command-line arguments.
    parser = get_parser()
    if args is None:
        args = vars(parser.parse_args())
    else:
        args = vars(parser.parse_args(args))

    params = get_params(args['regularisation'])

    # Define the simulation period.
    year = 2009
    days = 42
    start = datetime.datetime(year, 1, 1)
    until = start + datetime.timedelta(days=days)

    # Generate noisy (synthetic) observations.
    true_x0 = 0.7
    true_alpha = 0.05
    sdev = 0.03
    obs_list = generate_obs(true_x0, true_alpha, sdev, start, days)
    streams = [obs_list]

    # Define the summary tables to be saved to disk.
    summary = pypfilt.summary.HDF5(params, obs_list)
    summary.add_tables(pypfilt.summary.ModelCIs(probs=[0, 50, 95]))

    # Define the forecasting dates.
    fs = [datetime.datetime(year, 1, 2), datetime.datetime(year, 1, 3),
          datetime.datetime(year, 1, 5), datetime.datetime(year, 1, 9),
          datetime.datetime(year, 1, 16), datetime.datetime(year, 1, 23)]

    # Determine the output file name.
    if args['regularisation']:
        base = 'example-rpf.hdf5'
    else:
        base = 'example.hdf5'
    data_file = os.path.join(os.path.dirname(__file__), base)

    # Run the model estimation and forecasting simulations.
    pypfilt.forecast(params, start, until, streams, fs, summary, data_file)
    return 0

Forecasting

Model estimations and subsequent forecasts are generated by pypfilt.forecast(), as illustrated in the example above.

This function takes the following arguments:

  • A parameter dictionary;
  • The start and end of the simulation period (datetime.date instances);
  • Any number of observation streams;
  • The dates at which forecasts should be generated (a datetime.date list);
  • A summary object to calculate relevant statistics; and
  • The output file, if desired, otherwise set to None.

When generating forecasts on a regular basis (e.g., daily or weekly, in response to new or updated observations) the particle states can be saved to disk to greatly improve the speed with which the forecasts are generated. This is enabled by defining a cache file:

# If the location is not an absolute path, it is defined
# relative to the output directory, params['out_dir'].
params['hist']['cache_file'] = 'cache.hdf5'

Forecast plots

The code presented here is available in the doc/example directory. To generate and plot forecasts for this system, run the following commands from the root directory of the pypfilt repository:

./doc/example/run.py
./doc/example/plot.R

This will generate forecasts (stored in ./doc/example/example.hdf5) and plot the credible intervals for \(x(t)\) and \(\alpha\). Important: the plotting script requires a working version of R and the following packages: ggplot2, rhdf5, and scales.

Credible intervals for x and alpha.

Credible intervals for \(\alpha\) (top row) and \(x\) (bottom row, plotted against the observations), shown for each forecast (identified by date at the top of the plot). Note that the right-most plots (“2009–2–12”) show the credible intervals obtained by using all of the observations.

To generate and plot forecasts that use the post-regularisation particle filter (post-RPF), run the following commands from the root directory of the pypfilt repository:

./doc/example/run.py --regularisation
./doc/example/plot.R

This will generate forecasts (./doc/example/example-rpf.hdf5) and plot the credible intervals.

Credible intervals for x and alpha when using the post-regularisation particle filter (post-RPF).

Credible intervals for \(\alpha\) and \(x\) when using the post-regularisation particle filter (post-RPF).

API documentation

Generating a series of forecasts

Model estimation and forecasting is provided as a single function:

pypfilt.forecast(params, start, end, streams, dates, summary, filename)

Generate forecasts from various dates during a simulation.

Parameters:
  • params (dict) – The simulation parameters.
  • start (datetime.date) – The start of the simulation period.
  • end (datetime.date) – The (exclusive) end of the simulation period.
  • streams – A list of observation streams.
  • dates – The dates at which forecasts should be generated.
  • summary – An object that generates summaries of each simulation.
  • filename – The output file to generate (can be None).
Returns:

The simulation state for each forecast date.

This function returns a dictionary that contains the following keys:

  • 'obs': a (flattened) list of every observation;
  • 'complete': the simulation state obtained by assimilating every observation; and
  • datetime.datetime instances: the simulation state obtained for each forecast, identified by the forecasting date.

The simulation states are generated by pypfilt.run() and contain the following keys:

  • 'params': the simulation parameters;
  • 'summary': the dictionary of summary statistics; and
  • 'hist': the matrix of particle state vectors, including individual particle weights (hist[..., -2]) and the index of each particle at the previous time-step (hist[..., -1]), since these can change due to resampling.

The matrix has dimensions \(N_{Steps} \times N_{Particles} \times (N_{SV} + 2)\) for state vectors of size \(N_{SV}\).

Note: if max_days > 0 was passed to pypfilt.default_params(), only a fraction of the entire simulation period will be available.

Particle filter parameters

Default values for the particle filter parameters are provided:

pypfilt.default_params(model, max_days=0, px_count=0)

The default particle filter parameters.

Memory usage can reach extreme levels with a large number of particles, and so it may be necessary to keep only a sliding window of the entire particle history matrix in memory.

Parameters:
  • model – The system model.
  • max_days – The number of contiguous days that must be kept in memory (e.g., the largest observation period).
  • px_count – The number of particles.

The bootstrap particle filter

The bootstrap particle filter is exposed as a single-step function, which will update particle weights and perform resampling as necessary:

pypfilt.step(params, hist, hist_ix, step_num, when, step_obs, max_back, is_fs)

Perform a single time-step for every particle.

Parameters:
  • params – The simulation parameters.
  • hist – The particle history matrix.
  • hist_ix – The index of the current time-step in the history matrix.
  • step_num – The time-step number.
  • when – The current simulation time.
  • step_obs – The list of observations for this time-step.
  • max_back – The number of time-steps into the past when the most recent resampling occurred; must be either a positive integer or None (no limit).
  • is_fs – Indicate whether this is a forecasting simulation (i.e., no observations). For deterministic models it is useful to add some random noise when estimating, to allow identical particles to differ in their behaviour, but this is not desirable when forecasting.
Returns:

True if resampling was performed, otherwise False.

Running a single simulation

pypfilt.run(params, start, end, streams, summary, state=None, save_when=None, save_to=None)

Run the particle filter against any number of data streams.

Parameters:
  • params (dict) – The simulation parameters.
  • start (datetime.date) – The start of the simulation period.
  • end (datetime.date) – The (exclusive) end of the simulation period.
  • streams – A list of observation streams (see with_observations()).
  • summary – An object that generates summaries of each simulation.
  • state – A previous simulation state as returned by, e.g., this function.
  • save_when – Dates at which to save the particle history matrix.
  • save_to – The filename for saving the particle history matrix.
Returns:

The resulting simulation state: a dictionary that contains the simulation parameters ('params'), the particle history matrix ('hist'), and the summary statistics ('summary').

Simulation models

All simulation models should derive the following base class.

class pypfilt.model.Base

The base class for simulation models, which defines the minimal set of methods that are required.

static init(params, vec)

Initialise a matrix of state vectors.

Parameters:
  • params – Simulation parameters.
  • vec – An uninitialised \(P \times S\) matrix of state vectors, for \(P\) particles and state vectors of length \(S\) (as defined by state_size()). To set, e.g., the first element of each state vector to \(1\), you can use an ellipsis slice: vec[..., 0] = 1.
Raises:

NotImplementedError – Derived classes must implement this method.

static state_size()

Return the size of the state vector.

Raises:NotImplementedError – Derived classes must implement this method.
static priors(params)

Return a dictionary of model parameter priors. Each key must identify a parameter by name. Each value must be a function that returns samples from the associated prior distribution, and should have the following form:

lambda r, size=None: r.uniform(1.0, 2.0, size=size)

Here, the argument r is a PRNG instance and size specifies the output shape (by default, a single value).

Parameters:params – Simulation parameters.
Raises:NotImplementedError – Derived classes must implement this method.
classmethod update(params, step_date, dt, is_fs, prev, curr)

Perform a single time-step.

Parameters:
  • params – Simulation parameters.
  • step_date – The date and time of the current time-step.
  • dt – The time-step size (days).
  • is_fs – Indicates whether this is a forecasting simulation.
  • prev – The state before the time-step.
  • curr – The state after the time-step (destructively updated).
Raises:

NotImplementedError – Derived classes must implement this method.

classmethod state_info()

Describe each state variable as a (name, index) tuple, where name is a descriptive name for the variable and index is the index of that variable in the state vector.

Raises:NotImplementedError – Derived classes must implement this method.
classmethod param_info()

Describe each model parameter as a (name, index) tuple, where name is a descriptive name for the parameter and index is the index of that parameter in the state vector.

Raises:NotImplementedError – Derived classes must implement this method.
classmethod param_bounds()

Return two arrays that contain the (default) lower and upper bounds, respectively, for each model parameter.

Raises:NotImplementedError – Derived classes must implement this method.
classmethod stat_info()

Describe each statistic that can be calculated by this model as a (name, stat_fn) tuple, where name is a string that identifies the statistic and stat_fn is a function that calculates the value of the statistic.

Raises:NotImplementedError – Derived classes must implement this method.
static is_valid(hist)

Identify particles whose state and parameters can be inspected. By default, this function returns True for all particles. Override this function to ensure that inchoate particles are correctly ignored.

Weighted statistics

The pypfilt.stats module provides functions for calculating weighted statistics across particle populations.

pypfilt.stats.cov_wt(x, wt, cor=False)

Estimate the weighted covariance matrix, based on a NumPy pull request.

Equivalent to cov.wt(x, wt, cor, center=TRUE, method="unbiased") as provided by the stats package for R.

Parameters:
  • x – A 2-D array; columns represent variables and rows represent observations.
  • wt – A 1-D array of observation weights.
  • cor – Whether to return a correlation matrix instead of a covariance matrix.
Returns:

The covariance matrix (or correlation matrix, if cor=True).

pypfilt.stats.avg_var_wt(x, weights, biased=True)

Return the weighted average and variance (based on a Stack Overflow answer).

Parameters:
  • x – The data points.
  • weights – The normalised weights.
  • biased – Use a biased variance estimator.
Returns:

A tuple that contains the weighted average and weighted variance.

pypfilt.stats.qtl_wt(x, weights, probs)

Equivalent to wtd.quantile(x, weights, probs, normwt=TRUE) as provided by the Hmisc package for R.

Parameters:
  • x – The numerical data.
  • weights – The weight of each data point.
  • probs – The quantile(s) to compute.
Returns:

The array of weighted quantiles.

pypfilt.stats.cred_wt(x, weights, creds)

Calculate weighted credible intervals.

Parameters:
  • x – The numerical data.
  • weights – The weight of each data point.
  • creds (List(int)) – The credible interval(s) to compute (0..100, where 0 represents the median and 100 the entire range).
Returns:

A dictionary that maps credible intervals to the lower and upper interval bounds.

Simulation metadata

Every simulation data file should include metadata that documents the simulation parameters and working environment. The pypfilt.summary provides a function to automatically generate such metadata:

pypfilt.summary.metadata(params, pkgs=None)

Construct a metadata dictionary that documents the simulation parameters and system environment. Note that this should be generated at the start of the simulation, and that the git metadata will only be valid if the working directory is located within a git repository.

Parameters:
  • params – The simulation parameters.
  • pkgs – A dictionary that maps package names to modules that define appropriate __version__ attributes, used to record the versions of additional relevant packages; see below for an example:

By default, the versions of pypfilt, h5py, numpy and scipy are recorded. The following example demonstrates how to also record the installed version of the epifx package:

import epifx
import pypfilt.summary
params = ...
metadata = pypfilt.summary.metadata(params, {'epifx': epifx})

If the above function isn’t sufficiently flexible, several other utility functions are provided to assist with generating metadata:

pypfilt.summary.metadata_priors(params)

Return a dictionary that describes the model parameter priors.

Each key identifies a parameter (by name); the corresponding value is a string representation of the prior distribution, which is typically a numpy.random.RandomState method call.

For example:

{'alpha': "random.uniform(0.1, 1.0)"}
pypfilt.summary.encode_value(value)

Encode values in a form suitable for serialisation in HDF5 files.

  • Integer values are converted to numpy.int32 values.
  • Floating-point values and arrays retain their data type.
  • All other (i.e., non-numerical) values are converted to UTF-8 strings.
pypfilt.summary.filter_dict(values, ignore, encode_fn)

Recursively filter items from a dictionary, used to remove parameters from the metadata dictionary that, e.g., have no meaningful representation.

Parameters:
  • values – The original dictionary.
  • ignore – A dictionary that specifies which values to ignore.
  • encode_fn – A function that encodes the remaining values (see encode_value()).

For example, to ignore ['px_range'], ['resample']['rnd'], and 'expect_fn' and 'log_llhd_fn' for every observation system:

ignore = {
    'px_range': None,
    'resample': {'rnd': None},
    # Note the use of ``None`` to match any key under 'obs'.
    'obs': {None: {'expect_fn': None, 'log_llhd_fn': None}}
}
filter_dict(params, ignore, encode_value)

Summary data files

The HDF5 class encapsulates the process of calculating and recording summary statistics for each simulation.

class pypfilt.summary.HDF5(params, obs_list, meta=None, first_day=False, only_fs=False)

Save tables of summary statistics to an HDF5 file.

Parameters:
  • params – The simulation parameters.
  • obs_list – A list of all observations.
  • meta – The simulation metadata; by default the output of metadata() is used.
  • first_day – If False (the default) statistics are calculated from the date of the first observation. If True, statistics are calculated from the very beginning of the simulation period.
  • only_fs – If False (the default) statistics are calculated for the initial estimation simulation and for forecasting simulations. If True, statistics are only calculated for forecasting simulations.
add_tables(*tables)

Add summary statistic tables that will be included in the output file.

save_forecasts(fs, filename)

Save forecast summaries to disk in the HDF5 binary data format.

This function creates the following datasets that summarise the estimation and forecasting outputs:

  • 'data/TABLE' for each table.

The provided metadata will be recorded under 'meta/'.

If dataset creation timestamps are enabled, two simulations that produce identical outputs will not result in identical files. Timestamps will be disabled where possible (requires h5py >= 2.2):

  • 'hdf5_track_times': Presence of creation timestamps.
Parameters:
  • fs – Simulation outputs, as returned by pypfilt.forecast().
  • filename – The filename to which the data will be written.

Summary statistic tables

Summary statistics are stored in tables, each of which comprises a set of named columns and a specific number of rows.

The Table class

To calculate a summary statistic, you need to define a subclass of the Table class and provide implementations of each method.

class pypfilt.summary.Table(name)

The base class for summary statistic tables.

Tables are used to record rows of summary statistics as a simulation progresses.

Parameters:name – the name of the table in the output file.
dtype(params, obs_list)

Return the column names and data types, represented as a list of (name, data type) tuples. See the NumPy documentation for details.

Parameters:
  • params – The simulation parameters.
  • obs_list – A list of all observations.
Raises:

NotImplementedError – Derived classes must implement this method.

n_rows(start_date, end_date, n_days, n_sys, forecasting)

Return the number of rows required for a single simulation.

Parameters:
  • start_date – The date at which the simulation starts.
  • end_date – The date at which the simulation ends.
  • n_days – The number of days for which the simulation runs.
  • n_sys – The number of observation systems (i.e., data sources).
  • forecastingTrue if this is a forecasting simulation, otherwise False.
Raises:

NotImplementedError – Derived classes must implement this method.

add_rows(hist, weights, fs_date, dates, obs_types, insert_fn)

Record rows of summary statistics for some portion of a simulation.

Parameters:
  • hist – The particle history matrix.
  • weights – The weight of each particle at each date in the simulation window; it has dimensions (d, p) for d days and p particles.
  • fs_date – The forecasting date; if this is not a forecasting simulation, this is the date at which the simulation ends.
  • dates – A list of (datetime, ix, hist_ix) tuples that identify each day in the simulation window, the index of that day in the simulation window, and the index of that day in the particle history matrix.
  • obs_types – A set of (unit, period) tuples that identify each observation system from which observations have been taken.
  • insert_fn – A function that inserts one or more rows into the underlying data table; see the examples below.
Raises:

NotImplementedError – Derived classes must implement this method.

The row insertion function can be used as follows:

# Insert a single row, represented as a tuple.
insert_fn((x, y, z))
# Insert multiple rows, represented as a list of tuples.
insert_fn([(x0, y0, z0), (x1, y1, z1)], n=2)
finished(hist, weights, fs_date, dates, obs_types, insert_fn)

Record rows of summary statistics at the end of a simulation.

The parameters are as per add_rows().

Derived classes should only implement this method if rows must be recorded by this method; the provided method does nothing.

monitors()

Return a list of monitors required by this Table.

Derived classes should implement this method if they require one or more monitors; the provided method returns an empty list.

Predefined statistics

The following derived classes are provided to calculate basic summary statistics of any generic simulation model.

class pypfilt.summary.ModelCIs(probs=None, name=u'model_cints')

Calculate fixed-probability central credible intervals for all state variables and model parameters.

Parameters:
  • probs – an array of probabilities that define the size of each central credible interval. The default value is numpy.uint8([0, 50, 90, 95, 99, 100]).
  • name – the name of the table in the output file.
class pypfilt.summary.ParamCovar(name=u'param_covar')

Calculate the covariance between all pairs of model parameters during each simulation.

Parameters:name – the name of the table in the output file.
Utility functions

The following column types are provided for convenience when defining custom Table subclasses.

pypfilt.summary.dtype_date(name=u'date')

The dtype for columns that store dates.

pypfilt.summary.dtype_unit(obs_list, name=u'unit')

The dtype for columns that store observation units.

pypfilt.summary.dtype_period(name=u'period')

The dtype for columns that store observation periods.

pypfilt.summary.dtype_names_to_str(dtypes, encoding=u'utf-8')

Ensure that dtype field names are native strings, as required by NumPy. Unicode strings are not valid field names in Python 2, and this can cause problems when using Unicode string literals.

Parameters:
  • dtypes – A list of fields where each field is described by a tuple of length 2 or 3 (see the NumPy docs for details).
  • encoding – The encoding for converting Unicode strings to native strings in Python 2.
Returns:

A list of fields, where each field name is a native string (str type).

Raises:

ValueError – If a name cannot be converted to a native string.

Retrospective statistics

In some cases, the Table model is not sufficiently flexible, since it assumes that statistics can be calculated during the course of a simulation. For some statistics, it may be necessary to observe the entire simulation before the statistics can be calculated.

In this case, you need to define a subclass of the Monitor class, which will observe (“monitor”) each simulation and, upon completion of each simulation, can calculate the necessary summary statistics.

Note that a Table subclass is also required to define the table columns, the number of rows, and to record each row at the end of the simulation.

class pypfilt.summary.Monitor

The base class for simulation monitors.

Monitors are used to calculate quantities that:

  • Are used by multiple Tables (i.e., avoiding repeated computation); or
  • Require a complete simulation for calculation (as distinct from Tables, which incrementally record rows as a simulation progresses).

The quantities calculated by a Monitor can then be recorded by Table.add_rows() and/or Table.finished().

prepare(params, obs_list)

Perform any required preparation prior to a set of simulations.

Parameters:
  • params – The simulation parameters.
  • obs_list – A list of all observations.
begin_sim(start_date, end_date, n_days, n_sys, forecasting)

Perform any required preparation at the start of a simulation.

Parameters:
  • start_date – The date at which the simulation starts.
  • end_date – The date at which the simulation ends.
  • n_days – The number of days for which the simulation runs.
  • n_sys – The number of observation systems (i.e., data sources).
  • forecastingTrue if this is a forecasting simulation, otherwise False.
monitor(hist, weights, fs_date, dates, obs_types)

Monitor the simulation progress.

Parameters:
  • hist – The particle history matrix.
  • weights – The weight of each particle at each date in the simulation window; it has dimensions (d, p) for d days and p particles.
  • fs_date – The forecasting date; if this is not a forecasting simulation, this is the date at which the simulation ends.
  • dates – A list of (datetime, ix, hist_ix) tuples that identify each day in the simulation window, the index of that day in the simulation window, and the index of that day in the particle history matrix.
  • obs_types – A set of (unit, period) tuples that identify each observation system from which observations have been taken.
Raises:

NotImplementedError – Derived classes must implement this method.

end_sim(hist, weights, fs_date, dates, obs_types)

Finalise the data as required for the relevant summary statistics.

The parameters are as per monitor().

Derived classes should only implement this method if finalisation of the monitored data is required; the provided method does nothing.

Tables and Monitors

The methods of each Table and Monitor will be called in the following sequence by the HDF5 summary class:

  1. Before any simulations are performed:

    In addition to defining the column types for each Table, this allows objects to store the simulation parameters and observations.

  2. At the start of each simulation:

    This notifies each Monitor and each Table of the simulation period, the number of observation systems (i.e., data sources), and whether it is a forecasting simulation (where no resampling will take place).

  3. During each simulation:

    This provides a portion of the simulation period for analysis by each Monitor and each Table. Because all of the Monitor.monitor() methods are called before the Table.add_rows() methods, tables can interrogate monitors to obtain any quantities of interest that are calculated by Monitor.monitor().

  4. At the end of each simulation:

    This allows each Monitor and each Table to perform any final calculations once the simulation has completed. Because all of the Monitor.end_sim() methods are called before the Table.finished() methods, tables can interrogate monitors to obtain any quantities of interest that are calculated by Monitor.end_sim().

Unicode and byte strings

The pypfilt package simultaneously supports Python 2.7 and Python 3.x, and is intended to behave identically regardless of the Python version. It is assumed that the following Python 3 features are enabled in Python 2.7:

from __future__ import absolute_import, division, print_function
from __future__ import unicode_literals

Importantly, among the differences between Python 2.7 and Python 3.x, the native str type is a byte string in Python 2 and a Unicode string in Python 3. This means that, e.g., the str() built-in function returns byte strings in Python 2 and Unicode strings in Python 3.

Guidelines for working with text

As per the Unicode HOWTO for Python 2 and Python 3:

Tip

Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end.

To that end, adhere to the following guidelines:

  • Use Unicode strings and Unicode literals everywhere. In Python 2, this means placing the following at the top of every file:

    from __future__ import unicode_literals
    
  • If you have non-ASCII characters in a Python source file (e.g., in Unicode literals such as 'α'), you need to declare the file encoding at the top of the file:

    # -*- coding: utf-8 -*-
    
  • Encode Unicode text into UTF-8 when writing to disk:

    # Note: in Python 3, the open() built-in accepts an encoding argument
    with codecs.open(filename, 'wb', encoding='utf-8') as f:
        f.write(unicode_string)
    
  • Decode UTF-8 bytes into Unicode text when reading from disk:

    # Note: in Python 3, the open() built-in accepts an encoding argument
    with codecs.open(filename, 'rb', encoding='utf-8') as f:
        unicode_lines = f.read().splitlines()
    
  • Note that NumPy functions such as loadtxt and genfromtxt cannot reliably handle non-ASCII text (e.g., see NumPy issues #3184, #4543, #4600, #4939), and should only be used with ASCII files:

    import numpy as np
    with codecs.open(filename, encoding='ascii') as f:
        return np.loadtxt(f, ...)
    
  • Use the 'S' (byte string) data type when storing text in NumPy arrays. Encode Unicode text into UTF-8 when storing text, and decode UTF-8 bytes when reading text:

    >>> from __future__ import unicode_literals
    >>> import numpy as np
    >>> xs = np.empty(3, dtype='S20')
    >>> xs[0] = 'abc'.encode('utf-8')
    >>> xs[1] = '« äëïöü »'.encode('utf-8')
    >>> xs[2] = 'ç'.encode('utf-8')
    >>> print(max(len(x) for x in xs))
    16
    >>> for x in xs:
    >>>     print(x.decode('utf-8'))
    abc
    « äëïöü »
    ç
    
  • NumPy has a Unicode data type ('U'), but it is not supported by h5py (and is platform-specific).

  • Note that h5py object names (i.e., groups and datasets) are exclusively Unicode and are stored as bytes, so byte strings will be used as-is and Unicode strings will be encoded using UTF-8.

Functions for working with text

The pypfilt.text module provides functions for converting between Unicode strings and byte strings, which behave identically in Python 2 and Python 3.

pypfilt.text.to_unicode(value, encoding=u'utf-8')

Convert a value into a Unicode string.

  • If the value is a Unicode string, no conversion is performed.
  • If the value is a byte string, it is decoded according to the provided encoding.
  • If the value is neither a Unicode string nor a byte string, it is first converted into a string (by the str() built-in function) and then decoded if necessary.
pypfilt.text.to_bytes(value, encoding=u'utf-8')

Convert a value into a byte string.

  • If the value is a Unicode string, it is encoded according to the provided encoding.
  • If the value is a byte string, no conversion is performed.
  • If the value is neither a Unicode string nor a byte string, it is first converted into a string (by the str() built-in function) and then encoded if necessary.

It also provides functions for determining whether a value is a Unicode string or a byte string, although this should generally be known in advance.

pypfilt.text.is_unicode(value)

Return True if the value is a Unicode string.

pypfilt.text.is_bytes(value)

Return True if the value is a byte string.

Change Log

0.4.3 (2016-09-16)

  • Bug fix: correct the basic resampling method. Previously, random samples were drawn from the unit interval and were erroneously assumed to be in sorted order (as is the case for the stratified and deterministic methods).

  • Enhancement: automatically convert Unicode field names to native strings when using Python 2, to prevent NumPy from throwing a TypeError, as may occur when using from __future__ import unicode_literals.

    This functionality is provided by pypfilt.summary.dtype_names_to_str.

  • Enhancement: ensure that temporary files are deleted when the simulation process is terminated by the SIGTERM signal.

    Previously, they were only deleted upon normal termination (as noted in the atexit documentation).

  • Enhancement: consistently separate Unicode strings from bytes, and provide utility functions in the pypfilt.text module.

  • Enhancement: forecast from the most recent known-good cached state, avoiding the estimation pass whenever possible.

  • Enhancement: allow the observation table to be generated externally. This means that users can include additional columns as needed.

  • Enhancement: separate the calculation of log-likelihoods from the adjustment of particle weights, resulting in the new function pypfilt.log_llhd_of.

  • Enhancement: provide particle weights to the log-likelihood function, if the log-likelihood function accepts an extra argument. This has no impact on existing log-likelihood functions.

  • Enhancement: by default, allow simulations to continue if regularisation fails. This behaviour can be changed:

    params['resample']['regularise_or_fail'] = True
    

0.4.2 (2016-06-16)

  • Breaking change: pypfilt.forecast will raise an exception if no forecasting dates are provided.
  • Add installation instructions for Red Hat Enterprise Linux, Fedora, and Mac OS X (using Homebrew).

0.4.1 (2016-04-26)

  • Enhancement: allow forecasts to resume from cached states, greatly improving the speed with which forecasts can be generated when new or updated observations become available. This is enabled by defining a cache file:

    params['hist']['cache_file'] = 'cache.hdf5'
    
  • Enhancement: add option to restrict summary statistics to forecasting simulations, ignoring the initial estimation run. This is enabled by passing only_fs=True as an argument to the pypfilt.summary.HDF5 constructor.

0.4.0 (2016-04-22)

  • Breaking change: require models to define default parameter bounds by implementing the param_bounds method.

  • Enhancement: offer the post-regularised particle filter (post-RPF) as an alternative means of avoiding particle impoverishment (as opposed to incorporating stochastic noise into the model equations). This is enabled by setting:

    params['resample']['regularisation'] = True
    

    See the example script (./doc/example/run.py) for a demonstration.

  • Improved documentation for pypfilt.model.Base and summary statistics.

  • Add documentation for installing in a virtual environment.

0.3.0 (2016-02-23)

  • This release includes a complete overhaul of simulation metadata and summary statistics. See ./doc/example/run.py for an overview of these changes.
  • Breaking change: decrease the default resampling threshold from 75% to 25%.
  • Breaking change: define base classes for summary statistics and output.
  • Breaking change: define a base class for simulation models.
  • Breaking change: collate the resampling and history matrix parameters to reduce clutter.
  • Breaking change: move pypfilt.metadata_priors to pypfilt.summary.
  • Bug fix: prevent stats.cov_wt from mutating the history matrix.
  • Bug fix: ensure that the time-step mapping behaves as documented.
  • Bug fix: ensure that state vector slices have correct dimensions.
  • Enhancement: ensure that forecasting dates lie within the simulation period.
  • Performance improvement: Vectorise the history matrix initialisation.
  • Host the documentation at Read The Docs.

0.2.0 (2015-11-16)

  • Notify models whether the current simulation is a forecast (i.e., if there are no observations). This allows deterministic models to add noise when estimating, to allow identical particles to differ in their behaviour, and to avoid doing so when forecasting.

    Note that this is a breaking change, as it alters the parameters passed to the model update function.

  • Simplify the API for running a single simulation; pypfilt.set_limits has been removed and pypfilt.Time is not included in the API documentation, on the grounds that users should not need to make use of this class.

  • Greater use of NumPy array functions, removing the dependency on six >= 1.7.

  • Minor corrections to the example script (./doc/example/run.py).

0.1.2 (2015-06-08)

  • Avoid error messages if no logging handler is configured by the application.
  • Use a relative path for the output directory. This makes simulation metadata easier to reproduce, since the absolute path of the output directory is no longer included in the output file.
  • Build a universal wheel via python setup.py bdist_wheel, which supports both Python 2 and Python 3.

0.1.1 (2015-06-01)

  • Make the output directory a simulation parameter (out_dir) so that it can be changed without affecting the working directory, and vice versa.

0.1.0 (2015-05-29)

  • Initial release.