Working with data frames#

We have already shown how to run forecasts with pypfilt.forecast() and inspect the results returned by this function. For a more robust workflow, we should also save the forecast results to an output (HDF5) file:

pypfilt.forecast(context, forecast_times, filename='output_file.hdf5')

Note

HDF5 is a file format that allows you to store lots of data tables and related metadata in a single file. You can explore the contents of an HDF5 file with the h5py package.

Loading data frames#

The pypfilt.io module provides functions for loading results as Pandas or Polars data frames, including:

load_observations() for loading input observations; and
load_summary_tables() for loading summary statistic tables.

This offers greater flexibility in analysing and visualising the results of your simulations.

Summary table structure#

Recall that in Plotting the results we had to combine the credible intervals from the estimation pass and the forecast from \(t = 20\)

# Collect credible intervals for the recent backcast and the forecast.
fit = results.estimation.tables['forecasts']
forecast = results.forecasts[forecast_time].tables['forecasts']
credible_intervals = np.concatenate(
    (fit[fit['time'] >= backcast_time], forecast)
)

When saving results to an HDF5 file, the summary tables for the estimation pass and each forecasting pass are combined into a single table. For example, the 'forecasts' tables from the estimation pass and the forecast at \(t = 20\) (as shown in the code block above) are combined into a single 'forecasts' table that contains both sets of results.

These combined tables contain an additional 'fs_time' column:

For each forecast, the 'fs_time' column will contain the forecast time. For example, if we run a forecast from \(t = 20\), the 'fs_time' column will contain the value 20 for the rows associated with this forecast.
For the estimation pass, the 'fs_time' column will contain the end of the simulation period. For example, the simulation period for each scenario in this tutorial starts at \(t = 0\) and ends at \(t = 25\), and so the 'fs_time' column will contain the value 25 for the rows associated with the estimation pass.

Reproducibility#

Saving the results to an HDF5 file also has several advantages for reproducible results:

This collects all of the input observations and summary tables in a single file;
This file also contains the scenario settings, and the version number of pypfilt and related packages;
If you are working in a git repository, this file will also include the current git commit, branch name, and a list of modified files.