openghg_inversions.hbmcmc.inversion_pymc#

Functions for performing MCMC inversion with the xarray-first PyMC builder.

openghg_inversions.hbmcmc.inversion_pymc.build_inferpymc_model(inv_inputs: Dataset, *, xprior: dict | None = None, bcprior: dict | None = None, sigprior: dict | None = None, sigma_per_site: bool = True, offsetprior: dict | None = None, add_offset: bool = False, use_bc: bool = True, reparameterise_log_normal: bool = False, pollution_events_from_obs: bool = False, no_model_error: bool = False, offset_args: dict | None = None, power: dict | float = 1.99) Model#

Build the current component-based inferpymc model.

Parameters:
  • inv_inputs – Canonical inversion-input dataset, usually produced by make_inv_inputs(...). This dataset must contain the observation and model variables required by the current component-based model, including at minimum H, mf, mf_error, site_indicator, sigma_freq_index, and min_error. When use_bc is True, it must also contain H_bc.

  • xprior – Prior specification for emissions scaling factors.

  • bcprior – Prior specification for boundary-condition scaling factors.

  • sigprior – Prior specification for model-error terms.

  • sigma_per_site – Whether sigma should vary by site.

  • offsetprior – Prior specification for optional offsets.

  • add_offset – Whether to include an offset term in the model.

  • use_bc – Whether to include boundary-condition terms in the model.

  • reparameterise_log_normal – Whether to request lognormal reparameterisation for supported priors.

  • pollution_events_from_obs – Whether to derive pollution-event scaling from observations rather than modelled concentrations.

  • no_model_error – Whether to suppress the explicit model-error term.

  • offset_args – Extra keyword arguments forwarded to add_offset_component.

  • power – Exponent or prior specification used in the likelihood error scaling.

Returns:

Built PyMC model for the current inferpymc path.

openghg_inversions.hbmcmc.inversion_pymc.extend_inferencedata_predictive(trace: InferenceData, *, model: Model, sample_prior_predictive: bool | int = False, sample_posterior_predictive: bool | list[str] = False) InferenceData#

Extend an InferenceData trace with optional predictive groups.

Updates InferenceData in-place with requested groups and returns the result for convenience.

Parameters:
  • trace – Posterior trace to extend with predictive groups.

  • model – Built PyMC model used for predictive sampling.

  • sample_prior_predictive – If truthy, sample prior predictive draws and append prior and prior_predictive groups. If an integer, use that many draws; if True, reuse the posterior draw count.

  • sample_posterior_predictive – If truthy, sample posterior predictive draws and append posterior_predictive. If a list, restrict posterior predictive sampling to those variable names.

Returns:

Input trace extended with the requested predictive groups.

openghg_inversions.hbmcmc.inversion_pymc.inferpymc(inv_inputs: Dataset, xprior: dict | None = None, bcprior: dict | None = None, sigprior: dict | None = None, nuts_sampler: str = 'pymc', nit: int = 20000, burn: int = 10000, tune: int = 10000, nchain: int = 4, sigma_per_site: bool = True, offsetprior: dict | None = None, add_offset: bool = False, verbose: bool = False, use_bc: bool = True, reparameterise_log_normal: bool = False, pollution_events_from_obs: bool = False, no_model_error: bool = False, offset_args: dict | None = None, power: dict | float = 1.99, sampler_kwargs: dict | None = None) dict#

Perform Bayesian inference with PyMC for emissions, BCs, and model error.

This routine is the compatibility entrypoint for the current PyMC path. It builds the component-based model from make_inv_inputs output, runs sampling, and adapts the result into the legacy return structure used by downstream postprocessing.

Parameters:
  • inv_inputs – xarray.Dataset produced by make_inv_inputs.

  • xprior – Dictionary describing the prior PDF for emissions. The entry “pdf” is the name of the analytical PDF used; other entries are shape parameters (e.g., {‘pdf’: ‘lognormal’, ‘stdev’: 1.0}).

  • bcprior – Prior specification for boundary conditions. Only used if use_bc is True. A common choice is {‘pdf’: ‘truncatednormal’, ‘lower’: 0.0, ‘mu’: 1.0, ‘sigma’: 0.1}.

  • sigprior – Prior specification for the model-error parameter(s).

  • nuts_sampler – Name of the NUTS sampler used by pymc.sample (e.g., “pymc” or “numpyro”).

  • nit – Number of posterior draws to keep per chain. Tuning draws are controlled separately via tune.

  • burn – Number of samples to discard as burn-in.

  • tune – Number of tuning steps passed to the sampler.

  • nchain – Number of MCMC chains to run.

  • sigma_per_site – If True, estimate a separate sigma (model error) for each site.

  • offsetprior – Prior specification for offsets applied to sites or observations.

  • add_offset – If True, include an offset term in the model.

  • verbose – If True, print additional diagnostic information.

  • use_bc – If True, include boundary condition terms in the model.

  • reparameterise_log_normal – If True, reparameterise log-normal priors for numerical stability.

  • pollution_events_from_obs – If True, derive pollution event terms from observations.

  • no_model_error – If True, do not include an explicit model-error term.

  • offset_args – Additional arguments used when constructing offsets.

  • power – Exponent used in certain weighting or prior schemes; may be a dict or float.

  • sampler_kwargs – Extra keyword arguments passed to the sampler.

Returns:

Dictionary containing inference results, samples, and diagnostics in the legacy inferpymc key structure. Depending on the selected options, keys typically include:

  • "xouts": posterior samples for emissions / fluxes

  • "sigouts": posterior samples for sigma terms

  • "Ytrace": modelled concentrations

  • "bcouts" and "YBCtrace" when boundary conditions are used

  • "OFFSETtrace" when offsets are enabled

  • "trace", "model", and convergence metadata

raises ValueError:

If the model cannot be built from the supplied inv_inputs and configuration.

openghg_inversions.hbmcmc.inversion_pymc.inferpymc_postprocessouts(xouts: ndarray, sigouts: ndarray, convergence: str, Hx: ndarray, Y: ndarray, error: ndarray, Ytrace: ndarray, OFFSETtrace: ndarray, step1: str, step2: str, xprior: dict, sigprior: dict, offsetprior: dict | None, Ytime: ndarray, siteindicator: ndarray, sigma_freq_index: ndarray, domain: str, species: str, sites: list, start_date: str, end_date: str, outputname: str, outputpath: str, country_unit_prefix: str | None, burn: int, tune: int, nchain: int, sigma_per_site: bool, emissions_name: str, bcprior: dict | None = None, YBCtrace: ndarray | None = None, bcouts: ndarray | None = None, Hbc: ndarray | None = None, obs_repeatability: ndarray | None = None, obs_variability: ndarray | None = None, fp_data: dict | None = None, country_file: str | None = None, add_offset: bool = False, rerun_file: Dataset | None = None, use_bc: bool = False, min_error: float | ndarray = 0.0) Dataset#

Take the output from inferpymc function along with other input information.

Calculates statistics on them and places it all in a dataset. Also calculates statistics on posterior emissions for the countries in the inversion domain and saves all in netcdf.

Note that the uncertainties are defined by the highest posterior density (HPD) region and NOT percentiles (as the tdMCMC code). The HPD region is defined, for probability content (1-a), as:

  1. P(x ∈ R | y) = (1-a)

  2. for x1 ∈ R and x2 ∉ R, P(x1|y)>=P(x2|y)

Parameters:
  • xouts – MCMC chain for emissions scaling factors for each basis function.

  • sigouts – MCMC chain for model error.

  • convergence – Passed/Failed convergence test as to whether multiple chains have a Gelman-Rubin diagnostic value <1.05.

  • Hx – Transpose of the sensitivity matrix to map emissions to measurement. This is the same as what is given from fp_data[site].H.values, where fp_data is the output from e.g. footprint_data_merge, but where it has been stacked for all sites.

  • Y – Measurement vector containing all measurements.

  • error – Measurement error vector, containing a value for each element of Y.

  • Ytrace – Trace of modelled y values calculated from mcmc outputs and H matrices.

  • OFFSETtrace – Trace from offsets (if used).

  • step1 – Type of MCMC sampler for emissions and boundary condition updates.

  • step2 – Type of MCMC sampler for model error updates.

  • xprior – Dictionary containing information about the prior PDF for emissions. The entry “pdf” is the name of the analytical PDF used, see https://docs.pymc.io/api/distributions/continuous.html for PDFs built into pymc3, although they may have to be coded into the script. The other entries in the dictionary should correspond to the shape parameters describing that PDF as the online documentation, e.g. N(1,1**2) would be: xprior={pdf:”normal”, “mu”:1, “sigma”:1}. Note that the standard deviation should be used rather than the precision. Currently all variables are considered iid.

  • sigprior – Same as xprior but for model error.

  • offsetprior – Same as xprior but for bias offset. Only used is add_offset=True.

  • Ytime – Time stamp of measurements as used by the inversion.

  • siteindicator – Numerical indicator of which site the measurements belong to, same length at Y.

  • sigma_freq_index – Array of integer indexes that converts time into periods.

  • domain – Inversion spatial domain.

  • species – Species of interest.

  • sites – List of sites in inversion.

  • start_date – Start time of inversion “YYYY-mm-dd”.

  • end_date – End time of inversion “YYYY-mm-dd”.

  • outputname – Unique identifier for output/run name.

  • outputpath – Path to where output should be saved.

  • country_unit_prefix – A prefix for scaling the country emissions. Current options are: ‘T’ will scale to Tg, ‘G’ to Gg, ‘M’ to Mg, ‘P’ to Pg. To add additional options add to acrg_convert.prefix Default is none and no scaling will be applied (output in g).

  • burn – Number of iterations burned in MCMC

  • tune – Number of iterations used to tune step size

  • nchain – Number of independent chains run

  • sigma_per_site – Whether a model sigma value will be calculated for each site independantly (True) or all sites together (False).

  • emissions_name – List with “source” values as used when adding emissions data to the OpenGHG object store.

  • bcprior – Same as xrpior but for boundary conditions.

  • YBCtrace – Trace of modelled boundary condition values calculated from mcmc outputs and Hbc matrices

  • bcouts – MCMC chain for boundary condition scaling factors.

  • Hbc – Same as Hx but for boundary conditions

  • obs_repeatability – Instrument error

  • obs_variability – Error from resampling observations

  • fp_data – Output from footprints_data_merge + sensitivies

  • country_file – Path of country definition file

  • add_offset – Add an offset (intercept) to all sites but the first in the site list. Default False.

  • rerun_file (xarray dataset, optional) – An xarray dataset containing the ncdf output from a previous run of the MCMC code.

  • use_bc – When True, use and infer boundary conditions.

  • min_error – Minimum error to use during inversion. Only used if no_model_error is False.

Returns:

xarray dataset containing results from inversion

TO DO:
  • Look at compressability options for netcdf output

  • I’m sure the number of inputs can be cut down or found elsewhere.

  • Currently it can only work out the country total emissions if the a priori emissions are constant over the inversion period or else monthly (and inversion is for less than one calendar year).

openghg_inversions.hbmcmc.inversion_pymc.sample(model: Model, *, draws: int = 1000, tune: int = 1000, chains: int = 4, burn: int = 0, sample_prior_predictive: bool | int = False, sample_posterior_predictive: bool | list[str] = False, **kwargs: Any) InferenceData#

Sample from a built inferpymc model.

Parameters:
  • model – Built PyMC model to sample from.

  • draws – Number of posterior draws to keep per chain.

  • tune – Number of tuning draws passed to pm.sample.

  • chains – Number of MCMC chains to run.

  • burn – Number of posterior draws to discard from the returned InferenceData.

  • sample_prior_predictive – Optional prior predictive sampling request. If an integer, use that many draws; if True, reuse the posterior draw count.

  • sample_posterior_predictive – Optional posterior predictive sampling request. If a list, restrict sampling to those variable names.

  • **kwargs – Additional keyword arguments forwarded to pm.sample. return_inferencedata is always forced to True and idata_kwargs["log_likelihood"] is always enabled.

Returns:

Burn-sliced InferenceData for the requested model, optionally extended with predictive groups.