openghg_inversions.hbmcmc.inversion_pymc#

Functions for performing MCMC inversion. PyMC library used for Bayesian modelling.

openghg_inversions.hbmcmc.inversion_pymc.inferpymc(Hx: ndarray, Y: ndarray, error: ndarray, siteindicator: ndarray, sigma_freq_index: ndarray, Hbc: ndarray | None = None, xprior: dict = {'mu': 1.0, 'pdf': 'normal', 'sigma': 1.0}, bcprior: dict = {'mu': 1.0, 'pdf': 'normal', 'sigma': 1.0}, sigprior: dict = {'lower': 0.1, 'pdf': 'uniform', 'upper': 3.0}, nuts_sampler: str = 'pymc', nit: int = 20000, burn: int = 10000, tune: int = 10000, nchain: int = 4, sigma_per_site: bool = True, offsetprior: dict = {'mu': 0, 'pdf': 'normal', 'sigma': 1}, add_offset: bool = False, verbose: bool = False, min_error: ndarray | float | None = 0.0, use_bc: bool = True, reparameterise_log_normal: bool = False, pollution_events_from_obs: bool = False, no_model_error: bool = False, offset_args: dict | None = None, power: dict | float = 1.99, sampler_kwargs: dict | None = None) dict#

Uses PyMC module for Bayesian inference for emissions field, boundary conditions and (currently) a single model error value. This uses a Normal likelihood but the (hyper)prior PDFs can be selected by user.

Parameters:
  • Hx – Transpose of the sensitivity matrix to map emissions to measurement. This is the same as what is given from fp_data[site].H.values, where fp_data is the output from e.g. footprint_data_merge, but where it has been stacked for all sites.

  • Y – Measurement vector containing all measurements

  • error – Measurement error vector, containg a value for each element of Y.

  • siteindicator – Array of indexing integers that relate each measurement to a site

  • sigma_freq_index – Array of integer indexes that converts time into periods

  • Hbc – Same as Hx but for boundary conditions. Only used if use_bc=True.

  • xprior – Dictionary containing information about the prior PDF for emissions. The entry “pdf” is the name of the analytical PDF used, see https://docs.pymc.io/api/distributions/continuous.html for PDFs built into pymc3, although they may have to be coded into the script. The other entries in the dictionary should correspond to the shape parameters describing that PDF as the online documentation, e.g. N(1,1**2) would be: xprior={pdf: “normal”, “mu”: 1.0, “sigma”: 1.0}. Note that the standard deviation should be used rather than the precision. Currently all variables are considered iid.

  • bcprior – Same as xprior but for boundary conditions. Only used if use_bc=True.

  • sigprior – Same as xprior but for model error.

  • nuts_sampler – nuts_sampler use by pymc.sample. Options are “pymc” and “numpyro”?

  • nit – number of samples to generate (per chain)

  • burn – number of samples to discard (or “burn”) from the beginning of each chain

  • tune – number of tuning steps used by sampler

  • nchain – number of chains use by sampler. You should use at least 2 chains for the convergence checks to work; four chains is better. Chains run in parallel, so the number of chains doesn’t affect running time, provided the number of threads available is at least the number of chains.

  • sigma_per_site (bool) – Whether a model sigma value will be calculated for each site independantly (True) or all sites together (False). Default: True

  • offsetprior (dict) – Same as above but for bias offset. Only used is addoffset=True.

  • add_offset (bool) – Add an offset (intercept) to all sites but the first in the site list. Default False.

  • verbose – When True, prints progress bar

  • min_error – Minimum error to use during inversion. Only used if no_model_error is False.

  • save_trace – Path where to save the trace. If None, the trace is not saved. Default None.

  • use_bc – When True, use and infer boundary conditions.

  • reparameterise_log_normal – If there are many divergences when using a log normal prior, setting this to True might help. It samples from a normal prior, then puts the normal samples through a function that converts them to log normal samples; this changes the space the sampler needs to explore.

  • pollution_events_from_obs – When True, calculate the pollution events from obs; when false pollution events are set to the modeled concentration.

  • no_model_error – When True, only use observation error in likelihood function (omitting min. model error and model error from scaling pollution events.)

  • offset_args – optional arguments to pass to make_offset.

  • power – power to raise pollution events to when using pollution events from obs. Default is 1.99. Any value (strictly) between 1 and 2 will work. If a dictionary is passed, this is used to create a prior for the power, making the power a hyper-parameter.

Returns:

xouts (array):

MCMC chain for emissions scaling factors for each basis function.

sigouts (array):

MCMC chain for model error.

Ytrace (array):

MCMC chain for modelled obs..

OFFSETtrace (array):

MCMC chain for the offset.

convergence (str):

Passed/Failed convergence test as to whether mutliple chains have a Gelman-Rubin diagnostic value <1.05

step1 (str):

Type of MCMC sampler for emissions and boundary condition updates. Currently it’s hardwired to NUTS (probably wouldn’t change this unless you’re doing something obscure).

step2 (str):

Type of MCMC sampler for model error updates. Currently it’s hardwired to a slice sampler. This parameter is low dimensional and quite simple with a slice sampler, although could easily be changed.

bcouts (array):

MCMC chain for boundary condition scaling factors. Only if use_bc is True.

YBCtrace (array):

MCMC chain for modelled boundary condition Only if use_bc is True.

Return type:

Dictionary containing

TO DO:
  • Allow non-iid variables

openghg_inversions.hbmcmc.inversion_pymc.inferpymc_postprocessouts(xouts: ndarray, sigouts: ndarray, convergence: str, Hx: ndarray, Y: ndarray, error: ndarray, Ytrace: ndarray, OFFSETtrace: ndarray, step1: str, step2: str, xprior: dict, sigprior: dict, offsetprior: dict | None, Ytime: ndarray, siteindicator: ndarray, sigma_freq_index: ndarray, domain: str, species: str, sites: list, start_date: str, end_date: str, outputname: str, outputpath: str, country_unit_prefix: str | None, burn: int, tune: int, nchain: int, sigma_per_site: bool, emissions_name: str, bcprior: dict | None = None, YBCtrace: ndarray | None = None, bcouts: ndarray | None = None, Hbc: ndarray | None = None, obs_repeatability: ndarray | None = None, obs_variability: ndarray | None = None, fp_data: dict | None = None, country_file: str | None = None, add_offset: bool = False, rerun_file: Dataset | None = None, use_bc: bool = False, min_error: float | ndarray = 0.0) Dataset#

Take the output from inferpymc function along with other input information.

Calculates statistics on them and places it all in a dataset. Also calculates statistics on posterior emissions for the countries in the inversion domain and saves all in netcdf.

Note that the uncertainties are defined by the highest posterior density (HPD) region and NOT percentiles (as the tdMCMC code). The HPD region is defined, for probability content (1-a), as:

  1. P(x ∈ R | y) = (1-a)

  2. for x1 ∈ R and x2 ∉ R, P(x1|y)>=P(x2|y)

Parameters:
  • xouts – MCMC chain for emissions scaling factors for each basis function.

  • sigouts – MCMC chain for model error.

  • convergence – Passed/Failed convergence test as to whether multiple chains have a Gelman-Rubin diagnostic value <1.05.

  • Hx – Transpose of the sensitivity matrix to map emissions to measurement. This is the same as what is given from fp_data[site].H.values, where fp_data is the output from e.g. footprint_data_merge, but where it has been stacked for all sites.

  • Y – Measurement vector containing all measurements.

  • error – Measurement error vector, containing a value for each element of Y.

  • Ytrace – Trace of modelled y values calculated from mcmc outputs and H matrices.

  • OFFSETtrace – Trace from offsets (if used).

  • step1 – Type of MCMC sampler for emissions and boundary condition updates.

  • step2 – Type of MCMC sampler for model error updates.

  • xprior – Dictionary containing information about the prior PDF for emissions. The entry “pdf” is the name of the analytical PDF used, see https://docs.pymc.io/api/distributions/continuous.html for PDFs built into pymc3, although they may have to be coded into the script. The other entries in the dictionary should correspond to the shape parameters describing that PDF as the online documentation, e.g. N(1,1**2) would be: xprior={pdf:”normal”, “mu”:1, “sigma”:1}. Note that the standard deviation should be used rather than the precision. Currently all variables are considered iid.

  • sigprior – Same as xprior but for model error.

  • offsetprior – Same as xprior but for bias offset. Only used is add_offset=True.

  • Ytime – Time stamp of measurements as used by the inversion.

  • siteindicator – Numerical indicator of which site the measurements belong to, same length at Y.

  • sigma_freq_index – Array of integer indexes that converts time into periods.

  • domain – Inversion spatial domain.

  • species – Species of interest.

  • sites – List of sites in inversion.

  • start_date – Start time of inversion “YYYY-mm-dd”.

  • end_date – End time of inversion “YYYY-mm-dd”.

  • outputname – Unique identifier for output/run name.

  • outputpath – Path to where output should be saved.

  • country_unit_prefix – A prefix for scaling the country emissions. Current options are: ‘T’ will scale to Tg, ‘G’ to Gg, ‘M’ to Mg, ‘P’ to Pg. To add additional options add to acrg_convert.prefix Default is none and no scaling will be applied (output in g).

  • burn – Number of iterations burned in MCMC

  • tune – Number of iterations used to tune step size

  • nchain – Number of independent chains run

  • sigma_per_site – Whether a model sigma value will be calculated for each site independantly (True) or all sites together (False).

  • emissions_name – List with “source” values as used when adding emissions data to the OpenGHG object store.

  • bcprior – Same as xrpior but for boundary conditions.

  • YBCtrace – Trace of modelled boundary condition values calculated from mcmc outputs and Hbc matrices

  • bcouts – MCMC chain for boundary condition scaling factors.

  • Hbc – Same as Hx but for boundary conditions

  • obs_repeatability – Instrument error

  • obs_variability – Error from resampling observations

  • fp_data – Output from footprints_data_merge + sensitivies

  • country_file – Path of country definition file

  • add_offset – Add an offset (intercept) to all sites but the first in the site list. Default False.

  • rerun_file (xarray dataset, optional) – An xarray dataset containing the ncdf output from a previous run of the MCMC code.

  • use_bc – When True, use and infer boundary conditions.

  • min_error – Minimum error to use during inversion. Only used if no_model_error is False.

Returns:

xarray dataset containing results from inversion

TO DO:
  • Look at compressability options for netcdf output

  • I’m sure the number of inputs can be cut down or found elsewhere.

  • Currently it can only work out the country total emissions if the a priori emissions are constant over the inversion period or else monthly (and inversion is for less than one calendar year).

openghg_inversions.hbmcmc.inversion_pymc.lognormal_mu_sigma(mean: float, stdev: float) tuple[float, float]#

Return the pymc mu and sigma parameters that give a log normal distribution with the given mean and stdev.

Parameters:
  • mean – desired mean of log normal

  • stdev – desired standard deviation of log normal

Returns:

tuple (mu, sigma), where pymc.LogNormal(mu, sigma) has the given mean and stdev.

Formulas for log normal mean and variance:

mean = exp(mu + 0.5 * sigma ** 2) stdev ** 2 = var = exp(2*mu + sigma ** 2) * (exp(sigma ** 2) - 1)

This gives linear equations for mu and sigma ** 2:

mu + 0.5 * sigma ** 2 = log(mean) sigma ** 2 = log(1 + (stdev / mean)**2)

So

mu = log(mean) - 0.5 * log(1 + (stdev/mean)**2) sigma = sqrt(log(1 + (stdev / mean)**2))

openghg_inversions.hbmcmc.inversion_pymc.parse_prior(name: str, prior_params: dict[str, str | float], **kwargs) TensorVariable#

Parses all PyMC continuous distributions: https://docs.pymc.io/api/distributions/continuous.html.

Parameters:
  • name – name of variable in the pymc model

  • prior_params – dict of parameters for the distribution, including ‘pdf’ for the distribution to use. The value of prior_params[“pdf”] must match the name of a PyMC continuous distribution: https://docs.pymc.io/api/distributions/continuous.html

  • **kwargs – for instance, shape or dims

Returns:

continuous PyMC distribution

For example: ` params = {"pdf": "uniform", "lower": 0.0, "upper": 1.0} parse_prior("x", params, shape=(20, 20)) ` will create a 20 x 20 array of uniform random variables. Alternatively, ` params = {"pdf": "uniform", "lower": 0.0, "upper": 1.0} parse_prior("x", params, dims="nmeasure")) ` will create an array of uniform random variables with the same shape as the dimension coordinate nmeasure. This can be used if pm.Model is provided with coordinates.

Note: parse_prior must be called inside a pm.Model context (i.e. after with pm.Model()) has an important side-effect of registering the random variable with the model.