Getting started with OpenGHG Inversions#

This is an overview of what OpenGHG Inversions does, and how to use it.

Overview#

Countries are required to create “bottom-up” inventories of emissions. These may be totals for a country, or may be a map of estimated emissions for a given time period.
To check these inventories, we create “top-down” constraints by passing emissions/flux maps through a physical model and comparing the result with observations. We use a Bayesian model to update the flux maps using the given observation data.
The model is roughly \(\mathrm{obs} \approx \mathrm{sensitivities} \times \mathrm{flux} + \mathrm{baseline} + \mathrm{error}\)
The baseline is calculated by multiplying the flux at the boundaries by a sensivity map for each boundary “curtain” (NESW).
Disturbances from the baseline are calculated by multiplying a “footprint” (sensitivities for fluxes) times a flux map.
The sensitivities are considered deterministic, and the fluxes and boundary conditions are modelled as random quantities
We place prior distributions on the fluxes and boundary conditions and use the observation data and MCMC to sample from their posterior distributions. (These are specified by the xprior and bcprior variables in the .ini file below.)
Roughly, an inversion attempts to solve \(\mathrm{obs} - \mathrm{baseline} \approx \mathrm{sensitivities} \times \mathrm{flux}\); the sensitivity matrix is not invertible, so a method like least-squares is necessary. We use a hierarchical Bayesian regression approach, which estimates uncertainties in a natural way.
The output of an inversion contains prior and posterior: modelled observations (“ \(Y\) ” variables), fluxes, and boundary conditions.

What do you need to run an inversion?#

For an inversion, you need to decide the:

gas species (currently you can only choose one)
sites (how many, which sites)
time period (specified using start_date, end_date)
domain for inversion (for sites in the UK, this is usually 'EUROPE')

Once you know the rough outline of your inversion, you need to make sure the right data is stored in OpenGHG, and you will need a few extra files not stored in OpenGHG.

Data stored in OpenGHG#

Observation data: stored using standardise_surface; look for it using search_surface. You need observation data for each site.
Flux data: stored using standardise_flux; look for it using search_flux.
Boundary conditions data: stored using standardise_bc; look for it using search_bc.
Footprints: stored using standardise_footprint; look for it using search_footprints (NOTE: footprint(s) is plural for search, singular for standardise). You need separate footprints for each site.

NOTE: flux and boundary conditions data usually has a specific domain, but it is not specific to any one measurement site, so for multiple sites, you will usually use the same flux and boundary conditions.

Data not stored in OpenGHG#

countries file
- e.g. country_EUROPE.nc
- default location is openghg_inversions/countries
basis functions
- “basis function” usually means a netCDF file with latitude and longitude coordinates, with integer values. So all of the coordinates with value 1 are in region 1, and so on.
- basis functions for fluxes are created on the fly by a “quadtree basis” algorithm. You can also read in pre-defined basis functions. Default location: openghg_inversions/basis_functions
- basis functions for the boundary conditions have a similar format. Defaul location openghg_inversions/bc_basis_functions

Summary

basis functions (for flux) are created on the fly, or you can provide them; you need to provide a country file and bc_basis_functions files.
all of these files can be found in /group/chemistry/acrg/LPDM.

How do you run an inversion?#

Method 1: python script or notebook#

Assuming you have the necessary data, you just need to run the fixedbasisMCMC function from hbmcmc.py in openghg_inversions.hbmbmc.

Method 2: ini file#

Create a my_inversion.ini file based on the template in openghg_inversions/hbmcmc/config/.
Activate a conda or venv environment with inversions installed, and call python <path to openghg_inversions>/openghg_inversions/hbmcmc/run_hbmcmc.py -c my_inversion.ini
A sample .ini script is at the bottom of this document.

Method 3: as a job on Blue Pebble#

Create a .sh file to run a batch job using SLURM using sbatch.
This file will specify:
- the name of the job
- resources required (number of nodes, number of tasks per node, number of cpus per task, memory required, compute time required). Typically, you need 1 node, 1 task, and as many cpus per task as chains in your inversion (each chain is an indepedent MCMC process; using 4 is best for checking convergence)
- where terminal output and errors should be written
- shell commands to run on the compute node:
  - commands to activate your virtual environment
  - commands to run an inversion (usually method 2, but you could use method 1 in a python script)
On the blue pebble login node, run sbatch my_inversion_script.sh (replace 'my_inversion_script' with whatever your script is called).
Use the command sacct to see what batch jobs you have running.

Note: if you are running multiple inversions with a similar ini file (for instance, several years of yearly inversions), then you should use a SLURM array job.

Getting set up on Blue Pebble#

This is assuming you can ssh into blue pebble, and are able to modify files and run commands there (either on the terminal or through VS code or similar).

(This section is specific to the ACRG group at Bristol, but the parts about running jobs on SLURM could apply elsewhere.)

Virtual environment#

TODO: update conda instructions

Blue pebble has pre-installed software called “modules”.
- module avail shows available modules
- module load ... will load a module
Typically you will use the latest anaconda module: module load lang/python/anaconda.
To make your own environment for openghg_inversions, you should:
1. make a conda env conda create --name inv_env numpy (note: installing numpy from conda will install openblas, which is a fast linear algebra library; these libraries are in non-standard locations on Blue Pebble, and pip install numpy will not find them.)
2. clone openghg_inversions: git clone https://github.com/openghg/openghg_inversions.git
3. pip install openghg_inversions (in the same directory where you just cloned openghg_inversions)

Example batch job script#

This script assumes that you have already created a conda env called pymc_env and you have an .ini file in a folder called my_inversions.

#!/bin/sh
# ****************************************************************************
# Wrapper script for submitting jobs on ACRC HPC
# docs: https://www.acrc.bris.ac.uk/protected/hpc-docs/index.html
# ****************************************************************************
#SBATCH --job-name=my_inv
#SBATCH --output=openghg_inversions.out
#SBATCH --error=openghg_inversions.err
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --time=04:00:00
#SBATCH --mem=30gb
#SBATCH --account=dept123456


# Set up Python environment
module --force purge
module load git/2.45.1
eval "$(conda shell.bash hook)"
conda activate pymc_env

#conda info

# run inversion script
INI_FILE=/user/home/bm13805/my_inversions/my_hbmcmc_inputs.ini
python /user/home/bm13805/openghg_inversions/openghg_inversions/hbmcmc/run_hbmcmc.py -c $INI_FILE

# check numpy config
# python -c "import numpy as np; np.show_config()"
# python -c "import pymc"

If this script is saved as my_inversions_script.sh, you would run it with sbatch my_inversion_script.sh.

If you comment out the lines under # run inversion script and uncomment the lines under # check numpy config, you can check if numpy is using a fast linear algebra library. If you are not using a fast linear algebra library, then in the .err output file you will see the warning:

WARNING (pytensor): Using NumPy C-API based implementation for BLAS functions.

Your inversion will run much slower if this is the case; try using conda to install numpy before installing openghg_inversions, as mentioned above.

Work vs. Home directory#

Your home directory /user/home/ab12345 (replace ab12345 with your using name) has a storage quota of 25GB.
Check you usage with command user-quota.
Your work directory /user/work/ab12345 has a storage quota of 1000GB.
Conda environments and the inversion output files can be quite large, so if you are doing many inversions, you should put the output directory in your work directory. (And if you are making multiple conda environments, you should put them in your work directory as well.)

Sample .ini file#

The following file, my_hbmcmc_inputs.ini can be used to run an

; Configuration file for HBMCMC code
; Required inputs are marked as such.
; All other inputs are optional (defaults will be used)

[INPUT.MEASUREMENTS]
; Input values for extracting observations
; species (str) - species name,  e.g. "ch4", "co2", "n2o", etc.
; sites (list) - site codes as a list, e.g. ["TAC", "MHD"]
; averaging_period (list) - Time periods for to average the measurements to (can be None and must match length of sites)
; start_date (str) - Start of observations to extract (format YYYY-MM-DD)
; end_date (str) - End of observations to extract (format YYYY-MM-DD) (non-inclusive)
; inlet (list/None) - Specific inlet height for the site (list - must match number of sites)
; instrument (list/None) - Specific instrument for the site (list - must match number of sites)

species     = 'ch4'   ; (required)
sites       = ['TAC'] ; (required)
averaging_period = ['1H']  ; (required)
start_date  = '2019-01-01'      ; (required)
end_date    = '2019-02-01'      ; (required)
inlet         = ["185m"]
instrument    = ["picarro"]


[INPUT.PRIORS]
; Input values for extracting footprints, emissions and boundary conditions files (also uses values from INPUT.MEASUREMENTS)
; domain (str) - Name of inversion spatial domain
; fp_height (list) - Release height for footprints (must match number of sites).
; emissions_name (list/None) - Name for specific emissions source.

domain = 'EUROPE'  ; (required)
fp_height = ["185m"]  ; typically the same as inlet, but may differ slightly (e.g. if instrument moved to 180m, for instance)
fp_model = "NAME"  ; LPDM model, usually NAME
emissions_name = ["total-ukghg-edgar7"]  ; total = all emissions sources; agric-ukghg-edgar7 would be agricultural sources only
met_model = 'UKV'  ; or None if not specified, check the metadata for your footprint

[INPUT.BASIS_CASE]
; Input values to extract the basis cases to use within the inversion for boundary conditions nd emissions
; basis_algorithm (str): Choice of basis function algorithm to use. One of "quadtree" or "weighted"
; bc_basis_case (str): Boundary conditions basis, defaults to "NESW" (looks for file format {bc_basis_case}_{domain}_*.nc)
; bc_basis_directory (str/None): Directory for bc_basis functions. If None provided, creates new folder in openghg_inversions expecting to find bc_basis_function files there.
; fp_basis_case (str/None): Emissions bases:
; - if specified, looks for file format {fp_basis_case}_{domain}_*.nc
; - if None, creates basis function using algorithm specified and associated parameters
; nbasis: Number of basis functions to use for algorithm-specified basis function (rounded to %4) in domain
; basis_directory (str/None): Directory containing the basis functions (with domain name as subdirectories)
; country_file (str/None): Directory with filename  containing the indices of country boundaries in domain

basis_algorithm = "quadtree"
bc_basis_case = "NESW"
fp_basis_case = None
nbasis = 50
basis_directory = "/group/chemistry/acrg/LPDM/basis_functions/"
bc_basis_directory = None
country_file = None

[MCMC.TYPE]
; Which MCMC setup to use. This defines the function which will be called and the expected inputs.
; Options include: "fixed_basis"

mcmc_type = "fixed_basis"


[MCMC.PDF]
; Definitions of PDF shape and parameters for inputs
; - xprior (dict) - emissions
; - bcprior (dict) - boundary conditions
; - sigprior (dict) - model error

; Each of these inputs should be dictionary with the name of probability distribution and shape parameters.
; See https://docs.pymc.io/api/distributions/continuous.html
; Check openghg_inversions.hbmcmc.inversion_pymc for options.

xprior   = {"pdf":"lognormal", "stdev":1}  ; lognormal with mean = 1, stdev = 1
bcprior  = {"pdf":"truncatednormal", "mu":1.0, "sigma":0.02}  ; truncated normal with mean = 1, stdev 0.02
sigprior = {"pdf":"uniform", "lower":0.5, "upper":10}
;add_offset = False
;offsetprior = {"pdf": "normal"}
;offset_args = {"drop_first": False}  ;; set to True if you want first site to have offset 0.0

[MCMC.BC_SPLIT]
; Boundary conditions setup
; - bc_freq - The period over which the baseline is estimated. e.g.
;  - None - one scaling for the whole inversion
;  - "monthly" - per calendar monthly
;  - "*D" (e.g. "30D") - per number of days (e.g. 30 days)

bc_freq    = "monthly"
sigma_freq = None
sigma_per_site = True


[MCMC.ITERATIONS]
; Iteration parameters
; nit (int) - Number of iterations for MCMC
; burn (int) - Number of iterations to burn in MCMC
; tune (int) - Number of iterations to use to tune step size

nit  = 5000
burn = 1000
tune = 2000


[MCMC.NCHAIN]
; Number of chains to run simultaneously. Must be at least 2 to allow convergence to be checked.

nchain = 4

[MCMC.OPTIONS]
; averaging_error (bool): Add variability in averaging period to the measurement error (Note: currently this
;                         doesn't work correctly)
; min_error (float): value specifying a lower bound for the model-measurement mismatch error (i.e. the error on
;                    (y - y_mod)). Ignored if compute_min_error = True.
; fix_basis_outer_regions (bool): If True, the "outer regions" of the domain use basis regions specified by a
;                                 file provided by the Met Office (from their "InTem" model), and the "inner
;                                 region", which includes the UK, is fit using our basis algorithms.
; use_bc (bool): defaults to True. If False, no boundary conditions will be used in the inversion. This
;                implicitly assumes that contributions from the boundary have been subtracted from the
;                observations.
; nuts_sampler (str): defaults to "pymc". The other option is "numpyro", which will the JAX accelerated sampler
;                     from Numpyro; this tends to be significantly faster than the NUTS sampler built into PyMC.
; save_trace (bool): If True, the arviz InferenceData output from sampling will be saved to the output path of
;                    the inversion, with a file name of the form f"{outputname}{start_data}_trace.nc.
;                    Alternatively, you can pass a path (including filename), and that path will be used.
; calculate_min_error: computes min_error on the fly using the "residual error method" or a method based on percentiles.
;                      values can be: "residual", "percentile", None. If value is None, then value passed to `min_error`
;                      is used.
; pollution_events_from_obs (bool): Determines whether the model error is calculated as a fraction of:
;                                   - the measured enhancement above the modelled baseline (if True)
;                                   - the prior modelled enhancement (if False)
; reparameterise_log_normal (bool): If True, rewrite log normal prior samples as a function of standard normal samples.
;                                   This may reduce divergences when sampling.
; sampler_kwargs (dict): Kwargs to pass to the sampler (e.g. sampler_kwargs = {'target_accept': 0.99})

averaging_error = True
min_error = 0.0
fix_basis_outer_regions = False
use_bc = True
nuts_sampler = "numpyro"
save_trace = True
min_error_options = {"by_site": True}  ; options to pass to function used to compute min error
pollution_events_from_obs = True
no_model_error = False
reparameterise_log_normal = False

[MCMC.OUTPUT]
; Format of output
; See docs for openghg_inversions.hbmcmc.hbmcmc.fixedbasisMCMC
; for full list of options
output_format = "hbmcmc"  ; default option, but "paris" tends to be used more

; Details of where to write the output
; outputpath (str) - directory to write output
; outputname (str) - unique identifier for output/run name.

outputpath = '/user/work/ab12345/my_inversions'  ; (required)
outputname = 'ch4_TAC_test'  ; (required)

Description of HBMCMC output file#

The output of run_hbmcmc (and fixedbasisMCMC) is an xarray Dataset with the following variables and attributes:

HMCMC output data variables:

Y: Measurements used in inversion
Yerror: Measurement error
Ytime: Measurement times
Yapriori: Modelled measurements using a priori emissions
Ymod: Posterior modelled mean measurements
Ymod95: Posterior modelled 95% measurement uncertainty (2.5% and 97.5%)
Ymod68: Posterior modelled 68% measurement uncertainty (16% and 84%)
YmodBC: Posterior modelled mean boundary conditions
YaprioriBC: Modelled BCs using a priori BCs
xtrace: MCMC chain for basis functions
bctrace: MCMC chain for boundary conditions
sigtrace: MCMC chain for model error (currently iid)
siteindicator: numerical value corresponding to each site, same size as Y.
sitename: name of site (in order or siteindicators)
site_lon: Measurement site longitude
site_lat: Measurement site latitude
aprioriflux: Mean a priori flux for whole domain at NAME resolution. E.g.., if emissions file is monthly, and you are running for April and May it will take the mean emissions weighted by number of days in each month.
meanflux: Mean posterior flux for whole domain at NAME resolution
meanscaling: Mean posterior scaling for whole domain at NAME resolution
basis_functions: Basis functions used at NAME resolution
countrytotals: Mean posterior total emissions for every country in domain country file
country68: 68% uncertainty for country totals (16% and 84%)
country95: 95% uncertainty for country totals (2.5% and 97.5%)
countrysd: standard deviation for country totals

Attributes:

Units for all variables that need them
Start date of inversion period
End date of inversion period
Type(s) of MCMC sampler used
Prior PDF for emissions scaling
Prior PDF for BC scaling
Prior PDF for model error
Creator
Date created
Convergence: Passed/Failed based on multiple chains (min 2) having Gelman-Rubin diagnostic < 1.05