openghg_inversions.array_ops#
General methods for xarray Datasets and DataArrays.
The functions here are not specific to OpenGHG inversions: they add functionality missing from xarray. These functions should accept xarray Datasets and DataArrays, and return either a Dataset or a DataArray.
Functions#
- get_xr_dummies
Applies pandas
get_dummiesto xarray DataArrays.- sparse_xr_dot
Multiplies a Dataset or DataArray by a DataArray with sparse underlying array. The built-in xarray functionality doesn’t work correctly.
Coordinate Alignment Policy#
Spatial coordinates (e.g. lat and lon) are treated as structural
invariants of the inversion workflow. Although grids are expected to be
identical, small floating-point differences can arise from I/O, reprojection,
or sparse/dask operations.
Because xarray uses strict, label-based alignment, even negligible coordinate differences can trigger unintended reindexing, interpolation, or alignment errors.
To avoid this, we:
Validate numerical equivalence within tolerance.
Force coordinate identity via
assign_coordswhen validation passes.
This ensures deterministic arithmetic, prevents silent interpolation, and keeps grid equivalence an explicit, testable invariant.
The main function that does this is force_align; there is a legacy
function align_sparse_lat_lon that was introduced as a work-around to
an xarray issue, which now is written in terms of force_align.
- openghg_inversions.array_ops.align_sparse_lat_lon(sparse_da: DataArray, other_array: DataArray | Dataset, *, rtol: float = 1e-05, atol: float = 1e-08) DataArray#
Align lat/lon coordinates of sparse_da with those from other_array.
NOTE: Workaround for xarray issue #3445: pydata/xarray#3445
Validates numerical closeness of coordinates before forcing coordinate identity. No interpolation or reindexing is performed.
- Parameters:
sparse_da – DataArray with sparse backend.
other_array – Dataset or DataArray providing canonical lat/lon coords.
rtol – Relative tolerance for coordinate comparison.
atol – Absolute tolerance for coordinate comparison.
- Returns:
Copy of sparse_da with lat/lon coords replaced by those from other_array.
- Raises:
ValueError – If sizes differ or coordinates differ beyond tolerance.
- openghg_inversions.array_ops.align_to_multi_index_level_values(da: DataArray, *, multi_index: DataArray, multi_dim: str, level: str, other_dim: str) DataArray#
Broadcast da(other_dim, …) onto multi_dim using a MultiIndex level.
Raises a ValueError if any other MultiIndex level names are already present as coordinate variables on da, because this creates ambiguous/conflicting index ownership when the MultiIndex is assigned.
- Parameters:
da – DataArray to align, e.g. fp_x_flux with dimension other_dim=”source”.
multi_index – Coordinate for the MultiIndex dimension (e.g. basis_matrix[“state”]). Must be a MultiIndex coordinate that includes the level named by level.
multi_dim – Name of the MultiIndex dimension to align onto (e.g. “state”).
level – Name of the MultiIndex level within multi_index to use for alignment (e.g. “source”).
other_dim – Dimension on da that corresponds to level (e.g. “source”).
- Returns:
A DataArray aligned to multi_dim, with a canonical MultiIndex coordinate installed on multi_dim.
- Raises:
ValueError – If level is not a level of multi_index, or if other level names are already present as coordinates on da.
- openghg_inversions.array_ops.concat_data_arrays(da_dict: Mapping[str, DataArray], key_dim: str, **concat_kwargs) DataArray#
- openghg_inversions.array_ops.concat_gather_data_arrays(da_dict: Mapping[str, DataArray], key_dim: str, ragged_dim: str, stack_dim: str | None = None, **concat_kwargs) DataArray#
Concatenate DataArrays by gathering along ragged coordinate.
For example, if the keys are site codes and the ragged dimension is time, then the “stacked dimension” will be the usual nmeasure coordinate.
- Parameters:
da_dict – dictionary of DataArrays
key_dim – dimension name for the keys of the dictionary
ragged_dim – name of the ragged dimension
stack_dim – name for the “stacked” multi-index dimension
**concat_kwargs – arguments to pass to xr.concat
- Returns:
Combined DataArray with new stacked dimension.
- openghg_inversions.array_ops.concat_gather_datasets(ds_dict: Mapping[str, Dataset], key_dim: str, ragged_dim: str, stack_dim: str | None = None, missing_data_vars: Literal['error', 'drop'] = 'error', **concat_kwargs) Dataset#
Concatenate dictionary of xr.Datasets by gathering ragged coordinates.
- Parameters:
ds_dict – Mapping from key values to datasets to concatenate.
key_dim – Name for the key dimension level in the output MultiIndex.
ragged_dim – Ragged coordinate dimension to gather across datasets.
stack_dim – Optional name for the gathered dimension in the output.
missing_data_vars – Policy for handling data variables that are not shared by every dataset. Use
"error"to raise aValueErrorlisting the per-dataset differences, or"drop"to keep only the intersection of shared data variables and emit a warning when any are dropped.**concat_kwargs – Additional keyword arguments passed to
xr.concat.
- Returns:
Dataset containing gathered versions of the selected data variables.
- Raises:
ValueError – If
missing_data_vars="error"and the datasets do not all have identical data-variable sets, or ifmissing_data_varsis not recognised.
- openghg_inversions.array_ops.concat_gather_datatree(dt: DataTree, key_dim: str, ragged_dim: str, stack_dim: str | None = None, missing_data_vars: Literal['error', 'drop'] = 'error', **concat_kwargs) Dataset#
Concatenate xr.DataTree children by gathering ragged coordinates.
- Parameters:
dt – DataTree whose children will be converted to datasets and gathered.
key_dim – Name for the key dimension level in the output MultiIndex.
ragged_dim – Ragged coordinate dimension to gather across children.
stack_dim – Optional name for the gathered dimension in the output.
missing_data_vars – Policy for handling child data variables that are not shared by every dataset. See
concat_gather_datasets.**concat_kwargs – Additional keyword arguments passed to
xr.concat.
- Returns:
Dataset containing gathered versions of the selected data variables.
- openghg_inversions.array_ops.force_align(obj: T, reference: DataArray | Dataset, *, dims: Iterable[Hashable], rtol: float = 1e-05, atol: float = 1e-08) T#
Force coordinate identity with a reference object along given dimensions.
This function verifies that obj and reference share numerically equivalent coordinates (within tolerance) along the specified dimensions. If validation succeeds, the coordinates of obj are reassigned to be identical (object identity) to those of reference.
No interpolation or reindexing is performed. If coordinates differ beyond tolerance, a ValueError is raised.
- Parameters:
obj – The xarray DataArray or Dataset whose coordinates will be validated and reassigned.
reference – The object providing canonical coordinates. Must contain the specified dimensions.
dims – Dimensions along which to validate and enforce coordinate identity (e.g., [“lat”, “lon”]).
rtol – Relative tolerance for coordinate comparison. Defaults to NumPy’s default (1e-5).
atol – Absolute tolerance for coordinate comparison. Defaults to NumPy’s default (1e-8).
- Returns:
The same type as obj, with validated coordinates reassigned.
- Raises:
ValueError – If a dimension is missing, sizes differ, or coordinates differ beyond tolerance.
- openghg_inversions.array_ops.get_xr_dummies(da: DataArray, categories: Sequence[Any] | Index | DataArray | ndarray | None = None, cat_dim: str = 'categories', return_sparse: bool = True) DataArray#
Create 0-1 dummy matrix from DataArray with values that correspond to categories.
If the values of da are integers 0-N, then the result has N + 1 columns, and the (i, j) coordiante of the result is 1 if da[i] == j, and is 0 otherwise.
This function works like the pandas function get_dummies, but preserves the coordinates of the input data, and allowing the user to specify coordinates for the categories used to make the “dummies” (or “one-hot encoding”).
- Parameters:
da – DataArray encoding categories.
categories – optional coordinates for categories.
cat_dim – dimension for categories coordinate
sparse – if True, store values in sparse.COO matrix
- Returns:
- Dummy matrix corresponding to the input vector. Its dimensions are the same as the
input DataArray, plus an additional “categories” dimension, which has one value for each distinct value in the input DataArray.
- openghg_inversions.array_ops.sparse_xr_dot(da1: DataArray, da2: DataArray, dim: list[str] | None = None) DataArray#
- openghg_inversions.array_ops.sparse_xr_dot(da1: DataArray, da2: Dataset, dim: list[str] | None = None) Dataset
Compute the matrix “dot” of a tuple of DataArrays with sparse.COO values.
This multiplies and sums over all common dimensions of the input DataArrays, and preserves the coordinates and dimensions that are not summed over.
Common dimensions are automatically selected by name. The input arrays must have at least one dimension in common. All matching dimensions will be used for multiplication.
Compared to just using da1 @ da2, this function has two advantages: 1. if da1 is sparse but not a dask array, then da1 @ da2 will fail if da2 is a dask array 2. da2 can be a Dataset, and current DataArray @ Dataset is not allowed by xarray
- Parameters:
da1 – xr.DataArrays to multiply and sum along common dimensions.
da2 – xr.DataArrays to multiply and sum along common dimensions.
dim – optional list of dimensions to sum over; if None, then all common dimensions are summed over.
- Returns:
- containing the result of matrix/tensor multiplication.
The type that is returned will be the same as the type of da2.
- Return type:
xr.Dataset or xr.DataArray