Ideas

This note collects possible next-step ideas that build on the current artifact locator work without committing the core package to them yet.

Grouped Search Results

Current search returns one record per matching artifact. That is the right base behavior for precise file-level lookup, but it can be noisy for datasets that are naturally monthly file series.

Possible extension:

  • keep one record per file in storage

  • add a helper that groups search results by selected keys such as:

    • site

    • inlet

    • model

    • met_model

    • domain

    • species

  • return summary fields such as:

    • record_count

    • start_date_min

    • start_date_max

    • years

    • months

    • possible gap information later

Why it makes sense:

  • preserves the current simple storage model

  • avoids losing per-file fidelity

  • improves usability for monthly-series datasets such as footprints

Related CLI follow-up:

  • current CLI search output is still record-oriented

  • future work could add grouped or collapsed search views, especially for monthly file series

  • output modes such as --paths may need explicit semantics for mixed path-backed and non-path-backed result sets

Collection Records

Another direction is to represent one logical collection instead of one record per file.

Example:

  • one record for /group/chem/acrg/LPDM/fp_NAME/EUROPE/MHD-10magl/co2/

  • metadata could include:

    • site=MHD

    • inlet=10m

    • model=NAME

    • met_model=UKV

    • domain=EUROPE

    • species=co2

    • file_count

    • start_date

    • end_date

    • file_pattern

This would likely be a different record_type, such as:

  • external_collection

  • managed_store

Why it makes sense:

  • better matches “one dataset, many files”

  • leaves room for directory-backed stores and transform outputs

Tradeoff:

  • collection records are convenient summaries, but they lose direct one-record per-file visibility unless paired with file-level records or derived indexes

Managed Collection Updates And Member Manifests

Archive-backed scientific datasets often start as many transport files that should become one logical collection. For example, each annual .zip may contain one NetCDF file, and all extracted NetCDF files should live in the same managed directory.

Current behavior is best suited to:

  • one record per extracted file; or

  • one collection record written in a single operation by a custom writer.

It is not yet a good fit for incrementally appending files to one managed directory while updating the same catalog record’s metadata. That would require new semantics beyond the current “target should be absent, writer materialises it, record is inserted” flow.

Possible future directions:

  • add an explicit update/upsert API for catalog records

  • distinguish create-only writers from append/update writers

  • keep the core update API small, with plugins defining the concrete semantics for a given artifact shape

  • let a collection writer return a structured manifest of members, including:

    • source archive path

    • archive member name

    • stored relative path

    • year or other member-level keys

    • size and checksum

  • store collection-level summaries such as file_count, years, time_coverage, and member_glob

  • decide whether metadata updates replace, merge, or recompute derived collection metadata

  • make rollback behavior explicit for append operations, since deleting a whole directory may be wrong once a collection already exists

This also points toward a future artifact descriptor model where one logical record can have many concrete artifact members. Until then, a JSON-compatible manifest in derived_metadata could be a useful prototype.

The core package probably should not define one universal meaning for update. It can define the envelope: load an existing record, let a manager materialise changes, update metadata, and register rollback. A plugin or manager can then decide whether an update means append, replace, merge, rebuild, or reject.

A small vendored example could be a generic directory collection manager. It would treat a managed directory as a simple collection of files, roughly like a list in memory:

  • add writes one or more new files into the directory

  • remove deletes selected members

  • replace overwrites a selected member

  • manifest returns the current member list and summary metadata

That would mirror the familiar Unix directory file type without committing the whole model to directories forever. Later, a directory collection record could be replaced or complemented by a record that points to other records, but a plain managed-directory collection is likely the most useful first step.

Archive Member Naming

When an archive is just transport packaging, naming should often use the archive member rather than the archive file itself. A .zip named GCP-GridFEDv2023.1_2018.zip may contain GCP-GridFEDv2023.1_2018.nc; the managed artifact should usually use the .nc name.

Current options:

  • pass an explicit locator=ArtifactLocator.from_path(...) when planning storage

  • use a locator-resolution hook to inspect the archive and adjust the planned locator before the storage plan is finalised

  • write a domain-specific multi-archive writer that controls its directory layout and records extracted member metadata

Possible future improvements:

  • add an archive-member source descriptor, separate from the physical source archive path

  • expose naming context fields such as artifact_filename or source_member_filename without overloading original_filename

  • provide a builtin hook or planning helper for “single-member archive uses member filename”

  • keep original_filename reserved for the physical source filename, so metadata and naming remain easier to reason about

Non-Path Locators

The current model already leaves room for non-path locators such as URIs, but support is intentionally thin.

Examples:

  • s3://bucket/path/to/data.zarr

  • gs://bucket/path/to/file.nc

  • other opaque references that are not local filesystem paths

Possible next steps:

  • keep locator.kind explicit, for example path, uri, or another small set

  • preserve non-path values verbatim instead of passing them through Path(...)

  • add lightweight helpers for path-backed versus non-path-backed behavior

  • consider optional higher-level integrations with tools such as fsspec later

Why it makes sense:

  • avoids locking the model to local files only

  • supports cloud or remote references without forcing a heavy abstraction layer

Reader Hints For Collections

For collection-like artifacts, it may later be useful to attach a lightweight reader hint rather than a full plugin system.

Examples:

  • artifact_type="netcdf_monthly_series"

  • reader_hint="xarray.open_mfdataset"

This should stay advisory rather than executable application state. The catalog can describe what something is, while higher-level code decides how to open it.

Follow-Up PR: Repository-Owned Search And Record Identity

This should be treated as a focused architectural cleanup PR.

Problem statement:

  • the current repository interface is too thin

  • Catalog.search(...) currently loads repository.all() and filters in Python

  • that leaks search policy out of the backend layer and prevents backends such as TinyDB, SQLite, or MongoDB from owning query execution

  • CatalogRecord currently requires id, even though record identity is really assigned by the repository/database layer

  • allocate_record_ids() is a workaround for that mismatch and should likely be removed

Preferred direction:

  • move search behind the repository interface

  • adopt “option 1” for record identity:

    • CatalogRecord.id becomes optional before persistence

    • repository insert(...) / insert_many(...) assign ids

    • catalog-facing methods return persisted records with ids populated

Suggested scope:

  • add a repository search method, likely reusing the current search inputs:

    • where

    • contains

    • regex

    • ignore_case

  • provide a simple backend implementation for TinyDB

  • update Catalog.search(...) to delegate to the repository instead of calling all() and filtering in Python

  • remove allocate_record_ids() from the repository interface

  • update Catalog.add_file(...), add_artifact(...), and add_artifacts(...) so they build records without ids and persist them through repository methods that return ids or persisted records

  • keep the public Catalog API lightweight

  • avoid introducing a large ORM-like abstraction layer

Questions to resolve in that PR:

  • should repository insert(...) return the assigned id, or the full persisted CatalogRecord?

  • what is the cleanest shape for batch insert return values?

  • how should managed-file naming behave if templates want {id} before the record is persisted?

  • should id-based naming remain supported, or should it become discouraged or removed from the default design?

  • how much query expressiveness should the repository interface expose before it becomes too backend-specific?

Desired outcome:

  • repository backends own both identity assignment and search execution

  • CatalogRecord no longer needs caller-supplied ids

  • the catalog layer becomes thinner and less tied to backend implementation details