Ideas¶
This note collects possible next-step ideas that build on the current artifact locator work without committing the core package to them yet.
Grouped Search Results¶
Current search returns one record per matching artifact. That is the right base behavior for precise file-level lookup, but it can be noisy for datasets that are naturally monthly file series.
Possible extension:
keep one record per file in storage
add a helper that groups search results by selected keys such as:
siteinletmodelmet_modeldomainspecies
return summary fields such as:
record_countstart_date_minstart_date_maxyearsmonthspossible gap information later
Why it makes sense:
preserves the current simple storage model
avoids losing per-file fidelity
improves usability for monthly-series datasets such as footprints
Related CLI follow-up:
current CLI search output is still record-oriented
future work could add grouped or collapsed search views, especially for monthly file series
output modes such as
--pathsmay need explicit semantics for mixed path-backed and non-path-backed result sets
Collection Records¶
Another direction is to represent one logical collection instead of one record per file.
Example:
one record for
/group/chem/acrg/LPDM/fp_NAME/EUROPE/MHD-10magl/co2/metadata could include:
site=MHDinlet=10mmodel=NAMEmet_model=UKVdomain=EUROPEspecies=co2file_countstart_dateend_datefile_pattern
This would likely be a different record_type, such as:
external_collectionmanaged_store
Why it makes sense:
better matches “one dataset, many files”
leaves room for directory-backed stores and transform outputs
Tradeoff:
collection records are convenient summaries, but they lose direct one-record per-file visibility unless paired with file-level records or derived indexes
Managed Collection Updates And Member Manifests¶
Archive-backed scientific datasets often start as many transport files that
should become one logical collection. For example, each annual .zip may
contain one NetCDF file, and all extracted NetCDF files should live in the same
managed directory.
Current behavior is best suited to:
one record per extracted file; or
one collection record written in a single operation by a custom writer.
It is not yet a good fit for incrementally appending files to one managed directory while updating the same catalog record’s metadata. That would require new semantics beyond the current “target should be absent, writer materialises it, record is inserted” flow.
Possible future directions:
add an explicit update/upsert API for catalog records
distinguish create-only writers from append/update writers
keep the core update API small, with plugins defining the concrete semantics for a given artifact shape
let a collection writer return a structured manifest of members, including:
source archive path
archive member name
stored relative path
year or other member-level keys
size and checksum
store collection-level summaries such as
file_count,years,time_coverage, andmember_globdecide whether metadata updates replace, merge, or recompute derived collection metadata
make rollback behavior explicit for append operations, since deleting a whole directory may be wrong once a collection already exists
This also points toward a future artifact descriptor model where one logical
record can have many concrete artifact members. Until then, a JSON-compatible
manifest in derived_metadata could be a useful prototype.
The core package probably should not define one universal meaning for
update. It can define the envelope: load an existing record, let a manager
materialise changes, update metadata, and register rollback. A plugin or manager
can then decide whether an update means append, replace, merge, rebuild, or
reject.
A small vendored example could be a generic directory collection manager. It would treat a managed directory as a simple collection of files, roughly like a list in memory:
addwrites one or more new files into the directoryremovedeletes selected membersreplaceoverwrites a selected membermanifestreturns the current member list and summary metadata
That would mirror the familiar Unix directory file type without committing the whole model to directories forever. Later, a directory collection record could be replaced or complemented by a record that points to other records, but a plain managed-directory collection is likely the most useful first step.
Archive Member Naming¶
When an archive is just transport packaging, naming should often use the
archive member rather than the archive file itself. A .zip named
GCP-GridFEDv2023.1_2018.zip may contain
GCP-GridFEDv2023.1_2018.nc; the managed artifact should usually use the
.nc name.
Current options:
pass an explicit
locator=ArtifactLocator.from_path(...)when planning storageuse a locator-resolution hook to inspect the archive and adjust the planned locator before the storage plan is finalised
write a domain-specific multi-archive writer that controls its directory layout and records extracted member metadata
Possible future improvements:
add an archive-member source descriptor, separate from the physical source archive path
expose naming context fields such as
artifact_filenameorsource_member_filenamewithout overloadingoriginal_filenameprovide a builtin hook or planning helper for “single-member archive uses member filename”
keep
original_filenamereserved for the physical source filename, so metadata and naming remain easier to reason about
Non-Path Locators¶
The current model already leaves room for non-path locators such as URIs, but support is intentionally thin.
Examples:
s3://bucket/path/to/data.zarrgs://bucket/path/to/file.ncother opaque references that are not local filesystem paths
Possible next steps:
keep
locator.kindexplicit, for examplepath,uri, or another small setpreserve non-path values verbatim instead of passing them through
Path(...)add lightweight helpers for path-backed versus non-path-backed behavior
consider optional higher-level integrations with tools such as
fsspeclater
Why it makes sense:
avoids locking the model to local files only
supports cloud or remote references without forcing a heavy abstraction layer
Reader Hints For Collections¶
For collection-like artifacts, it may later be useful to attach a lightweight reader hint rather than a full plugin system.
Examples:
artifact_type="netcdf_monthly_series"reader_hint="xarray.open_mfdataset"
This should stay advisory rather than executable application state. The catalog can describe what something is, while higher-level code decides how to open it.
Follow-Up PR: Repository-Owned Search And Record Identity¶
This should be treated as a focused architectural cleanup PR.
Problem statement:
the current repository interface is too thin
Catalog.search(...)currently loadsrepository.all()and filters in Pythonthat leaks search policy out of the backend layer and prevents backends such as TinyDB, SQLite, or MongoDB from owning query execution
CatalogRecordcurrently requiresid, even though record identity is really assigned by the repository/database layerallocate_record_ids()is a workaround for that mismatch and should likely be removed
Preferred direction:
move search behind the repository interface
adopt “option 1” for record identity:
CatalogRecord.idbecomes optional before persistencerepository
insert(...)/insert_many(...)assign idscatalog-facing methods return persisted records with ids populated
Suggested scope:
add a repository search method, likely reusing the current search inputs:
wherecontainsregexignore_case
provide a simple backend implementation for TinyDB
update
Catalog.search(...)to delegate to the repository instead of callingall()and filtering in Pythonremove
allocate_record_ids()from the repository interfaceupdate
Catalog.add_file(...),add_artifact(...), andadd_artifacts(...)so they build records without ids and persist them through repository methods that return ids or persisted recordskeep the public
CatalogAPI lightweightavoid introducing a large ORM-like abstraction layer
Questions to resolve in that PR:
should repository
insert(...)return the assigned id, or the full persistedCatalogRecord?what is the cleanest shape for batch insert return values?
how should managed-file naming behave if templates want
{id}before the record is persisted?should id-based naming remain supported, or should it become discouraged or removed from the default design?
how much query expressiveness should the repository interface expose before it becomes too backend-specific?
Desired outcome:
repository backends own both identity assignment and search execution
CatalogRecordno longer needs caller-supplied idsthe catalog layer becomes thinner and less tied to backend implementation details