# Ideas This note collects possible next-step ideas that build on the current artifact locator work without committing the core package to them yet. ## Grouped Search Results Current search returns one record per matching artifact. That is the right base behavior for precise file-level lookup, but it can be noisy for datasets that are naturally monthly file series. Possible extension: - keep one record per file in storage - add a helper that groups search results by selected keys such as: - `site` - `inlet` - `model` - `met_model` - `domain` - `species` - return summary fields such as: - `record_count` - `start_date_min` - `start_date_max` - `years` - `months` - possible gap information later Why it makes sense: - preserves the current simple storage model - avoids losing per-file fidelity - improves usability for monthly-series datasets such as footprints Related CLI follow-up: - current CLI search output is still record-oriented - future work could add grouped or collapsed search views, especially for monthly file series - output modes such as `--paths` may need explicit semantics for mixed path-backed and non-path-backed result sets ## Collection Records Another direction is to represent one logical collection instead of one record per file. Example: - one record for `/group/chem/acrg/LPDM/fp_NAME/EUROPE/MHD-10magl/co2/` - metadata could include: - `site=MHD` - `inlet=10m` - `model=NAME` - `met_model=UKV` - `domain=EUROPE` - `species=co2` - `file_count` - `start_date` - `end_date` - `file_pattern` This would likely be a different `record_type`, such as: - `external_collection` - `managed_store` Why it makes sense: - better matches "one dataset, many files" - leaves room for directory-backed stores and transform outputs Tradeoff: - collection records are convenient summaries, but they lose direct one-record per-file visibility unless paired with file-level records or derived indexes ## Managed Collection Updates And Member Manifests Archive-backed scientific datasets often start as many transport files that should become one logical collection. For example, each annual ``.zip`` may contain one NetCDF file, and all extracted NetCDF files should live in the same managed directory. Current behavior is best suited to: - one record per extracted file; or - one collection record written in a single operation by a custom writer. It is not yet a good fit for incrementally appending files to one managed directory while updating the same catalog record's metadata. That would require new semantics beyond the current "target should be absent, writer materialises it, record is inserted" flow. Possible future directions: - add an explicit update/upsert API for catalog records - distinguish create-only writers from append/update writers - keep the core update API small, with plugins defining the concrete semantics for a given artifact shape - let a collection writer return a structured manifest of members, including: - source archive path - archive member name - stored relative path - year or other member-level keys - size and checksum - store collection-level summaries such as ``file_count``, ``years``, ``time_coverage``, and ``member_glob`` - decide whether metadata updates replace, merge, or recompute derived collection metadata - make rollback behavior explicit for append operations, since deleting a whole directory may be wrong once a collection already exists This also points toward a future artifact descriptor model where one logical record can have many concrete artifact members. Until then, a JSON-compatible manifest in ``derived_metadata`` could be a useful prototype. The core package probably should not define one universal meaning for ``update``. It can define the envelope: load an existing record, let a manager materialise changes, update metadata, and register rollback. A plugin or manager can then decide whether an update means append, replace, merge, rebuild, or reject. A small vendored example could be a generic directory collection manager. It would treat a managed directory as a simple collection of files, roughly like a list in memory: - ``add`` writes one or more new files into the directory - ``remove`` deletes selected members - ``replace`` overwrites a selected member - ``manifest`` returns the current member list and summary metadata That would mirror the familiar Unix directory file type without committing the whole model to directories forever. Later, a directory collection record could be replaced or complemented by a record that points to other records, but a plain managed-directory collection is likely the most useful first step. ## Archive Member Naming When an archive is just transport packaging, naming should often use the archive member rather than the archive file itself. A ``.zip`` named ``GCP-GridFEDv2023.1_2018.zip`` may contain ``GCP-GridFEDv2023.1_2018.nc``; the managed artifact should usually use the ``.nc`` name. Current options: - pass an explicit ``locator=ArtifactLocator.from_path(...)`` when planning storage - use a locator-resolution hook to inspect the archive and adjust the planned locator before the storage plan is finalised - write a domain-specific multi-archive writer that controls its directory layout and records extracted member metadata Possible future improvements: - add an archive-member source descriptor, separate from the physical source archive path - expose naming context fields such as ``artifact_filename`` or ``source_member_filename`` without overloading ``original_filename`` - provide a builtin hook or planning helper for "single-member archive uses member filename" - keep ``original_filename`` reserved for the physical source filename, so metadata and naming remain easier to reason about ## Non-Path Locators The current model already leaves room for non-path locators such as URIs, but support is intentionally thin. Examples: - `s3://bucket/path/to/data.zarr` - `gs://bucket/path/to/file.nc` - other opaque references that are not local filesystem paths Possible next steps: - keep `locator.kind` explicit, for example `path`, `uri`, or another small set - preserve non-path values verbatim instead of passing them through `Path(...)` - add lightweight helpers for path-backed versus non-path-backed behavior - consider optional higher-level integrations with tools such as `fsspec` later Why it makes sense: - avoids locking the model to local files only - supports cloud or remote references without forcing a heavy abstraction layer ## Reader Hints For Collections For collection-like artifacts, it may later be useful to attach a lightweight reader hint rather than a full plugin system. Examples: - `artifact_type="netcdf_monthly_series"` - `reader_hint="xarray.open_mfdataset"` This should stay advisory rather than executable application state. The catalog can describe what something is, while higher-level code decides how to open it. ## Follow-Up PR: Repository-Owned Search And Record Identity This should be treated as a focused architectural cleanup PR. Problem statement: - the current repository interface is too thin - `Catalog.search(...)` currently loads `repository.all()` and filters in Python - that leaks search policy out of the backend layer and prevents backends such as TinyDB, SQLite, or MongoDB from owning query execution - `CatalogRecord` currently requires `id`, even though record identity is really assigned by the repository/database layer - `allocate_record_ids()` is a workaround for that mismatch and should likely be removed Preferred direction: - move search behind the repository interface - adopt "option 1" for record identity: - `CatalogRecord.id` becomes optional before persistence - repository `insert(...)` / `insert_many(...)` assign ids - catalog-facing methods return persisted records with ids populated Suggested scope: - add a repository search method, likely reusing the current search inputs: - `where` - `contains` - `regex` - `ignore_case` - provide a simple backend implementation for TinyDB - update `Catalog.search(...)` to delegate to the repository instead of calling `all()` and filtering in Python - remove `allocate_record_ids()` from the repository interface - update `Catalog.add_file(...)`, `add_artifact(...)`, and `add_artifacts(...)` so they build records without ids and persist them through repository methods that return ids or persisted records - keep the public `Catalog` API lightweight - avoid introducing a large ORM-like abstraction layer Questions to resolve in that PR: - should repository `insert(...)` return the assigned id, or the full persisted `CatalogRecord`? - what is the cleanest shape for batch insert return values? - how should managed-file naming behave if templates want `{id}` before the record is persisted? - should id-based naming remain supported, or should it become discouraged or removed from the default design? - how much query expressiveness should the repository interface expose before it becomes too backend-specific? Desired outcome: - repository backends own both identity assignment and search execution - `CatalogRecord` no longer needs caller-supplied ids - the catalog layer becomes thinner and less tied to backend implementation details