# Ideas

This note collects possible next-step ideas that build on the current artifact
locator work without committing the core package to them yet.

## Grouped Search Results

Current search returns one record per matching artifact. That is the right base
behavior for precise file-level lookup, but it can be noisy for datasets that
are naturally monthly file series.

Possible extension:

- keep one record per file in storage
- add a helper that groups search results by selected keys such as:
  - `site`
  - `inlet`
  - `model`
  - `met_model`
  - `domain`
  - `species`
- return summary fields such as:
  - `record_count`
  - `start_date_min`
  - `start_date_max`
  - `years`
  - `months`
  - possible gap information later

Why it makes sense:

- preserves the current simple storage model
- avoids losing per-file fidelity
- improves usability for monthly-series datasets such as footprints

Related CLI follow-up:

- current CLI search output is still record-oriented
- future work could add grouped or collapsed search views, especially for
  monthly file series
- output modes such as `--paths` may need explicit semantics for mixed
  path-backed and non-path-backed result sets

## Collection Records

Another direction is to represent one logical collection instead of one record
per file.

Example:

- one record for `/group/chem/acrg/LPDM/fp_NAME/EUROPE/MHD-10magl/co2/`
- metadata could include:
  - `site=MHD`
  - `inlet=10m`
  - `model=NAME`
  - `met_model=UKV`
  - `domain=EUROPE`
  - `species=co2`
  - `file_count`
  - `start_date`
  - `end_date`
  - `file_pattern`

This would likely be a different `record_type`, such as:

- `external_collection`
- `managed_store`

Why it makes sense:

- better matches "one dataset, many files"
- leaves room for directory-backed stores and transform outputs

Tradeoff:

- collection records are convenient summaries, but they lose direct one-record
  per-file visibility unless paired with file-level records or derived indexes

## Managed Collection Updates And Member Manifests

Archive-backed scientific datasets often start as many transport files that
should become one logical collection. For example, each annual ``.zip`` may
contain one NetCDF file, and all extracted NetCDF files should live in the same
managed directory.

Current behavior is best suited to:

- one record per extracted file; or
- one collection record written in a single operation by a custom writer.

It is not yet a good fit for incrementally appending files to one managed
directory while updating the same catalog record's metadata. That would require
new semantics beyond the current "target should be absent, writer materialises
it, record is inserted" flow.

Possible future directions:

- add an explicit update/upsert API for catalog records
- distinguish create-only writers from append/update writers
- keep the core update API small, with plugins defining the concrete semantics
  for a given artifact shape
- let a collection writer return a structured manifest of members, including:
  - source archive path
  - archive member name
  - stored relative path
  - year or other member-level keys
  - size and checksum
- store collection-level summaries such as ``file_count``, ``years``,
  ``time_coverage``, and ``member_glob``
- decide whether metadata updates replace, merge, or recompute derived
  collection metadata
- make rollback behavior explicit for append operations, since deleting a whole
  directory may be wrong once a collection already exists

This also points toward a future artifact descriptor model where one logical
record can have many concrete artifact members. Until then, a JSON-compatible
manifest in ``derived_metadata`` could be a useful prototype.

The core package probably should not define one universal meaning for
``update``. It can define the envelope: load an existing record, let a manager
materialise changes, update metadata, and register rollback. A plugin or manager
can then decide whether an update means append, replace, merge, rebuild, or
reject.

A small vendored example could be a generic directory collection manager. It
would treat a managed directory as a simple collection of files, roughly like a
list in memory:

- ``add`` writes one or more new files into the directory
- ``remove`` deletes selected members
- ``replace`` overwrites a selected member
- ``manifest`` returns the current member list and summary metadata

That would mirror the familiar Unix directory file type without committing the
whole model to directories forever. Later, a directory collection record could
be replaced or complemented by a record that points to other records, but a
plain managed-directory collection is likely the most useful first step.

## Archive Member Naming

When an archive is just transport packaging, naming should often use the
archive member rather than the archive file itself. A ``.zip`` named
``GCP-GridFEDv2023.1_2018.zip`` may contain
``GCP-GridFEDv2023.1_2018.nc``; the managed artifact should usually use the
``.nc`` name.

Current options:

- pass an explicit ``locator=ArtifactLocator.from_path(...)`` when planning
  storage
- use a locator-resolution hook to inspect the archive and adjust the planned
  locator before the storage plan is finalised
- write a domain-specific multi-archive writer that controls its directory
  layout and records extracted member metadata

Possible future improvements:

- add an archive-member source descriptor, separate from the physical source
  archive path
- expose naming context fields such as ``artifact_filename`` or
  ``source_member_filename`` without overloading ``original_filename``
- provide a builtin hook or planning helper for "single-member archive uses
  member filename"
- keep ``original_filename`` reserved for the physical source filename, so
  metadata and naming remain easier to reason about

## Non-Path Locators

The current model already leaves room for non-path locators such as URIs, but
support is intentionally thin.

Examples:

- `s3://bucket/path/to/data.zarr`
- `gs://bucket/path/to/file.nc`
- other opaque references that are not local filesystem paths

Possible next steps:

- keep `locator.kind` explicit, for example `path`, `uri`, or another small set
- preserve non-path values verbatim instead of passing them through `Path(...)`
- add lightweight helpers for path-backed versus non-path-backed behavior
- consider optional higher-level integrations with tools such as `fsspec` later

Why it makes sense:

- avoids locking the model to local files only
- supports cloud or remote references without forcing a heavy abstraction layer

## Reader Hints For Collections

For collection-like artifacts, it may later be useful to attach a lightweight
reader hint rather than a full plugin system.

Examples:

- `artifact_type="netcdf_monthly_series"`
- `reader_hint="xarray.open_mfdataset"`

This should stay advisory rather than executable application state. The catalog
can describe what something is, while higher-level code decides how to open it.

## Follow-Up PR: Repository-Owned Search And Record Identity

This should be treated as a focused architectural cleanup PR.

Problem statement:

- the current repository interface is too thin
- `Catalog.search(...)` currently loads `repository.all()` and filters in Python
- that leaks search policy out of the backend layer and prevents backends such
  as TinyDB, SQLite, or MongoDB from owning query execution
- `CatalogRecord` currently requires `id`, even though record identity is really
  assigned by the repository/database layer
- `allocate_record_ids()` is a workaround for that mismatch and should likely be
  removed

Preferred direction:

- move search behind the repository interface
- adopt "option 1" for record identity:
  - `CatalogRecord.id` becomes optional before persistence
  - repository `insert(...)` / `insert_many(...)` assign ids
  - catalog-facing methods return persisted records with ids populated

Suggested scope:

- add a repository search method, likely reusing the current search inputs:
  - `where`
  - `contains`
  - `regex`
  - `ignore_case`
- provide a simple backend implementation for TinyDB
- update `Catalog.search(...)` to delegate to the repository instead of calling
  `all()` and filtering in Python
- remove `allocate_record_ids()` from the repository interface
- update `Catalog.add_file(...)`, `add_artifact(...)`, and `add_artifacts(...)`
  so they build records without ids and persist them through repository methods
  that return ids or persisted records
- keep the public `Catalog` API lightweight
- avoid introducing a large ORM-like abstraction layer

Questions to resolve in that PR:

- should repository `insert(...)` return the assigned id, or the full persisted
  `CatalogRecord`?
- what is the cleanest shape for batch insert return values?
- how should managed-file naming behave if templates want `{id}` before the
  record is persisted?
- should id-based naming remain supported, or should it become discouraged or
  removed from the default design?
- how much query expressiveness should the repository interface expose before it
  becomes too backend-specific?

Desired outcome:

- repository backends own both identity assignment and search execution
- `CatalogRecord` no longer needs caller-supplied ids
- the catalog layer becomes thinner and less tied to backend implementation
  details