# Architecture

`ogcat` is a small, spec-driven catalog for managed artifacts, with managed local files as the
current MVP. The current implementation is deliberately narrow: it creates a catalog on disk,
stores records in a lightweight database, ingests files by copy or move, derives a small amount of
metadata when possible, and exposes search through both Python and CLI.

## Current Architecture

The main pieces are:

- `CatalogSpec`: the serialisable catalog definition stored in `catalog.json`
- `Catalog`: the user-facing API for creating, opening, adding, searching, and resolving record paths
- `CatalogRepository`: a protocol for record storage
- `TinyDbCatalogRepository`: the current repository implementation
- `CatalogRecord`: the stored record model
- naming helpers: template rendering for managed storage paths
- extractors: optional post-ingest metadata extraction
- CLI: a thin wrapper around the catalog API

The catalog root is self-describing:

```text
<catalog-root>/
  catalog.json
  db.json
  files/
```

`catalog.json` defines how the catalog behaves. `db.json` stores records. `files/` is the managed storage root for ingested files.

## First Pass Artifact Generalisation

The current record model now separates two ideas that were previously collapsed into one stored
path:

- `record_type`: what the catalogued thing is, such as `managed_file`
- `locator`: how that thing is located, such as a local `path` or a future URI

This is intentionally small. The goal is not to introduce a full abstraction framework, only to
stop the internal model from assuming every record is a copied or moved local file forever.

For the current MVP:

- `add_file()` still performs managed ingest by copy or move
- managed file records store a `path` locator
- compatibility fields `stored_abspath` and `stored_relpath` remain present for today's APIs and
  CLI
- `Catalog.path()` resolves only path-backed records and returns `None` for records that are
  missing or not path-backed

This leaves room for later work on managed directory-like stores, external references, and
pre-allocated transform targets without forcing those features into the first pass.

## Why Hide TinyDB Behind a Repository Abstraction

TinyDB is the current storage backend, but the rest of the package depends on a repository protocol rather than on TinyDB APIs directly. This keeps the catalog logic focused on records and search semantics instead of persistence details.

The abstraction is useful even in the current small codebase:

- it keeps `Catalog` independent from backend-specific query and update code
- it makes tests simpler because record operations are expressed in terms of the protocol
- it leaves room for backend changes later without rewriting the public catalog API

This is not a claim that multiple backends already exist. It only means the package boundary is already in place.

## Transaction Boundaries And Rollback

Catalog writes use a lightweight unit-of-work helper for multi-step operations such as managed
file ingest. With the current TinyDB backend this is a best-effort rollback mechanism based on
compensating actions: staged records can be deleted and owned copied files can be removed if a later
step fails. It is not a true database transaction and should not be described as ACID.

Each unit of work exposes an `operation_id` so future audit logging or hooks can correlate staged
record writes, storage activity, and cleanup. Stronger backends can map the same conceptual API to
native transaction support later.

## Why Templates and Metadata Live in `catalog.json`

The default schema, optional named record schemas, default operation, and field resolution order
are stored in `catalog.json` so a catalog remains self-describing on disk.

That choice has a few practical benefits:

- a catalog can be opened without separate application configuration
- stored records can be interpreted in the context of the catalog that produced them
- schema-level metadata field descriptions travel with the catalog instead of being hard-coded elsewhere

Schemas are intentionally lightweight. Required metadata fields are checked at ingest, but there
is no deep type validation or domain-specific schema language in the catalog core.

## User Metadata and Derived Metadata

Each record separates metadata into distinct areas:

- top-level reserved fields for catalog bookkeeping
- `user_metadata` for metadata supplied at ingest time
- `derived_metadata` for metadata extracted from the stored file after ingest

This separation matters for both clarity and search behavior. Unqualified field lookup resolves in this order:

1. top-level record fields
2. `user_metadata`
3. `derived_metadata`

If a caller wants an exact nested location, dotted paths can bypass flattened lookup, for example `user_metadata.product.family.name` or `derived_metadata.netcdf.dims.time`.

The current derived metadata layer is intentionally small. For netCDF files, if `xarray` is installed, the extractor records a compact summary of dimensions, variables, coordinates, and selected attributes.

## Current Limitations

The present architecture is intentionally constrained.

- richer locator handling is still future work; today the generalisation is intentionally minimal
- TinyDB is the only supported backend
- metadata validation is intentionally shallow and limited to required-field presence
- search supports exact equality, contains, and regex matching only
- there are no reader hooks, manager APIs, or import or scan workflows yet
- extractor support is limited and should not be described as a general reader framework

These limitations are deliberate. The package is currently a compact catalog core, not a full data management system.