Architecture¶

ogcat is a small, spec-driven catalog for managed artifacts, with managed local files as the current MVP. The current implementation is deliberately narrow: it creates a catalog on disk, stores records in a lightweight database, ingests files by copy or move, derives a small amount of metadata when possible, and exposes search through both Python and CLI.

Current Architecture¶

The main pieces are:

CatalogSpec: the serialisable catalog definition stored in catalog.json
Catalog: the user-facing API for creating, opening, adding, searching, and resolving record paths
CatalogRepository: a protocol for record storage
TinyDbCatalogRepository: the current repository implementation
CatalogRecord: the stored record model
naming helpers: template rendering for managed storage paths
extractors: optional post-ingest metadata extraction
CLI: a thin wrapper around the catalog API

The catalog root is self-describing:

<catalog-root>/
  catalog.json
  db.json
  files/

catalog.json defines how the catalog behaves. db.json stores records. files/ is the managed storage root for ingested files.

First Pass Artifact Generalisation¶

The current record model now separates two ideas that were previously collapsed into one stored path:

record_type: what the catalogued thing is, such as managed_file
locator: how that thing is located, such as a local path or a future URI

This is intentionally small. The goal is not to introduce a full abstraction framework, only to stop the internal model from assuming every record is a copied or moved local file forever.

For the current MVP:

add_file() still performs managed ingest by copy or move
managed file records store a path locator
compatibility fields stored_abspath and stored_relpath remain present for today’s APIs and CLI
Catalog.path() resolves only path-backed records and returns None for records that are missing or not path-backed

This leaves room for later work on managed directory-like stores, external references, and pre-allocated transform targets without forcing those features into the first pass.

Why Hide TinyDB Behind a Repository Abstraction¶

TinyDB is the current storage backend, but the rest of the package depends on a repository protocol rather than on TinyDB APIs directly. This keeps the catalog logic focused on records and search semantics instead of persistence details.

The abstraction is useful even in the current small codebase:

it keeps Catalog independent from backend-specific query and update code
it makes tests simpler because record operations are expressed in terms of the protocol
it leaves room for backend changes later without rewriting the public catalog API

This is not a claim that multiple backends already exist. It only means the package boundary is already in place.

Transaction Boundaries And Rollback¶

Catalog writes use a lightweight unit-of-work helper for multi-step operations such as managed file ingest. With the current TinyDB backend this is a best-effort rollback mechanism based on compensating actions: staged records can be deleted and owned copied files can be removed if a later step fails. It is not a true database transaction and should not be described as ACID.

Each unit of work exposes an operation_id so future audit logging or hooks can correlate staged record writes, storage activity, and cleanup. Stronger backends can map the same conceptual API to native transaction support later.

Why Templates and Metadata Live in `catalog.json`¶

The default schema, optional named record schemas, default operation, and field resolution order are stored in catalog.json so a catalog remains self-describing on disk.

That choice has a few practical benefits:

a catalog can be opened without separate application configuration
stored records can be interpreted in the context of the catalog that produced them
schema-level metadata field descriptions travel with the catalog instead of being hard-coded elsewhere

Schemas are intentionally lightweight. Required metadata fields are checked at ingest, but there is no deep type validation or domain-specific schema language in the catalog core.

User Metadata and Derived Metadata¶

Each record separates metadata into distinct areas:

top-level reserved fields for catalog bookkeeping
user_metadata for metadata supplied at ingest time
derived_metadata for metadata extracted from the stored file after ingest

This separation matters for both clarity and search behavior. Unqualified field lookup resolves in this order:

top-level record fields
user_metadata
derived_metadata

If a caller wants an exact nested location, dotted paths can bypass flattened lookup, for example user_metadata.product.family.name or derived_metadata.netcdf.dims.time.

The current derived metadata layer is intentionally small. For netCDF files, if xarray is installed, the extractor records a compact summary of dimensions, variables, coordinates, and selected attributes.

Current Limitations¶

The present architecture is intentionally constrained.

richer locator handling is still future work; today the generalisation is intentionally minimal
TinyDB is the only supported backend
metadata validation is intentionally shallow and limited to required-field presence
search supports exact equality, contains, and regex matching only
there are no reader hooks, manager APIs, or import or scan workflows yet
extractor support is limited and should not be described as a general reader framework

These limitations are deliberate. The package is currently a compact catalog core, not a full data management system.