Architecture

ogcat is a small, spec-driven catalog for managed artifacts, with managed local files as the current MVP. The current implementation is deliberately narrow: it creates a catalog on disk, stores records in a lightweight database, ingests files by copy or move, derives a small amount of metadata when possible, and exposes search through both Python and CLI.

Current Architecture

The main pieces are:

  • CatalogSpec: the serialisable catalog definition stored in catalog.json

  • Catalog: the user-facing API for creating, opening, adding, searching, and resolving record paths

  • CatalogRepository: a protocol for record storage

  • TinyDbCatalogRepository: the current repository implementation

  • CatalogRecord: the stored record model

  • naming helpers: template rendering for managed storage paths

  • extractors: optional post-ingest metadata extraction

  • CLI: a thin wrapper around the catalog API

The catalog root is self-describing:

<catalog-root>/
  catalog.json
  db.json
  files/

catalog.json defines how the catalog behaves. db.json stores records. files/ is the managed storage root for ingested files.

First Pass Artifact Generalisation

The current record model now separates two ideas that were previously collapsed into one stored path:

  • record_type: what the catalogued thing is, such as managed_file

  • locator: how that thing is located, such as a local path or a future URI

This is intentionally small. The goal is not to introduce a full abstraction framework, only to stop the internal model from assuming every record is a copied or moved local file forever.

For the current MVP:

  • add_file() still performs managed ingest by copy or move

  • managed file records store a path locator

  • compatibility fields stored_abspath and stored_relpath remain present for today’s APIs and CLI

  • Catalog.path() resolves only path-backed records and returns None for records that are missing or not path-backed

This leaves room for later work on managed directory-like stores, external references, and pre-allocated transform targets without forcing those features into the first pass.

Why Hide TinyDB Behind a Repository Abstraction

TinyDB is the current storage backend, but the rest of the package depends on a repository protocol rather than on TinyDB APIs directly. This keeps the catalog logic focused on records and search semantics instead of persistence details.

The abstraction is useful even in the current small codebase:

  • it keeps Catalog independent from backend-specific query and update code

  • it makes tests simpler because record operations are expressed in terms of the protocol

  • it leaves room for backend changes later without rewriting the public catalog API

This is not a claim that multiple backends already exist. It only means the package boundary is already in place.

Transaction Boundaries And Rollback

Catalog writes use a lightweight unit-of-work helper for multi-step operations such as managed file ingest. With the current TinyDB backend this is a best-effort rollback mechanism based on compensating actions: staged records can be deleted and owned copied files can be removed if a later step fails. It is not a true database transaction and should not be described as ACID.

Each unit of work exposes an operation_id so future audit logging or hooks can correlate staged record writes, storage activity, and cleanup. Stronger backends can map the same conceptual API to native transaction support later.

Why Templates and Metadata Live in catalog.json

The default schema, optional named record schemas, default operation, and field resolution order are stored in catalog.json so a catalog remains self-describing on disk.

That choice has a few practical benefits:

  • a catalog can be opened without separate application configuration

  • stored records can be interpreted in the context of the catalog that produced them

  • schema-level metadata field descriptions travel with the catalog instead of being hard-coded elsewhere

Schemas are intentionally lightweight. Required metadata fields are checked at ingest, but there is no deep type validation or domain-specific schema language in the catalog core.

User Metadata and Derived Metadata

Each record separates metadata into distinct areas:

  • top-level reserved fields for catalog bookkeeping

  • user_metadata for metadata supplied at ingest time

  • derived_metadata for metadata extracted from the stored file after ingest

This separation matters for both clarity and search behavior. Unqualified field lookup resolves in this order:

  1. top-level record fields

  2. user_metadata

  3. derived_metadata

If a caller wants an exact nested location, dotted paths can bypass flattened lookup, for example user_metadata.product.family.name or derived_metadata.netcdf.dims.time.

The current derived metadata layer is intentionally small. For netCDF files, if xarray is installed, the extractor records a compact summary of dimensions, variables, coordinates, and selected attributes.

Current Limitations

The present architecture is intentionally constrained.

  • richer locator handling is still future work; today the generalisation is intentionally minimal

  • TinyDB is the only supported backend

  • metadata validation is intentionally shallow and limited to required-field presence

  • search supports exact equality, contains, and regex matching only

  • there are no reader hooks, manager APIs, or import or scan workflows yet

  • extractor support is limited and should not be described as a general reader framework

These limitations are deliberate. The package is currently a compact catalog core, not a full data management system.