Architecture¶
ogcat is a small, spec-driven catalog for managed artifacts, with managed local files as the
current MVP. The current implementation is deliberately narrow: it creates a catalog on disk,
stores records in a lightweight database, ingests files by copy or move, derives a small amount of
metadata when possible, and exposes search through both Python and CLI.
Current Architecture¶
The main pieces are:
CatalogSpec: the serialisable catalog definition stored incatalog.jsonCatalog: the user-facing API for creating, opening, adding, searching, and resolving record pathsCatalogRepository: a protocol for record storageTinyDbCatalogRepository: the current repository implementationCatalogRecord: the stored record modelnaming helpers: template rendering for managed storage paths
extractors: optional post-ingest metadata extraction
CLI: a thin wrapper around the catalog API
The catalog root is self-describing:
<catalog-root>/
catalog.json
db.json
files/
catalog.json defines how the catalog behaves. db.json stores records. files/ is the managed storage root for ingested files.
First Pass Artifact Generalisation¶
The current record model now separates two ideas that were previously collapsed into one stored path:
record_type: what the catalogued thing is, such asmanaged_filelocator: how that thing is located, such as a localpathor a future URI
This is intentionally small. The goal is not to introduce a full abstraction framework, only to stop the internal model from assuming every record is a copied or moved local file forever.
For the current MVP:
add_file()still performs managed ingest by copy or movemanaged file records store a
pathlocatorcompatibility fields
stored_abspathandstored_relpathremain present for today’s APIs and CLICatalog.path()resolves only path-backed records and returnsNonefor records that are missing or not path-backed
This leaves room for later work on managed directory-like stores, external references, and pre-allocated transform targets without forcing those features into the first pass.
Why Hide TinyDB Behind a Repository Abstraction¶
TinyDB is the current storage backend, but the rest of the package depends on a repository protocol rather than on TinyDB APIs directly. This keeps the catalog logic focused on records and search semantics instead of persistence details.
The abstraction is useful even in the current small codebase:
it keeps
Catalogindependent from backend-specific query and update codeit makes tests simpler because record operations are expressed in terms of the protocol
it leaves room for backend changes later without rewriting the public catalog API
This is not a claim that multiple backends already exist. It only means the package boundary is already in place.
Transaction Boundaries And Rollback¶
Catalog writes use a lightweight unit-of-work helper for multi-step operations such as managed file ingest. With the current TinyDB backend this is a best-effort rollback mechanism based on compensating actions: staged records can be deleted and owned copied files can be removed if a later step fails. It is not a true database transaction and should not be described as ACID.
Each unit of work exposes an operation_id so future audit logging or hooks can correlate staged
record writes, storage activity, and cleanup. Stronger backends can map the same conceptual API to
native transaction support later.
Why Templates and Metadata Live in catalog.json¶
The default schema, optional named record schemas, default operation, and field resolution order
are stored in catalog.json so a catalog remains self-describing on disk.
That choice has a few practical benefits:
a catalog can be opened without separate application configuration
stored records can be interpreted in the context of the catalog that produced them
schema-level metadata field descriptions travel with the catalog instead of being hard-coded elsewhere
Schemas are intentionally lightweight. Required metadata fields are checked at ingest, but there is no deep type validation or domain-specific schema language in the catalog core.
User Metadata and Derived Metadata¶
Each record separates metadata into distinct areas:
top-level reserved fields for catalog bookkeeping
user_metadatafor metadata supplied at ingest timederived_metadatafor metadata extracted from the stored file after ingest
This separation matters for both clarity and search behavior. Unqualified field lookup resolves in this order:
top-level record fields
user_metadataderived_metadata
If a caller wants an exact nested location, dotted paths can bypass flattened lookup, for example user_metadata.product.family.name or derived_metadata.netcdf.dims.time.
The current derived metadata layer is intentionally small. For netCDF files, if xarray is installed, the extractor records a compact summary of dimensions, variables, coordinates, and selected attributes.
Current Limitations¶
The present architecture is intentionally constrained.
richer locator handling is still future work; today the generalisation is intentionally minimal
TinyDB is the only supported backend
metadata validation is intentionally shallow and limited to required-field presence
search supports exact equality, contains, and regex matching only
there are no reader hooks, manager APIs, or import or scan workflows yet
extractor support is limited and should not be described as a general reader framework
These limitations are deliberate. The package is currently a compact catalog core, not a full data management system.