Data Catalog

Warning

Updates are open for review; srvo approval is still pending. Log the approval in the compliance change log before treating this section as authoritative.

1 Purpose

The data catalog describes every structured dataset Ethical Capital uses for client service, compliance, research, marketing, and finance. Each record documents lineage, stewardship, sensitivity, and operational expectations so teams know how to request access and how often information refreshes.

2 Required Metadata Fields

Field	Description	Owner
`dataset_name`	Canonical name used in DuckDB, Snowflake, or shared drives.	Data Steward
`business_domain`	Primary business function (e.g., Compliance, Client Experience, Research, Finance).	srvo
`source_system`	Original system of record (Buttondown, LACRM, Schwab, Workbench, etc.).	IT/Automation
`refresh_cadence`	Expected update frequency (`daily`, `weekly`, `T+2`, `on-demand`).	Data Steward
`retention_policy`	How long data is stored and where archives live (`S3 infrequent access`, `72h temp table`, etc.).	Compliance
`sensitivity_class`	Classification aligned with the Compliance Manual (PII, Confidential, Internal, Public).	Compliance
`schema_reference`	Link to the corresponding format spec or ERD within `reference/data-formats/`.	Data Steward
`downstream_dependencies`	Processes, dashboards, or reports that rely on the dataset.	Team using dataset
`access_method`	How to request or programmatically access (e.g., `duckdb:/warehouse/client.db`, API key vault entry).	IT/Automation
`quality_checks`	Automated or manual controls run during ingestion (lychee, dbt tests, manual review).	Data Steward

3 Catalog Elements

3.1 Domains

We group datasets by business domain to simplify ownership and review cycles:

Client Experience – onboarding packets, review prep, ACAT tracking, communications.
Compliance – books and records extracts, advisor attestations, marketing approvals, regulatory filings.
Investment Research – exclusion screens, portfolio analytics, backtests, attribution models.
Finance & Operations – billing exports, expense ledgers, vendor management artifacts.

3.2 Storage Locations

DuckDB Warehouses (~/warehouse/*.db) for analytics-ready tables and joins.
S3 Object Storage for large historical archives or CSV templates.
Cloud SaaS (Buttondown, LACRM, Schwab Advisor Center) as systems of record; catalog entries link to API documentation and export procedures.

3.3 Stewardship Workflow

Draft new dataset entry using the metadata table above.
Link to the relevant specification in reference/data-formats/ or include a short schema section if no spec exists yet.
Add the record to the data_catalog.yml (stored alongside automation scripts) and reference it from this page.
Submit a pull request with validation output (lychee, schema checks) and note any downstream updates.

3.4 Example Entry

- dataset_name: onboarding_client_packets
  business_domain: Client Experience
  source_system: LACRM API v2
  refresh_cadence: daily
  retention_policy: S3://ecic-data/onboarding/ (7y)
  sensitivity_class: PII
  schema_reference: reference/data-formats/client_master_template.csv
  downstream_dependencies:
    - runbooks/onboarding
    - dashboards/client_lifecycle.qmd
  access_method: duckdb:///warehouse/client.db
  quality_checks:
    - dbt.tests.unique: client_id
    - qa_scripts/onboarding_document_audit.py

4 Next Steps

Add new datasets to the catalog whenever a pipeline is built or modified.
Keep refresh cadences aligned with automation schedules; update both if either changes.
Review the data catalog during quarterly governance checks to confirm ownership, retention, and sensitivity remain accurate.