Data Catalog

Warning

Updates are open for review; srvo approval is still pending. Log the approval in the compliance change log before treating this section as authoritative.

1 Purpose

The data catalog describes every structured dataset Ethical Capital uses for client service, compliance, research, marketing, and finance. Each record documents lineage, stewardship, sensitivity, and operational expectations so teams know how to request access and how often information refreshes.

2 Required Metadata Fields

Field Description Owner
dataset_name Canonical name used in DuckDB, Snowflake, or shared drives. Data Steward
business_domain Primary business function (e.g., Compliance, Client Experience, Research, Finance). srvo
source_system Original system of record (Buttondown, LACRM, Schwab, Workbench, etc.). IT/Automation
refresh_cadence Expected update frequency (daily, weekly, T+2, on-demand). Data Steward
retention_policy How long data is stored and where archives live (S3 infrequent access, 72h temp table, etc.). Compliance
sensitivity_class Classification aligned with the Compliance Manual (PII, Confidential, Internal, Public). Compliance
schema_reference Link to the corresponding format spec or ERD within reference/data-formats/. Data Steward
downstream_dependencies Processes, dashboards, or reports that rely on the dataset. Team using dataset
access_method How to request or programmatically access (e.g., duckdb:/warehouse/client.db, API key vault entry). IT/Automation
quality_checks Automated or manual controls run during ingestion (lychee, dbt tests, manual review). Data Steward

3 Catalog Elements

3.1 Domains

We group datasets by business domain to simplify ownership and review cycles:

  • Client Experience – onboarding packets, review prep, ACAT tracking, communications.
  • Compliance – books and records extracts, advisor attestations, marketing approvals, regulatory filings.
  • Investment Research – exclusion screens, portfolio analytics, backtests, attribution models.
  • Finance & Operations – billing exports, expense ledgers, vendor management artifacts.

3.2 Storage Locations

  • DuckDB Warehouses (~/warehouse/*.db) for analytics-ready tables and joins.
  • S3 Object Storage for large historical archives or CSV templates.
  • Cloud SaaS (Buttondown, LACRM, Schwab Advisor Center) as systems of record; catalog entries link to API documentation and export procedures.

3.3 Stewardship Workflow

  1. Draft new dataset entry using the metadata table above.
  2. Link to the relevant specification in reference/data-formats/ or include a short schema section if no spec exists yet.
  3. Add the record to the data_catalog.yml (stored alongside automation scripts) and reference it from this page.
  4. Submit a pull request with validation output (lychee, schema checks) and note any downstream updates.

3.4 Example Entry

- dataset_name: onboarding_client_packets
  business_domain: Client Experience
  source_system: LACRM API v2
  refresh_cadence: daily
  retention_policy: S3://ecic-data/onboarding/ (7y)
  sensitivity_class: PII
  schema_reference: reference/data-formats/client_master_template.csv
  downstream_dependencies:
    - runbooks/onboarding
    - dashboards/client_lifecycle.qmd
  access_method: duckdb:///warehouse/client.db
  quality_checks:
    - dbt.tests.unique: client_id
    - qa_scripts/onboarding_document_audit.py

4 Next Steps

  • Add new datasets to the catalog whenever a pipeline is built or modified.
  • Keep refresh cadences aligned with automation schedules; update both if either changes.
  • Review the data catalog during quarterly governance checks to confirm ownership, retention, and sensitivity remain accurate.