Data Catalog
Warning
Updates are open for review; srvo approval is still pending. Log the approval in the compliance change log before treating this section as authoritative.
1 Purpose
The data catalog describes every structured dataset Ethical Capital uses for client service, compliance, research, marketing, and finance. Each record documents lineage, stewardship, sensitivity, and operational expectations so teams know how to request access and how often information refreshes.
2 Required Metadata Fields
| Field | Description | Owner |
|---|---|---|
dataset_name |
Canonical name used in DuckDB, Snowflake, or shared drives. | Data Steward |
business_domain |
Primary business function (e.g., Compliance, Client Experience, Research, Finance). | srvo |
source_system |
Original system of record (Buttondown, LACRM, Schwab, Workbench, etc.). | IT/Automation |
refresh_cadence |
Expected update frequency (daily, weekly, T+2, on-demand). |
Data Steward |
retention_policy |
How long data is stored and where archives live (S3 infrequent access, 72h temp table, etc.). |
Compliance |
sensitivity_class |
Classification aligned with the Compliance Manual (PII, Confidential, Internal, Public). | Compliance |
schema_reference |
Link to the corresponding format spec or ERD within reference/data-formats/. |
Data Steward |
downstream_dependencies |
Processes, dashboards, or reports that rely on the dataset. | Team using dataset |
access_method |
How to request or programmatically access (e.g., duckdb:/warehouse/client.db, API key vault entry). |
IT/Automation |
quality_checks |
Automated or manual controls run during ingestion (lychee, dbt tests, manual review). | Data Steward |
3 Catalog Elements
3.1 Domains
We group datasets by business domain to simplify ownership and review cycles:
- Client Experience – onboarding packets, review prep, ACAT tracking, communications.
- Compliance – books and records extracts, advisor attestations, marketing approvals, regulatory filings.
- Investment Research – exclusion screens, portfolio analytics, backtests, attribution models.
- Finance & Operations – billing exports, expense ledgers, vendor management artifacts.
3.2 Storage Locations
- DuckDB Warehouses (
~/warehouse/*.db) for analytics-ready tables and joins. - S3 Object Storage for large historical archives or CSV templates.
- Cloud SaaS (Buttondown, LACRM, Schwab Advisor Center) as systems of record; catalog entries link to API documentation and export procedures.
3.3 Stewardship Workflow
- Draft new dataset entry using the metadata table above.
- Link to the relevant specification in
reference/data-formats/or include a short schema section if no spec exists yet. - Add the record to the
data_catalog.yml(stored alongside automation scripts) and reference it from this page. - Submit a pull request with validation output (lychee, schema checks) and note any downstream updates.
3.4 Example Entry
- dataset_name: onboarding_client_packets
business_domain: Client Experience
source_system: LACRM API v2
refresh_cadence: daily
retention_policy: S3://ecic-data/onboarding/ (7y)
sensitivity_class: PII
schema_reference: reference/data-formats/client_master_template.csv
downstream_dependencies:
- runbooks/onboarding
- dashboards/client_lifecycle.qmd
access_method: duckdb:///warehouse/client.db
quality_checks:
- dbt.tests.unique: client_id
- qa_scripts/onboarding_document_audit.py4 Next Steps
- Add new datasets to the catalog whenever a pipeline is built or modified.
- Keep refresh cadences aligned with automation schedules; update both if either changes.
- Review the data catalog during quarterly governance checks to confirm ownership, retention, and sensitivity remain accurate.