Data Ingestion Pipeline SOP

graph LR
  A[Google Calendar Sync] --> G[DuckDB Catalog Build]
  B[Gmail Messages] --> C[Contact Touches]
  C --> G
  D[Google Contacts] --> G
  E[Drive Metadata] --> G
  F[Forms Responses] --> G
  G --> H[MotherDuck Sync]
  H --> I[Data Quality Validation]

1 Purpose

Maintain a unified data lake of client interaction data, operational metadata, and business intelligence sources to enable automated metrics, contact management, and compliance reporting.

2 Triggers

Scheduled: Daily at 2 AM UTC via cron job
Manual: Operator execution of run_google_sync.sh
Event-driven: New client onboarding, significant data changes requiring immediate refresh

3 Inputs

Google OAuth token with workspace access (securely stored in Doppler encrypted credential store)
Calendar configuration mapping (config/google-calendars.json)
Existing data lake structure under data/ directory hierarchy
Previous ingestion state for incremental processing

4 Steps

4.1 1. Google Calendar Sync

Script: download_google_calendar.py
Scope: Primary calendar + configured secondary calendars
Timeframe: 730 days historical, 365 days forward
Output: Parquet files in data/google/calendar/
Summary: Aggregated statistics in data/model/google_calendar_summary.json

4.2 2. Gmail Metadata Collection

Script: download_gmail_messages.py
Scope: Message metadata only (no content), last 180 days
Limit: 1,500 most recent messages to respect API quotas
Output: Message headers in data/google/gmail/messages.parquet
Privacy: Content and attachments explicitly excluded

4.3 3. Contact Touch Analysis

Script: update_gmail_touch_table.py
Input: Gmail message metadata from step 2
Processing: Extract sender/recipient patterns, calculate interaction frequency
Output: Contact engagement metrics in data/model/contact_email_touches.parquet

4.4 4. Google Contacts Extraction

Script: download_google_contacts.py
Scope: All workspace contacts for email resolution
Output: Contact records in data/google/contacts/
Purpose: Foundation for cross-system contact matching

4.5 5. Drive Metadata Inventory

Script: download_google_drive.py
Scope: Non-trashed files, up to 5,000 most recent
Output: File metadata in data/google/drive/
Use Cases: Client document organization, storage auditing

4.6 6. DuckDB Catalog Build

Script: build_duckdb_catalog.py
Purpose: Create unified queryable interface over all ingested data
Output: data/duckdb/lake_catalog.duckdb with views for each dataset
Features: Schema validation, union handling for inconsistent structures

4.7 7. Cloud Sync (MotherDuck)

Trigger: Manual or post-ingestion automation
Target: MotherDuck hooks database with 24 synchronized tables
Validation: Row count verification, schema consistency checks

5 Exceptions

5.1 OAuth Token Expiry

Detection: HTTP 401 errors from Google APIs
Response: Automatic token refresh using stored refresh token
Escalation: Manual re-authentication if refresh fails

5.2 API Rate Limiting

Detection: HTTP 429 errors, quota exceeded responses
Response: Exponential backoff with jitter, resume from last successful batch
Fallback: Delay execution to next available quota window

5.3 Data Quality Issues

Schema Changes: Alert data engineering team, halt processing until resolved
Missing Data: Log gaps, continue with available data, flag for investigation
Duplicate Records: Deduplication logic based on message IDs, timestamps

5.4 Storage Exhaustion

Detection: Disk space monitoring, parquet file size validation
Response: Archive older datasets, compress historical data
Prevention: Automated cleanup of files older than retention policy

6 Owner Handoffs

Data Engineering → Client Experience for contact resolution failures
Data Engineering → Compliance for PII handling policy violations
Data Engineering → Infrastructure for persistent authentication or storage issues

7 SLAs

Daily Pipeline: Complete within 90 minutes of 2 AM UTC start
Manual Execution: Complete within 60 minutes for standard datasets
Error Recovery: Automatic retry within 30 minutes, manual escalation at 2 hours
Data Freshness: All datasets refreshed daily, contact touches updated hourly during business hours

8 Controls

PII Protection: Message content never accessed, only metadata headers processed
Access Control: OAuth tokens restricted to necessary scopes, stored encrypted
Data Validation: Schema enforcement, row count validation, foreign key checks
Audit Logging: Complete execution logs with timing, success/failure metrics

9 Audit Artifacts

Execution logs in logs/ingestion_YYYYMMDD.log with detailed timing and error information
Summary reports in data/model/ingestion_summary.json with dataset statistics
Data lineage tracked in MotherDuck metadata_sources_datasets table
Quality metrics including row counts, schema validation results, processing duration

10 Vendor Nuances

10.1 Google APIs

Rate Limits: Vary by service (Gmail: 250 quota units/user/second, Calendar: 1000/second)
Pagination: Varies (Gmail uses pageToken, Calendar uses nextPageToken)
Authentication: OAuth 2.0 with offline access for automatic refresh
Quotas: Daily limits reset at midnight Pacific Time

10.2 DuckDB

Schema Evolution: union_by_name=true handles column additions in parquet files
Memory Usage: Large datasets may require streaming processing for memory-constrained environments
Concurrent Access: Single writer, multiple readers supported

10.3 MotherDuck

Version Compatibility: Requires DuckDB 1.3.2 (not 1.4.0+)
Network Dependencies: Cloud sync requires stable internet connection
Authentication: Token-based, stored in Doppler credential management

11 Monitoring & Alerts

Success Metrics: Datasets processed, records ingested, sync duration
Error Conditions: API failures, authentication issues, data quality violations
Performance: Processing times, API response latency, storage utilization
Business Impact: Contact resolution accuracy, calendar event capture completeness

12 FAQs

Why is Gmail content not ingested? Privacy and compliance requirements limit processing to metadata headers only. This provides sender/recipient analysis without accessing message content.

How does the pipeline handle Google Workspace changes? OAuth tokens automatically refresh. Calendar and contact changes are captured incrementally. New users require manual scope granting.

Can the pipeline run during business hours? Yes, but avoid concurrent executions. The morning run processes overnight changes. Manual runs should specify date ranges to avoid quota conflicts.

What happens if MotherDuck sync fails? Local DuckDB catalog remains available. Cloud sync can be re-run independently without re-ingesting source data.