graph LR A[Google Calendar Sync] --> G[DuckDB Catalog Build] B[Gmail Messages] --> C[Contact Touches] C --> G D[Google Contacts] --> G E[Drive Metadata] --> G F[Forms Responses] --> G G --> H[MotherDuck Sync] H --> I[Data Quality Validation]
Data Ingestion Pipeline SOP
1 Purpose
Maintain a unified data lake of client interaction data, operational metadata, and business intelligence sources to enable automated metrics, contact management, and compliance reporting.
2 Triggers
- Scheduled: Daily at 2 AM UTC via cron job
- Manual: Operator execution of
run_google_sync.sh - Event-driven: New client onboarding, significant data changes requiring immediate refresh
3 Inputs
- Google OAuth token with workspace access (securely stored in Doppler encrypted credential store)
- Calendar configuration mapping (
config/google-calendars.json) - Existing data lake structure under
data/directory hierarchy - Previous ingestion state for incremental processing
4 Steps
4.1 1. Google Calendar Sync
- Script:
download_google_calendar.py - Scope: Primary calendar + configured secondary calendars
- Timeframe: 730 days historical, 365 days forward
- Output: Parquet files in
data/google/calendar/ - Summary: Aggregated statistics in
data/model/google_calendar_summary.json
4.2 2. Gmail Metadata Collection
- Script:
download_gmail_messages.py - Scope: Message metadata only (no content), last 180 days
- Limit: 1,500 most recent messages to respect API quotas
- Output: Message headers in
data/google/gmail/messages.parquet - Privacy: Content and attachments explicitly excluded
4.3 3. Contact Touch Analysis
- Script:
update_gmail_touch_table.py - Input: Gmail message metadata from step 2
- Processing: Extract sender/recipient patterns, calculate interaction frequency
- Output: Contact engagement metrics in
data/model/contact_email_touches.parquet
4.4 4. Google Contacts Extraction
- Script:
download_google_contacts.py - Scope: All workspace contacts for email resolution
- Output: Contact records in
data/google/contacts/ - Purpose: Foundation for cross-system contact matching
4.5 5. Drive Metadata Inventory
- Script:
download_google_drive.py - Scope: Non-trashed files, up to 5,000 most recent
- Output: File metadata in
data/google/drive/ - Use Cases: Client document organization, storage auditing
4.6 6. DuckDB Catalog Build
- Script:
build_duckdb_catalog.py - Purpose: Create unified queryable interface over all ingested data
- Output:
data/duckdb/lake_catalog.duckdbwith views for each dataset - Features: Schema validation, union handling for inconsistent structures
4.7 7. Cloud Sync (MotherDuck)
- Trigger: Manual or post-ingestion automation
- Target: MotherDuck
hooksdatabase with 24 synchronized tables - Validation: Row count verification, schema consistency checks
5 Exceptions
5.1 OAuth Token Expiry
- Detection: HTTP 401 errors from Google APIs
- Response: Automatic token refresh using stored refresh token
- Escalation: Manual re-authentication if refresh fails
5.2 API Rate Limiting
- Detection: HTTP 429 errors, quota exceeded responses
- Response: Exponential backoff with jitter, resume from last successful batch
- Fallback: Delay execution to next available quota window
5.3 Data Quality Issues
- Schema Changes: Alert data engineering team, halt processing until resolved
- Missing Data: Log gaps, continue with available data, flag for investigation
- Duplicate Records: Deduplication logic based on message IDs, timestamps
5.4 Storage Exhaustion
- Detection: Disk space monitoring, parquet file size validation
- Response: Archive older datasets, compress historical data
- Prevention: Automated cleanup of files older than retention policy
6 Owner Handoffs
- Data Engineering → Client Experience for contact resolution failures
- Data Engineering → Compliance for PII handling policy violations
- Data Engineering → Infrastructure for persistent authentication or storage issues
7 SLAs
- Daily Pipeline: Complete within 90 minutes of 2 AM UTC start
- Manual Execution: Complete within 60 minutes for standard datasets
- Error Recovery: Automatic retry within 30 minutes, manual escalation at 2 hours
- Data Freshness: All datasets refreshed daily, contact touches updated hourly during business hours
8 Controls
- PII Protection: Message content never accessed, only metadata headers processed
- Access Control: OAuth tokens restricted to necessary scopes, stored encrypted
- Data Validation: Schema enforcement, row count validation, foreign key checks
- Audit Logging: Complete execution logs with timing, success/failure metrics
9 Audit Artifacts
- Execution logs in
logs/ingestion_YYYYMMDD.logwith detailed timing and error information - Summary reports in
data/model/ingestion_summary.jsonwith dataset statistics - Data lineage tracked in MotherDuck
metadata_sources_datasetstable - Quality metrics including row counts, schema validation results, processing duration
10 Vendor Nuances
10.1 Google APIs
- Rate Limits: Vary by service (Gmail: 250 quota units/user/second, Calendar: 1000/second)
- Pagination: Varies (Gmail uses pageToken, Calendar uses nextPageToken)
- Authentication: OAuth 2.0 with offline access for automatic refresh
- Quotas: Daily limits reset at midnight Pacific Time
10.2 DuckDB
- Schema Evolution:
union_by_name=truehandles column additions in parquet files - Memory Usage: Large datasets may require streaming processing for memory-constrained environments
- Concurrent Access: Single writer, multiple readers supported
10.3 MotherDuck
- Version Compatibility: Requires DuckDB 1.3.2 (not 1.4.0+)
- Network Dependencies: Cloud sync requires stable internet connection
- Authentication: Token-based, stored in Doppler credential management
11 Monitoring & Alerts
- Success Metrics: Datasets processed, records ingested, sync duration
- Error Conditions: API failures, authentication issues, data quality violations
- Performance: Processing times, API response latency, storage utilization
- Business Impact: Contact resolution accuracy, calendar event capture completeness
12 FAQs
Why is Gmail content not ingested? Privacy and compliance requirements limit processing to metadata headers only. This provides sender/recipient analysis without accessing message content.
How does the pipeline handle Google Workspace changes? OAuth tokens automatically refresh. Calendar and contact changes are captured incrementally. New users require manual scope granting.
Can the pipeline run during business hours? Yes, but avoid concurrent executions. The morning run processes overnight changes. Manual runs should specify date ranges to avoid quota conflicts.
What happens if MotherDuck sync fails? Local DuckDB catalog remains available. Cloud sync can be re-run independently without re-ingesting source data.