Daily Bill Scan — Review (2026-06-22)

What it is

  • Script /opt/data/bin/bill-scanner.py runs daily via autopilot.
  • Scans IMAP inbox (INBOX, [Gmail]/All Mail) for emails matching bill keywords (bill, invoice, payment, etc.).
  • Downloads attachments, converts PDFs to images via PyMuPDF, sends images to inference server (.106:8080) for OCR extraction of vendor/amount/due-date.
  • Results logged to /opt/data/bills/scan-log.jsonl, OCR text saved under /opt/data/bills/processed/.

Current state

  • Running — autopilot scans every ~10 min, processes on --process-attachments flag.
  • Last run 2026-06-10: 5 bill emails found, 0 new attachments (all duplicates). Pipeline healthy.
  • Active open items: Unitywater 125 new vendor (needs verification).
  • PyMuPDF fixed in venv (/opt/data/.venv/) — shebang corrected to #!/opt/data/.venv/bin/python3.
  • ~20 sessions logged since May 2026, mostly “no new bills” cycles.

Gaps / risks

  • No deduplication — scanner re-scans all emails every tick; no record of already-processed message IDs in a durable store. Duplicate detection is implicit (checking if attachments dir already has files for msg_id).
  • No alerting/notification — extracted bills just sit in /opt/data/bills/processed/. pvs never gets notified unless autopilot logs are manually checked.
  • No due-date tracking — bills with upcoming deadlines (Unitywater 26 Jun) not fed into any task/calendar system. Just noted in index.md.
  • Test/dummy data contamination — PowerCo NZ test files and Superloop “test_421” still on disk from May 5 run. Not cleaned up.
  • Single keyword listKEYWORDS list is short; bills with creative subject lines could slip through (e.g., “statement”, “remittance”, “utility”).
  • No structured output schema — OCR results are free-form text blobs. No consistent JSON structure for downstream automation.
  • Dependent on GPU node.106:8080 is the single point of failure for OCR. No fallback.
  • Keep the existing scanner as-is (it works). Layer improvements without rewriting.
  • Add a processed-messages store (/opt/data/bills/scanned_ids.json) to skip re-download.
  • Route structured bill data into Mercury tasks so pvs gets notified on due dates.
  • Clean up test data once and document what to keep vs delete.

Phased plan

Phase 1 — hygiene (immediate):

  • Add processed ID tracking to avoid redundant downloads.
  • Delete confirmed test/dummy files (PowerCo NZ, Superloop test_421).
  • Expand keyword list with common billing terms.

Phase 2 — notification (short term):

  • Parse OCR output into structured JSON (vendor, amount, due date, bill number).
  • Create Mercury task when a new bill is detected or a due date is approaching within 3 days.
  • Add a weekly summary email to pvs.

Phase 3 — calendar integration (medium term):

  • Export due dates as iCal/VTODO or feed into Google Calendar.
  • Auto-recurring bills: detect patterns, auto-create next-period tasks.