Files
phaseflow/specs/observability.md
Petru Paler 6a8d55c0b9 Document spec gaps: auth, phase scaling, observability, testing
Address 21 previously undefined behaviors across specs:

- Authentication: Replace email/password with OIDC (Pocket-ID)
- Cycle tracking: Add fixed-luteal phase scaling formula with examples
- Calendar: Document period logging behavior (preserve predictions)
- Garmin: Clarify connection is required (no phase-only mode)
- Dashboard: Add UI states, dark mode, onboarding, accessibility
- Notifications: Document timezone batching approach
- New specs: observability.md (health, metrics, logging)
- New specs: testing.md (unit + integration strategy)
- Main spec: Add backup/recovery, known limitations, API updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 07:49:56 +00:00

3.6 KiB

Observability Specification

Job to Be Done

When the app is running in production, I want visibility into its health and behavior, so that I can detect and diagnose issues quickly.

Health Check

GET /api/health

Returns application health status for monitoring and load balancer checks.

Response (200 OK):

{
  "status": "ok",
  "timestamp": "2024-01-10T12:00:00Z",
  "version": "1.0.0"
}

Response (503 Service Unavailable):

{
  "status": "unhealthy",
  "timestamp": "2024-01-10T12:00:00Z",
  "error": "PocketBase connection failed"
}

Checks Performed:

  • PocketBase connectivity
  • Basic app startup complete

Usage:

  • Nomad health checks
  • Uptime monitoring (e.g., Uptime Kuma)
  • Load balancer health probes

Prometheus Metrics

GET /metrics

Returns Prometheus-format metrics for scraping.

Standard Node.js Metrics:

  • nodejs_heap_size_total_bytes
  • nodejs_heap_size_used_bytes
  • nodejs_eventloop_lag_seconds
  • http_request_duration_seconds (histogram)
  • http_requests_total (counter)

Custom Application Metrics:

Metric Type Labels Description
phaseflow_garmin_sync_total counter status (success/failure) Garmin sync attempts
phaseflow_garmin_sync_duration_seconds histogram - Garmin sync duration
phaseflow_email_sent_total counter type (daily/warning) Emails sent
phaseflow_decision_engine_calls_total counter decision (REST/GENTLE/...) Decision engine invocations
phaseflow_active_users gauge - Users with activity in last 24h

Implementation: Use prom-client npm package for metrics collection.

Structured Logging

Format

JSON-structured logs for all significant events:

{
  "timestamp": "2024-01-10T12:00:00.000Z",
  "level": "info",
  "message": "Garmin sync completed",
  "userId": "user123",
  "duration_ms": 1250,
  "metrics": {
    "bodyBattery": 95,
    "hrvStatus": "Balanced"
  }
}

Log Levels

Level Usage
error Failures requiring attention (sync failures, email errors)
warn Degraded behavior (using cached data, retries)
info Normal operations (sync complete, email sent, decision made)

Key Events to Log

Event Level Fields
Auth success info userId
Auth failure warn reason, ip
Garmin sync start info userId
Garmin sync complete info userId, duration_ms, metrics
Garmin sync failure error userId, error, attempt
Email sent info userId, type, recipient
Email failed error userId, type, error
Decision calculated info userId, decision, reason
Period logged info userId, date
Override toggled info userId, override, enabled

Implementation

Use structured logger (e.g., pino) configured for JSON output:

import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
});

Success Criteria

  1. Health endpoint responds in under 100ms
  2. Metrics endpoint scrapable by Prometheus
  3. All key events logged with consistent structure
  4. Logs parseable by log aggregators (Loki, ELK, etc.)

Acceptance Tests

  • GET /api/health returns 200 when healthy
  • GET /api/health returns 503 when PocketBase unreachable
  • GET /metrics returns valid Prometheus format
  • Custom metrics increment on relevant events
  • Logs output valid JSON to stdout
  • Error logs include stack traces