Address 21 previously undefined behaviors across specs: - Authentication: Replace email/password with OIDC (Pocket-ID) - Cycle tracking: Add fixed-luteal phase scaling formula with examples - Calendar: Document period logging behavior (preserve predictions) - Garmin: Clarify connection is required (no phase-only mode) - Dashboard: Add UI states, dark mode, onboarding, accessibility - Notifications: Document timezone batching approach - New specs: observability.md (health, metrics, logging) - New specs: testing.md (unit + integration strategy) - Main spec: Add backup/recovery, known limitations, API updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
3.6 KiB
3.6 KiB
Observability Specification
Job to Be Done
When the app is running in production, I want visibility into its health and behavior, so that I can detect and diagnose issues quickly.
Health Check
GET /api/health
Returns application health status for monitoring and load balancer checks.
Response (200 OK):
{
"status": "ok",
"timestamp": "2024-01-10T12:00:00Z",
"version": "1.0.0"
}
Response (503 Service Unavailable):
{
"status": "unhealthy",
"timestamp": "2024-01-10T12:00:00Z",
"error": "PocketBase connection failed"
}
Checks Performed:
- PocketBase connectivity
- Basic app startup complete
Usage:
- Nomad health checks
- Uptime monitoring (e.g., Uptime Kuma)
- Load balancer health probes
Prometheus Metrics
GET /metrics
Returns Prometheus-format metrics for scraping.
Standard Node.js Metrics:
nodejs_heap_size_total_bytesnodejs_heap_size_used_bytesnodejs_eventloop_lag_secondshttp_request_duration_seconds(histogram)http_requests_total(counter)
Custom Application Metrics:
| Metric | Type | Labels | Description |
|---|---|---|---|
phaseflow_garmin_sync_total |
counter | status (success/failure) |
Garmin sync attempts |
phaseflow_garmin_sync_duration_seconds |
histogram | - | Garmin sync duration |
phaseflow_email_sent_total |
counter | type (daily/warning) |
Emails sent |
phaseflow_decision_engine_calls_total |
counter | decision (REST/GENTLE/...) |
Decision engine invocations |
phaseflow_active_users |
gauge | - | Users with activity in last 24h |
Implementation:
Use prom-client npm package for metrics collection.
Structured Logging
Format
JSON-structured logs for all significant events:
{
"timestamp": "2024-01-10T12:00:00.000Z",
"level": "info",
"message": "Garmin sync completed",
"userId": "user123",
"duration_ms": 1250,
"metrics": {
"bodyBattery": 95,
"hrvStatus": "Balanced"
}
}
Log Levels
| Level | Usage |
|---|---|
error |
Failures requiring attention (sync failures, email errors) |
warn |
Degraded behavior (using cached data, retries) |
info |
Normal operations (sync complete, email sent, decision made) |
Key Events to Log
| Event | Level | Fields |
|---|---|---|
| Auth success | info | userId |
| Auth failure | warn | reason, ip |
| Garmin sync start | info | userId |
| Garmin sync complete | info | userId, duration_ms, metrics |
| Garmin sync failure | error | userId, error, attempt |
| Email sent | info | userId, type, recipient |
| Email failed | error | userId, type, error |
| Decision calculated | info | userId, decision, reason |
| Period logged | info | userId, date |
| Override toggled | info | userId, override, enabled |
Implementation
Use structured logger (e.g., pino) configured for JSON output:
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
});
Success Criteria
- Health endpoint responds in under 100ms
- Metrics endpoint scrapable by Prometheus
- All key events logged with consistent structure
- Logs parseable by log aggregators (Loki, ELK, etc.)
Acceptance Tests
- GET
/api/healthreturns 200 when healthy - GET
/api/healthreturns 503 when PocketBase unreachable - GET
/metricsreturns valid Prometheus format - Custom metrics increment on relevant events
- Logs output valid JSON to stdout
- Error logs include stack traces