Files

Petru Paler 6a8d55c0b9 Document spec gaps: auth, phase scaling, observability, testing

Address 21 previously undefined behaviors across specs:

- Authentication: Replace email/password with OIDC (Pocket-ID)
- Cycle tracking: Add fixed-luteal phase scaling formula with examples
- Calendar: Document period logging behavior (preserve predictions)
- Garmin: Clarify connection is required (no phase-only mode)
- Dashboard: Add UI states, dark mode, onboarding, accessibility
- Notifications: Document timezone batching approach
- New specs: observability.md (health, metrics, logging)
- New specs: testing.md (unit + integration strategy)
- Main spec: Add backup/recovery, known limitations, API updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-11 07:49:56 +00:00

3.6 KiB

Raw Blame History

Observability Specification

Job to Be Done

When the app is running in production, I want visibility into its health and behavior, so that I can detect and diagnose issues quickly.

Health Check

GET `/api/health`

Returns application health status for monitoring and load balancer checks.

Response (200 OK):

{
  "status": "ok",
  "timestamp": "2024-01-10T12:00:00Z",
  "version": "1.0.0"
}

Response (503 Service Unavailable):

{
  "status": "unhealthy",
  "timestamp": "2024-01-10T12:00:00Z",
  "error": "PocketBase connection failed"
}

Checks Performed:

PocketBase connectivity
Basic app startup complete

Usage:

Nomad health checks
Uptime monitoring (e.g., Uptime Kuma)
Load balancer health probes

Prometheus Metrics

GET `/metrics`

Returns Prometheus-format metrics for scraping.

Standard Node.js Metrics:

nodejs_heap_size_total_bytes
nodejs_heap_size_used_bytes
nodejs_eventloop_lag_seconds
http_request_duration_seconds (histogram)
http_requests_total (counter)

Custom Application Metrics:

Metric	Type	Labels	Description
`phaseflow_garmin_sync_total`	counter	`status` (success/failure)	Garmin sync attempts
`phaseflow_garmin_sync_duration_seconds`	histogram	-	Garmin sync duration
`phaseflow_email_sent_total`	counter	`type` (daily/warning)	Emails sent
`phaseflow_decision_engine_calls_total`	counter	`decision` (REST/GENTLE/...)	Decision engine invocations
`phaseflow_active_users`	gauge	-	Users with activity in last 24h

Implementation: Use prom-client npm package for metrics collection.

Structured Logging

Format

JSON-structured logs for all significant events:

{
  "timestamp": "2024-01-10T12:00:00.000Z",
  "level": "info",
  "message": "Garmin sync completed",
  "userId": "user123",
  "duration_ms": 1250,
  "metrics": {
    "bodyBattery": 95,
    "hrvStatus": "Balanced"
  }
}

Log Levels

Level	Usage
`error`	Failures requiring attention (sync failures, email errors)
`warn`	Degraded behavior (using cached data, retries)
`info`	Normal operations (sync complete, email sent, decision made)

Key Events to Log

Event	Level	Fields
Auth success	info	userId
Auth failure	warn	reason, ip
Garmin sync start	info	userId
Garmin sync complete	info	userId, duration_ms, metrics
Garmin sync failure	error	userId, error, attempt
Email sent	info	userId, type, recipient
Email failed	error	userId, type, error
Decision calculated	info	userId, decision, reason
Period logged	info	userId, date
Override toggled	info	userId, override, enabled

Implementation

Use structured logger (e.g., pino) configured for JSON output:

import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
});

Success Criteria

Health endpoint responds in under 100ms
Metrics endpoint scrapable by Prometheus
All key events logged with consistent structure
Logs parseable by log aggregators (Loki, ELK, etc.)

Acceptance Tests

GET /api/health returns 200 when healthy
GET /api/health returns 503 when PocketBase unreachable
GET /metrics returns valid Prometheus format
Custom metrics increment on relevant events
Logs output valid JSON to stdout
Error logs include stack traces

3.6 KiB Raw Blame History