Files
phaseflow/specs/observability.md
Petru Paler 6a8d55c0b9 Document spec gaps: auth, phase scaling, observability, testing
Address 21 previously undefined behaviors across specs:

- Authentication: Replace email/password with OIDC (Pocket-ID)
- Cycle tracking: Add fixed-luteal phase scaling formula with examples
- Calendar: Document period logging behavior (preserve predictions)
- Garmin: Clarify connection is required (no phase-only mode)
- Dashboard: Add UI states, dark mode, onboarding, accessibility
- Notifications: Document timezone batching approach
- New specs: observability.md (health, metrics, logging)
- New specs: testing.md (unit + integration strategy)
- Main spec: Add backup/recovery, known limitations, API updates

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-11 07:49:56 +00:00

139 lines
3.6 KiB
Markdown

# Observability Specification
## Job to Be Done
When the app is running in production, I want visibility into its health and behavior, so that I can detect and diagnose issues quickly.
## Health Check
### GET `/api/health`
Returns application health status for monitoring and load balancer checks.
**Response (200 OK):**
```json
{
"status": "ok",
"timestamp": "2024-01-10T12:00:00Z",
"version": "1.0.0"
}
```
**Response (503 Service Unavailable):**
```json
{
"status": "unhealthy",
"timestamp": "2024-01-10T12:00:00Z",
"error": "PocketBase connection failed"
}
```
**Checks Performed:**
- PocketBase connectivity
- Basic app startup complete
**Usage:**
- Nomad health checks
- Uptime monitoring (e.g., Uptime Kuma)
- Load balancer health probes
## Prometheus Metrics
### GET `/metrics`
Returns Prometheus-format metrics for scraping.
**Standard Node.js Metrics:**
- `nodejs_heap_size_total_bytes`
- `nodejs_heap_size_used_bytes`
- `nodejs_eventloop_lag_seconds`
- `http_request_duration_seconds` (histogram)
- `http_requests_total` (counter)
**Custom Application Metrics:**
| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `phaseflow_garmin_sync_total` | counter | `status` (success/failure) | Garmin sync attempts |
| `phaseflow_garmin_sync_duration_seconds` | histogram | - | Garmin sync duration |
| `phaseflow_email_sent_total` | counter | `type` (daily/warning) | Emails sent |
| `phaseflow_decision_engine_calls_total` | counter | `decision` (REST/GENTLE/...) | Decision engine invocations |
| `phaseflow_active_users` | gauge | - | Users with activity in last 24h |
**Implementation:**
Use `prom-client` npm package for metrics collection.
## Structured Logging
### Format
JSON-structured logs for all significant events:
```json
{
"timestamp": "2024-01-10T12:00:00.000Z",
"level": "info",
"message": "Garmin sync completed",
"userId": "user123",
"duration_ms": 1250,
"metrics": {
"bodyBattery": 95,
"hrvStatus": "Balanced"
}
}
```
### Log Levels
| Level | Usage |
|-------|-------|
| `error` | Failures requiring attention (sync failures, email errors) |
| `warn` | Degraded behavior (using cached data, retries) |
| `info` | Normal operations (sync complete, email sent, decision made) |
### Key Events to Log
| Event | Level | Fields |
|-------|-------|--------|
| Auth success | info | userId |
| Auth failure | warn | reason, ip |
| Garmin sync start | info | userId |
| Garmin sync complete | info | userId, duration_ms, metrics |
| Garmin sync failure | error | userId, error, attempt |
| Email sent | info | userId, type, recipient |
| Email failed | error | userId, type, error |
| Decision calculated | info | userId, decision, reason |
| Period logged | info | userId, date |
| Override toggled | info | userId, override, enabled |
### Implementation
Use structured logger (e.g., `pino`) configured for JSON output:
```typescript
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label }),
},
});
```
## Success Criteria
1. Health endpoint responds in under 100ms
2. Metrics endpoint scrapable by Prometheus
3. All key events logged with consistent structure
4. Logs parseable by log aggregators (Loki, ELK, etc.)
## Acceptance Tests
- [ ] GET `/api/health` returns 200 when healthy
- [ ] GET `/api/health` returns 503 when PocketBase unreachable
- [ ] GET `/metrics` returns valid Prometheus format
- [ ] Custom metrics increment on relevant events
- [ ] Logs output valid JSON to stdout
- [ ] Error logs include stack traces