phaseflow/specs/observability.md

# Observability Specification

## Job to Be Done

When the app is running in production, I want visibility into its health and behavior, so that I can detect and diagnose issues quickly.

## Health Check

### GET `/api/health`

Returns application health status for monitoring and load balancer checks.

**Response (200 OK):**
```json
{
  "status": "ok",
  "timestamp": "2024-01-10T12:00:00Z",
  "version": "1.0.0"
}
```

**Response (503 Service Unavailable):**
```json
{
  "status": "unhealthy",
  "timestamp": "2024-01-10T12:00:00Z",
  "error": "PocketBase connection failed"
}
```

**Checks Performed:**
- PocketBase connectivity
- Basic app startup complete

**Usage:**
- Nomad health checks
- Uptime monitoring (e.g., Uptime Kuma)
- Load balancer health probes

## Prometheus Metrics

### GET `/metrics`

Returns Prometheus-format metrics for scraping.

**Standard Node.js Metrics:**
- `nodejs_heap_size_total_bytes`
- `nodejs_heap_size_used_bytes`
- `nodejs_eventloop_lag_seconds`
- `http_request_duration_seconds` (histogram)
- `http_requests_total` (counter)

**Custom Application Metrics:**

| Metric | Type | Labels | Description |
|--------|------|--------|-------------|
| `phaseflow_garmin_sync_total` | counter | `status` (success/failure) | Garmin sync attempts |
| `phaseflow_garmin_sync_duration_seconds` | histogram | - | Garmin sync duration |
| `phaseflow_email_sent_total` | counter | `type` (daily/warning) | Emails sent |
| `phaseflow_decision_engine_calls_total` | counter | `decision` (REST/GENTLE/...) | Decision engine invocations |
| `phaseflow_active_users` | gauge | - | Users with activity in last 24h |

**Implementation:**
Use `prom-client` npm package for metrics collection.

## Structured Logging

### Format

JSON-structured logs for all significant events:

```json
{
  "timestamp": "2024-01-10T12:00:00.000Z",
  "level": "info",
  "message": "Garmin sync completed",
  "userId": "user123",
  "duration_ms": 1250,
  "metrics": {
    "bodyBattery": 95,
    "hrvStatus": "Balanced"
  }
}
```

### Log Levels

| Level | Usage |
|-------|-------|
| `error` | Failures requiring attention (sync failures, email errors) |
| `warn` | Degraded behavior (using cached data, retries) |
| `info` | Normal operations (sync complete, email sent, decision made) |

### Key Events to Log

| Event | Level | Fields |
|-------|-------|--------|
| Auth success | info | userId |
| Auth failure | warn | reason, ip |
| Garmin sync start | info | userId |
| Garmin sync complete | info | userId, duration_ms, metrics |
| Garmin sync failure | error | userId, error, attempt |
| Email sent | info | userId, type, recipient |
| Email failed | error | userId, type, error |
| Decision calculated | info | userId, decision, reason |
| Period logged | info | userId, date |
| Override toggled | info | userId, override, enabled |

### Implementation

Use structured logger (e.g., `pino`) configured for JSON output:

```typescript
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
});
```

## Success Criteria

1. Health endpoint responds in under 100ms
2. Metrics endpoint scrapable by Prometheus
3. All key events logged with consistent structure
4. Logs parseable by log aggregators (Loki, ELK, etc.)

## Acceptance Tests

- [ ] GET `/api/health` returns 200 when healthy
- [ ] GET `/api/health` returns 503 when PocketBase unreachable
- [ ] GET `/metrics` returns valid Prometheus format
- [ ] Custom metrics increment on relevant events
- [ ] Logs output valid JSON to stdout
- [ ] Error logs include stack traces