Address 21 previously undefined behaviors across specs: - Authentication: Replace email/password with OIDC (Pocket-ID) - Cycle tracking: Add fixed-luteal phase scaling formula with examples - Calendar: Document period logging behavior (preserve predictions) - Garmin: Clarify connection is required (no phase-only mode) - Dashboard: Add UI states, dark mode, onboarding, accessibility - Notifications: Document timezone batching approach - New specs: observability.md (health, metrics, logging) - New specs: testing.md (unit + integration strategy) - Main spec: Add backup/recovery, known limitations, API updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
139 lines
3.6 KiB
Markdown
139 lines
3.6 KiB
Markdown
# Observability Specification
|
|
|
|
## Job to Be Done
|
|
|
|
When the app is running in production, I want visibility into its health and behavior, so that I can detect and diagnose issues quickly.
|
|
|
|
## Health Check
|
|
|
|
### GET `/api/health`
|
|
|
|
Returns application health status for monitoring and load balancer checks.
|
|
|
|
**Response (200 OK):**
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"timestamp": "2024-01-10T12:00:00Z",
|
|
"version": "1.0.0"
|
|
}
|
|
```
|
|
|
|
**Response (503 Service Unavailable):**
|
|
```json
|
|
{
|
|
"status": "unhealthy",
|
|
"timestamp": "2024-01-10T12:00:00Z",
|
|
"error": "PocketBase connection failed"
|
|
}
|
|
```
|
|
|
|
**Checks Performed:**
|
|
- PocketBase connectivity
|
|
- Basic app startup complete
|
|
|
|
**Usage:**
|
|
- Nomad health checks
|
|
- Uptime monitoring (e.g., Uptime Kuma)
|
|
- Load balancer health probes
|
|
|
|
## Prometheus Metrics
|
|
|
|
### GET `/metrics`
|
|
|
|
Returns Prometheus-format metrics for scraping.
|
|
|
|
**Standard Node.js Metrics:**
|
|
- `nodejs_heap_size_total_bytes`
|
|
- `nodejs_heap_size_used_bytes`
|
|
- `nodejs_eventloop_lag_seconds`
|
|
- `http_request_duration_seconds` (histogram)
|
|
- `http_requests_total` (counter)
|
|
|
|
**Custom Application Metrics:**
|
|
|
|
| Metric | Type | Labels | Description |
|
|
|--------|------|--------|-------------|
|
|
| `phaseflow_garmin_sync_total` | counter | `status` (success/failure) | Garmin sync attempts |
|
|
| `phaseflow_garmin_sync_duration_seconds` | histogram | - | Garmin sync duration |
|
|
| `phaseflow_email_sent_total` | counter | `type` (daily/warning) | Emails sent |
|
|
| `phaseflow_decision_engine_calls_total` | counter | `decision` (REST/GENTLE/...) | Decision engine invocations |
|
|
| `phaseflow_active_users` | gauge | - | Users with activity in last 24h |
|
|
|
|
**Implementation:**
|
|
Use `prom-client` npm package for metrics collection.
|
|
|
|
## Structured Logging
|
|
|
|
### Format
|
|
|
|
JSON-structured logs for all significant events:
|
|
|
|
```json
|
|
{
|
|
"timestamp": "2024-01-10T12:00:00.000Z",
|
|
"level": "info",
|
|
"message": "Garmin sync completed",
|
|
"userId": "user123",
|
|
"duration_ms": 1250,
|
|
"metrics": {
|
|
"bodyBattery": 95,
|
|
"hrvStatus": "Balanced"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Log Levels
|
|
|
|
| Level | Usage |
|
|
|-------|-------|
|
|
| `error` | Failures requiring attention (sync failures, email errors) |
|
|
| `warn` | Degraded behavior (using cached data, retries) |
|
|
| `info` | Normal operations (sync complete, email sent, decision made) |
|
|
|
|
### Key Events to Log
|
|
|
|
| Event | Level | Fields |
|
|
|-------|-------|--------|
|
|
| Auth success | info | userId |
|
|
| Auth failure | warn | reason, ip |
|
|
| Garmin sync start | info | userId |
|
|
| Garmin sync complete | info | userId, duration_ms, metrics |
|
|
| Garmin sync failure | error | userId, error, attempt |
|
|
| Email sent | info | userId, type, recipient |
|
|
| Email failed | error | userId, type, error |
|
|
| Decision calculated | info | userId, decision, reason |
|
|
| Period logged | info | userId, date |
|
|
| Override toggled | info | userId, override, enabled |
|
|
|
|
### Implementation
|
|
|
|
Use structured logger (e.g., `pino`) configured for JSON output:
|
|
|
|
```typescript
|
|
import pino from 'pino';
|
|
|
|
export const logger = pino({
|
|
level: process.env.LOG_LEVEL || 'info',
|
|
formatters: {
|
|
level: (label) => ({ level: label }),
|
|
},
|
|
});
|
|
```
|
|
|
|
## Success Criteria
|
|
|
|
1. Health endpoint responds in under 100ms
|
|
2. Metrics endpoint scrapable by Prometheus
|
|
3. All key events logged with consistent structure
|
|
4. Logs parseable by log aggregators (Loki, ELK, etc.)
|
|
|
|
## Acceptance Tests
|
|
|
|
- [ ] GET `/api/health` returns 200 when healthy
|
|
- [ ] GET `/api/health` returns 503 when PocketBase unreachable
|
|
- [ ] GET `/metrics` returns valid Prometheus format
|
|
- [ ] Custom metrics increment on relevant events
|
|
- [ ] Logs output valid JSON to stdout
|
|
- [ ] Error logs include stack traces
|