Document spec gaps: auth, phase scaling, observability, testing
Address 21 previously undefined behaviors across specs: - Authentication: Replace email/password with OIDC (Pocket-ID) - Cycle tracking: Add fixed-luteal phase scaling formula with examples - Calendar: Document period logging behavior (preserve predictions) - Garmin: Clarify connection is required (no phase-only mode) - Dashboard: Add UI states, dark mode, onboarding, accessibility - Notifications: Document timezone batching approach - New specs: observability.md (health, metrics, logging) - New specs: testing.md (unit + integration strategy) - Main spec: Add backup/recovery, known limitations, API updates Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
138
specs/observability.md
Normal file
138
specs/observability.md
Normal file
@@ -0,0 +1,138 @@
|
||||
# Observability Specification
|
||||
|
||||
## Job to Be Done
|
||||
|
||||
When the app is running in production, I want visibility into its health and behavior, so that I can detect and diagnose issues quickly.
|
||||
|
||||
## Health Check
|
||||
|
||||
### GET `/api/health`
|
||||
|
||||
Returns application health status for monitoring and load balancer checks.
|
||||
|
||||
**Response (200 OK):**
|
||||
```json
|
||||
{
|
||||
"status": "ok",
|
||||
"timestamp": "2024-01-10T12:00:00Z",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
```
|
||||
|
||||
**Response (503 Service Unavailable):**
|
||||
```json
|
||||
{
|
||||
"status": "unhealthy",
|
||||
"timestamp": "2024-01-10T12:00:00Z",
|
||||
"error": "PocketBase connection failed"
|
||||
}
|
||||
```
|
||||
|
||||
**Checks Performed:**
|
||||
- PocketBase connectivity
|
||||
- Basic app startup complete
|
||||
|
||||
**Usage:**
|
||||
- Nomad health checks
|
||||
- Uptime monitoring (e.g., Uptime Kuma)
|
||||
- Load balancer health probes
|
||||
|
||||
## Prometheus Metrics
|
||||
|
||||
### GET `/metrics`
|
||||
|
||||
Returns Prometheus-format metrics for scraping.
|
||||
|
||||
**Standard Node.js Metrics:**
|
||||
- `nodejs_heap_size_total_bytes`
|
||||
- `nodejs_heap_size_used_bytes`
|
||||
- `nodejs_eventloop_lag_seconds`
|
||||
- `http_request_duration_seconds` (histogram)
|
||||
- `http_requests_total` (counter)
|
||||
|
||||
**Custom Application Metrics:**
|
||||
|
||||
| Metric | Type | Labels | Description |
|
||||
|--------|------|--------|-------------|
|
||||
| `phaseflow_garmin_sync_total` | counter | `status` (success/failure) | Garmin sync attempts |
|
||||
| `phaseflow_garmin_sync_duration_seconds` | histogram | - | Garmin sync duration |
|
||||
| `phaseflow_email_sent_total` | counter | `type` (daily/warning) | Emails sent |
|
||||
| `phaseflow_decision_engine_calls_total` | counter | `decision` (REST/GENTLE/...) | Decision engine invocations |
|
||||
| `phaseflow_active_users` | gauge | - | Users with activity in last 24h |
|
||||
|
||||
**Implementation:**
|
||||
Use `prom-client` npm package for metrics collection.
|
||||
|
||||
## Structured Logging
|
||||
|
||||
### Format
|
||||
|
||||
JSON-structured logs for all significant events:
|
||||
|
||||
```json
|
||||
{
|
||||
"timestamp": "2024-01-10T12:00:00.000Z",
|
||||
"level": "info",
|
||||
"message": "Garmin sync completed",
|
||||
"userId": "user123",
|
||||
"duration_ms": 1250,
|
||||
"metrics": {
|
||||
"bodyBattery": 95,
|
||||
"hrvStatus": "Balanced"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Log Levels
|
||||
|
||||
| Level | Usage |
|
||||
|-------|-------|
|
||||
| `error` | Failures requiring attention (sync failures, email errors) |
|
||||
| `warn` | Degraded behavior (using cached data, retries) |
|
||||
| `info` | Normal operations (sync complete, email sent, decision made) |
|
||||
|
||||
### Key Events to Log
|
||||
|
||||
| Event | Level | Fields |
|
||||
|-------|-------|--------|
|
||||
| Auth success | info | userId |
|
||||
| Auth failure | warn | reason, ip |
|
||||
| Garmin sync start | info | userId |
|
||||
| Garmin sync complete | info | userId, duration_ms, metrics |
|
||||
| Garmin sync failure | error | userId, error, attempt |
|
||||
| Email sent | info | userId, type, recipient |
|
||||
| Email failed | error | userId, type, error |
|
||||
| Decision calculated | info | userId, decision, reason |
|
||||
| Period logged | info | userId, date |
|
||||
| Override toggled | info | userId, override, enabled |
|
||||
|
||||
### Implementation
|
||||
|
||||
Use structured logger (e.g., `pino`) configured for JSON output:
|
||||
|
||||
```typescript
|
||||
import pino from 'pino';
|
||||
|
||||
export const logger = pino({
|
||||
level: process.env.LOG_LEVEL || 'info',
|
||||
formatters: {
|
||||
level: (label) => ({ level: label }),
|
||||
},
|
||||
});
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. Health endpoint responds in under 100ms
|
||||
2. Metrics endpoint scrapable by Prometheus
|
||||
3. All key events logged with consistent structure
|
||||
4. Logs parseable by log aggregators (Loki, ELK, etc.)
|
||||
|
||||
## Acceptance Tests
|
||||
|
||||
- [ ] GET `/api/health` returns 200 when healthy
|
||||
- [ ] GET `/api/health` returns 503 when PocketBase unreachable
|
||||
- [ ] GET `/metrics` returns valid Prometheus format
|
||||
- [ ] Custom metrics increment on relevant events
|
||||
- [ ] Logs output valid JSON to stdout
|
||||
- [ ] Error logs include stack traces
|
||||
Reference in New Issue
Block a user