SPECTRE Architecture Decisions Record

Project: SPECTRE Fleet Date: 2026-02-15 Status: Active Development (Phase 2 → Phase 3)

🎯 Critical Decisions

ADR-001: Nix-First Kubernetes Orchestration over Helm

Status: ✅ Accepted & Implemented Classification: Critical

Decision: Use Nix modules to generate Kubernetes manifests declaratively instead of Helm charts as primary deployment method.

Context: - Required reproducible deployments across dev/prod environments - Wanted to eliminate Docker daemon dependency - Needed build-time validation of configurations - Already using Nix flakes for development environment

Rationale: - Reproducibility: Nix content-addressable store guarantees bit-for-bit identical builds - Type Safety: Nix language catches configuration errors at build-time - No Docker Daemon: OCI images built with dockerTools.buildLayeredImage - Hermetic Builds: All dependencies pinned in flake.lock, zero version drift - Developer Isolation: nix develop .#kubernetes provides isolated K8s tooling

Trade-offs Accepted: - ✅ Accept: Steeper learning curve for Nix vs Helm - ✅ Accept: Smaller community ecosystem - ✅ Accept: JSON output instead of YAML (both valid for K8s) - ✅ Gain: Build-time correctness guarantees - ✅ Gain: Zero configuration drift

Implementation: - nix/lib/k8s.nix: Helper functions (mkLabels, mkContainer, mkHttpProbe) - nix/kubernetes/*.nix: Resource modules (Deployment, Service, ConfigMap, Ingress) - nix/images/spectre-proxy.nix: Container image builder - flake.nix: Packages (manifests) and apps (deploy scripts)

References: - User request: "Vamos criar modulos nix para orquestrar as fleets de kube" - Files: nix/kubernetes/default.nix, flake.nix, KUBERNETES.md - Helm chart maintained as fallback option

ADR-002: Argon2id KDF for Secret Encryption (Critical Security Fix)

Status: ✅ Accepted & Implemented Classification: Critical

Decision: Replace weak XOR-based key derivation with Argon2id algorithm for AES-256 key generation.

Context: - Original implementation used XOR (password.bytes().cycle()) to derive AES keys - Short passwords could generate keys with zero bytes (critical vulnerability) - Salt parameter was present but ignored - TODO comment acknowledged the weakness

Rationale: - Security: Argon2id is OWASP/NIST recommended for password-based key derivation - Resistance: Protected against GPU/ASIC brute-force attacks - Memory-Hard: Prevents parallel cracking attempts - Proper Salting: Uses cryptographically random salt via OsRng

Implementation:

// Before (VULNERABLE):
let key_bytes: Vec<u8> = password.bytes().cycle().take(32).collect();

// After (SECURE):
let mut key_bytes = [0u8; 32];
let argon2 = Argon2::default();
argon2.hash_password_into(password.as_bytes(), salt, &mut key_bytes)?;

Files Modified: - crates/spectre-secrets/src/crypto.rs: Argon2id implementation - crates/spectre-secrets/Cargo.toml: Added argon2, rand deps - crates/spectre-secrets/src/lib.rs: Wire CryptoEngine modules

Impact: Fixed critical security vulnerability that could lead to secret leakage

ADR-003: Ingress + cert-manager Architecture over Service Mesh

Status: ✅ Accepted & Implemented Classification: Major

Decision: Use nginx-ingress + cert-manager for TLS termination instead of service mesh (Istio/Linkerd).

Context: - Three options evaluated: Ingress, LoadBalancer, Service Mesh - Need TLS termination, routing, and certificate management - Want to minimize operational complexity for initial deployment

Alternatives Considered:

Option A: Service Mesh (Istio/Linkerd) - Pros: mTLS, advanced traffic management, observability - Cons: High complexity, resource overhead, steep learning curve - Rejected: Over-engineered for current scale

Option B: LoadBalancer per Service - Pros: Simple, direct - Cons: Multiple IPs, no path routing, manual cert management - Rejected: Not cost-effective, limited routing

Option C: Ingress + cert-manager (CHOSEN) - Pros: Single entry point, path-based routing, automatic TLS, battle-tested - Cons: Less sophisticated traffic policies than mesh - Accepted: Right balance of features vs complexity

Implementation: - nginx-ingress controller for L7 routing - cert-manager for Let's Encrypt certificate automation - Ingress resource in nix/kubernetes/ingress.nix - TLS disabled in proxy (Ingress handles it)

Configuration:

ingress = {
  enabled = true;
  className = "nginx";
  host = "spectre.production.com";
  tls = {
    enabled = true;
    issuer = "letsencrypt-prod";
  };
};

ADR-004: NATS JetStream Event-Driven Architecture

Status: ✅ Accepted & Implemented (Inherited from Phase 1) Classification: Critical

Decision: Use NATS JetStream as the event bus for asynchronous communication between services.

Context: - SPECTRE is event-driven microservices architecture - Need reliable message delivery, at-least-once semantics - Require pub/sub patterns and stream persistence

Rationale: - Performance: 11M+ msgs/sec throughput - Persistence: JetStream provides durable streams - Simplicity: Simpler than Kafka, lighter than RabbitMQ - Cloud-Native: CNCF project, K8s native operators

Implementation: - crates/spectre-events/src/client.rs: NATS client wrapper - Automatic reconnection enabled - Connection health checks via is_connected() - Event publishing for proxy requests

Fixes Applied: - Enabled retry_on_initial_connect() for resilience - Fixed is_connected() to check !client.is_closed() - Added connection state logging

⚙️ Major Decisions

ADR-005: Token Bucket Rate Limiting Strategy

Status: ✅ Accepted & Implemented Classification: Major

Decision: Implement token bucket algorithm for request rate limiting with configurable RPS and burst.

Rationale: - Fair: Allows burst traffic while maintaining average rate - Configurable: RATE_LIMIT_RPS and RATE_LIMIT_BURST env vars - Standard: Industry-standard algorithm - Efficient: O(1) token check per request

Implementation: - Tower-governor middleware integration - Per-IP rate limiting - 429 responses with Retry-After header - Default: 100 RPS, 200 burst (prod), 1000 RPS (dev)

Files: crates/spectre-proxy/src/main.rs, Cargo.toml (tower-governor)

ADR-006: Three-Tier RBAC Hierarchy (admin > service > readonly)

Status: ✅ Accepted & Implemented Classification: Major

Decision: Enforce role-based access control with three levels: admin, service, readonly.

Hierarchy:

admin    → Full access (all endpoints)
service  → Write access (ingest, proxy, events)
readonly → Read-only (health, metrics, status)

Rationale: - Principle of Least Privilege: Services only get necessary permissions - Defense in Depth: JWT validation + role verification - Auditability: Role logged in every request

Implementation:

#[derive(Debug, Clone, PartialEq, Eq, PartialOrd, Ord)]
enum Role { Readonly, Service, Admin }

fn required_role_for_path(path: &str) -> Role {
    match path {
        p if p.starts_with("/admin") => Role::Admin,
        p if p.starts_with("/ingest") || p.starts_with("/proxy") => Role::Service,
        _ => Role::Readonly,
    }
}

Files: crates/spectre-proxy/src/main.rs (auth middleware)

ADR-007: Prometheus + OTLP Observability Stack

Status: ✅ Accepted & Implemented Classification: Major

Decision: Use Prometheus for metrics and OTLP for distributed tracing.

Components: - Prometheus: Scrapes /metrics endpoint for custom metrics - OTLP Exporter: Sends traces to Tempo/Jaeger - Custom Metrics: - spectre_proxy_requests_total (counter) - spectre_proxy_request_duration_seconds (histogram) - spectre_events_published_total (counter)

Rationale: - Standard: OpenTelemetry is industry standard - Vendor Neutral: Can switch backends easily - Comprehensive: Metrics + Traces + Logs - Configurable Sampling: 10% default, 100% in dev

Implementation: - crates/spectre-observability/src/lib.rs: OTLP setup - crates/spectre-observability/src/metrics.rs: Prometheus metrics - prometheus.yml: Scrape configuration - Configurable via OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_TRACES_SAMPLER_ARG

Fixes Applied: - Removed .unwrap() that could panic - Separated json/pretty formatter branches for type safety - Added error handling for tracer initialization

ADR-008: Graceful Shutdown with SIGTERM/SIGINT Handling

Status: ✅ Accepted & Implemented Classification: Major

Decision: Implement graceful shutdown to prevent dropped requests during rolling updates.

Context: - Kubernetes sends SIGTERM before killing pods - Need to drain in-flight requests before exit - Observability shutdown must happen cleanly

Implementation:

// crates/spectre-core/src/shutdown.rs
pub async fn shutdown_signal() {
    let ctrl_c = signal::ctrl_c();
    let terminate = signal::unix::signal(SignalKind::terminate());

    tokio::select! {
        _ = ctrl_c => info!("Received SIGINT"),
        _ = terminate => info!("Received SIGTERM"),
    }
}

// In proxy main.rs
axum::serve(listener, app)
    .with_graceful_shutdown(shutdown_signal())
    .await?;

Benefits: - Zero-downtime deployments - Proper cleanup of connections - Observability flush before exit

Files: - crates/spectre-core/src/shutdown.rs (new) - crates/spectre-proxy/src/main.rs (integration)

🔧 Implementation Decisions

Shared HTTP Client with Connection Pooling

Status: ✅ Implemented Impact: Performance, Resource Efficiency

Decision: Create single reqwest::Client in AppState instead of per-request allocation.

Before:

async fn proxy_handler() -> Result<Response, StatusCode> {
    let client = reqwest::Client::new(); // ❌ New client per request!
    let upstream_url = "http://localhost:8000"; // ❌ Hardcoded!
}

After:

struct AppState {
    http_client: reqwest::Client,  // ✅ Shared, pooled
    neutron_url: String,           // ✅ Configurable
}

// Client configured with timeouts and connection limits
reqwest::Client::builder()
    .timeout(Duration::from_secs(30))
    .connect_timeout(Duration::from_secs(5))
    .pool_max_idle_per_host(20)
    .build()?

Health Check Endpoints Pattern

Status: ✅ Implemented Impact: Kubernetes Integration

Endpoints: - /health → Liveness probe (process alive) - /ready → Readiness probe (NATS connected, upstream reachable) - /metrics → Prometheus scraping

Implementation: - All bypass authentication middleware - /ready returns 503 if dependencies unavailable - Used in K8s deployment probes: - livenessProbe: /health (10s interval) - readinessProbe: /ready (5s interval) - startupProbe: /health (2s interval, 30 retries)

Structured Error Responses

Status: ✅ Implemented Impact: API Consistency

Decision: Return JSON error responses instead of bare status codes.

Format:

{
  "error": "Forbidden",
  "message": "Insufficient permissions: requires service role",
  "status": 403
}

Implementation:

struct ApiError {
    status: StatusCode,
    message: String,
}

impl IntoResponse for ApiError {
    fn into_response(self) -> Response {
        (self.status, Json(json!({
            "error": self.status.canonical_reason().unwrap_or("Error"),
            "message": self.message,
            "status": self.status.as_u16()
        }))).into_response()
    }
}

📊 Summary

By Phase

Phase 1 (Complete): Event infrastructure, secret management foundations Phase 2 (Complete): Security hardening, observability, Kubernetes deployment Phase 3 (In Progress): Production operationalization, testing, optimization

By Classification

Critical Decisions: 4 (Nix K8s, Argon2id, Ingress, NATS)
Major Decisions: 4 (Rate limiting, RBAC, Observability, Graceful shutdown)
Implementation Decisions: 3 (HTTP pooling, Health checks, Error responses)

Status

✅ Accepted & Implemented: 11 decisions
🔄 In Progress: None (ADR infrastructure itself)
❌ Rejected: Service mesh, LoadBalancer, XOR KDF

🔗 References

ADR System: /home/kernelcore/master/adr-ledger/
Primary ADR: ADR-0037 (Nix-First Kubernetes)
Source Code: /home/kernelcore/master/spectre/
Documentation: KUBERNETES.md, HELM_CHART_SUMMARY.md, IMPLEMENTATION_REPORT.md
Git Commits: 10 commits on 2026-02-15 covering all implementations

This document consolidates architectural decisions made during SPECTRE Phase 2 development. For formal ADR tracking, see the adr-ledger repository.

Keys	Action
`?`	Open this help
`n`	Next page
`p`	Previous page
`s`	Search