Skip to main content

Observability and SLOs

What Is Observability?

Observability means being able to understand what a system is doing from the outside, without having to guess. For a security platform three signals matter most:

  • Logs — what happened and when
  • Metrics — how fast, how often, how many
  • Alerts — when something breaks an expectation

EvilTwin uses structured JSON logs (easily ingested by Splunk or any log platform), application-level metrics, and defined SLOs (Service Level Objectives) to make operational health transparent.


Golden Signals

The four golden signals cover the health of any service:

SignalMetricTool
LatencyAPI p95 response time per endpointApplication logs (timing middlware)
TrafficRequests/s to /log, /sessions, /ai/analyzeLog aggregation
Errors4xx/5xx rate per endpointLog aggregation
SaturationDatabase connection pool utilisation, queue depth in alert managerApplication state

Service Level Objectives (SLOs)

An SLO is a target that defines "acceptable" service. Breaching it means something needs attention.

SLOTargetMeasurement Window
Ingest availability≥ 99.5% of POST /log requests succeed7 days
Ingest latencyp95 < 500 ms for POST /log1 day
Session query latencyp95 < 1 s for GET /sessions1 day
Dashboard stats latencyp95 < 800 ms for GET /dashboard/stats1 day
Auth latencyp95 < 300 ms for POST /auth/login1 day
Alert delivery delayp95 WebSocket delivery < 2 s from event ingestion1 day
AI analysis latencyp95 < 10 s for POST /ai/analyze (LLM dependent)1 day
AI availabilityGET /ai/status returns available: true ≥ 95% of time7 days
note

The AI endpoints have a deliberately relaxed SLO (10 s, 95% availability) because they depend on an external LLM service. When the LLM is down, the platform degrades gracefully — ML-based scoring continues, but narrative triage is unavailable.


Logging Practices

Log Format

All backend log entries are structured JSON:

{
"timestamp": "2024-01-15T10:22:33.456Z",
"level": "INFO",
"logger": "backend.routers.ingest",
"request_id": "a1b2c3d4",
"message": "Event ingested",
"source_ip": "203.0.113.7",
"sensor_id": "cowrie-1",
"event_type": "login_attempt",
"session_id": "uuid-...",
"threat_score": 0.42,
"threat_level": "medium",
"duration_ms": 87
}

Key Log Events to Monitor

EventLoggerWhat It Signals
Event ingestedbackend.routers.ingestNormal operation — volume metric
Score computedbackend.services.threat_scorerScoring pipeline healthy
Alert queuedbackend.services.alert_managerHigh/Critical event detected
WebSocket broadcastbackend.routers.alertsAlert delivered to analysts
Canary token triggeredbackend.routers.canaryHigh-confidence IoC
Auth login successbackend.routers.authAnalyst logged in
Auth login failedbackend.routers.authFailed login — potential brute force
Token refreshbackend.routers.authNormal session extension
LLM analysis completebackend.services.llm_serviceAI triage cycle
LLM unavailablebackend.services.llm_serviceLLM down — graceful degradation
DB connection pool exhaustedSQLAlchemySaturation — scale database pool

Log Levels

LevelUsed for
DEBUGVerbose per-request details (disabled in production)
INFONormal operation milestones (ingest success, score computed, auth)
WARNINGUnusual but non-fatal conditions (LLM degraded mode, slow query)
ERRORHandled failures (DB timeout, LLM API error)
CRITICALUnhandled exceptions, service startup failures

Set LOG_LEVEL=INFO in production. Set LOG_LEVEL=DEBUG locally for troubleshooting.


Alerting Model

Operational alerts (separate from threat alerts sent to analysts) notify operators when the platform itself is degrading:

AlertConditionSeverity
Ingest error surgeError rate on POST /log > 5% for 5 minutesHigh
Ingest latency breachp95 > 1 s for 5 consecutive minutesMedium
Database unreachable/health returns database: falseCritical
LLM service downGET /ai/status returns available: false for 15+ minutesLow
Auth failure spike> 20 failed logins in 60 secondsMedium (possible brute force against platform)
Alert queue overflowalert_manager.queue_depth > 1000High
High memory usageContainer memory > 80% of limitMedium
warning

Auth failure spike alerts are operationally important. If attackers discover the API, they may try to brute-force platform credentials. The platform itself is a target.


Dashboard Views

Splunk Dashboards (splunk/dashboards/)

The production Splunk integration provides:

DashboardWhat It Shows
Ingest RateEvents/minute over time, broken by sensor_id
Threat Level DistributionPie chart of Low/Medium/High/Critical sessions
Top AttackersTop 20 source IPs by event count
Alert TimelineHigh/Critical alerts over 24h
Auth ActivityLogin success/failure rate, token refresh activity
LLM Usageanalyze_session call rate, average response latency
Error Rate4xx/5xx rate per endpoint

SOC Dashboard (Frontend)

The built-in dashboard (/dashboard) shows:

  • 24-hour attack summary (total events, unique attackers, active sessions)
  • Threat level timeline (sparkline per hour)
  • Top attackers table
  • Live alert feed (WebSocket)

AI and Enrichment Metrics

MetricWhat It Measures
ai.analyze_session.countTotal calls to POST /ai/analyze
ai.analyze_session.latency_msEnd-to-end LLM round-trip time
ai.chat.countTotal chat turns on POST /ai/chat
ai.unavailable.countTimes LLMService returned available: false
scoring.ml.countTimes ML scorer ran (vs degraded mode)
scoring.degraded.countTimes rule-based fallback was used

Track ai.unavailable.count alongside ai.analyze_session.count to understand what fraction of triage attempts succeeded.


Weekly Health Review Checklist

Run this review every Monday before the week begins:

# Check for any error log spikes in the last 7 days
# (adapt to your log platform — example uses jq on saved log files)
cat backend_logs.jsonl | jq -r 'select(.level=="ERROR") | .message' | sort | uniq -c | sort -rn

# Check database health
curl -s http://localhost:8000/health | jq .

# Check AI status
curl -s http://localhost:8000/ai/status \
-H "Authorization: Bearer $ADMIN_TOKEN" | jq .

# Check alert queue depth (if metrics endpoint is enabled)
curl -s http://localhost:8000/health | jq '.alert_queue_depth'

Review the following:

  • Average ingest latency — has it trended upward?
  • Error rate — any new error types in the past week?
  • Auth failure rate — any spikes suggesting brute-force attempts against the platform?
  • LLM availability — what fraction of AI triage requests succeeded?
  • Database connection pool hits — is the pool sized appropriately for the current event volume?