Health Monitoring

The NightOwl health dashboard gives you real-time visibility into the agent process that sits between your application and your PostgreSQL database. Use it to confirm that telemetry is flowing, to diagnose back-pressure before it becomes data loss, and to decide when to scale horizontally.

Where to find it

In the dashboard, open your app and click Health in the sidebar. The page polls every 10 seconds while open and reads directly from the agent’s metrics SQLite file — no extra round-trip to PostgreSQL.

The three signals that matter

Ingest rate

Payloads accepted per second over the last minute. On a single instance a healthy agent ingests up to ~13,400 payloads/s.

Drain rate

Rows written to PostgreSQL per second. Drain should keep pace with ingest; a persistent gap means the buffer is filling.

Buffer depth

Pending rows sitting in the SQLite WAL buffer. Grows when drain lags ingest; the agent rejects new payloads once it hits 100,000.

Reading the charts

Ingest vs. drain — the two lines should track each other. If drain flattens while ingest keeps climbing, PostgreSQL is the bottleneck (check the postgresql-sizing guide).
Buffer depth — a steady sawtooth is normal: ingest fills, drain empties in 5,000-row batches. A monotonically rising line is the early warning for back-pressure.
Drain batch latency — time to COPY one batch. Sustained values above a second usually mean PostgreSQL is IO-bound or connection-starved.

Status banners

Agent offline

The health file hasn’t been touched in the last 30 seconds. The agent process likely crashed or was never started. Check supervisord / systemd logs for the nightowl:agent worker.

Buffer near capacity

Buffer depth is above 80,000 rows. Ingest will start rejecting payloads at 100,000. Scale drain workers (see NIGHTOWL_DRAIN_WORKERS) or add a PostgreSQL replica.

Drain stalled

Drain rate has been zero for longer than one minute while ingest is non-zero. Almost always a PostgreSQL connection issue — verify credentials, PgBouncer, and max_connections.

No recent payloads

Ingest rate has been zero for 5+ minutes. Your application may not be sending. Confirm NIGHTOWL_TOKEN in the customer app’s .env matches the token shown in the dashboard’s Settings → Agent token, and that the laravel/nightwatch package is installed and booted.

What to do when it’s unhealthy

Persistent drain lag → increase drain workers: NIGHTOWL_DRAIN_WORKERS=4 (one worker per PostgreSQL core is a reasonable ceiling).
Connection churn → put PgBouncer in front of PostgreSQL. The bundled docker-compose.yml ships a ready configuration on port 6432.
Single-instance ceiling → run multiple agent instances behind SO_REUSEPORT. See Multiple Instances.
PostgreSQL saturation → check pg_stat_activity, tune synchronous_commit = off, or upgrade disk. See PostgreSQL sizing.

Visibility from the host application

When the agent returns 5:ERROR (back-pressure, token mismatch, malformed payload), the laravel/nightwatch SDK in your application drops the current batch and silently swallows the error — no retry, no in-memory buildup, no host crash, but also no log line by default. The dashboard’s reject rate chart already shows the picture on the receiving side. To log it from inside the host app, register a handler in AppServiceProvider::boot():

use Illuminate\Support\Facades\Log;
use Laravel\Nightwatch\Facades\Nightwatch;

Nightwatch::handleUnrecoverableExceptionsUsing(function (\Throwable $e) {
    Log::warning('NightOwl telemetry batch dropped', ['error' => $e->getMessage()]);
});

Pipe to Sentry, Bugsnag, or any logger. The handler slot is single-occupant — if you already register your own elsewhere, only the most recent registration runs.

Raw metrics for external monitoring

The agent writes a JSON metrics file next to the SQLite buffer for Prometheus, Datadog, or a cron-scraped shell script.

Single-worker (default) — storage/nightowl/agent-buffer.sqlite.drain-metrics.json
Multi-worker (NIGHTOWL_DRAIN_WORKERS > 1) — one file per worker, suffixed with the worker id: storage/nightowl/agent-buffer.sqlite.drain-metrics-{id}.json

When you scale horizontally, configure your scraper to glob *.drain-metrics-*.json so you don’t miss rows from worker 2 onward.

​Where to find it

​The three signals that matter

Ingest rate

Drain rate

Buffer depth

​Reading the charts

​Status banners

​What to do when it’s unhealthy

​Visibility from the host application

​Raw metrics for external monitoring