,

From Bash Script to AI-Assisted Monitoring

John Timmer Avatar

·

,

·

It started as a simple bash script. Every night, it ran through a list of hostnames and triggered rsync to create backups. It worked, until backups grew too large and too slow. Snapshots across the wire weren’t any better. What began as a nightly helper slowly grew into a central management platform that today handles backups, monitoring, reporting, and even AI-assisted analysis.

Growing Beyond Backups

The original goal was clear: make backups, avoid vendor lock-in, and keep it simple. Over the years, that “script with vague ideas” became Backup Planner, a web-based system that maps every device in our infrastructure, from VPS nodes to switches, VLANs, and networks across datacenters.

When a new VPS is added, Zabbix discovers it. From there, Backup Planner integrates the server into its own database. Administrators can decide whether the VPS participates in backups, monitoring, or both. If yes, a lightweight SysDev Agent is installed, a POSIX-compliant shell script triggered every minute by cron. The agent asks Backup Planner what checks to run, executes them, and sends the results back. Metrics flow into InfluxDB for high-speed time-series analysis, while logs are shipped via rsyslog to Graylog for indexing and correlation.

Beyond Metrics: Checks, Questions, and Thresholds

In Backup Planner, monitoring revolves around questions: “What is the load average?” “How much traffic passed through eth0?” Each check produces fields validated against thresholds. If values drift outside bounds, they are flagged and logged to Influx. From there, webhooks can trigger downstream automations.

A simple example: outbound traffic exceeding 1 Gbit/s for more than three minutes triggers a webhook. That webhook can log the event or hand it to something smarter.

Enter AI: Context Before Action

This is where things become exciting. A threshold breach isn’t always meaningful on its own. Is the spike due to normal workload? Bot traffic? A runaway backup? Instead of alerting blindly, we route these events through n8n flows that call an AI model running on our Jetson AGX Orin, a NVIDIA AI development platform, but with n8n you can route it to e.g. OpenAI. We choose to have everything running on our own hardware and keep data within out company.

The AI digs into correlated data: Influx metrics, Graylog logs, recent backup jobs, even CrowdSec alerts. It then produces a human-readable analysis:

  • What happened,
  • Why it likely happened,
  • What should be checked next.

That analysis doesn’t just arrive as a text message. It’s pushed to my phone and even spoken aloud by Home Assistant at home:

“Admin? Server webhosting 1 triggered a threshold alert. Outgoing traffic on interface 0 exceeded 86% of capacity. Suggested action: investigate bot traffic. Link: [Backup Planner Portal].”

Suddenly, alerts aren’t noise but contextual stories.

Dynamic Reporting: AI Chooses Its Own Evidence

One of the most powerful aspects of Backup Planner is its report block system. Any piece of data in the database can be exposed through a reusable block: CPU load, network traffic, pending updates, backup history, CrowdSec alerts, and more.

Normally, administrators assemble these blocks into scheduled reports. But in our AI-assisted workflows, it works the other way around:

  • When an incident is detected, the AI decides what information it needs to understand the situation.
  • It dynamically queries the relevant report blocks, or even builds new ones on the fly.
  • The result is a custom dossier, a tailored report containing exactly the data needed to explain the incident.

This means every alert can come with its own context pack, automatically generated. Instead of a static list of metrics, you get a living document that reflects how the AI reasons about the issue.

Solo Builder’s Journey: Complex Systems, Simple Steps

This whole platform didn’t appear overnight. It was built the only way truly complex systems ever get built by one person: in tiny, relentless steps. A cron here, a bash helper there, one API endpoint, one template block, one task in Influx, one n8n node and repeat. The result isn’t just a tool; it’s an ecosystem that grew with every insight, mistake, and small win.

What made it possible

  • Start embarrassingly simple. A bash script that worked beat any grand design that never shipped.
  • Ship, then shape. Turn the rough edge into a reusable block (agent check → field → template → report).
  • Automate only what hurts. Manual once, scripted twice, automated forever.
  • Guardrails before power. Dry-runs, allowlists, TTLs, audit logs, speed, but with a seatbelt.
  • Observe everything. Metrics to Influx, logs to Graylog, actions to a central log, no ghosts in the machine.
  • AI as a teammate, not a deity. Let it gather evidence and propose a plan; humans approve the first iterations.
  • Design for change. Everything is a block: checks, reports, actions. Swap a block, not the whole system.

Working alone forces clarity. Every feature must pay rent: either it shortens feedback loops, reduces risk, or increases leverage. That constraint is a gift. It’s how a simple nightly rsync grew into a platform that can detect, explain, and (carefully) act without losing the plot.

Takeaway: Big capabilities don’t require big teams; they require small, compounding decisions and a workflow that makes the next step easy.

“If you can make the next step obvious and safe, you can build almost anything solo.”

Looking Ahead: Towards Self-Healing Infrastructure

What we have now is a system that observes, explains, and proposes. The next step is letting it act safely. Playbooks for common incidents, time-limited mitigations, and guardrails that guarantee reversibility will allow the platform to not only detect problems, but also resolve them in real time.

This isn’t about replacing administrators. It’s about raising the baseline: fewer false alarms, more context, and faster recovery. Humans remain in control, but the system does the heavy lifting.

From here, the roadmap is clear:

  • Safer automation: wrappers and policies that keep AI actions predictable.
  • More intelligence: richer correlations between logs, metrics, and history.
  • Better storytelling: alerts that explain themselves like incident reports.
  • Continuous learning: every solved issue makes the platform stronger.

It’s a long way from a bash script that copied files at night. But step by step, we’re moving towards something powerful: infrastructure that can watch itself, explain itself, and heal itself.

Closing the Loop (Safely)

The natural next step is action. The AI already proposes mitigations: rate-limiting a NIC, blocking an IP via CrowdSec, or rescheduling a backup. For now, these remain dry-runs: the system drafts an action plan, logs it, and prepares it for morning review. It’s safer this way, and it keeps us in control.

But the framework is in place: an action wrapper that enforces guardrails, a policy engine that defines what actions are safe, and audit logs that record every intent. Moving from “observe” to “act” is no longer a dream, it’s just a matter of confidence.

What Backup Planner Can Do Today

While the story often focuses on monitoring and AI, Backup Planner is a full-featured management platform. Out of the box it provides:

  • Backups & Restore
    • Tier-1 rsync backups with scheduling and smart load balancing.
    • Tier-2 ZFS snapshots with quota enforcement and retention policies.
    • A* Pathfinding engine to always use the fastest network route for backups and restores.
  • Monitoring & Metrics
    • Lightweight SysDev Agent for POSIX systems.
    • Checks and groups for CPU, load, traffic, updates, and custom scripts. And we mean as custom as you want it.
    • Time-series storage in InfluxDB, dashboards in Grafana.
  • Reporting & Automation
    • Report blocks: modular SQL+template units for mail reports or AI dossiers.
    • Automated mailings (hourly, daily, weekly, monthly) with RBAC scopes.
    • Webhooks that trigger flows in n8n, AI pipelines, or external systems.
  • Configuration & Control
    • Central device management: IPs, NICs, VLANs, paths, service plugins.
    • Role-based access control with scopes per customer or VPS.
    • Traffic shaping rules per NIC with activation/expiry controls.
    • CrowdSec integration to share bans across servers.

In short: Backup Planner is one system for backup, monitoring, reporting, and automation flexible enough for daily operations, yet open enough to integrate AI-driven workflows.

Why This Matters

What began as a simple rsync script has evolved into a self-aware platform:

  • Tier-1 rsync and Tier-2 ZFS snapshots provide reliable backups.
  • Centralized monitoring and reporting give full visibility.
  • AI-assisted analysis turns raw alerts into actionable insights.
  • Home automation bridges the gap from datacenter to living room.

It’s not just about backups anymore. It’s about building infrastructure that can explain itself, learn from its own history, and eventually heal itself.

And the best part? We built it step by step, one bash script, one cron job, one agent, one webhook at a time.

Leave a Reply

Your email address will not be published. Required fields are marked *