Upgrading a Local LLM Server: New Hardware, a Monitoring Stack, and the Hermes AI Harness

TL;DR: A CPU swap to the Ryzen 7 5700X and a GPU upgrade to the RX 9060 XT finally give Shockwave the hardware floor it needs to run local LLM models with manageable latency. Layering in Prometheus, Grafana, and Loki makes the whole stack observable. And the Hermes AI harness from Nous Research adds a new kind of interaction layer: a persistent, learning assistant wired directly to the network and easily accessible through Discord from anywhere I please.

There has been a lot of movement on the Shockwave project since the last update. The server has been running, automating, and generally doing its job, but the hardware was still the weakest link in the chain. Running multiple containers with any degree of AI workload on an RX 580 and a first-generation Ryzen 5 was possible, but not comfortable. This post covers the upgrades that changed that, the monitoring stack that now gives full visibility into what the server is doing, and the addition of a tool that fundamentally changes how I interact with it.

What Hardware Upgrades Does a Local LLM Server Actually Need?

The two components that constrain a local LLM server most are the CPU core count and GPU VRAM. Everything else is secondary. With multiple Docker containers running concurrently on Shockwave, the original Ryzen 5 1600 was hitting its ceiling faster than expected. The GPU situation was even more limiting; 8GB of VRAM is below the practical threshold for running most current local models without significant performance compromises.

Two upgrades addressed both constraints.

CPU: AMD Ryzen 5 1600 to AMD Ryzen 7 5700X

The jump from 6 cores to 8 cores at 4.66GHz looks modest on paper, but in practice it makes a meaningful difference for a server running this many concurrent workloads. The Ryzen 7 5700X is also one of the highest-performing options available under $400 that still supports AM4 socket motherboards, which meant no motherboard replacement was needed. A Newegg discount made the decision easy.

The upgrade did require a BIOS update on the ASRock B450M-HDV, and the official support documentation from ASRock was not sufficient to get through it. The answer came from the community: a forum thread with a working solution for exactly this board and CPU combination.

GPU: Sapphire PULSE Radeon RX 580 8GB to Sapphire PULSE Radeon RX 9060 XT 16GB

The GPU upgrade was more of a deliberate compromise than a dream spec, but it was the right call. The RX 9060 XT hits the 16GB VRAM threshold that matters most for local AI workloads. It also brought the requirement to add AMD’s ROCm firmware and drivers to the Debian installation. AMD provides a solid guide for this specific setup through their Instinct documentation, which covers the package manager installation path for Debian in detail.

Why Does 16GB of VRAM Change What a Home AI Server Can Do?

Sixteen gigabytes of VRAM is the practical minimum for running small local LLM models with little and manageable latency on consumer hardware. Below that threshold, models either require aggressive quantization that degrades output quality, or they overflow into system RAM and become too slow to be useful in any interactive context.

The RX 9060 XT finally puts Shockwave above that line. Models in the 7B to 13B parameter range can now run locally with reasonable response times. This is the hardware prerequisite that the rest of the server architecture has been waiting on. Everything built in the previous posts, the n8n automations, the Claude Code agent network, the Tailscale connectivity, becomes more capable when the underlying model inference is no longer the bottleneck.

One observation worth noting from this build: as I look back at the hardware choices, Gentoo Linux might have been a more effective starting point than Debian for a server built entirely around squeezing performance from specific AMD hardware. Gentoo’s compile-from-source approach produces binaries optimized for the exact hardware they run on. That said, Debian continues to be a solid and stable foundation for this kind of setup, and switching distros at this stage would create more disruption than it would solve.

How Do Prometheus, Grafana, and Loki Work Together on a Home Server?

Prometheus, Grafana, and Loki form a standard open-source observability stack, and each handles a distinct layer. Prometheus collects and stores time-series metrics. Grafana visualizes those metrics on configurable dashboards accessible over the network. Loki aggregates logs and application telemetry, feeding structured event data into Grafana alongside the Prometheus metrics.

For Shockwave specifically, Prometheus is currently tracking active containers, available storage across the server’s 3TB of disk, RAM utilization, GPU output, and agent activity. That last metric is the one that makes this stack particularly relevant for an AI server: being able to see agent usage, token consumption, memory access patterns, and tool calls in a single dashboard is genuinely useful for understanding how the system is actually being used.

The Grafana dashboard is still in progress. More to report once it has enough data and enough panels to be worth showing. Loki is already feeding telemetry from the applications running on the server into Grafana, so the data pipeline is in place even if the visualization layer is still being shaped.

All three run as Docker containers, consistent with the rest of the server architecture. Adding them to the existing Docker Compose setup required minimal disruption to what was already running.

What Is the Hermes AI Harness and Why Does It Change the Interaction Model?

Hermes is an AI agent harness from Nous Research designed to run persistently, learn from its interactions over time, and be wired into whatever communication and model interfaces you choose. Where most AI tools are stateless per session, Hermes is built to be shaped through use, functioning more like a persistent assistant that can be guided and refined over time rather than a fresh context window every time.

For Shockwave, Hermes is connected to Discord as its messaging gateway. That means I can interact with the server directly through Discord from any device, without opening a terminal or triggering a workflow manually. Send a message, get a response from a Hermes agent running on the server. The interaction feels more natural than firing off n8n webhooks, and it opens up a different category of use cases: quick questions, status checks, ad hoc tasks that do not need a full workflow built around them.

The model backend is flexible. Hermes supports local models via Ollama, Hugging Face, and LM Studio, as well as API connections to hosted models. On Shockwave, the local path is the primary one, which is exactly what the GPU upgrade was building toward.

What Comes Next for Shockwave?

The hardware and observability foundation are now in a much better place. The Hermes integration is the immediate priority, getting it to a point where it feels like a digital twin is the ultimate goal.

The voice interface project is still on the roadmap now that the harness will allow Shockwave to get more intelligent. With Hermes now handling Discord-based interaction, the architecture for a voice layer is clearer; it becomes another interface into the same agent rather than a separate system to build and maintain. That framing makes the implementation path more coherent, and it is the next thing worth exploring in depth.

More to come. Happy hacking.

Contact Me

Thank you for visiting my site. I hope you found something useful or interesting. Please use this form to send me any feedback, questions or just to connect. Have a wonderful day!