Email copied to clipboard
Tima Nlemvo

Tima Nlemvo

Systems Engineer · Infrastructure & Security · Automation & Cloud

Seven years in IT operations taught me what keeps systems alive under pressure. Now I'm building it myself. The Alliance Fleet is a 25+ service, 3-node Proxmox cluster where I recreate enterprise infrastructure to learn by doing.

01

Professional Alignment Matrix

DomainEnterprise ExperienceAlliance Implementation
Identity (IAM) Active Directory, Google Workspace, user access governance across 200+ users. Authentik SSO — centralized OIDC/SAML with MFA enforcement across 15+ services. Full audit trail.
Infrastructure Backup solutions, patching, enterprise imaging, macOS/Windows fleet management via JAMF Pro & Intune. 3-node Proxmox VE cluster with Corosync quorum, ECC memory for data integrity, NVMe Gen4 storage.
Networking Enterprise VPN, firewall policies, DNS management, WAN optimization. 4-VLAN segmentation (Mgmt/Services/IoT/DMZ) via UniFi Dream Machine with static-only trust zones and inter-VLAN firewall rules.
Security Endpoint protection, compliance audits, Tier III incident triage and escalation. Wazuh SIEM — brute-force detection, FIM, log aggregation. Automated threat response via n8n orchestration.
Observability Monitoring dashboards, SLA reporting, alerting thresholds, capacity planning. Telegraf → InfluxDB → Grafana pipeline. 10-second metric resolution. Used for real incident forensics (VFIO lockup RCA).
Remote Access Enterprise VPN, Zscaler, conditional access policies. Tailscale zero-trust mesh with subnet routing, ACL policies, and no exposed ports.
02

Active Fleet Status

25+ Services

Node A — Falcon

AI / ML Compute

Active

Ollama · OpenWebUI · ComfyUI · AnythingLLM

View Specifications →

Node B — Corvette

Data & Operations

Shielded

PostgreSQL · Authentik · InfluxDB · Grafana · n8n · Vaultwarden

View Specifications →

Node C — Gozanti

Network & Security

Gateway

Wazuh SIEM · AdGuard DNS · Nginx Proxy Manager · UptimeKuma

View Specifications →

03

The Work

Each content type has a different purpose and voice. Projects show what I built. Writeups show how I think under pressure. The blog shows where I'm headed.

Projects

Architecture & implementation

Writeups

Incident forensics & postmortems

Blog

The Holocron Logs

Projects

Security

SIEM Automation Pipeline

Automated threat detection, alerting, and auto-blocking using Wazuh + n8n + Discord webhooks. Manual monitoring doesn't scale.

Wazuhn8nDiscord

Identity

Zero-Trust Identity Platform

Eliminated password sprawl with Authentik OIDC/SAML. 15+ services under SSO with 100% MFA enforcement and full audit logging.

AuthentikOIDC/SAMLMFA

AI/ML

GPU AI Platform

Local LLM inference via Ollama on RTX 4000 Ada with VFIO passthrough. 50 tok/s, 500+ document RAG pipeline. Zero data egress.

OllamaVFIORTX 4000

Incident Response

VFIO Lockup Forensics

Diagnosed a silent hard lockup with zero local logs. Used external telemetry to trace root cause to a PCIe bus stall from an NVIDIA GPU under passthrough.

InfluxDBFluxPCIe/IOMMU
04

Core Stack

Linux Administration Proxmox VE Clustering VLAN Segmentation Firewall Policy Wazuh SIEM/XDR Authentik SSO/IAM Telegraf → InfluxDB → Grafana Tailscale Zero-Trust GPU Passthrough (VFIO) Reverse Proxy & TLS Incident Forensics Active Directory JAMF Pro / Intune PowerShell / Bash Docker / LXC n8n Automation
05

Briefing Room

Want to see the infrastructure running live? I'll do a screen-share walkthrough of the Alliance Fleet: architecture, monitoring dashboards, and the decisions behind them.

The Alliance Fleet: Deep-Dive

3-Node Proxmox VE Cluster · Corosync Quorum · 25+ Services

Why This Exists

"I built this to learn by doing. Every design decision mirrors production standards I saw across three enterprise environments. The goal is to understand why infrastructure works, not just how to configure it."

I

Physical Architecture

Millennium Falcon — Node A (FCM2250)

  • Intel Core Ultra 9
  • 64 GB DDR5
  • 2 TB NVMe Gen4
  • RTX 4000 SFF Ada (20 GB VRAM, VFIO passthrough)

Carries the GPU for AI/ML inference — 20 GB VRAM handles 70B parameter models via Ollama in the Tantive-III VM.

CR90 Corvette — Node B (QCM1255)

  • AMD Ryzen 7 PRO (ECC-capable)
  • 64 GB DDR5 ECC
  • 4 TB Storage

ECC memory because it runs InfluxDB, PostgreSQL, and Wazuh — silent bit-flip corruption in time-series or auth data would poison monitoring and identity.

Gozanti Cruiser — Node C (OptiPlex 7050)

  • Intel i7-7700
  • 32 GB DDR4
  • 512 GB NVMe + 1 TB SATA
  • 2.5 GbE NIC (hardware mod)

Network-edge services and Tailscale subnet router — keeping the security control plane on a dedicated node.

II

Network Segmentation

VLANNameSubnetDHCPStrategy
10Management192.168.1.0/24DisabledHypervisors, switch, gateway UI. Static-only — prevents rogue device access.
20Services192.168.20.0/24.100-.200All application workloads — AI models, SIEM, databases, identity, automation.
30IoT192.168.30.0/24.100-.200Fully isolated — cannot initiate connections to Management or Services.
40DMZ192.168.40.0/24DisabledPublic-facing reverse proxy ingress only. Static-only — every host explicitly provisioned.

Network hardware: UniFi Dream Machine (gateway/firewall/routing), UniFi US-8-150W (PoE managed switch, VLAN trunking), UniFi Beacon HD (wireless).

III

Traffic Flow & Remote Access

External Request Path

Internet → UDM Firewall
→ VLAN 40: Nginx Proxy Manager (TLS termination)
→ VLAN 20: Authentik (SSO challenge, MFA)
→ VLAN 20: Backend Service

Remote Access: Tailscale

Zero-trust mesh VPN with Node-C as the subnet router. No ports exposed to the public internet. ACL policies enforce least-privilege access per device and user.

IV

SIEM & Identity Core

Wazuh SIEM

Brute-force detection, file integrity monitoring, and log aggregation across all nodes. Custom detection rules being expanded. Alerts piped to Discord via n8n.

Authentik SSO/IAM

All 15+ internal services sit behind Authentik with OIDC/SAML integration and MFA enforcement. Every login logged. Zero-trust gateway enforced via Nginx Proxy Manager in the DMZ.

V

Observability & Automation

Telegraf → InfluxDB → Grafana

Observability pipeline. Telegraf agents on all nodes push CPU, memory, disk, network, and kernel metrics at 10-second intervals to InfluxDB. Grafana dashboards provide cluster-wide visibility.

n8n: Tactical Orchestration

API orchestrator for fleet maintenance, threat response, and notification pipelines.

Wazuh → Block & Alert Pipeline
// n8n workflow logic
{
  "trigger": "Wazuh Webhook (POST)",
  "filter":  "Reject if srcip ∈ 192.168.*",
  "action_1": "Block IP via firewall alias",
  "action_2": "Discord webhook alert"
}
VI

Private AI Stack

Tantive-III VM — Node A

Local LLM inference running Ollama and AnythingLLM on the RTX 4000 Ada (20 GB VRAM) via VFIO passthrough. 50 tok/s on 70B models. 500+ document RAG pipeline with OpenWebUI and ComfyUI for image generation. Zero data egress, all inference on-premises.

VII

Roadmap

In Progress

Automated threat response integration for UniFi Dream Machine
Wazuh agent tuning and custom detection rule expansion
Grafana dashboard buildout for cluster-wide visibility
Inter-VLAN firewall rule hardening
Tailscale ACL policy refinement

Planned

Proxmox Backup Server — scheduled VM/CT snapshots with retention
Offsite encrypted backups — B2 or S3 replication for DR
Kubernetes (k3s) on Node-B — container orchestration beyond Compose
Terraform for VM provisioning — full IaC, GitOps workflow
Grafana alerting rules — CPU, memory, disk, availability thresholds

The Holocron Logs

Engineering documentation: incident forensics, infrastructure hardening, and systems troubleshooting from a working homelab.

Full blog at holocron-labs.tima.dev

I

Latest Posts

Live

Fetching transmissions...

Connecting to Ghost CMS

II

Featured Writeup

Pinned
Incident Response Postmortem Feb 9, 2026

Diagnosing a Silent Hard Lockup on a Proxmox VFIO Passthrough Node

Node-A (Millennium Falcon) suffered a complete hard lockup with zero local crash artifacts — no kernel panic, no pstore dump, no journal entries. The culprit: log2ram held all logs in RAM and the instantaneous failure prevented disk sync, destroying 9 days of logs.

Investigation Timeline

1. Establish timeline — Corosync logs from surviving peer (Node-B) pinpointed the crash at 07:55:07 PST
2. Discover log gapjournalctl -b -1 showed logs ending Jan 31. log2ram confirmed as the blind spot
3. Pivot to external telemetry — Telegraf metrics in InfluxDB on Node-B still had 10-second resolution data for the crashed host
4. Pinpoint exact crash time — Last data point at 15:54:50 UTC. Uptime counter incrementing normally, then stopped
5. Analyze system state — CPU idle at 99.8%, memory at 7.4%, zero network errors. Not resource exhaustion
6. Check crash artifacts — Empty pstore, no MCE, no panic. Failure happened below kernel observation
7. Root cause — PCIe bus stall from NVIDIA GPU under VFIO passthrough. GPU configured 3 days prior (change correlation)

Resolution

Kernel mitigations: pcie_aspm=off pci=noaer — disables PCIe power state transitions and AER recovery attempts that can stall the bus
Logging fix: Disabled log2ram — negligible write-reduction benefit on NVMe vs. risk of losing all crash forensics
Result: Node has been stable since applying mitigations

Key Takeaways

→ External telemetry saved this investigation. Observability pipelines are forensic tools, not dashboards.
→ log2ram is a footgun on servers. Designed for SD card wear on SBCs, not production hypervisors.
→ NVIDIA VFIO passthrough can cause undetectable host lockups — no panic, no MCE, no pstore.
→ The 17-second delta between last metric and corosync alert reflects the token timeout — useful for calibrating monitoring thresholds.
→ Always verify boot parameter changes with cat /proc/cmdline after reboot.
Linux Kernel InfluxDB/Flux PCIe/IOMMU Corosync Incident Response Postmortem
Read all posts on holocron-labs.tima.dev

The Engineer Behind the Fleet

7+ years in IT operations. Now leveling up into systems engineering.

At Team Liquid, Stagwell, and Creative Artists Agency, I was the person who got the call when things broke. Tier III escalation across identity, endpoints, and infrastructure for globally distributed teams. That work taught me what resilience actually looks like in production.

The Alliance Fleet is where I put that experience to work. I recreate enterprise operations in my homelab to design, break, diagnose, and document real infrastructure. Every VLAN, every firewall rule, every monitoring pipeline exists because I wanted to understand why it works, not just how to configure it.

The proof is in the projects, the writeups, and the blog.

What I'm Building Toward

The principles I hold myself to:

  • → Design fault domains before deploying workloads
  • → Treat identity and network boundaries as foundational controls
  • → Favor explicit trust and default-deny over convenience
  • → Make observability and documentation first-class components
  • → Automate to reduce cognitive load, not just manual effort

These come from enterprise environments where outages had real consequences. I apply them daily in the Fleet.

Operational Record

2023 — Present

IT Systems Specialist

Team Liquid

Operate production systems for competitive gaming and corporate environments. Tier III escalation across identity, endpoints, and infrastructure. Manage access for 200+ users.

2021 — 2023

Senior Infrastructure Engineer

Stagwell

Kept enterprise IT systems and cloud applications reliable across a distributed agency network. Managed macOS and Windows fleets via JAMF Pro and Intune. Enforced security policy compliance.

2019 — 2021

Service Desk Lead

Creative Artists Agency (CAA)

Managed asset lifecycle and procurement for the Los Angeles office and West Coast. Maintained Active Directory infrastructure. Led service desk operations and improved identity and escalation workflows.

Command Frequency