Before I bring any AI idea to the vendors — SIEM copilots, ticket summarizers, phishing triagers — I prototype it on a local model first. Two reasons. First, my prototype data is real SIEM output. I am not sending EDR telemetry to someone else's API while I am sketching. Second, if the idea does not work against a decent 8B model, it is not going to magically work against GPT-5. Bad prompts do not get saved by big models.
The Hardware
One dedicated box sitting on the lab VLAN:
- Ryzen 7 7700 (8 cores, cheap)
- 64 GB DDR5
- RTX 4070 Ti Super with 16 GB VRAM — the sweet spot for models up to 13B at Q5_K_M
- 2 TB NVMe because quantized models add up fast
- Ubuntu 24.04 LTS, nothing fancy
Total around $1,800 in early 2026. A used 3090 with 24 GB is also excellent if you can find one under $800 — you get to run 34B models at usable speeds.
The Stack
Ollama is the boring right answer. Install:
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
ollama pull qwen2.5-coder:14b
ollama pull nomic-embed-text
Then a thin FastAPI wrapper and a Streamlit page for each experiment. Do not adopt LangChain on day one. You will spend more time debugging abstractions than writing prompts. A hundred lines of Python plus direct Ollama HTTP calls gets you further.
What Actually Works Locally
Prototype these without paying a cloud bill:
- Alert summarization. Feed a Splunk search result as JSON, ask for a three-bullet incident brief. Llama 3.1 8B does this at about 85% of GPT-4 quality after one round of prompt tuning.
- Phishing triage. Email headers plus body → structured verdict with reasoning. Great for building a gold-standard dataset you can later evaluate models against.
- Rule translation. Sigma to Splunk SPL, KQL to Splunk, YARA authoring from a sample. Code models like Qwen2.5-Coder:14B shine here.
- Log normalization. Unknown vendor log → CEF. Deterministic enough that an 8B model is fine.
- Semantic search over your own runbooks. Embed with nomic-embed-text, store in SQLite with the sqlite-vec extension. No pgvector cluster required.
What Does Not Work Locally
Be honest about the ceiling. Things I stopped trying to prototype at home:
- Scale. A local box at 40 tokens per second will not tell you if your idea survives 10,000 alerts per hour. That is a production question.
- Long-context reasoning. 128K windows exist locally but degrade badly past 32K on 8B quantized models. If your workflow needs a full week of logs in context, test on a hosted model.
- Tool-use reliability. Function calling on small models is flaky. Prototype the shape of the workflow; do not benchmark reliability here.
- Multi-agent orchestration. Latency compounds. Two 8B hops feel like waiting for dial-up.
A Workflow I Use Weekly
"Does this prompt idea even deserve a meeting?" Pipeline:
- Pull 50 representative inputs from prod (scrubbed).
- Write a naive prompt. Run it on Llama 3.1 8B.
- Hand-label outputs as good/bad.
- If I get less than 60% on a small model, GPT-4-class will maybe get me to 80%. Not enough for a production case.
- If I get above 70% locally, the idea is worth a real eval on a hosted model.
That filter has killed more bad ideas than any strategy session. Three hours on a $1,800 box has saved me from three months of vendor POCs that were never going to succeed. Prototype locally. Ship with the best tool for the actual production load.