Self-Hosting Gemma 4 for Production Automation Revealed Two Ollama Bugs

DevArt keeps this article discoverable at a fast, self-canonical URL and links clearly to the original DEV publication.

I thought Gemma 4's reasoning traces were wasting tokens. During testing, I realized they were acting as an audit layer for automation. That realization changed how I designed an n8n node for self-hosted AI workflows.

In most automation systems, the model output is the only thing the operator sees. But once AI starts triggering downstream workflows, hidden reasoning becomes operationally important. If the model is making decisions on behalf of a business, the logic path matters as much as the final response.

Here's what I built, what I found, and what it means for AI automation on owned infrastructure.

What I Built

An n8n community node that connects any n8n workflow to a self-hosted Gemma 4 26B MoE endpoint. The node calls Ollama's native /api/generate API, returns clean text, and works with a custom model called triava-prod — a Gemma 4 26B derivative with Triava Labs' brand voice baked in.

The tagline for Triava Labs is "Your model. Your voice. Your business." This node operationalizes that idea.

Repo: github.com/triavalabs/n8n-nodes-triava

The Infrastructure

Everything runs on a single Hetzner CCX33 server: Ollama serving the model, Caddy as reverse proxy, Let's Encrypt for SSL.

No GPU cluster.
No cloud API dependency.
One server, owned infrastructure, real inference.

triava-prod is a Q4_K_M quantization of Gemma 4 26B MoE — 25.8B parameters loaded, roughly 4B active per token. Built using Ollama's Modelfile system with a custom system prompt that encodes Triava's brand voice:

SYSTEM "You are a direct, professional AI assistant for independent operators.
Reply with the answer only. Never show reasoning, drafts, or thinking process.
Match the operator's voice and tone. Be concise unless asked for detail."

Why Gemma 4 26B MoE

The MoE design gives high-capability reasoning behavior at roughly 4B active-parameter inference cost per token. That means it runs at practical throughput on a single owned server — which is the whole point of sovereign infrastructure. A model that requires an A100 cluster isn't sovereign in any meaningful sense for an independent operator or small agency.

Gemma 4 also introduced native system-role support. That matters specifically for this project because the brand voice IS a system prompt. The whole pipeline depends on reliable system-role adherence and consistent on-voice output.

Then I actually tested it in production-like conditions:

Cold inference on a Hetzner CCX33: ~16-31 seconds via /api/generate for a full brand-voice response
Output quality: coherent, on-tone, holds the voice across 150+ word outputs

The model reasons before writing.

What initially looked like a bug turned out to be a feature.

What I Actually Discovered

Two upstream Ollama bugs, found through methodical testing during Phase 2 build.

Bug 1 — `/v1/chat/completions` returns empty content for all Gemma 4 models

(Ollama issue #15288)

When using Gemma 4 via the OpenAI-compatible endpoint, the content field is always empty. The reasoning trace exhausts the max_tokens budget before any final output is generated. I confirmed this affects the base gemma4:26b model too — it's not a Modelfile issue.

I diagnosed it with five comparative curl tests: three against /v1/chat/completions (all empty), two against /api/generate (both clean). The native endpoint folds reasoning and output into one response field and runs 4× faster — ~16s vs ~60s.

Decision: the node targets /api/generate. This isn't a workaround — it's the correct endpoint for Gemma 4 on Ollama right now.

Bug 2 — Long system prompts return empty responses on the 26B MoE

(Ollama issue #15428)

The Gemma 4 26B MoE returns empty output when the combined system prompt exceeds roughly 500 characters. Dense models handle the same prompt correctly. This is isolated to the MoE architecture.

Practical implication: triava-prod already has the brand voice baked into its Modelfile. Passing an additional system prompt from the node adds to the total and can hit the threshold. The fix: leave the System Prompt field blank when using triava-prod. The voice lives in the model, not in the API call.

This is what "your voice" actually means architecturally. The brand voice isn't injected per-call. It lives in the model you own.

The Demo

The workflow: Manual Trigger → Triava node → brand-voice content.

n8n Workflow
     ↓
Triava Node
     ↓
Ollama /api/generate
     ↓
triava-prod (Gemma 4 26B MoE)
     ↓
Reasoning + Output
     ↓
Clean text → downstream automation

The reasoning trace can optionally be retained for auditability before execution.

The prompt:

Write 3 X post variants (max 280 characters each) about this topic.

Topic: Building a sovereign n8n node for Gemma 4
Angle: The reasoning trace is an auditing layer, not noise
Voice: First-person, direct, technically honest. No hype.

Number them V1, V2, V3. No hashtags.

Output from triava-prod:

V1: Building a sovereign n8n node for Gemma 4. The reasoning trace isn't noise; it's an auditing layer. You need to be able to verify the logic behind the automation.

V2: I'm working on a Gemma 4 node for n8n. I'm treating the reasoning trace as an audit log rather than extra text. For sovereign workflows, seeing the steps taken is more important than a clean output.

V3: Developing an n8n node for Gemma 4. The reasoning trace is often seen as clutter, but I'm using it as an auditing layer. It provides the transparency necessary to trust automated decisions.

Model Selection — The Honest Summary

I picked the 26B MoE. I tested it. I found two real bugs. I made deliberate engineering decisions based on what the tests showed.

The 26B MoE delivers high-capability reasoning behavior at ~4B active-parameter inference cost on hardware an independent operator can actually own. It has native system-role support that makes brand-voice workflows possible. And its reasoning behavior — which initially looked like a problem — turns out to be an auditing layer that makes the model's logic inspectable before it triggers downstream automation.

If automation is going to make decisions on behalf of operators, the reasoning layer cannot remain invisible.

That last point isn't something I planned to write about. It's something I observed. Which is the only kind of model-selection story worth telling.

What's Next

The OpenAI-compatible path (/v1/chat/completions) is a real goal for Triava Labs — if the upstream Ollama issue gets resolved, the node's architecture is already designed to support it. That's a v1.5 roadmap item, not a contest deliverable.

The node is at github.com/triavalabs/n8n-nodes-triava. npm publish is in progress via GitHub Actions with provenance.

Triava Labs v1 is in active development at triavalabs.com. The node is the first production component of the broader Triava Labs infrastructure.

The deeper lesson from this build was that self-hosting a model is only part of sovereignty. The other part is being able to inspect the model's reasoning before automation turns it into action.

Update — May 16, 2026

Since publishing, an unexpected cross-article thread emerged with @alimafana, who independently hit complementary Gemma 4 26B MoE failure modes from a completely different deployment context — a production Arabic e-commerce chat router on Google AI Studio rather than self-hosted Ollama.

Their finding: MoE and Dense handle ambiguous instructions in opposite ways. Same prompt, two architectures, inverse failures.

The intersection: both findings point to the same underlying picture — each Gemma 4 variant has its own tax, paid on different inputs. Their behavioral observation from the application layer and my infrastructure-level bug documentation appear to be two angles on the same architectural reality.

The upstream bugs filed:

Ollama issue #15288 — /v1/chat/completions empty content for all Gemma 4 models
Ollama issue #15428 — long system prompts return empty responses on the 26B MoE

Related:

@alimafana's submission — "I Added Three Rules to Gemma 4. The MoE Searched. The Dense Model Refused."

Update — May 20, 2026

Ran the uncapped re-run we discussed in the comments. The MoE on sovereign Ollama handles all six scenarios correctly across temperatures 0.3 / 0.7 / 1.0 when the budget isn't capped — which doesn't reproduce @alimafana's Dense regression on the MoE side. Consistent with what we both expected.

The unexpected part: even uncapped on /api/generate, there's a measurable gap between eval_count and the characters returned in the response field, and it widens with query difficulty.

Grounded retrieval ("white shirt size L"): 389 tokens / 155 chars (~2.5 chars per token)
Under-specified retrieval ("formal but soft"): 1,096–1,379 tokens / 281–321 chars (~0.2–0.3 chars per token)

Temperature isn't the driver — the gap holds at both 0.3 and 0.7. The query type is.

What this means architecturally: the audit-layer framing in the original article holds, but the audit layer is even more elusive than the article suggested. On /v1/chat/completions the reasoning eats the budget and the response is empty (#15288). On /api/generate the reasoning happens, the budget absorbs it, and you see a clean final answer — but the work itself isn't in any field the API returns. Bugs #15288 and #15428 still stand. This is a third observation about the same architecture, not a contradiction of either.

Methodology and full 18-call CSV available on request.

Update — May 21, 2026

Status check on both upstream Ollama issues, four days after publishing:

#15288 — /v1/chat/completions empty content for Gemma 4 — was closed as completed on April 3. The maintainer (@jmorganca) and a collaborator (@rick-github) identified that the empty content field is caused by max_tokens being insufficient to accommodate Gemma 4's reasoning plus the final output. Workarounds: raise max_tokens, set reasoning_effort: "none", or use /api/generate (the path Triava took). My characterization of the endpoint as "broken" was a step too strong — the more honest framing is that /api/generate was the simpler integration path for Gemma 4's reasoning behavior, but /v1/chat/completions works with the right configuration. The architectural choice still holds; the framing gets calibrated.

#15428 — gemma4:26b MoE empty response on long system prompts — was real on Ollama 0.20.x. Multiple users independently confirmed it (@wiltongorske, @semidark, @cymise, @maxbanton). It was fixed somewhere between 0.20.x and 0.21.x, likely as part of the gemma4 renderer rework that brought tokenization counts back to expected values. I verified this morning on Ollama 0.23.1 against the same model SHA: the empty-response behavior is gone, prompt_eval_count is back to the expected ~329 tokens (vs. 1,423 on 0.20.3), and content/thinking fields populate correctly. Comment posted on the issue confirming resolved-in-0.21+.

The bake-voice-into-the-Modelfile architectural decision still holds independent of #15428 — keeping the voice in the model rather than the API call is the right design for sovereignty regardless of whether long system prompts work via the chat endpoint. What changed is the framing: it's a deliberate architectural choice aligned with the thesis, not an engineering workaround forced by a bug.

Update — May 23, 2026

The Scenario 6 substitution observation from the May 20 update — same response across all three temperatures — turned out to be the visible edge of something more general.

Jiwon SEO (Hashevolution/JAMES-RAG-Evol) ran a confirming sweep on gemma4:e4b (PR #440) and found the same pattern at smaller parameter count: substitution-mode prompts produce identical output across runs at T=0.2, regardless of cap budget. The mode genuinely bypasses the sampling layer.

I ran a 2×2 cross-model sweep on gemma4:26b MoE this morning (cap × prompt-type, n=20/cell) — full data in triavalabs/gemma4-26b-mode-split. Three findings:

Substitution is bit-for-bit deterministic on 26b — 40/40 calls, 1 unique response, eval_count 38 flat. Same canonical text every time at T=0.2.
Cap budget is irrelevant when task fits within cap — 10× cap change produces no behavioral difference for either mode.
26b synthesis is ~9× more token-efficient than e4b — equivalent decisions in 1/9th the tokens, 100% success vs 70%. Parameter count appears to buy reasoning efficiency, not just capacity.

The collaboration now has a three-axis joint paper structure: mode split (qualitative), workload gradient (JAMES quantitative), and model-scale efficiency (cross-model + answer-convergence). Headline: "Substitution is free. Synthesis costs in proportion to what it has to invent."

Joint research artifacts:

PR #440 — Jiwon's e4b V3'.e sweep
PR #453 — Direction 4 result: e4b determinism replication + cross-model convergence
Issue #448 — cross-stack analysis thread
triavalabs/gemma4-26b-mode-split — my 26b companion data + analysis

What started as an audit-layer observation in this article has become a cross-stack research thread with three named contributors and a productization path (Direction 5: auto-routing on JAMES's Provider Contract using the mode split as a design primitive). The work isn't finished — Ali Afana's deployment-context data on managed Gemini (Track 3) lands mid-June. But the architecture pattern is in hand.

Built by Robin Converse · Triava Labs · "Your model. Your voice. Your business."