This is a submission for the Gemma 4 Challenge: Write About Gemma 4

Gemma 4 ships with native function calling built in — trained from scratch, not prompt-engineered. But "built in" and "tuned for your specific tools" are different things.

If you have a set of internal APIs, a specific tool schema, or edge-case behaviors that the base model handles inconsistently, fine-tuning on your own function-calling data is the right move. TRL (Transformer Reinforcement Learning library) added multimodal tool response support in the same release window as Gemma 4, making this the first time you can fine-tune a multimodal model on tool use — including image outputs from tools.

This guide walks through the full pipeline: data format, fine-tuning with QLoRA, and evaluation.


What TRL's Multimodal Tool Support Actually Adds

Before this update, TRL's SFTTrainer (Supervised Fine-Tuning) could train on text tool calls and text tool responses. The new version adds:

  • Image outputs from tools — a tool that returns an image (chart, rendered diagram, screenshot) can now be part of the training trajectory
  • Multi-turn tool use sequences — full call → response → reasoning → next call chains
  • Gemma 4 chat template — native support for Gemma 4's system/user/assistant/tool format

For pure text tool calling, the older approach still works. The multimodal support matters when your agent needs to reason about visual outputs from tool calls.


Prerequisites

pip install trl>=0.12.0 transformers>=4.50 datasets peft accelerate bitsandbytes
Enter fullscreen mode Exit fullscreen mode

For 4-bit quantization (recommended for consumer GPUs):

pip install bitsandbytes>=0.43.0
Enter fullscreen mode Exit fullscreen mode

Data Format

Your training data needs to be structured as multi-turn conversations with tool calls and responses. TRL's SFTTrainer expects the Gemma 4 chat format:

# Each example is a list of messages
example = [
    {
        "role": "system",
        "content": "You are a data analysis assistant with access to a Python code executor and a charting tool."
    },
    {
        "role": "user",
        "content": "What's the trend in our monthly revenue for Q1 2026?"
    },
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "type": "function",
                "function": {
                    "name": "execute_python",
                    "arguments": '{"code": "import pandas as pd\\ndf = pd.read_csv(\'revenue.csv\')\\nprint(df[df[\'quarter\']==\'Q1\'].groupby(\'month\')[\'revenue\'].sum())"}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "name": "execute_python",
        "content": "month\nJanuary     842000\nFebruary    891000\nMarch       967000\nName: revenue, dtype: int64"
    },
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "type": "function",
                "function": {
                    "name": "create_chart",
                    "arguments": '{"data": [842000, 891000, 967000], "labels": ["Jan", "Feb", "Mar"], "type": "line", "title": "Q1 2026 Monthly Revenue"}'
                }
            }
        ]
    },
    {
        "role": "tool",
        "name": "create_chart",
        "content": [
            {"type": "image", "image": "<PIL.Image object>"},  # chart image
            {"type": "text", "text": "Chart generated successfully"}
        ]
    },
    {
        "role": "assistant",
        "content": "Q1 revenue shows a clear upward trend: January $842K → February $891K → March $967K. That's 14.8% growth over the quarter, with acceleration in March (+8.5% vs February's +5.8%). The chart shows a steepening curve, which is a positive signal."
    }
]
Enter fullscreen mode Exit fullscreen mode

The key difference from text-only tool calling: the tool role message can now contain {"type": "image", ...} content — the image output from the tool becomes part of the training trajectory.


Building a Training Dataset

For fine-tuning to work, you need enough examples to cover your tool schemas and the edge cases you care about. A minimum viable dataset for a specific domain is usually 100-500 examples.

from datasets import Dataset

def build_training_examples():
    """
    Build your domain-specific function calling examples.
    Each example should demonstrate:
    1. Correct tool selection
    2. Correct argument formatting
    3. Correct reasoning from tool output
    4. Edge case handling (tool errors, multiple calls, no tool needed)
    """
    examples = []

    # Example: correct tool selection
    examples.append([
        {"role": "system", "content": "You have access to: get_weather(city: str), search_web(query: str), calculator(expression: str)"},
        {"role": "user", "content": "What's 15% of $847?"},
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [{"type": "function", "function": {"name": "calculator", "arguments": '{"expression": "847 * 0.15"}'}}]
        },
        {"role": "tool", "name": "calculator", "content": "127.05"},
        {"role": "assistant", "content": "15% of $847 is **$127.05**."}
    ])

    # Example: no tool needed
    examples.append([
        {"role": "system", "content": "You have access to: get_weather(city: str), search_web(query: str), calculator(expression: str)"},
        {"role": "user", "content": "What's the capital of France?"},
        {"role": "assistant", "content": "The capital of France is Paris."}
        # No tool call — model should answer directly from knowledge
    ])

    # Example: tool error handling
    examples.append([
        {"role": "system", "content": "You have access to: get_weather(city: str)"},
        {"role": "user", "content": "What's the weather in Atlantis?"},
        {
            "role": "assistant",
            "content": None,
            "tool_calls": [{"type": "function", "function": {"name": "get_weather", "arguments": '{"city": "Atlantis"}'}}]
        },
        {"role": "tool", "name": "get_weather", "content": '{"error": "City not found"}'},
        {"role": "assistant", "content": "I couldn't retrieve weather for Atlantis — it wasn't found in the weather service. Could you double-check the city name?"}
    ])

    return examples

raw_examples = build_training_examples()
dataset = Dataset.from_dict({"messages": raw_examples})
Enter fullscreen mode Exit fullscreen mode

QLoRA Fine-Tuning with SFTTrainer

import torch
from transformers import AutoTokenizer, AutoModelForImageTextToText, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer, SFTConfig

model_id = "google/gemma-4-E4B-it"

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA config — target the attention and MLP projection layers
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: ~20M || all params: ~8B || trainable%: ~0.25%

# Training config
training_config = SFTConfig(
    output_dir="./gemma4-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,        # effective batch size: 8
    learning_rate=2e-4,
    warmup_ratio=0.05,
    lr_scheduler_type="cosine",
    bf16=True,
    logging_steps=10,
    save_strategy="epoch",
    max_seq_length=4096,
    dataset_text_field=None,              # we're using messages format
    remove_unused_columns=False,
)

trainer = SFTTrainer(
    model=model,
    args=training_config,
    train_dataset=dataset,
    tokenizer=tokenizer,
    peft_config=lora_config,
)

trainer.train()
trainer.save_model("./gemma4-finetuned")
Enter fullscreen mode Exit fullscreen mode

Memory requirements on E4B:

  • 4-bit quantized base model: ~4GB
  • LoRA adapters + optimizer states: ~6GB
  • Activations + gradient checkpointing: ~4GB
  • Total: ~14GB — fits on a 16GB consumer GPU

Evaluating Tool Call Accuracy

After fine-tuning, evaluation should measure the things that matter:

from transformers import pipeline

pipe = pipeline(
    "text-generation",
    model="./gemma4-finetuned",
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def evaluate_tool_calling(test_cases: list[dict]) -> dict:
    results = {"correct_tool": 0, "correct_args": 0, "no_hallucination": 0, "total": 0}

    for case in test_cases:
        response = pipe(case["messages"], max_new_tokens=256)[0]["generated_text"]

        # Check: did it call the right tool?
        expected_tool = case["expected_tool"]
        called_right_tool = expected_tool in response if expected_tool else "tool_calls" not in response

        # Check: were the arguments well-formed JSON?
        import json, re
        args_match = re.search(r'"arguments":\s*"({.*?})"', response)
        valid_args = False
        if args_match:
            try:
                json.loads(args_match.group(1).encode().decode('unicode_escape'))
                valid_args = True
            except:
                pass

        # Check: did it hallucinate a tool not in the schema?
        available_tools = case.get("available_tools", [])
        hallucinated = any(
            f'"name": "{t}"' in response
            for t in re.findall(r'"name":\s*"(\w+)"', response)
            if t not in available_tools
        )

        results["total"] += 1
        results["correct_tool"] += int(called_right_tool)
        results["correct_args"] += int(valid_args)
        results["no_hallucination"] += int(not hallucinated)

    return {k: v/results["total"] for k, v in results.items() if k != "total"}

metrics = evaluate_tool_calling(test_cases)
print(f"Correct tool selection: {metrics['correct_tool']:.1%}")
print(f"Valid argument JSON:    {metrics['correct_args']:.1%}")
print(f"No hallucinated tools:  {metrics['no_hallucination']:.1%}")
Enter fullscreen mode Exit fullscreen mode

What Fine-Tuning Buys You Here

Gemma 4's base function-calling capability (86.4% agentic tool use on benchmarks) is already strong. Fine-tuning is worth doing when:

Your tool schema is unusual. If your tools have nested objects, enum parameters, or optional fields that the base model handles inconsistently, SFT on your schema stabilizes behavior.

You need edge case control. "When no tool is needed, answer directly" is a policy decision. "When the tool returns an error, do X not Y" is a policy decision. Fine-tuning encodes these policies reliably.

You have domain-specific tool semantics. A create_report function in your system means something specific to your domain. The base model doesn't know that.

You need multimodal tool outputs. If your pipeline includes tools that return images (charts, rendered documents, screenshots), the TRL multimodal support is the only path to training on those trajectories.


Exporting the LoRA Adapter

# Merge LoRA into base model for deployment
from peft import PeftModel

base_model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype=torch.bfloat16)
merged_model = PeftModel.from_pretrained(base_model, "./gemma4-finetuned")
merged_model = merged_model.merge_and_unload()
merged_model.save_pretrained("./gemma4-merged")
tokenizer.save_pretrained("./gemma4-merged")

# Or keep as adapter for smaller disk footprint
# Just load the base + apply the adapter at inference time
Enter fullscreen mode Exit fullscreen mode

The LoRA adapter is ~80MB. The base model is ~8GB. For deployment, keeping them separate is often more practical — you can swap adapters for different tool schemas without reloading the base model.


The Combination That Makes This Interesting

Gemma 4 with native function calling + thinking mode + multimodal tool outputs + fine-tuning for your specific schema is a stack that simply didn't exist six months ago in open-weight form.

An agent that reasons about image outputs from tools, fine-tuned to your internal APIs, running locally with no data leaving your infrastructure: that's the practical combination this tutorial enables.

The pieces are all available. This is how you connect them.