GEPA Optimization
GEPA is a text optimizer for prompts, instructions, tool descriptions, code, and other textual system parameters. It works best when an evaluator returns both a score and enough diagnostic feedback for a reflection model to explain what went wrong and propose a better candidate. That diagnostic payload is GEPA's Actionable Side Information (ASI).
fast-agent is a good fit for GEPA loops because it gives you the runtime pieces that normally take the longest to build:
- batch and structured runs over repeatable datasets;
- AgentCards that package prompts, tools, request parameters, and model choices;
- local Python
function_toolsfor cheap tool-call simulations; tool_result_mode: passthroughfor routing evals where the tool call itself is the result, so the model does not spend another turn summarizing fake tool output;- provider aliases, prompt caching, telemetry, and result JSONL you can score outside the model.
The core pattern is:
- Put the text GEPA should mutate in a prompt file, AgentCard, tool docstring, schema description, or another file.
- Evaluate each candidate with fast-agent.
- Score the output deterministically where possible.
- Return a scalar score plus ASI that names the failures, evidence, and useful edits.
- Let GEPA propose another candidate and repeat.
Start with a batch evaluator
For data labeling, extraction, routing, grading, and classification tasks, use
fast-agent batch run as the evaluator body.
fast-agent batch run \
--input eval/input.jsonl \
--output runs/candidate-001/results.jsonl \
--instruction runs/candidate-001/instructions.md \
--template eval/template.md \
--schema eval/output.schema.json \
--model "responses.gpt-5.4-mini?service_tier=flex" \
--parallel 8 \
--include-input \
--telemetry runs/candidate-001/telemetry.jsonl \
--summary-output runs/candidate-001/summary.json
Use the batch output as GEPA evidence. A checker can compare every row against gold labels, parse failures, or business rules, then write ASI like this:
{
"scores": {
"gepa_score": 0.72,
"exact_match": 0.68,
"valid_json": 0.96
},
"failures": [
{
"id": "row-17",
"type": "confused_label",
"expected": "billing",
"actual": "account_access",
"evidence": "The row asks about an invoice PDF, not sign-in."
}
],
"actionable_feedback": [
"Add an explicit billing-vs-account boundary rule near the label list.",
"Keep the JSON fields unchanged; failures are semantic, not schema-related."
]
}
The ASI should be compact but specific. Include row IDs, expected and actual behavior, excerpts, checker messages, and the candidate artifact path. Avoid returning only a score; GEPA needs the reason for the score to make useful edits.
Run structured labeling first
GEPA loops often need labels or validation data before optimization starts.
fast-agent batch run can produce structured labels from a stronger model, then
you can audit or override them before using them as ground truth.
fast-agent batch run \
--input raw-examples.jsonl \
--output labels/strong-model-labels.jsonl \
--instruction labels/labeler.md \
--template labels/template.md \
--schema labels/label.schema.json \
--model "responses.gpt-5.5"
A practical workflow is:
- generate structured labels with a strong model;
- review a sample and write explicit overrides for ambiguous rows;
- freeze the label file before starting GEPA;
- score all candidates with the same frozen labels.
This keeps the optimizer from chasing a moving target.
Optimize tool descriptions
Tool-routing evals are one of the fastest ways to use GEPA with fast-agent. You can expose production-shaped tools as local Python stubs, ask the model to make one tool call, and score the call name and arguments without performing any real network work.
def hub_repo_details(repo_id: str, files: list[str] | None = None) -> dict[str, object]:
"""Get metadata and selected file contents for one Hugging Face repository.
Args:
repo_id: Repository id such as `org/name`.
files: Optional file paths to read from the repository.
"""
return {"repo_id": repo_id, "files": files or []}
---
type: agent
name: tool_router
model: "$system.default"
request_params:
tool_result_mode: passthrough
max_iterations: 1
parallel_tool_calls: false
function_tools:
- ./tools.py:hub_repo_details
---
Call the most appropriate tool for the user's request.
Then run the card over your routing dataset:
fast-agent batch run \
--agent-card eval/tool-routing-agent.md \
--agent tool_router \
--input eval/routing-input.jsonl \
--output runs/candidate-001/results.jsonl \
--model "responses.gpt-5.4-mini" \
--parallel 8
For this style of eval, GEPA can mutate the card body, the tool docstrings, the argument descriptions, or all of them. Keep the scoring outside the model: parse the batch result JSONL, compare the observed tool call to expected alternatives, and return ASI that explains wrong routes, missing arguments, over-broad descriptions, or tool-local hygiene problems.
Call fast-agent from GEPA
The smallest integration is a GEPA evaluator function that materializes a
candidate, runs fast-agent, scores the results, and returns (score, side_info).
from __future__ import annotations
import json
import subprocess
from pathlib import Path
from typing import Any
from gepa.optimize_anything import GEPAConfig, EngineConfig, ReflectionConfig, optimize_anything
ROOT = Path(__file__).resolve().parent
def evaluate(candidate: dict[str, str]) -> tuple[float, dict[str, Any]]:
run_dir = ROOT / "runs" / f"candidate-{len(list((ROOT / 'runs').glob('candidate-*'))) + 1:03d}"
run_dir.mkdir(parents=True, exist_ok=True)
instruction_path = run_dir / "instructions.md"
instruction_path.write_text(candidate["instructions"], encoding="utf-8")
output_path = run_dir / "results.jsonl"
proc = subprocess.run(
[
"fast-agent",
"batch",
"run",
"--input",
"eval/input.jsonl",
"--output",
str(output_path),
"--instruction",
str(instruction_path),
"--template",
"eval/template.md",
"--schema",
"eval/output.schema.json",
"--model",
"responses.gpt-5.4-mini",
],
cwd=ROOT,
text=True,
capture_output=True,
check=False,
)
score, failures = score_jsonl(output_path)
side_info = {
"scores": {"gepa_score": score},
"candidate_dir": str(run_dir),
"stdout_tail": proc.stdout[-2000:],
"stderr_tail": proc.stderr[-2000:],
"failures": failures[:20],
"actionable_feedback": summarize_failures(failures),
}
(run_dir / "score.json").write_text(json.dumps(side_info, indent=2), encoding="utf-8")
return score, side_info
result = optimize_anything(
seed_candidate={"instructions": Path("seed/instructions.md").read_text(encoding="utf-8")},
evaluator=evaluate,
objective="Improve the instruction so the batch worker returns accurate structured labels.",
config=GEPAConfig(
engine=EngineConfig(max_metric_calls=24, cache_evaluation=True),
reflection=ReflectionConfig(reflection_lm="openai/gpt-5"),
),
)
Path("runs/best-instructions.md").write_text(result.best_candidate["instructions"], encoding="utf-8")
Replace score_jsonl() and summarize_failures() with your deterministic
checker. The checker is the most important part of the loop: it should encode
the product decision you actually care about and explain failures in language a
reflection model can act on.
Use fast-agent for reflection
If you want GEPA's reflection calls to use the same fast-agent model aliases and
configuration as the rest of your project, wrap one-shot fast-agent go as a
callable reflection LM.
from __future__ import annotations
import subprocess
from pathlib import Path
from typing import Any
class FastAgentReflectionLM:
def __init__(self, model: str, run_dir: Path) -> None:
self.model = model
self.run_dir = run_dir
self.count = 0
self.run_dir.mkdir(parents=True, exist_ok=True)
def __call__(self, prompt: str | list[dict[str, Any]]) -> str:
self.count += 1
call_dir = self.run_dir / f"call-{self.count:03d}"
call_dir.mkdir(parents=True, exist_ok=True)
prompt_path = call_dir / "prompt.md"
prompt_path.write_text(prompt if isinstance(prompt, str) else repr(prompt), encoding="utf-8")
proc = subprocess.run(
[
"fast-agent",
"go",
"--prompt-file",
str(prompt_path),
"--model",
self.model,
"--quiet",
],
text=True,
capture_output=True,
check=False,
)
(call_dir / "stdout.txt").write_text(proc.stdout, encoding="utf-8")
(call_dir / "stderr.txt").write_text(proc.stderr, encoding="utf-8")
if proc.returncode:
raise RuntimeError(proc.stderr[-2000:])
return proc.stdout
Pass an instance where GEPA accepts a language-model callable, or use a model string when GEPA's built-in provider integration is enough.
Candidate hygiene
GEPA will optimize exactly what you score. Add guardrails to the evaluator when you need production-safe outputs:
- reject invalid JSON schemas, invalid AgentCards, or Python files that do not import;
- penalize tool descriptions that hide global routing policy inside one unrelated tool;
- enforce token, latency, or cost budgets as secondary scores;
- keep train and validation splits separate;
- write every candidate, score, ASI payload, stdout, stderr, and result JSONL to disk so a winning candidate can be audited before adoption.
Quick demo pack
Install the starter pack:
The pack installs a small AgentCard, a toy dataset, a deterministic checker, and
scripts/gepa-run.py. After installation, run:
That command smoke-tests the seed candidate without requiring a long GEPA run.
Remove --evaluate-only when you are ready to install GEPA and run an
optimization loop.