Curunir
A configurable agentic harness for building specialized digital assistants on open / closed source frontier models.
Curunir
Closed BetaA configurable agentic harness framework for building specialized digital assistants with open / closed source frontier models.
A minimalist take on the modern agentic harness — same lineage as projects like hermes-agent and openclaw, stripped to the essentials and built to stay legible.
The architecture is deliberately minimal and model-agnostic: a small, principled set of primitive tools, skills as composable prompts, and structured markdown memory in place of heavyweight vector pipelines. That keeps the system legible, debuggable, and portable across providers — so the value compounds in the assistant's identity, skills, and memory rather than getting locked to a single model vendor.
For builders, it's a fast path from "we have an LLM" to a deployed assistant that does specific work. For the broader thesis: as frontier models commoditize, durable value moves up-stack into the harness, the skills, and the memory — which is exactly what Curunir is built to own.
Currently in closed beta. Reach out if you'd like access.
Can models other than the big cloud APIs actually drive an autonomous agent?
The frontier-lab APIs (Claude, GPT, Gemini) clearly work as agentic substrates — but how far down the stack does that capability extend? Open-weight cloud models? A 26B model running locally on a laptop? A 4B? We're trying to map where the agentic-loop capability cliff actually sits, what fails first when you cross it (tool selection, planning, memory recall, instruction adherence), and which models are surprisingly viable substrates for production assistants.
The methodology is deliberately simple: same Curunir harness, same 24 prompts across 8 categories (tool use, multi-step planning, memory retrieval, instruction following, skill orchestration, error recovery, output quality, efficiency), same system prompt. The only variable is the model. Claude Sonnet 4.6 is the baseline; every other model is compared prompt-by-prompt against it. It's a qualitative smoke test, not a leaderboard — directional results meant to expose capability gaps and failure modes, not produce rankings.