Curunir

A configurable agentic harness for building specialized digital assistants on open / closed source frontier models.

Curunir

Closed Beta

A configurable agentic harness framework for building specialized digital assistants with open / closed source frontier models.

A minimalist take on the modern agentic harness — same lineage as projects like hermes-agent and openclaw, stripped to the essentials and built to stay legible.

The architecture is deliberately minimal and model-agnostic: a small, principled set of primitive tools, skills as composable prompts, and structured markdown memory in place of heavyweight vector pipelines. That keeps the system legible, debuggable, and portable across providers — so the value compounds in the assistant's identity, skills, and memory rather than getting locked to a single model vendor.

For builders, it's a fast path from "we have an LLM" to a deployed assistant that does specific work. For the broader thesis: as frontier models commoditize, durable value moves up-stack into the harness, the skills, and the memory — which is exactly what Curunir is built to own.

Currently in closed beta. Reach out if you'd like access.

Eval Series

Can models other than the big cloud APIs actually drive an autonomous agent?

The frontier-lab APIs (Claude, GPT, Gemini) clearly work as agentic substrates — but how far down the stack does that capability extend? Open-weight cloud models? A 26B model running locally on a laptop? A 4B? We're trying to map where the agentic-loop capability cliff actually sits, what fails first when you cross it (tool selection, planning, memory recall, instruction adherence), and which models are surprisingly viable substrates for production assistants.

The methodology is deliberately simple: same Curunir harness, same 24 prompts across 8 categories (tool use, multi-step planning, memory retrieval, instruction following, skill orchestration, error recovery, output quality, efficiency), same system prompt. The only variable is the model. Claude Sonnet 4.6 is the baseline; every other model is compared prompt-by-prompt against it. It's a qualitative smoke test, not a leaderboard — directional results meant to expose capability gaps and failure modes, not produce rankings.

MiniMax M2.7: Another Cloud Model Goes Toe-to-Toe with Sonnet closed-source cloud · vs Sonnet 4.6
GLM-5 Turbo vs Sonnet 4.6: A Statistical Tie open-source cloud (Zhipu AI) · vs Sonnet 4.6
Kimi K2.5: When Path Hallucination Kills Agentic Tool Use open-source cloud (Moonshot AI) · vs Sonnet 4.6
Can a Local 26B Model Drive an Agentic Framework? open-source local · Gemma 4 26B vs Sonnet 4.6
Qwen3.6 Heretic on Apple Silicon: Tied with Sonnet, Failures in Different Places open-source local · Qwen3.6 35B-A3B (M5 Pro) vs Sonnet 4.6

→ all eval articles

Curunir

Can models other than the big cloud APIs actually drive an autonomous agent?

Thanks for subscribing!