Software Engineer, Agent Evaluation and Harness

Engineering

Full-time

Stockholm, Sweden

Apply

About Lemonado

Lemonado is the AI-native data and intelligence platform for marketing teams. We unify data across every marketing and business tool into one workspace, where humans and AI agents work together with full context. Marketing agencies and in-house teams use Lemonado to replace the patchwork of dashboards, reporting tools, and data pipelines they've been stitching together for years.

Under the hood: a proprietary unified SQL engine that queries across every marketing and business platform simultaneously, an enterprise-grade OLAP database that slices millions of rows in milliseconds, real-time data sync, networks of background agents that run autonomously, and MCP connectors that let any AI tool talk to agency and in-house data securely. The AI layer on top is what turns this into magic. That's where you come in.

The role

This is an engineering role, not a research one. You'll own the agent and eval layer of the product. Chat, Studio, background agents, MCP connectors. All of these are hard problems in production AI: agents that behave reliably across thousands of customers, evals that catch regressions before customers do, retrieval that stays grounded in real data, latency and cost that hold up under load.

The work splits roughly into the agent harness (orchestration, tool-use, guardrails, model selection, the patterns that make multi-step work reliable) and the quality loop around it (datasets, scorers, replay, dashboards, the things that turn vague "is this any good" questions into something we can actually measure).

This is freedom under responsibility. You'll get a big token budget, the best tools, and room to use them however you want. In return we expect you to ship a lot.

What you'll work on

Build the agent harness: orchestration, tool-use, structured outputs, plan-then-execute patterns, fallback routing, guardrails, cost and latency tracking.
Tune the prompts, retrieval, and grounding behind chat, Studio, and the background agents that run autonomously across customer data.
Design the evaluation system. Curated datasets, offline replay, judges and scorers, regression alerts, dashboards.
Build feedback loops from real usage. Collect and analyze user signals, and turn them into concrete model and system changes.
Build debugging tools for agent behavior. Trace failures and surface patterns so we catch problems early.
Improve retrieval over connected customer data: chunking, embeddings, context window management, grounding answers in live SQL results.
Work on MCP connectors so external AI tools (Claude, ChatGPT, Cursor, n8n) can talk to customer data safely.
Stay close to model releases. Decide what to adopt, what to skip, where to swap providers, where to fine-tune.

Who you are

You've shipped production LLM applications that real users depend on, with the failure modes that come with that.
You've built or operated evaluation systems before. Eval datasets, scorers, regression suites, experimentation platforms, or comparable quality tooling. You're comfortable taking a vague quality question and turning it into something measurable.
Deep, opinionated familiarity with LLM APIs (OpenAI, Anthropic, Gemini), prompting, function calling, structured outputs, and modern agent patterns.
Good at managing context. You know how to give an agent the right information, prune what doesn't matter, and keep long-running work coherent.
Comfortable with the messy edge of production AI: prompt injection, hallucination, latency, cost. You design for failure from the start.
Strong software fundamentals. Python and TypeScript fluency, or willing to pick one up fast.
You learn fast and you're happy to take on whatever needs doing.
Ambitious. You want to ship a lot, learn fast, and grow with the company.

Details

Full-time, Stockholm.
Competitive salary and meaningful equity.
Work directly with the co-founders.

Stop fighting with data. Start feeding your AI.

Connect your data to AI and free your team from reporting and busywork.