Contract | 4 weeks | 15–25 hours/week | Remote (with availability to collaborate during East Coast workday)
Start date: Immediately
About the project
We're helping a leading nonpartisan research and advocacy organization ship a public-facing AI chatbot. It helps education stakeholders explore school district-level teacher policy data through natural conversation: salary comparisons, observation requirements, compensation structures, peer district matching, and more.
The system is a multi-stage agent pipeline built on PydanticAI and FastAPI that we're refining for public launch. It uses structured handoffs between stages managed through Pydantic models, with a PostgreSQL data layer, a PHP/Tailwind frontend, Logfire observability, and containerized deployment on Azure. Evaluation uses pydantic-evals with a mix of deterministic validators and LLM-based judges.
We're in a 4-week sprint to get from "works most of the time" to "accurate enough to launch publicly." The core challenge: the pipeline currently relies on LLM judgment for decisions that should be deterministic, creating accuracy and consistency failures. The fixes are well-defined but touch multiple pipeline stages with interdependencies.
What we're looking for
Someone who can bring structure to this effort without slowing it down. You'd own sequencing and prioritization across the fix backlog — understanding which changes unblock others, which introduce regression risk, and how to organize testing around them. You're not writing the agent code, but you understand the architecture well enough to know that a change at one stage of the pipeline cascades through downstream stages — and you can manage that cascade.
Specifically, we need someone who has shipped AI or software products and understands why "turn feedback into tickets" isn't a process — it's a backlog. You'll help us organize input from user testing, our own golden scenario results, and the developer's implementation priorities into a coherent plan that updates as we learn. You'll help create a sustainable cadence: what gets reviewed when, how we validate fixes without re-testing everything manually, how we communicate progress to the client without drowning them in technical detail.
Experience with LLM-based systems is important — not because you'd be prompt-engineering, but because you need to understand non-determinism, evaluation strategies, the difference between "temperature 0" and "deterministic," and why accuracy in an AI system is a spectrum that requires its own measurement framework.
About Starling
Starling Strategy is a team of organizers, technologists, creatives, and strategists helping mission-driven organizations build solutions that drive advocacy and community growth. We’re not a traditional AI firm, but we’re fired up to learn as we go. We have a deep commitment to getting the work right and a growing portfolio of clients building human-centered AI-powered tools that need the same kind of structured, thoughtful delivery leadership this role provides.
Compensation: $6,000 for 4 weeks at 15–25 hours/week. We have other clients with similar AI delivery work in the pipeline, and there's a strong possibility of extending the engagement beyond this initial sprint.
To apply: Send your resume, an example of a similar project you’ve worked on, and your availability to hello@starlingstrategy.com.