Evaluating multi-agent systems when ground truth is incomplete
Three eval modes — output, trajectory, and side effect — and the practical pattern we use when there's no single right answer.
I build applied AI/ML systems. Currently co-founding VEZRAN — agentic AI for security operations. Previously: Senior Data Scientist at Starbucks and FedEx. Half my career was pre-LLM ML.
Focused on multi-agent systems and RAG pipelines that survive contact with production. I write about the unglamorous parts — eval design, infrastructure, what actually ships.
Three eval modes — output, trajectory, and side effect — and the practical pattern we use when there's no single right answer.
Co-founder and Head of AI/ML. Multi-agent autonomy on top of an existing security stack, with audit-ready evidence for every action.
Transformer-based topic modeling, summarization, and semantic search over partner contact-center data.
Production ML across the logistics network. Identified at-risk shipments before they failed; quantified customer-level loss exposure.
Field notes on applied AI — agent eval, RAG in production, what the demos leave out. No threads. No hot takes.
TODO: wire form action to Substack / Buttondown