IPI Framework

Published on Jun 18, 2025

Introduction

LLMs have replaced deterministic code with probabilistic conversation. This shift has blanketed the internet with a distinctly more “human” feel, but it also inherits our human communication quirks. Sometimes our words with AI are misunderstood; we talk past each other, we get frustrated, we send the wrong document, and misaligned actions are taken because we fail to communicate our intentions effectively. However, our community still judges success almost exclusively through static evals and leaderboard benchmarks — metrics that say little about whether an agent preserves a user’s intent once it starts reasoning, calling tools, and acting in the wild. These evals presume that the agent is acting towards satisfying the correct user intent, but getting to that point in the first place is often the biggest challenge. The Intent-Prompt-Intent (IPI) framework was built to help us understand how to think about these experiences more holistically, and to give us a more concrete way to experiment with the agentic experience — both as builders and as users.

The IPI pipeline is a concept highlighting how intent is translated — often imperfectly — across human, model, and execution (tool) layers of agentic AI stacks. Dilution or misrepresentation of initial human intent is a core technical, product, and alignment challenge. There’s early tension around which layer to optimize: prompt engineering, model fine-tuning, or evaluation. Additionally, there is early tension on what specifically to optimize when taking broader alignment issues into the scope of discussion. Building robust infrastructure to evaluate and close the loop on intent fidelity is still an open frontier. We can model this pipeline mathematically to provide structure.

Formalization of the IPI Pipeline

Let:

$\mathcal{I}_h$ : the space of internal human intents (latent, abstract goal representations)
$\mathcal{P} \subset \mathcal{T}^{*}$ : the set of valid prompts, where $\mathcal{T}$ is the token vocabulary and $\mathcal{T}^{*}$ is the set of all finite-length token sequences
$\mathcal{I}_m$ : the model’s internal representation of intent (latent embedding space)
$\mathcal{E}$ : the space of executable outcomes (could be token outputs, tool calls, structured actions, etc.)

We define the following functions:

$f_1 : \mathcal{I}_h \rightarrow \mathcal{P}$ (human intent to prompt)
$f_2 : \mathcal{P} \rightarrow \mathcal{I}_m$ (prompt to model intent)
$f_3 : \mathcal{I}_m \rightarrow \mathcal{E}$ (model intent to execution)

The full pipeline is the composition:

$f = f_3 \circ f_2 \circ f_1 : \mathcal{I}_h \rightarrow \mathcal{E}$

In practice, we cannot tweak $\mathcal{I}_h$ as this is the latent space of internal human intent. We can, however, tweak $f_1$ via good UX and behind-the-scenes work (contextual autocomplete, retrieval-augmented prompts, etc.) to supplement a prompt with external information.

While the underlying token vocabulary is fixed, we still influence $\mathcal{P}$ via prompt design; deeper shifts in the mapping from prompt to model intent ( $f_2$ ) come from fine-tuning.

Lastly, we can modify $\mathcal{E}$ directly by augmenting tool usage with clever engineering, which we’ve done and will document in a future post.

Intent Drift / Fidelity

We introduce a fidelity metric $D : \mathcal{I}_h \times \mathcal{E} \rightarrow \mathbb{R}_{\geq 0}$ that captures “intent drift”:

$D(I_h, f(I_h)) = \text{distance between original intent and a particular execution}$

If we are optimizing the pipeline, we aim to:

$\min_{f_1, f_2, f_3} \ \mathbb{E}_{I_h \sim \mathcal{I}_h} \left[ D\left(I_h, f(I_h)\right) \right]$

Notes

$\mathcal{I}_h$ and $\mathcal{I}_m$ are latent spaces. We can’t directly observe them (yet), but we may approximate them via proxies (e.g. human feedback, internal embeddings/observability research).
$\mathcal{P}$ and $\mathcal{E}$ are observable and thus serve as anchor points for evaluation.
Evaluation functions and feedback loops might define surrogate loss functions on observable projections of $\mathcal{E}$ .

Proximity to Human Intent

When considering the application of IPI to various agentic systems, it is important to distinguish between differences in expected intent drift across various agent types.

We therefore segment agent types into 4 distinct buckets:

Agent Categories by Intent Distance

Fully Autonomous Agents (FAAs)
- Furthest from human intent
- Operate on system prompts + self-regulated improvement to later contextual information
- Can diverge significantly from initial intent -> less important to have tight intent drift
  - (We assume autonomy is granted precisely when large deviations are acceptable or recoverable)
Service Agents (SAs)
- Two degrees separation from user intent
- Called primarily by other agents
- Can tolerate moderate intent drift, but not as much as FAAs
Reactive Agents (RAs)
- Close to human intent – ChatGPT is an example in most use cases
- Handle broad but immediate requests
- Low tolerance for intent drift
Precise Agents (PAs) / Co-Pilot Agents
- Closest to human intent
- Handle specific, well-defined tasks
- 0 tolerance for intent drift

Our Work

We will be detailing our research findings based off of an agentic system we’ve built over the last month in a later post, but for now, we can summarize at a high-level.

The work so far points to four open fronts.

First, where is the biggest intent leak through the IPI pipeline? We believe it is still the prompt-creation step ( $f_1$ ), because context construction is fickle long before we hit the model’s intelligence ceiling, but it’s difficult to create hard data here.

Second, we need a repeatable metric for “intent fidelity” that can turn into a public eval suite. In essence, $D(I_h, f(I_h))$ needs some work.

Third, we’re looking for tighter feedback loops that surface potential improvements as fast as possible. The difficulty here is finding good AI applications that allow us to clearly demarcate each step of the IPI pipeline.

Finally, we’re sketching the leanest possible infrastructure to run these experiments. We need the right amount of interaction data, a strong tool layer, and a deeper ability to peek into / modulate the internals of the inference step.

Conclusion

Agentic system are primarily “intent functions” that take a latent human intention to a concrete execution step in the digital world. The IPI framework gives us language, and metrics (soon) to spot intent drift across the entire flow. We’re sharing our thinking on this early because we believe in open-source development beyond engineering. We want to open-source our thinking as much as possible so we can all benefit from a tighter suite of tools to build the AI-based future we all know is coming. The sooner we align on what “good” agents look like, the faster we’ll turn probabilistic conversation into dependable software.

Related Research

Evals are not all you need https://www.marble.onl/posts/evals_are_not_all_you_need.html
Speaking with Intent in Large Language Models https://arxiv.org/html/2503.21544v1
ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629
Measuring Faithfulness in Chain-of-Thought Reasoning https://arxiv.org/abs/2307.13702
Inadequacies of Large Language Model Benchmarks in the Era of Generative Artificial Intelligence https://arxiv.org/abs/2402.09880