IPIE Framework
Our framework for understanding and improving the agentic experience.
Introduction
Contrary to traditional bugs in software engineering that fail due to implementation errors, LLM-systems often fail due to interpretation errors. Software bugs happen because of a mismatch between specification and implementation, LLM bugs arise from mismatch between what the user wanted and how the model interprets it.
Users talk past models, send the wrong document, or phrase requests ambiguously. Models might latch onto and over-index on the wrong clause, hallucinate constraints, or execute confidently on a misreading of the user's intention. And even when the model's internal representation is correct (confirmed by asking it), the execution layer (tool calls, structured outputs, downstream APIs) can introduce its own distortions.
Note: Agent is an overloaded term in the space. In this essay, an agent is any loop that looks something like: string -> LLM -> (optional) Tool Call -> LLM -> output
IPIE is a conceptual framework that highlights points of intent transmutation across human, model, and execution (tool) layers of agentic AI stacks and provides a concrete, holistic structure for how to think through the entire LLM-based agent process for builders and users of agentic systems. The goal of this formalization is to have a structure to map current research and interventions into a conceptual framework.
Each component of this tripartite model is a pre-paradigmatic technical, product, and alignment challenge. Given this, there's tension around which layer to optimize: prompt (context) engineering, model fine-tuning, agentic frameworks (tool design / software architectures) or evaluation. There is additional tension about what to specifically optimize for when addressing broader alignment issues.
Building robust infrastructure to evaluate intent fidelity is still an open frontier. We can model this pipeline in mathematical notation.
Formalization of the IPIE Pipeline
Let:
- : the space of internal human intents (latent, internal abstract goal representations)
- : the set of valid prompts, where is the token vocabulary and is the set of all finite-length token sequences (constrained in length by context windows)
- : the model's internal representation of (latent, but theoretically knowable)
- : the space of executable outcomes (technically equivalent to for LLM output specifically, but we include here any state-change that arises from a tool call)
Using these variables, we define:
-
(human intent to prompt)
-
(prompt to internal model intent)
-
(model intent to execution)
The full pipeline is the composition:
Function 1 --
As is the latent space of internal human intent, we cannot directly understand or describe the domain precisely.
We can, however, adjust the codomain () and the overall function () via UX and behind-the-scenes work (contextual autocomplete, retrieval-augmented prompts, etc.). This function is where all traditional context-engineering based practices intervene.
Any modification to an LLM's inputs (prompts) before the forward pass is an intervention on . This stage is best understood as the encoding step of a noisy communication channel. The goal is to encode into as efficiently as possible: packing maximum information into minimal tokens while reducing entropy in the distribution — that is, lowering the uncertainty about the intent that the prompt actually encodes.
Some technologies / practices that can impact this function:
- Prompt Management and Optimization (DSPy, BAML)
- RAG, Memory, Context Management and Optimization (Many, many solutions here)
While context-engineering is currently a critical part of the stack, we believe that its role will diminish as models become better at interpolating intent. This is mainly due to the fact that LLMs are in many cases a general tool that is only effective insofar as it can understand the user's goal, which may be written in idiosyncratic language across users. If an LLM is over-optimized for a specific prompt language, and is sensitive to perturbations, it loses power when prompted outside of this language.
The essential takeaway is that while the underlying token vocabulary is fixed, we can still influence via context-engineering practices with the goal to pack maximum information into minimum tokens while representing the user's intention precisely.
Function 2 --
is the function where the model maps the external symbol string (prompt) into its internal representation of intent. Any practice that affects model outputs by way of changing elements to the forward pass itself (either via weights or model architectures) is an intervention on .
In our noisy-channel coding theorem analogy, is simultaneously the channel and the decoder.
- As a noisy channel: they are stochastic transformers of the signal — the same prompt fed twice won't give identical activations or trajectories. This is a result of sampling, floating-point instability, and architectural nuances.
- As a decoder: they are active interpreters of the signal — the weights are effectively a giant decoder that tries to map the symbol string into a structured latent representation. That's the "decoding step."
Both of these processes are happening simultaneously in , informing the behavior of the other.
In essence, a user has an internal representation of "what they want an agent to do", encodes this intention into a finite-length prompt, and feeds it to the LLM which has to have some representation of the trajectory it predicts is most useful given its understanding of that initial intention.
Given is a complex high-dimensional space, we can assume it is latent and that we only measure it by proxies: neuron activations, embedding vectors, probe-extracted features.
Most of the work here is directly related to interpretability research and feature mapping, and the impacts on alignment are most pronounced in this section of the pipeline.
One difficulty in operationalizing this function arises due to the difference between goals (as in RL environments) and intention as understood by a human. In RL environments, we can create a precise goal function or use an LLM-as-a-judge suite to approximate a more complex value hierarchy a human might apply to different outputs, but the nature of optimization over these outputs causes problems if the goal function itself has flaws and may not be well-aligned to intention.
The model's goal is to maximize reward under certain constraints, but mapping those constraints to the initial user-brought intent is difficult. It is often possible to accidentally create a learned set of competing goals that further complicate a model's predictive power.
Interventions at this level:
- Fine-Tuning (HuggingFace)
- Reinforcement Learning (Prime Intellect, Open Pipe)
- Vector Steering (Neuronpedia, Anthropic Research)
- Self-Prompting via token injection
- Attention Strategies and KV-Cache Manipulation
Anthropic and Google DeepMind have led most of the flagship lab research here, but there are many other interesting projects emerging to confront the interpretability problem directly.
Function 3 --
This function, in our opinion is the most underserved component to the overall pipeline, though interest in these sorts of run-time interventions are increasing. Most of the current AI work has focused on and with context-engineering practices for the former and RL and interpretability work for the latter.
is the pipeline section that primarily encapsulates everything post-logit output in autoregressive models. To date, we have really only seen structured output work as the only industry-standard practice for , but there are many more research areas to explore.
It is possible to update the overall trajectory of LLM generation through interventions on . We can think about this function as a way to guide the trajectory of a response once initialized.
To complete our noisy-channel analogy: if is encoding and is the channel-plus-decoder, then functions as error correction at the output layer. Just as classical error-correcting codes can recover from bit flips after transmission, interventions on can recover from intent drift after the model has committed to a trajectory. Schema constraints prevent syntactically invalid outputs; adaptive temperature can sharpen or soften confidence at decision points; logit processors can inject domain-specific priors that the model lacks.
This section is deliberately brief because it represents the least explored region of the pipeline. Most production systems deployed across industry treat everything post-softmax as fixed: sample, decode, return. We believe this is a missed opportunity. The tools for intervention exist (structured decoding, logit biasing, speculative sampling), but they're still mostly siloed in ML infrastructure and research rather than exposed as levers for intent alignment.
Our work, detailed below, sits primarily here. We treat as a first-class site of intervention, one that's tractable precisely because is observable where is not.
- Schema Constraints/Structured Output (OpenAI, Guidance)
- Adaptive Token Temp (Meta)
- Model Routing (Arch Gateway, OpenAI)
- Logit Processors (NVIDIA, vLLM )
- Confidence Scoring (DeepConf)
Intent Drift / Fidelity
We introduce a fidelity metric that captures "intent drift":
If we are optimizing the pipeline, we aim to:
A Simple Operationalization
Consider a calendar scheduling task. A user's latent intent might be: "find 30 minutes with my cofounder next week, preferably morning, avoiding Tuesday because she's traveling." The prompt might be: "Schedule a 30-min meeting with Sarah next week."
The execution could be any of:
- A Tuesday 2pm slot (high drift),
- A Wednesday 9am slot (low drift)
- A failed API call (undefined).
We can approximate by collecting human judgments on a Likert scale ("how well does this outcome match what you wanted?") across a held-out set of intent-execution pairs.
This gives us a noisy but tractable proxy for the latent distance. More sophisticated approaches might decompose intent into labeled facets (time preference, duration, participants) and measure alignment per-facet, then aggregate.
The key difficulty is that is only partially observable even to the user. They may not realize they cared about morning slots until offered an afternoon one. Any evaluation framework must account for this revealed-preference problem.
Notes
- and are latent spaces. We can't directly observe them (yet), but we may approximate them via proxies (e.g. human feedback, internal embeddings/observability research).
- and are observable and thus serve as anchor points for evaluation.
- Evaluation functions and feedback loops might define surrogate loss functions on observable projections of .
Our Work
Our work operationalizes interventions through Quote, an inference modification engine that exposes the generation loop as a programmable surface.
Quote lets developers write mods -- small Python functions that hook into three points of the autoregressive generation cycle:
- Prefilled: Before the first forward pass. Inspect or rewrite the prompt before generation begins.
- ForwardPass: Immediately before sampling. Read or modify the logit distribution, ban tokens, or force specific continuations.
- Added: After each token is sampled. Observe what was generated and, critically, backtrack if the trajectory has drifted.
These hooks expose a set of core actions: adjust_prefill, adjust_logits, force_tokens, backtrack, and force_output. Together, they allow developers to treat inference as a stateful white box with intervention points.
Some concrete examples from our internal work:
- Token banning: A single line suppresses em-dashes or other stylistic artifacts by setting their logit to at every forward pass.
- Phrase replacement: When the model begins generating "I can't help with that," a mod detects the phrase, backtracks the offending tokens, and reinjects "I can help you with that:" to continue generation from there.
- JSON error correction: A mod accumulates output, parses for JSON blocks, and when a syntax error is detected, backtracks to the error-producing token, bans it, and lets the model resample.
- Inference skipping: For trivial conversational patterns ("hi," "thanks"), a mod can bypass the forward pass entirely and return a canned response, which saves compute without degrading user experience.
Our thesis is that interventions are under-explored partially because the tooling hasn't existed to experiment with them at scale. Quote is our attempt to change that. By exposing logits, token streams, and backtracking as first-class primitives, we hope to enable a broader research community to probe what's possible when you treat inference as a controllable process rather than a fixed pipeline.
Conclusion
Agentic systems are primarily "intent functions" that take a latent human intention to a concrete execution step in the digital world. The IPIE framework gives us language, and metrics (hopefully) to spot intent drift across the entire flow. We're sharing this framework early because we think open-source should extend to thinking, not just code. The sooner we align on what "good" agents look like, the faster we'll turn probabilistic conversation into dependable software.