Announcing Concordance: Mech Interp and Inference Mods

October 25, 2025

Token level interventions give you control, reliability & observability

Announcing Concordance: Mech Interp and Inference Mods

Custom Inference Controls & Applied Mechanistic Interpretability

We're excited to officially introduce Concordance. We want to share what we’ve discovered so far, as well as where we’re headed.

Concordance is building a software suite for applying mechanistic interpretability (“mech interp”) techniques to AI systems, with the thesis that it will improve control, reliability, observability, and widen the design space for more complex applications. While much of the mech interp field is focused around alignment concerns, we are primarily focused on how these tools can impact the developer experience, the end user experience, and the model’s performance.


Our journey starts with granular token-level interventions, and we will soon release an SDK to build custom inference modifications (“mods”) that enable conditional forced tokens, backtracking, smarter tool-calling, per-token sampling strategies, and deep logit analytics.

The SDK is built around Events, Actions, Mods, and Flows. Events are emitted at important steps in the inference process (Prefill, ForwardPass, Sampled, and Added). Actions are responses that can steer the inference process after each Event. The Actions are AdjustPrefill, ForceTokens, AdjustLogits, ForceOutput, ForceToolCalls, and Backtrack. Mods are modules that ingest Events and return Actions. Mods can hold arbitrary state and be strung together with Flows to create complex inference-time steering. Below are a few examples.


Example 1

A simple token intervention - look for simple math expressions, and when a = is generated, evaluate the math expression and force the generated result.

Simple Token InterventionClick to collapse
1import re
2SIMPLE_MATH = re.compile(r'(?<!w)(?:[+-]?(?:d+(?:.d+)?|.d+)(?:[eE][+-]?d+)?|((?:[^()]*|([^()]*))*))(?:s*[-+*/^]s*(?:[+-]?(?:d+(?:.d+)?|.d+)(?:[eE][+-]?d+)?|((?:[^()]*|([^()]*))*)))+s*=(?!=)')
3class Calculator:
4 """
5 A dummy example of finding and parsing simple math equations in generated output and catching
6when the model is about to generate an answer.
7Looks for expressions like "(5 - 1) * 234.22 =" and calls eval on them.
8 """
9 accumulated_text: dict[str, str]
10
11 def maybe_calculate(self, req_id) -> Optional[Any]:
12 results = SIMPLE_MATH.findall(self.accumulated_text[req_id])
13 if len(results) > 0:
14 latest = results[-1]
15 if latest.strip()[-1] == "=":
16 return eval(latest[:-1])
17
18calc = Calculator()
19
20@mod
21def simple_calculator(event, actions: ActionBuilder, tokenizer):
22 if isinstance(event, Added):
23 added_tokens = event.tokens
24 generated_text = tokenizer.decode(added_tokens)
25 calc.accumulated_text[event.request_id] += generated_text
26 result = calc.maybe_calculate(event.request_id)
27 if result:
28 return actions.force_tokens(tokenizer.encode(str(result)))
29 return actions.noop()
30

Example 2

Use the Concordance SDK's self-prompting techniques to route to inference-time tool calling (no client back-and-forth).

Agent Flows and Runtime Tool CallingClick to expand
1# user defined tools
2runtime_tools = [...]
3
4def call_runtime_tool(ctx: FlowState, actions: ActionBuilder, tokenizer):
5 # pull the answer from the runtime_tool_caller question
6 tool_to_call = ctx.answers["runtime_tool_router"]
7 # extract the tool from the list of tools available
8 tool = runtime_tools[[i for i, tool in enumerate(runtime_tools) if tool.name == tool_to_call][-1]]
9 # call the tool that should return an action like noop(), force_tokens(), backtrack(), etc
10 # i.e. RAG injection, steering, etc.
11 return tool.fn(actions, tokenizer)
12
13runtime_tool_caller_router = FlowQuestion(
14 name="runtime_tool_router",
15 prompt=f" Which of these tools should I call: {", ".join(runtime_tools)}?",
16 responses=[tool.name for tool in runtime_tools],
17 erase_mode="all"
18).then(call_runtime_tool)
19
20should_call_runtime_tool = FlowQuestion(
21 name="runtime_tool_calls",
22 # the "self-prompt" that the model generates
23 prompt=f" Wait, I am augmented by tools that I can immediately call: {", ".join(runtime_tools)}. Should I call one to help the user?",
24 # valid response tokens/phrases
25 responses=["yes", "no"],
26 # erase the self-prompt and response from output once complete as to not send it to the user
27 erase_mode="all"
28).on(
29 # if the model answers "yes" to itself, it will call the passed in question, in this case runtime_tool_caller
30 "yes", runtime_tool_caller_router
31).on(
32 # if the model answers "no", just go back to normal generation
33 "no", None
34)
35
36# Compiles the flows into a state machine that advances as events come in. In this example:
37# should_call_runtime_tool -> runtime_tool_caller_router -> call_runtime_tool
38# Once its done, normal generation continues
39ENGINE = FlowEngine(
40 entry_question=should_call_runtime_tool
41)
42
43@mod
44def runtime_tools(event, actions: ActionBuilder, tokenizer):
45 if isinstance(event, (Prefill, ForwardPass, Added)):
46 return ENGINE.handle_event(event, actions, tokenizer)
47 return actions.noop()
48

Example 3

An implementation of Reasoning with Sampling: Your Base Model is Smarter Than You Think as a Mod.

Advanced AI Scaffolding: Reasoning With SamplingClick to expand
1class Phase(Enum):
2 OLD = "old" # initial collection of the block (base logits, old tokens)
3 NEW = "new" # propose a new suffix from pivot m with sharpened sampler
4 REV = "rev" # reverse-walk: score old suffix under NEW prefix at tau
5 DECIDE = "decide"
6 DONE = "done"
7
8class ReasoningWithSamplingState:
9 def __init__(self, alpha: float = 4.0, block_size: int = 192, nmcmc: int = 6):
10 self.block_size: int = block_size
11 self.alpha: float = alpha
12 self.tau: float = 1.0 / alpha
13 self.nmcmc: int = nmcmc
14 self.phase: Phase = Phase.OLD
15
16 self.base_logits_old: list = [] # list[Tensor], length B
17 self.old_tokens: list[int] = [] # list[int], length B
18
19 self.iter_idx: int = 0 # 0..nmcmc-1
20 self.pivot_m: int = 0 # random pivot in [0, B-1]
21 self.suf_len: int = 0 # suffix length = B - m
22
23 self.base_logits_new: list = [] # base logits along NEW prefix (len = suf_len)
24 self.new_tokens: list[int] = [] # proposed tokens for suffix (len = suf_len)
25 self.logp_new_suf: float = 0.0 # Σ log p_base(new_t | new_prefix), t∈suffix
26 self.logq_fwd: float = 0.0 # Σ log q(new_t | old_prefix+new_so_far), t∈suffix
27
28 self.logq_rev: float = 0.0 # Σ log q(old_t | new_prefix+old_so_far), t∈suffix
29 self.base_logits_rev_last = None # last base logits observed in REV step
30 self._rev_pos: int = 0 # 0..suf_len-1 (position inside suffix)
31
32 self._logp_old_suf: float = 0.0
33
34 def start_first_iteration(self):
35 self.iter_idx = 0
36 self._start_iteration()
37
38 def _start_iteration(self):
39 self.pivot_m = random.randint(0, self.block_size - 1) # uniform pivot
40 self.suf_len = self.block_size - self.pivot_m
41 self.base_logits_new.clear()
42 self.new_tokens.clear()
43 self.logp_new_suf = 0.0
44 self.logq_fwd = 0.0
45 self.logq_rev = 0.0
46 self.base_logits_rev_last = None
47 self._rev_pos = 0
48 self._logp_old_suf = 0.0
49 self.phase = Phase.NEW
50
51 def _finish_iteration_and_maybe_continue(self, actions):
52 self.iter_idx += 1
53 if self.iter_idx < self.nmcmc:
54 self._start_iteration()
55 return actions.backtrack(self.suf_len)
56 else:
57 self.phase = Phase.DONE
58 return actions.noop()
59
60 def decide_and_apply(self, actions):
61 m = self.pivot_m
62 self._logp_old_suf = 0.0
63 for t in range(m, self.block_size):
64 tok = self.old_tokens[t]
65 self._logp_old_suf += log_softmax(self.base_logits_old[t])[tok]
66
67 logA = self.alpha * (self.logp_new_suf - self._logp_old_suf) + (self.logq_rev - self.logq_fwd)
68 self.phase = Phase.DECIDE
69 if bernoulli(min(1.0, math.exp(min(0.0, logA)))):
70 self.old_tokens[m:] = list(self.new_tokens)
71 self.base_logits_old[m:] = list(self.base_logits_new)
72 return actions.backtrack(self.suf_len, self.new_tokens)
73 else:
74 return actions.noop()
75
76RWS: dict[str, ReasoningWithSamplingState] = {}
77
78@mod
79def reasoning_with_sampling(event, actions: ActionBuilder, tokenizer: Any):
80 state = RWS.get(event.request_id)
81 if state is None:
82 state = ReasoningWithSamplingState(alpha=alpha, block_size=block_size, nmcmc=nmcmc)
83 RWS[event.request_id] = state
84
85 if isinstance(event, ForwardPass):
86 logits = event.logits
87 if state.phase == Phase.OLD:
88 state.base_logits_old.append(logits)
89 if state.phase == Phase.NEW:
90 state.base_logits_new.append(logits)
91 return actions.adjusted_logits(logits / state.tau)
92 if state.phase == Phase.REV:
93 state.base_logits_rev_last = logits
94 return actions.noop()
95
96 if isinstance(event, Added):
97 if len(event.tokens) > 1 and event.forced and state.phase != Phase.DECIDE:
98 return actions.noop()
99 if state.phase == Phase.OLD:
100 tok = event.added_tokens[0]
101 state.old_tokens.append(tok)
102 if len(state.old_tokens) == state.block_size:
103 state.start_first_iteration()
104 return actions.backtrack(state.suf_len)
105 return actions.noop()
106
107 if state.phase == Phase.NEW:
108 tok = event.added_tokens[0]
109 state.new_tokens.append(tok)
110 i = len(state.new_tokens) - 1
111 state.logp_new_suf += log_softmax(state.base_logits_new[i])[tok]
112 state.logq_fwd += log_softmax(state.base_logits_new[i] / state.tau)[tok]
113 if len(state.new_tokens) == state.suf_len:
114 state.phase = Phase.REV
115 state._rev_pos = 0
116 return actions.backtrack(state.suf_len)
117 return actions.noop()
118
119 if state.phase == Phase.REV:
120 i = state._rev_pos
121 if i < state.suf_len:
122 old_tok = state.old_tokens[state.pivot_m + i]
123 logits_here = state.base_logits_rev_last
124 state.logq_rev += log_softmax(logits_here / state.tau)[old_tok]
125 state._rev_pos += 1
126 if state._rev_pos == state.suf_len:
127 state.phase = Phase.DECIDE
128 return state.decide_and_apply(actions)
129 else:
130 return actions.force_tokens([old_tok])
131 return actions.noop()
132
133 if state.phase == Phase.DECIDE:
134 return state._finish_iteration_and_maybe_continue(actions)
135 return actions.noop()
136

Once you treat LLMs like complete programs, many more control surfaces appear. At the inference level, you can deterministically guarantee outputs under programmed conditions, alter style and constraints in real time (without a multi-step agent loop), and make LLM applications behave like engineered systems.

We believe the industry is mostly overlooking the potential improvements found at the inference level, beyond cost and speed optimizations. We’re excited to bring these controls to developers and product teams building AI systems.

We’re beginning a closed alpha that will allow developers to use mods across common architectures (Llama, Gemma, DeepSeek, GPT-OSS, and more).

If you want to join us and experiment with inference mods, send us a DM.