Announcing Concordance: Mech Interp and Inference Mods

October 25, 2025

Token level interventions give you control, reliability & observability

Custom Inference Controls & Applied Mechanistic Interpretability

We're excited to officially introduce Concordance. We want to share what we’ve discovered so far, as well as where we’re headed.

Concordance is building a software suite for applying mechanistic interpretability (“mech interp”) techniques to AI systems, with the thesis that it will improve control, reliability, observability, and widen the design space for more complex applications. While much of the mech interp field is focused around alignment concerns, we are primarily focused on how these tools can impact the developer experience, the end user experience, and the model’s performance.

Our journey starts with granular token-level interventions, and we will soon release an SDK to build custom inference modifications (“mods”) that enable conditional forced tokens, backtracking, smarter tool-calling, per-token sampling strategies, and deep logit analytics.

The SDK is built around Events, Actions, Mods, and Flows. Events are emitted at important steps in the inference process (Prefill, ForwardPass, Sampled, and Added). Actions are responses that can steer the inference process after each Event. The Actions are AdjustPrefill, ForceTokens, AdjustLogits, ForceOutput, ForceToolCalls, and Backtrack. Mods are modules that ingest Events and return Actions. Mods can hold arbitrary state and be strung together with Flows to create complex inference-time steering. Below are a few examples.

Example 1

A simple token intervention - look for simple math expressions, and when a = is generated, evaluate the math expression and force the generated result.

Simple Token InterventionClick to collapse

1import re
2SIMPLE_MATH = re.compile(r'(?<!w)(?:[+-]?(?:d+(?:.d+)?|.d+)(?:[eE][+-]?d+)?|((?:[^()]*|([^()]*))*))(?:s*[-+*/^]s*(?:[+-]?(?:d+(?:.d+)?|.d+)(?:[eE][+-]?d+)?|((?:[^()]*|([^()]*))*)))+s*=(?!=)')
3class Calculator:
4    """
5    A dummy example of finding and parsing simple math equations in generated output and catching
6when the model is about to generate an answer.
7Looks for expressions like "(5 - 1) * 234.22 =" and calls eval on them.
8    """
9    accumulated_text: dict[str, str]
10
11    def maybe_calculate(self, req_id) -> Optional[Any]:
12        results = SIMPLE_MATH.findall(self.accumulated_text[req_id])
13        if len(results) > 0:
14            latest = results[-1]
15            if latest.strip()[-1] == "=":
16                return eval(latest[:-1])
17
18calc = Calculator()
19
20@mod
21def simple_calculator(event, actions: ActionBuilder, tokenizer):
22    if isinstance(event, Added):
23        added_tokens = event.tokens
24        generated_text = tokenizer.decode(added_tokens)
25        calc.accumulated_text[event.request_id] += generated_text
26        result = calc.maybe_calculate(event.request_id)
27        if result:
28            return actions.force_tokens(tokenizer.encode(str(result)))
29    return actions.noop()
30

Example 2

Use the Concordance SDK's self-prompting techniques to route to inference-time tool calling (no client back-and-forth).

Agent Flows and Runtime Tool CallingClick to expand

1# user defined tools
2runtime_tools = [...]
3
4def call_runtime_tool(ctx: FlowState, actions: ActionBuilder, tokenizer):
5    # pull the answer from the runtime_tool_caller question
6    tool_to_call = ctx.answers["runtime_tool_router"]
7    # extract the tool from the list of tools available
8    tool = runtime_tools[[i for i, tool in enumerate(runtime_tools) if tool.name == tool_to_call][-1]]
9    # call the tool that should return an action like noop(), force_tokens(), backtrack(), etc
10    # i.e. RAG injection, steering, etc.
11    return tool.fn(actions, tokenizer)
12
13runtime_tool_caller_router = FlowQuestion(
14    name="runtime_tool_router",
15    prompt=f" Which of these tools should I call: {", ".join(runtime_tools)}?",
16    responses=[tool.name for tool in runtime_tools],
17    erase_mode="all"
18).then(call_runtime_tool)
19
20should_call_runtime_tool = FlowQuestion(
21  name="runtime_tool_calls",
22  # the "self-prompt" that the model generates
23  prompt=f" Wait, I am augmented by tools that I can immediately call: {", ".join(runtime_tools)}. Should I call one to help the user?",
24  # valid response tokens/phrases
25  responses=["yes", "no"],
26  # erase the self-prompt and response from output once complete as to not send it to the user
27  erase_mode="all"
28).on(
29    # if the model answers "yes" to itself, it will call the passed in question, in this case runtime_tool_caller
30    "yes", runtime_tool_caller_router
31).on(
32    # if the model answers "no", just go back to normal generation
33    "no", None
34)
35
36# Compiles the flows into a state machine that advances as events come in. In this example:
37# should_call_runtime_tool -> runtime_tool_caller_router -> call_runtime_tool
38# Once its done, normal generation continues
39ENGINE = FlowEngine(
40    entry_question=should_call_runtime_tool
41)
42
43@mod
44def runtime_tools(event, actions: ActionBuilder, tokenizer):
45    if isinstance(event, (Prefill, ForwardPass, Added)):
46        return ENGINE.handle_event(event, actions, tokenizer)
47    return actions.noop()
48

Example 3

An implementation of Reasoning with Sampling: Your Base Model is Smarter Than You Think as a Mod.

Advanced AI Scaffolding: Reasoning With SamplingClick to expand

1class Phase(Enum):
2    OLD = "old"    # initial collection of the block (base logits, old tokens)
3    NEW = "new"    # propose a new suffix from pivot m with sharpened sampler
4    REV = "rev"    # reverse-walk: score old suffix under NEW prefix at tau
5    DECIDE = "decide"
6    DONE = "done"
7
8class ReasoningWithSamplingState:
9    def __init__(self, alpha: float = 4.0, block_size: int = 192, nmcmc: int = 6):
10        self.block_size: int = block_size
11        self.alpha: float = alpha
12        self.tau: float = 1.0 / alpha
13        self.nmcmc: int = nmcmc
14        self.phase: Phase = Phase.OLD
15
16        self.base_logits_old: list = []      # list[Tensor], length B
17        self.old_tokens: list[int] = []      # list[int], length B
18
19        self.iter_idx: int = 0               # 0..nmcmc-1
20        self.pivot_m: int = 0                # random pivot in [0, B-1]
21        self.suf_len: int = 0                # suffix length = B - m
22
23        self.base_logits_new: list = []      # base logits along NEW prefix (len = suf_len)
24        self.new_tokens: list[int] = []      # proposed tokens for suffix (len = suf_len)
25        self.logp_new_suf: float = 0.0       # Σ log p_base(new_t | new_prefix), t∈suffix
26        self.logq_fwd: float = 0.0           # Σ log q(new_t | old_prefix+new_so_far), t∈suffix
27
28        self.logq_rev: float = 0.0           # Σ log q(old_t | new_prefix+old_so_far), t∈suffix
29        self.base_logits_rev_last = None     # last base logits observed in REV step
30        self._rev_pos: int = 0               # 0..suf_len-1 (position inside suffix)
31
32        self._logp_old_suf: float = 0.0
33
34    def start_first_iteration(self):
35        self.iter_idx = 0
36        self._start_iteration()
37
38    def _start_iteration(self):
39        self.pivot_m = random.randint(0, self.block_size - 1)  # uniform pivot
40        self.suf_len = self.block_size - self.pivot_m
41        self.base_logits_new.clear()
42        self.new_tokens.clear()
43        self.logp_new_suf = 0.0
44        self.logq_fwd = 0.0
45        self.logq_rev = 0.0
46        self.base_logits_rev_last = None
47        self._rev_pos = 0
48        self._logp_old_suf = 0.0
49        self.phase = Phase.NEW
50
51    def _finish_iteration_and_maybe_continue(self, actions):
52        self.iter_idx += 1
53        if self.iter_idx < self.nmcmc:
54            self._start_iteration()
55            return actions.backtrack(self.suf_len)
56        else:
57            self.phase = Phase.DONE
58            return actions.noop()
59
60    def decide_and_apply(self, actions):
61        m = self.pivot_m
62        self._logp_old_suf = 0.0
63        for t in range(m, self.block_size):
64            tok = self.old_tokens[t]
65            self._logp_old_suf += log_softmax(self.base_logits_old[t])[tok]
66
67        logA = self.alpha * (self.logp_new_suf - self._logp_old_suf) + (self.logq_rev - self.logq_fwd)
68        self.phase = Phase.DECIDE
69        if bernoulli(min(1.0, math.exp(min(0.0, logA)))):
70            self.old_tokens[m:] = list(self.new_tokens)
71            self.base_logits_old[m:] = list(self.base_logits_new)
72            return actions.backtrack(self.suf_len, self.new_tokens)
73        else:
74            return actions.noop()
75
76RWS: dict[str, ReasoningWithSamplingState] = {}
77
78@mod
79def reasoning_with_sampling(event, actions: ActionBuilder, tokenizer: Any):
80    state = RWS.get(event.request_id)
81    if state is None:
82        state = ReasoningWithSamplingState(alpha=alpha, block_size=block_size, nmcmc=nmcmc)
83        RWS[event.request_id] = state
84
85    if isinstance(event, ForwardPass):
86        logits = event.logits
87        if state.phase == Phase.OLD:
88            state.base_logits_old.append(logits)
89        if state.phase == Phase.NEW:
90            state.base_logits_new.append(logits)
91            return actions.adjusted_logits(logits / state.tau)
92        if state.phase == Phase.REV:
93            state.base_logits_rev_last = logits
94        return actions.noop()
95
96    if isinstance(event, Added):
97        if len(event.tokens) > 1 and event.forced and state.phase != Phase.DECIDE:
98            return actions.noop()
99        if state.phase == Phase.OLD:
100            tok = event.added_tokens[0]
101            state.old_tokens.append(tok)
102            if len(state.old_tokens) == state.block_size:
103                state.start_first_iteration()
104                return actions.backtrack(state.suf_len)
105            return actions.noop()
106
107        if state.phase == Phase.NEW:
108            tok = event.added_tokens[0]
109            state.new_tokens.append(tok)
110            i = len(state.new_tokens) - 1
111            state.logp_new_suf += log_softmax(state.base_logits_new[i])[tok]
112            state.logq_fwd += log_softmax(state.base_logits_new[i] / state.tau)[tok]
113            if len(state.new_tokens) == state.suf_len:
114                state.phase = Phase.REV
115                state._rev_pos = 0
116                return actions.backtrack(state.suf_len)
117            return actions.noop()
118
119        if state.phase == Phase.REV:
120            i = state._rev_pos
121            if i < state.suf_len:
122                old_tok = state.old_tokens[state.pivot_m + i]
123                logits_here = state.base_logits_rev_last
124                state.logq_rev += log_softmax(logits_here / state.tau)[old_tok]
125                state._rev_pos += 1
126                if state._rev_pos == state.suf_len:
127                    state.phase = Phase.DECIDE
128                    return state.decide_and_apply(actions)
129                else:
130                    return actions.force_tokens([old_tok])
131            return actions.noop()
132
133        if state.phase == Phase.DECIDE:
134            return state._finish_iteration_and_maybe_continue(actions)
135    return actions.noop()
136

Once you treat LLMs like complete programs, many more control surfaces appear. At the inference level, you can deterministically guarantee outputs under programmed conditions, alter style and constraints in real time (without a multi-step agent loop), and make LLM applications behave like engineered systems.

We believe the industry is mostly overlooking the potential improvements found at the inference level, beyond cost and speed optimizations. We’re excited to bring these controls to developers and product teams building AI systems.

We’re beginning a closed alpha that will allow developers to use mods across common architectures (Llama, Gemma, DeepSeek, GPT-OSS, and more).

If you want to join us and experiment with inference mods, send us a DM.