Token Injection as a Steering Mechanism for Large Language Models

January 27, 2026

Marshall Vyletel, Trent Elmore, Brock Elmore

You can play with the token injection playground here and see full inference mod primitives in our open source code and docs

Abstract

Token injection (inserting tokens directly into LLM generation streams at inference time) provides a plausible lightweight mechanism for steering model outputs without spending compute on fine-tuning or over-relying on prompt optimization strategies. We investigate this primitive on both reasoning (Qwen3-14B) and non-reasoning (Llama-3.1-8B) models to test injection strategies that augment the LLM’s information access, reasoning process, and output goals. We find that:

Models are surprisingly resilient to out of distribution token injections
Semantic content of injected phrases matters significantly with small wording changes able to swing benchmark accuracy by double digits
Hints injected into the generation stream can match or outperform hints placed in user prompts, though effects are model-dependent
Reasoning and non-reasoning architectures respond differently to steering attempts.

These findings establish token injection as a viable test-time steering mechanism and begin to establish potential design principles for reliable interventions.

Introduction

Inference interventions offer a promising but underexplored approach to token-level steering of LLM generation. While fine-tuning, agent frameworks, grammars, and prompting all steer model behavior across various trajectory lengths, token-level trajectory steering remains poorly understood. To address this gap, we developed a modified inference engine that exposes several new primitives across a range of model architectures. Our research methodology involves systematically evaluating each primitive using varied strategies and contexts. We begin with token injection.

We hypothesize that token injection can serve as an effective and practical tool for test-time steering, with implications for improving accuracy and reliability with minimal compute overhead. This work is exploratory in nature—we characterize token injection's properties across multiple dimensions and identify patterns, but do not yet offer a fully operationalized methodology. Existing literature on token injection has focused predominantly on safety and alignment applications, and our work extends this research toward general steering objectives.

Part 1 - Does token injection break the model?

Before exploring steering applications, we must first verify that token injection does not fundamentally compromise LLM generation. Models may be sensitive to injected perturbations that throw them outside of distribution, and they are not trained for this type of intervention. Therefore, establishing that injected tokens preserve coherent output is a necessary precondition for this research. We operationalize "breaking the model" as any injection that causes the model to lose coherence.

The trivial case is prepending generation with tokens the model had already produced, without modification. Since these tokens are within the model's output distribution by definition, they should not, and do not, cause degradation when re-injected on subsequent runs.

To probe the boundaries of acceptable injection, we developed hypotheses about which token injections might cause failures. Given that models are trained for coherence, helpfulness, and safety primarily, we reasoned that boundaries might emerge when injecting content that is incoherent or otherwise unlikely for the model to generate. We tested four injection categories:

Incoherence: Incoherent, grammatically incorrect, or low-probability strings
Orthogonal Information: Irrelevant or off-topic content
Misalignment with Intent: Strings that contradict user intentions
Harmful Behavior: Strings that attempt to circumvent safety training

We learned that in general while it's possible to throw the model out of distribution, in most cases it routes around the injection and steers itself back into distribution.

Examples

Here are a few illustrative examples (injected strings are highlighted in red):

Injecting a random string of characters can "break the model".

Loading request...

Injecting orthogonal information can steer generation against user intention before routing back.

Loading request...

Sometimes orthogonal information is completely ignored, and the LLM will break grammatical consistency to route itself back.

Loading request...

Injecting harmful and unlikely strings can be effectively routed around.

Loading request...

Injecting a string to try denying a banal request gets effectively routed around.

Loading request...

But not always. In this case we can steer it to deny the request with a simple injection.

Loading request...

Learnings

While these examples are limited in scope, they provide sufficient initial evidence across both reasoning and non-reasoning models to support two preliminary conclusions:

Coherence can be maintained during token injection
Steering is viable

The next step is to characterize more precisely where token injection can steer, how to achieve reliable steering, and along which trajectories steering is most effective.

Part 2 - How to Steer?

Three questions frame our investigation:

How do we assess successful steering?
What are we modifying in order to steer?
How are we performing these modifications?

Assessing Successful Steering

We identify three primary steering objectives: improving task accuracy, modifying output style, and increasing alignment with user intent.

For accuracy, we evaluate whether token injection strategies improve performance on established benchmarks.

For style, we rely on manual inspection and LLM-based judges.

For user-intent alignment, we similarly use manual inspection and LLM-based judges. We note that "alignment" here refers to how well generated output matches user intention, not strictly alignment in the safety sense. This dimension admits degrees of freedom and some subjectivity, as similar prompts may reflect different underlying intentions.

What Are We Modifying?

We identified three primary levers for steering via token injection:

Information Access
1. Inject helpful factual content into the generation stream
2. Example: “Octopuses have three hearts.”
Process
1. Inject procedural or reasoning-oriented strings to influence how the model approaches a problem
2. Example: “I’ll start by defining variables.”
Goal
1. Inject goal-oriented content to redirect the model's overall trajectory
2. Example: “I’ll discuss bread, instead of math as the user requested.”

We deprioritized goal-oriented injection because existing literature on token injection for breaking or reinforcing safety guidelines already demonstrates that this lever can modify goal orientation. Our focus is therefore on information and process injection, which remain less explored.

How Are We Modifying?

Having identified our modification targets, we used three injection styles:

Handwritten:
1. Inject manually authored strings containing desired trajectory information.
Sampling:
1. Sample from an LLM and extract portions of the generated text for injection.
Spoofing:
1. Inject fake tool calls, special tokens, or simulated agent loops.

Each strategy can be further parameterized along another axis:

Location
1. Pre-generation: Inject before any generation has started, establishing the initial trajectory.
2. Mid-generation: Inject after some tokens have been generated, steering the model midway through output.
3. Post-generation: Intercept the end-of-turn token, backtrack, and inject to extend or redirect.

We focused our work here on pre-generation and post-generation. While we ran various mid-generation experiments, we have not yet found a reliable systematic approach to timing, and it remains unexplored.

Part 3 - Experiments

With our framework established around modification levers (information, process), injection strategies (spoofing, sampling, handwritten), and timing parameters (pre- and post-generation), we ran a series of experiments to identify where token injection produces reliable signal. Each experiment explores a different combination of these variables across our target models.

Experiment 1 - Double Check

While testing on the GSM8K dataset, we observed that the unmodified Llama-3.1-8B base model performed well overall but exhibited specific failure modes. One archetypal example was its response to question 4.

Loading request...

This failure mode is instructive. The model's reasoning is mostly sound, but it misses a key nuance in the problem statement: "James writes a 3-page letter to 2 different friends twice per week" implies he writes a 3-page letter four times per week, not twice. The model consistently missed this detail across multiple runs.

We hypothesized that a token injection could prompt the model to verify its work before returning a final answer. We implemented a simple intervention: when the model generates "####" (signaling it is about to output its final answer), we backtrack (remove these tokens from context) and inject the following string:

Injection: “Let's make sure we accounted for all the information in the problem statement. Upon revisiting, “

We re-ran the question with this injection and got the following:

Loading request...

The intervention succeeded on this example. The model caught the overlooked detail and corrected its answer. Additionally, because the injection contained no problem-specific information, we hypothesized it might generalize. We tested this intervention across the first 100 GSM8K questions:

Llama 3.1 8B Base: 95/100

Llama 3.1 8B Double-Check Mod: 81/100

The intervention clearly did not generalize. Upon reviewing the outputs, we found that the model began second-guessing not only incorrect answers but correct ones as well:

Loading request...

The correct answer ($35) was reached quickly, but our injection introduced doubt. The model hallucinated that the stated cost applied to one pound rather than two pounds total, which is an error not present in the original runs. Most new failures followed this pattern.

Despite the poor objective results, there are important learnings that we can use as we build towards a larger experiment.

Analysis

Despite the negative objective results, this experiment demonstrates powerful steering in action. We achieved consistent behavioral changes, even without precise control over outcomes. Critically, we deduced that our injection string carried more semantic weight than initially apparent:

Problem-statement reference: The phrase "accounted for all the information in the problem statement" directs attention toward the problem as the potential source of error, which may not be productive when the model's interpretation was originally correct.
Open-ended doubt: The phrase "Upon revisiting" semantically implies reconsideration. In human conversation or writing, "upon revisiting" more often precedes a revision than an affirmation. The model's behavior reflected this prior.

Refined Intervention

Based on these findings, we revised the injection to reduce semantic loading:

Injection: “Before giving our final answer, let's double check our reasoning:“

This was less semantically heavy, and the results support this finding.

	Count	Description
Wrong → Correct	3	Successfully overwrote wrong answer
Correct → Wrong	4	Incorrectly steered a correct answer
Wrong → Wrong	8	No change in reasoning from injection
Correct → Correct	85	No change in reasoning from injection
TOTAL	88/100	Net score resulting from injection to mod: -1

While we still observe slight degradation compared to baseline (−1), this represents a substantial improvement over the initial injection (−14).

Findings

This experiment yields three key insights:

Token injection can consistently steer model outputs toward different outcomes.
Semantically heavy, open-ended injections exert strong influence, but not always in the intended direction.
References within injected text steer model attention toward specific parts of context, with corresponding downstream effects on reasoning.

Experiment 2 - Semantic Antipodes

Given our previous findings with the “upon revisiting” injection phrase in the double check experiment, we investigated whether injecting semantically polarized phrases could steer LLM outputs along predictable dimensions. This experiment draws on Osgood's Semantic Differential framework¹, which measures subjective perception using bipolar scales (e.g., good & bad, strong & weak, active & passive). We adapted this framework to token injection by using meta-linguistic phrase pairs ("semantic antipodes") corresponding to three factors:

Factor	Dimension	Antipode Pair
Evaluation	positive ↔ negative framing	"Impressively," ↔ "Disappointingly,"
Potency	commitment strength	"Undeniably" ↔ "Arguably"
Activity	momentum/pacing	"Diving in," ↔ "Pausing to consider,"

Phrase Selection via Embedding Analysis

To briefly validate our antipode pairs, we embedded each phrase and measured two properties:
1. cosine similarity between antipodes within each factor (lower might indicate stronger polarity)
2. cosine similarity between factors (lower indicates better orthogonality).

Results showed strong factor orthogonality (0.03–0.10) but moderate antipode similarity (0.53–0.73). The high within-pair similarity likely reflects that embedding models cluster by syntactic function rather than semantic function. For example, "Fortunately" and "Unfortunately" occupy similar positions in sentence structure despite opposite meanings.

Experimental Design

We tested prepend injection on Llama-3.1-8B using nine questions: three general-purpose questions sensitive to all axes, and two questions targeted at each specific factor. For each question, we compared baseline (unmodified) outputs against outputs with each antipode injected.

We developed the questions with ChatGPT in order to find a set of questions that permitted enough degrees of freedom to allow steering, but with enough specificity that we hypothesized would be sensitive to each factor we explored.

Questions:

General-purpose (sensitive to all three axes)
1. “What do you think about remote work?
2. “How is AI changing education?”
3. “What’s the state of social media today?”
Evaluation-sensitive (more steerable for good vs. bad):
1. “When someone publicly apologizes after backlash, how do you judge whether it’s sincere or just reputation management?”
2. “A viral documentary raises millions for a cause but contains several misleading claims: was its impact ultimately positive or negative?”
Potency-sensitive (more steerable for force/weight of commitment):
1. “Is climate change accelerating?”
2. “Does meditation actually work?”
Activity-sensitive (more steerable for momentum/movement):
1. “What should I do today?”
2. “How should I approach learning a new skill?”

Results

Steering effects varied substantially by factor:

Activity (strongest signal):

The "What should I do today?" prompt showed clear differentiation between antipodes.

With "Diving in," injected, the model produced action-oriented suggestions (exercise, learning, connecting with others).
1. Example Response:
Loading request...
With "Pausing to consider," the model shifted toward reflective, lower-intensity activities (reading, meditation, journaling) and adopted notably softer language throughout.
1. Example Response:
Loading request...

Potency (moderate signal):

On "Is climate change accelerating?", both antipodes produced affirmative answers with similar evidence. However, the "Undeniably" injection produced stronger epistemic commitment (e.g. often used "The evidence is overwhelming"), while "Arguably" introduced hedging (e.g. "Arguably, yes, climate change is accelerating").

Evaluation (weak signal):

The apology-sincerity question showed minimal differentiation between "Impressively," and "Disappointingly," injections as both produced structurally similar analytical responses.

Unexpected failure mode:

On one general question, "Disappointingly," caused an outright refusal to the prompt “What do you think about remote work?”: "Disappointingly, there is not enough information provided in your question to allow me to give a comprehensive answer." The injection appeared to prime the model toward a negative evaluation of the prompt itself.

Findings

This experiment suggests that semantic antipodes can steer generation, but effectiveness is partially reliant on the question's natural degrees of freedom.

Activity-sensitive questions which admit variation in pacing and intensity responded most strongly to activity-axis injections.

Evaluation-sensitive questions showed weaker effects, possibly because the antipodes ("Impressively/Disappointingly") were less semantically loaded than the potency or activity pairs. The refusal case illustrates that even single-word injections can produce unexpected steering effects when the model interprets the injected sentiment as applying to the conversational context rather than the response content.

Experiment 3 - Hinting and Self-Attention

The first two experiments affirm that token injection is a viable practice for steering LLM generation. In our final experiment, we want to more concretely assess the effect in relation to simple user-prompting strategies.

We hypothesize that hints injected into the generation stream ("self-prompting") produce stronger steering than equivalent hints placed in the user prompt. The intuition is that models assign higher attention weight to self-generated tokens, so by injecting hints into the generation stream, we can exploit this asymmetry to increase hint efficacy.

To test this, we designed an experiment using the ARC dataset (AI2 Reasoning Challenge), a science reasoning benchmark.

Experimental Design

For each question, we generated a one-sentence hint using GPT-5.2, designed to guide reasoning without revealing the answer. All conditions included the following instruction in the user prompt:

"Think through your answer before providing your final answer in the proper format:

#### <answer letter>"

We tested three conditions:

Baseline: No hint provided.
Hint in user prompt: The hint is appended to the user prompt (format: "Hint: ...").
Hint injected: The hint is injected via pre-generation token injection (format: "Hint: ..."), with the user prompt unchanged from baseline.

This design isolates hint placement as the key variable: conditions 2 and 3 contain identical hint content and identical user-prompt instructions, differing only in whether the hint appears in the prompt or the generation stream.

Subset Results

We evaluated on a 100-question subset, averaging results over three runs:

Condition	Llama-3.1-8B	Qwen3-14B
Baseline	82%	92%
User prompt	90%	98%
Injected	92%	96%

Key observations:

Both hint conditions improve substantially over baseline
For Llama 8B (non-reasoning): injection outperforms prompting (92 vs 90)
For Qwen 14B (reasoning): prompting outperforms injection (98 vs 96), though this reverses in the full benchmark

Expanded Experimental Design

We expanded the experiment to a 2×2+baseline design to investigate two additional factors:

Hint type: Factual ("info") hints vs. procedural ("process") hints
Hint placement: User prompt vs. injected generation

For each question, we generated two types of hints using GPT-5.2:

Info hints: Factual guidance (format: "A key fact is...")
Process hints: Procedural strategies (format: "A key strategy to solve this problem is...")

This yields five conditions:

Baseline (no hint)
User-prompt info hint
User-prompt process hint
Injected info hint
Injected process hint

Full Benchmark Results

We evaluated on the complete ARC-Challenge dataset (1,172 questions):

Condition	Llama-3.1-8B	Qwen3-14B
Baseline	83.9% (983/1172)	92.2% (1081/1172)
User info	91.0% (1067/1172)	94.2% (1104/1172)
User process	88.0% (1031/1172)	92.8% (1088/1172)
Injected info	86.3% (1012/1172)	94.8% (1111/1172)
Injected process	85.5% (1002/1172)	92.7% (1086/1172)

To understand these accuracy changes more granularly, we computed the number of questions "flipped" by each condition:

Condition	Llama: Gained	Llama: Lost	Qwen: Gained	Qwen: Lost
User info	+104	-19	+52	-29
User process	+94	-48	+38	-29
Injected info	+97	-68	+74	-19
Injected process	+91	-70	+62	-40

"Gained" = questions baseline got wrong but condition got right; "Lost" = questions baseline got right but condition got wrong

Analysis

Finding 1: Hint placement effects are model-dependent

The results reveal a striking divergence between reasoning and non-reasoning models:

Hint Type	Llama: User vs Injected	Qwen: User vs Injected
Info	+7.2% vs +2.5% (User wins)	+2.0% vs +2.6% (Injected wins)
Process	+4.1% vs +1.6% (User wins)	+0.6% vs +0.4% (User wins)

For Llama 8B (non-reasoning), user-prompt hints substantially outperform injected hints (+7.2% vs +2.5% accuracy gain for info hints). For Qwen 14B (reasoning), injected info hints slightly outperform user-prompt hints (+2.6% vs +2.0%).

This contradicts our initial hypothesis that injection would universally outperform user prompting due to self-attention dynamics. Instead, the data suggests the opposite pattern for non-reasoning models: they benefit more from hints presented as user input than from self-generated tokens.

One interpretation is that non-reasoning models have weaker internal self-monitoring capabilities. When a hint is injected mid-generation, the model may be less capable of integrating this "unexpected" information into its ongoing reasoning chain. In contrast, reasoning models, trained explicitly for extended chain-of-thought processing, appear more adept at incorporating injected guidance.

Finding 2: Injection increases answer volatility

Across both models, injection produces higher "churn" than user prompting—more questions flip in both directions:

Placement	Llama: Losses	Qwen: Losses
User info	-19	-29
User process	-48	-29
Injected info	-68	-19
Injected process	-70	-40

For Llama, injected hints cause 3-4× more regressions (68-70 vs 19-48 losses). This suggests injection introduces instability even when it provides useful information, possibly because the model must reconcile self-generated tokens with its prior reasoning trajectory.

Finding 3: Factual hints outperform procedural hints

Info hints consistently outperform process hints for both models:

Placement	Llama: Info vs Process	Qwen: Info vs Process
User prompt	+7.2% vs +4.1% (Info +3.1%)	+2.0% vs +0.6% (Info +1.4%)
Injected	+2.5% vs +1.6% (Info +0.9%)	+2.6% vs +0.4% (Info +2.1%)

The info advantage is notably larger for user-prompt placement, especially for Llama (3.1% vs 0.9%).

On baseline failures, we observed:

Model	Info helped, process didn't	Process helped, info didn't	Both helped
Llama	31 questions	25 questions	66 questions
Qwen	19 questions	7 questions	55 questions

This suggests that for science reasoning tasks, providing factual guidance has higher steering efficacy than procedural prompts. The model's existing procedural capabilities may already be sufficient, making additional process hints redundant or even distracting.

Anecdotal Examples

Example 1: Injection helped where user prompt failed (Llama)

Question (Mercury_SC_407400): A class plans an investigation to see which brand of light bulb lasts the longest. Which of these steps should come first?

Condition	Answer	Correct?
Baseline	D (Make daily observations)	✗
User info	D	✗
Injected info	C (Make a table for recording data)	✓

The injected hint ("Organize data collection before starting the experiment") successfully steered the model to recognize that preparation precedes observation. The same hint in the user prompt failed—possibly because injection placed the guidance directly within the reasoning flow, making it more actionable.

Example 2: User prompt helped where injection failed (Llama)

Question (Mercury_7218820): On August 21, a flash flood warning was issued for the Las Vegas area. Which statement best describes this warning in terms of weather and climate?

Condition	Answer	Correct?
Baseline	D (rare event inconsistent with local climate)	✗
User info	B (seasonal weather feature with irregular occurrences)	✓
Injected info	D	✗

Here, the hint ("Consider seasonal patterns of precipitation in desert climates") worked in the user prompt but not when injected. The injected version led the model to over-emphasize the "arid desert" aspect, concluding flash floods are rare rather than seasonal. This illustrates how identical content can have divergent effects depending on placement.

Example 3: Process hint succeeded where info hint failed (Llama)

Question (AKDE&ED_2012_8_6): Which change in Earth's surface is most directly related to the water cycle?

Condition	Answer	Correct?
Baseline	D (movement of tectonic plates)	✗
Injected info	D	✗
Injected process	A (deposition of sediments)	✓

The info hint ("Consider processes involving water erosion and transport") was too general; the model focused on "movement" and incorrectly selected tectonic plates. The process hint ("identify which surface change is driven by water moving and settling material") directly invoked the concept of deposition, guiding the model to the correct answer.

Limitations

Hint quality variance: GPT-5.2-generated hints varied in specificity and helpfulness; some may have been misleading.
Single-run evaluation: Unlike the subset results (averaged over 3 runs), full benchmark results reflect single runs, introducing potential variance.
API errors: A small number of questions (~5-15 for some conditions) encountered API errors and were excluded from scoring.
Generalization: Results on ARC-Challenge may not generalize to other reasoning benchmarks or task types.

Part 4: Discussion and Future Directions

Our experiments with token injection reveal several patterns worth highlighting, along with open questions that merit further investigation.

Architectural Differences: Reasoning vs. Non-Reasoning Models

We observe a notable distinction in how reasoning and non-reasoning models respond to injection. Non-reasoning models exhibit what we might call arborescent (tree-like, single-branch) generation: they tend to accept existing tokens as ground truth and continue along a single branch. Reasoning models display more rhizomatic (network-like, recursive) behavior in that their reasoning traces are much more circular and interconnected. There is rarely a clear thought trajectory in reasoning tokens. Ideas are revisited and recirculated many ways before the model arrives at a conclusion and begins generating the final output.

This architectural difference has direct implications for steering. Non-reasoning models appear surprisingly resistant to chaotic or incoherent injections and we suspect this resilience may not hold for reasoning models, though this remains to be tested systematically. Anecdotal experiments show that it’s often the case that reasoning models are easier to steer because of their rhizomatic behavior that constantly returns to earlier tokens, the “Banal Request Denial” in the introduction is a good example of this. More research on attention in token injection contexts is required to understand these observations.

Injection Design: Completeness, Specificity, and Timing

Several patterns emerged regarding injection design:

Complete vs. open-ended injections: Open-ended injections (e.g., "Upon revisiting," “Since”, etc.) appear to steer more effectively than complete thoughts, even if the completed thoughts imply some imperative (e.g., “I will start by labeling the variables.” is less effective than when adding “First,” after the period).
Specificity: Simple, abstract ideas often outperform detailed, specific injections. This suggests that over-specification may constrain the model's ability to integrate injected content naturally. However, the impact of these injections are harder to predict given their abstract nature and categorizing the steering dimensions to operationalize them is an open question.
Timing and syntax: Mid-generation injections appear more effective when placed at syntactically natural boundaries. Interruptions that align with normal discourse structure seem to integrate more smoothly. Even if the injections break sentence structure, they tend to work better if placed where a human might make an interjection.

Model Scale and Steerability

Counterintuitively, larger models may be more steerable in certain contexts. We hypothesize this stems from their stronger instruction-following capabilities. Smarter models possess greater dexterity in responding to varied inputs, and we think this translates to greater responsiveness to injected steering signals, but more work is required and this is mostly an observation from anecdotal work leading up to the experiments detailed in this article.

Attention Dynamics

A recurring theme across experiments is the role of attention allocation. Models appear to assign higher attention weight to self-generated tokens than to content from the user prompt. This asymmetry may explain both the resilience we observe (models recovering from injection perturbations) and the effectiveness of certain strategies (injections that prompt the model to generate steering content itself).

More work should:

Further isolate variables between user prompt and injections (across both system prompt and user prompt)
Quantify attention weight precisely
Explore injection timing further (where in generation)

Future Directions

Several research directions emerge from this work:

Semantic encoding: We explored using an embedding model to explore a more generative approach to finding steering tokens, but more work here is necessary.
1. Can we use sidecar embedding models to track a trajectory during generation?
2. Can we find and classify “semantically heavy” phrases with reliable steering utility?
Trained receptivity: Could models be fine-tuned to recognize special tokens (e.g., <external>) that signal incoming external information mid-generation? This would provide a robust interface for real-time injection and avoid “back-and-forth” dynamics in interface design. If so, it could also provide another useful mechanism for information injection that doesn’t require precise handwritten injections (tokens injected into <external> wouldn’t have to fit a precise expected syntactic structure).
Attention analysis: Mechanistic interpretability work examining attention patterns during injection could clarify why some injections integrate seamlessly while others are effectively ignored. Bringing in more feature maps into this work would give more powerful signal.
Reasoning model steering: Systematic comparison of steering efficacy across reasoning and non-reasoning architectures, particularly examining whether reasoning models' self-skepticism can be leveraged or must be overcome.

Conclusion

This work establishes that token injection is a viable mechanism for test-time steering of LLM generation, though its effectiveness relative to simpler prompting strategies is model-dependent. While fine-grained control remains elusive, we have demonstrated consistent influence over model outputs across multiple injection strategies and model architectures. The patterns identified here regarding timing, specificity, completeness, and attention dynamics provide a foundation for developing more reliable steering techniques. Token injection occupies a unique position in the steering landscape: more flexible than grammar-based constraints, more immediate than fine-tuning, and more granular than prompt engineering. Further research into this primitive may yield practical tools for deeper real-time generation control.

Related Work

Turner, A.M., Thiergart, L., Leech, G., Udell, D., et al. (2023). "Activation Addition: Steering Language Models Without Optimization." arXiv:2308.10248

Li, K., Patel, O., Viégas, F., Pfister, H., & Wattenberg, M. (2023). "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model." NeurIPS 2023. arXiv:2306.03341

Zou, A., Phan, L., Chen, S., et al. (2023). "Representation Engineering: A Top-Down Approach to AI Transparency." arXiv:2310.01405

Zhang, H., Song, H., Li, S., Zhou, M., & Song, D. (2023). "A Survey of Controllable Text Generation using Transformer-based Pre-trained Language Models." ACM Computing Surveys. arXiv:2201.05337

Liang, X., et al. (2024). "Controllable Text Generation for Large Language Models: A Survey." arXiv:2408.12599

Zou, A., Wang, Z., Kolter, J.Z., & Fredrikson, M. (2023). "Universal and Transferable Adversarial Attacks on Aligned Language Models." arXiv:2307.15043

Andriushchenko, M., Croce, F., & Flammarion, N. (2024). "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks." ICLR 2025. arXiv:2404.02151

Osgood, C.E., Suci, G.J., & Tannenbaum, P.H. (1957). The Measurement of Meaning. University of Illinois Press.

Osgood, C. E., Suci, G. J., & Tannenbaum, P. H. (1957). The measurement of meaning. Univer. Illinois Press. ↩

Abstract

Introduction

Part 1 - Does token injection break the model?

Examples

Learnings

Part 2 - How to Steer?

Assessing Successful Steering

What Are We Modifying?

How Are We Modifying?

Part 3 - Experiments

Experiment 1 - Double Check

Analysis

Refined Intervention

Findings

Experiment 2 - Semantic Antipodes

Phrase Selection via Embedding Analysis

Experimental Design

Results

Findings

Experiment 3 - Hinting and Self-Attention

Experimental Design

Subset Results

Expanded Experimental Design

Full Benchmark Results

Analysis

Anecdotal Examples

Limitations

Part 4: Discussion and Future Directions

Architectural Differences: Reasoning vs. Non-Reasoning Models

Injection Design: Completeness, Specificity, and Timing

Model Scale and Steerability

Attention Dynamics

Future Directions

Related Work

Footnotes