One million tokens won't save your engineering standards

Claude Code just got a one million token context window. Gemini offers two million. The industry is celebrating. LinkedIn is full of posts about holding entire codebases in a single session, eliminating token budgets, running 28 skills and 24 agents simultaneously in one massive context.

I get the excitement. Bigger context windows remove real friction. No more phasing work across sessions, no more compressing instructions to stay under the limit.

But if you’re an engineering leader hoping that a million tokens will finally make your AI coding agents follow your internal standards, the research says you’re betting on the wrong thing.

What actually breaks when rules get ignored

Before we get into the science, let’s look at what this problem looks like in practice.

Your team runs a Go microservice architecture. You have clear rules:

all inter-service communication goes through the internal serviceclient package.
It must handle:
- circuit breaking
- distributed tracing via OpenTelemetry
- retries with exponential backoff
- standardized error wrapping

It’s documented. It’s in your AGENTS.md or CLAUDE.md file. Every engineer on the team knows it.

An AI coding agent picks up a task to integrate with the payments service. It writes clean, idiomatic Go. Tests pass. The PR looks fine at first glance.

// ❌ What the agent generated (direct HTTP calls)
func (s *OrderService) ProcessRefund(ctx context.Context, orderID string) error {
    order, err := s.repo.GetOrder(ctx, orderID)
    if err != nil {
        return fmt.Errorf("fetch order: %w", err)
    }

    payload, err := json.Marshal(RefundRequest{
        OrderID: order.ID,
        Amount:  order.Total,
        Reason:  "customer_request",
    })
    if err != nil {
        return fmt.Errorf("marshal refund request: %w", err)
    }

    req, err := http.NewRequestWithContext(ctx, http.MethodPost,
        "http://payments-service:8080/api/v1/refunds", bytes.NewReader(payload))
    if err != nil {
        return fmt.Errorf("create request: %w", err)
    }
    req.Header.Set("Content-Type", "application/json")

    resp, err := http.DefaultClient.Do(req)
    if err != nil {
        return fmt.Errorf("call payments service: %w", err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != http.StatusCreated {
        body, _ := io.ReadAll(resp.Body)
        return fmt.Errorf("refund failed (status %d): %s", resp.StatusCode, body)
    }

    return s.repo.MarkRefunded(ctx, orderID)
}

This compiles. Tests pass with a mock HTTP server. The logic is correct. But it’s a production incident waiting to happen:

No circuit breaking. If the payments service goes down, every refund request will hang for 30 seconds and pile up. No fallback, no fast failure.
No tracing. The OpenTelemetry span is gone. When something breaks at 2 AM, your on-call engineer sees the order service call but no downstream trace. Debugging takes hours instead of minutes.
No retries. Transient network failures (which happen constantly in Kubernetes) cause permanent refund failures. Customers don’t get their money back.
No standardized errors. Your middleware can’t map the error to the right HTTP status code. Your error tracking groups it as “unknown.”

Here’s what it should look like:

// ✅ What it should have generated (using the internal service client)
func (s *OrderService) ProcessRefund(ctx context.Context, orderID string) error {
    order, err := s.repo.GetOrder(ctx, orderID)
    if err != nil {
        return fmt.Errorf("fetch order: %w", err)
    }

    refundReq := RefundRequest{
        OrderID: order.ID,
        Amount:  order.Total,
        Reason:  "customer_request",
    }

    var refundResp RefundResponse
    err = s.paymentsClient.Post(ctx, "/api/v1/refunds", refundReq, &refundResp)
    if err != nil {
        return apperror.Wrap(err, apperror.CodeUpstream,
            "process refund for order %s", orderID)
    }

    return s.repo.MarkRefunded(ctx, orderID)
}

Half the lines. But behind s.paymentsClient.Post(), you get circuit breaking, OpenTelemetry spans, retries with backoff, and structured error codes. All enforced automatically.

The refactoring cost? This isn’t adding an attribute. You’re rewriting the entire function, introducing a new dependency, updating the constructor, changing the test setup from mock HTTP servers to mock service clients, and verifying that error codes propagate correctly through your middleware. That’s an hour of rework. Per function.

Now multiply this by every service call the agent writes across your codebase. Each missed rule is another hour of rework, another PR review cycle, another “please use the service client” comment. And if it slips through review, it’s a production incident.

That’s the problem. And pouring a million tokens of context into the window doesn’t fix it. Here’s why.

The assumption everyone is making

The implicit belief behind the one million token context hype is simple: if we can fit more into the context window, the agent will know more and perform better. More rules, more docs, more architecture decisions, more history. All in one session. Problem solved.

It sounds logical. It’s also wrong.

The relationship between context size and rule adherence isn’t linear. It’s closer to an inverted curve: you add more rules, the output quality improves, peaks, then degrades as you keep adding more. This degradation starts much earlier than most people expect.

What the research actually shows

Every model degrades as input length increases

In July 2025, researchers at Chroma published “Context Rot”, a study testing 18 frontier models including GPT-4.1, Claude 4, Gemini 2.5, and Qwen3. Their finding was unambiguous: every single model tested gets worse as input length increases. None are immune.

The critical insight: context rot isn’t about hitting the context window limit. It occurs well before that. Models with 200K token windows showed measurable degradation at 50K tokens. The researchers found that all models fell far short of their maximum context window by more than 99% when it came to reliable performance. A one million token window doesn’t mean one million tokens of useful capacity. Not even close.

Information in the middle gets lost

Stanford researchers Nelson Liu, Kevin Lin, John Hewitt et al. demonstrated this in “Lost in the Middle: How Language Models Use Long Contexts”, published in the Transactions of the Association for Computational Linguistics. They found an accuracy drop of 30%+ when relevant information was placed in the middle of the context versus the beginning or end.

Think about what that means for engineering standards. You dump 200 rules into the context. The rules at the top and bottom get attention. The 150 rules in the middle? The model’s attention drifts right past them. Not because the context window is too small — because the attention spreads thinner as you add more context.

This isn’t a bug that will get fixed with the next model release. It’s a fundamental property of how transformers work. More tokens means it’s more likely that some tokens get overlooked. It’s a zero-sum game.

Accuracy by position of relevant information in context

Liu et al., "Lost in the Middle" (TACL 2024) · Illustrative data based on reported findings

The U-shaped curve: Models focus on the beginning and end of the context. Rules placed in the middle, where most guidelines end up in a large AGENTS.md, get the least attention. In a one million token context, the "middle" is enormous.

More instructions = worse compliance

Researchers Daniel Jaroslawicz et al. tested this directly in “How Many Instructions Can LLMs Follow at Once?” using a benchmark called IFScale with 500 keyword-inclusion instructions. Their finding: even the best frontier models only achieve 68% accuracy at maximum instruction density.

68%. That means a third of your rules get ignored. Not because the model can’t see them, but because there are too many competing for attention. The more rules you add, the less likely any individual rule is to be followed.

Instruction compliance drops as rule count increases

Jaroslawicz et al., IFScale (2025) · Projected curves anchored to IFEval single-instruction scores

More rules = lower compliance per rule. Opus 4.6 scores 94% and Sonnet 4.6 scores 89.5% on IFEval (low-density, 1 to 3 instructions). But as instruction count climbs to 500, even the best models drop significantly. Adding 500 engineering guidelines doesn't mean 500 rules followed. Each rule is less likely to be applied.

This is the core problem with the “put everything in context” approach to engineering standards. You’re not helping the agent by giving it all of your guidelines at once. You’re making it less likely to follow any specific one.

What this means for engineering standards

Let’s make this concrete. You’re working on a Go microservice codebase. You have conventions for inter-service communication, error handling, observability, testing patterns, security, API design, database access, logging, and deployment. 150 rules total.

The one million token approach means you dump all 150 rules into the context alongside the codebase, the issue description, conversation history, tool schemas, and whatever else fits. The agent now has everything it could possibly need. It also has so much noise that your serviceclient convention on line 847 of the context is competing for attention with 999,000 other tokens.

Straion solves this with the opposite approach. The agent is integrating with the payments service. With Straion, it receives 8 rules: the serviceclient package requirement, error wrapping with apperror, OpenTelemetry tracing conventions, and retry policies. That’s it. Eight rules in a focused, structured format. The signal-to-noise ratio is orders of magnitude higher.

The research predicts which approach wins. And it’s not the one with more tokens.

Retrieval ≠ compliance

One argument I hear often: “But needle-in-a-haystack tests show near-perfect retrieval at one million tokens.” That’s true. Google’s tests show Gemini 1.5 Pro achieving 99.7% recall on retrieving a specific fact from one million tokens of context.

Retrieval and compliance are fundamentally different tasks. Finding a needle is a lookup. Can the model locate a specific piece of information? Following engineering standards is behavioral. Can the model consistently apply multiple rules while generating code, holding each rule in working attention alongside the actual coding task?

The IFScale research answers that question directly: compliance degrades with instruction density even when the model can clearly “see” the instructions. The rules are in the context, the model can retrieve them if asked, but it doesn’t consistently apply them while coding. That’s the gap we see in code generation day to day.

What we’re building at Straion

At Straion, our approach is to have a small, sharp context:

Dynamically match only the relevant rules to each specific task. The agent fixing a CSS layout gets frontend conventions. The agent patching a security vulnerability gets compliance policies. No overlap. No noise.

Keep the agent’s context sharp and narrow. Fewer rules, higher compliance. The research backs this up across every study we’ve looked at.

The result: your engineering standards actually get followed. Not because the model has a bigger context window, but because every rule it receives is relevant to what it’s doing right now.

Stay on Track.
Start for free.

See how Straion keeps your AI coding agent aligned with your standards.
Set up takes less than 5 minutes.

Get Started Free →

Works with Claude Code, GitHub Copilot & Cursor. No credit card required.

The context window arms race misses the point

Bigger context windows are genuinely useful for many things. Long conversations, large file analysis, complex multi-step reasoning. I’m not arguing against the progress.

But for the specific problem of making AI coding agents follow your engineering standards, the solution was never “fit more rules into the window.” It was always “deliver the right rules at the right time.”

The science is clear: attention dilutes with scale, compliance drops with instruction density, and performance degrades well before you hit the context limit. More tokens won’t fix that. Better signal will.

Stay on track.

Lukas

References:

Hong, K., Troynikov, A., & Huber, J. (2025). Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Research. https://research.trychroma.com/context-rot

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics. https://arxiv.org/abs/2307.03172

Jaroslawicz, D. et al. (2025). How Many Instructions Can LLMs Follow at Once? arXiv:2507.11538. https://arxiv.org/abs/2507.11538

Gloaguen, T., Mündler, N., Müller, M., Raychev, V., & Vechev, M. (2026). Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents? arXiv:2602.11988. https://arxiv.org/abs/2602.11988