Preventing Common Issues When Working with AI-Generated Code

AI coding agents have made me significantly more productive as an engineer, but they have a systematic problem: as context fills, they drift from explicit guidance and violate documented patterns. This examines AI limitations and the workflow adaptations that help while labs address the underlying issues.

In Part 1 Code Reviewing AI-Generated JavaScript: What I Found, I reviewed JavaScript code generated by Claude Sonnet 4.5 and Opus 4.5 running in Cursor IDE. The code calculated distances using a third-party API, handled batching, managed timeouts, and validated data. It worked, but code review uncovered nine issues ranging from minor inefficiencies to a critical production bug.

The AI had access to comprehensive guidance: an 800-line CLAUDE.md file with explicit instructions on validation, error handling, and testing, plus custom webdev agent skills from my open-source skills repository covering CSS, semantic HTML, frontend testing, and security. These documentation mechanisms (CLAUDE.md files and custom skills) are exactly the right tools for guiding AI agents. I invested heavily in using them properly.

During sessions, I take care to write clear prompts that state the problem, desired outcome, and often include acceptance criteria. I explicitly prompt the AI: “Please also refer to the guidance provided in the codebase,” sometimes specifically calling out the CLAUDE.md file. The CLAUDE.md is always included in context, likely via system prompt, so the guidance is available from the start. This creates a delicate balance: prompts need enough context to be clear without being so verbose they consume significant context before work even begins.

Yet despite this comprehensive, properly-implemented guidance and explicit prompting to reference it, these nine issues still emerged during code review. This tells us something important about current AI coding agents and their limitations.

The Guidance I Provided

Before diving into what went wrong, it’s worth understanding what guidance was available to the AI. The CLAUDE.md file included detailed sections on data validation, error handling, testing philosophy, DRY principles, and JavaScript conventions. The custom skills provided specialized knowledge for CSS authoring, semantic HTML, frontend testing strategies, and security patterns. These weren’t vague suggestions. They were explicit instructions with code examples and rationale.

Data Validation Guidance

Here is an extract from the CLAUDE.md file:

Data Validation

CRITICAL: Always use Zod for data validation.

Example:

import { z } from "zod";

const UserSchema = z.object({
  name: z.string().min(2, "Name must be at least 2 characters"),
  email: z.string().email("Invalid email format"),
  age: z.number().int().positive().optional(),
});

const result = UserSchema.safeParse(data);
if (!result.success) {
  return { valid: false, error: result.error.errors[0].message };
}

Why Zod:

Error Handling Guidance

Here is an extract from the CLAUDE.md file:

Error Handling

User errors (invalid input, not found, timeout) → return error objects Config/programming errors (missing API keys, DOM elements) → throw exceptions

// User error: return
if (!input) return { valid: false, error: "Input cannot be empty" };

// Config error: throw
if (!el) throw new Error("Element #api-config not found in DOM");

Rules:

Testing Philosophy Guidance

Here is an extract from the CLAUDE.md file:

Testing Philosophy

When a test fails, check the code under test first. A failing test may have uncovered a real bug—do not mask it by modifying the test to pass.

Approach:

  1. Investigate source code when tests fail unexpectedly
  2. Only adjust assertions after confirming code is correct
  3. Verify failures are due to intentional invalid input
  4. For validation tests, confirm valid data passes first

Accessibility Testing Guidance

Here is an extract from the CLAUDE.md file:

Accessibility Testing

Use ARIA snapshots as the default. Consolidate multiple accessible name checks into a single toMatchAriaSnapshot() test. This is faster, more accurate (tests actual accessibility tree), and easier to maintain.

// ✅ BEST: Single ARIA snapshot validates multiple elements
await expect(dialog).toMatchAriaSnapshot(`
  - dialog:
    - button "Close"
    - iframe
`);

// ✅ ACCEPTABLE: Semantic locator for single element
const closeButton = dialog.getByRole("button", { name: /close/i });
await expect(closeButton).toBeVisible();

// ❌ WRONG: Tests implementation detail
const ariaLabel = await button.getAttribute("aria-label");
expect(ariaLabel).toBeTruthy();

Key principles:

ARIA snapshots do double duty:

If you can’t express your component structure clearly through roles and accessible names in an ARIA snapshot, that’s a red flag that your HTML might not be semantic or accessible.

What Not to Test

The CLAUDE.md also explicitly documented what should NOT be tested:

Don’t test the web platform or third-party libraries:

Focus on:

This wasn’t superficial documentation. Each section provided the what, the how, the why, and concrete examples. The validation section specified exactly which Zod methods to use. The error handling section distinguished user errors from config errors. The testing sections emphasized checking code before tests, using ARIA snapshots for accessibility, and not testing platform features. The DRY section gave specific extraction criteria.

Where the AI Deviated from Guidance

Issue Group 1: Zod Validation Not Applied (Issues #2, #5, #6)

The Guidance: CLAUDE.md explicitly stated “CRITICAL: Always use Zod for data validation” with examples showing schema definitions and .safeParse() usage for validation.

How the AI Violated It:

The schemas existed. The examples were there. But the AI created schemas as type definition tools rather than validation tools, then wrote manual validation alongside them. Using schemas for type inference is valuable and needed, but the primary purpose (runtime validation) was ignored.

// What the AI did: Schema exists but unused
const DistanceSuccessSchema = z.object({
  success: z.literal(true),
  data: z.array(DistanceResultSchema),
});

// eslint-disable-next-line no-unused-vars
return { success: false, error: "Invalid origin" }; // Plain object

// What the guidance specified: Actually validate
return DistanceErrorSchema.parse({
  success: false,
  error: "Invalid origin",
});

Issue Group 2: Error Handling Boundaries Mixed (Issue #4, #5)

The Guidance: CLAUDE.md documented: “Internal helpers throw errors. Public APIs return success/error objects. Choose one paradigm per abstraction level.”

How the AI Violated It:

// What the AI did: Returns error objects internally
async function calculateDistanceBatch(...) {
  try {
    if (!response.ok) {
      return DistanceErrorSchema.parse({ success: false, error: "..." });
    }
  } catch (error) {
    return DistanceErrorSchema.parse({ success: false, error: "..." });
  }
}

// Then manually check for failures
const batchResults = await Promise.all([...]);
const failedBatch = batchResults.find(result => !result.success);

// What the guidance specified: Internal helpers throw
async function calculateDistanceBatch(...) {
  if (!response.ok) {
    throw new Error(`API error: ${response.status}`);
  }
  return { data: [...] };
}

// Promise.all handles failures naturally
const batchResults = await Promise.all([...]); // Fails fast on throw

Issue Group 3: Testing Violations (Issue #9)

The Guidance: Multiple sections in CLAUDE.md addressed testing:

How the AI Violated It:

Issue #9a - Testing Implementation: Created a test that waited 20+ actual seconds to verify a 10-second timeout worked, testing whether JavaScript could count to 10,000 milliseconds instead of testing whether the timeout mechanism functioned.

// What the AI did: Test actual duration
it("should timeout after 10 seconds", async () => {
  // Waits 20+ seconds
}, 15000);

// What the guidance specified: Test the mechanism
it("respects custom timeout", async () => {
  const result = await calculateDistances(
    origin,
    destinations,
    { timeout: 100 } // Test-friendly value
  );
  expect(result.success).toBe(false);
  expect(result.error).toContain("timeout");
});

Test suite went from 20+ seconds to 2.2 seconds because we tested whether the abort signal triggers correctly, not whether JavaScript can count.

Issue #9b - Testing Web Platform: Frequently wrote tests that verified native browser behavior despite explicit guidance not to test the web platform. Examples included:

These tests verify browser behavior, not application logic. The guidance explicitly stated to focus on application logic and integration with libraries, not platform features.

Issue #9c - Ignoring ARIA Snapshots: Despite guidance to use ARIA snapshots as the default for accessibility testing, the AI wrote multiple individual tests checking accessible names, attributes, and roles. This frequently resulted in 20+ tests that could be consolidated into 4-5 ARIA snapshot tests.

// What the AI did: 20+ individual tests
it("close button has accessible name", async () => {
  const button = await dialog.getByRole("button");
  expect(await button.getAttribute("aria-label")).toBe("Close");
});

it("dialog has role", async () => {
  expect(await dialog.getAttribute("role")).toBe("dialog");
});

// ... 18+ more similar tests

// What the guidance specified: Single ARIA snapshot
await expect(dialog).toMatchAriaSnapshot(`
  - dialog:
    - button "Close"
    - iframe
`);

The impact was significant. Test suites that took minutes to run could be reduced to seconds. More importantly, ARIA snapshots do double duty: they validate structure and presence/absence of elements while simultaneously making it obvious whether components can be clearly expressed through the accessibility tree. If you can’t write a clean ARIA snapshot, that’s a red flag about your semantic HTML. By ignoring ARIA snapshots, the AI missed this valuable signal about potential accessibility issues.

Lengthy test suites also discourage developers from running tests locally during development. Tests are only useful if they’re purpose-built and actually executed. Creating 20+ tests to verify what 4-5 ARIA snapshots can validate makes tests a burden rather than a benefit.

Issue Group 4: List Processing (Issue #7 - Critical Bug)

The Guidance: While not explicitly documented in CLAUDE.md, maintaining predictable input/output relationships is a fundamental programming principle reinforced by the documented emphasis on data validation and type safety.

How the AI Violated It: Filtered out unreachable destinations, returning 24 results for 25 inputs with no way to map results back to original destinations.

// What the AI did: Silent filtering
const validResults = results
  .filter((r) => r.distance !== null) // Removes unreachable
  .map((r) => DistanceResultSchema.parse(r));

// What should happen: Include all with nullable fields
const DistanceResultSchema = z.object({
  targetId: z.string(),
  distance: z.number().nullable(),
  time: z.number().nullable(),
});

const validResults = results.map((r) => DistanceResultSchema.parse(r));

This was the critical bug. With 25 store locations and one unreachable, the function returned 24 results. No way to tell which store was missing. No way to show “Store #7 is unreachable” in a UI.

Why Does Comprehensive Guidance Fail?

These four issue groups reveal a pattern: comprehensive, explicit guidance systematically violated. Schemas defined but unused for validation. ARIA snapshot patterns marked “BEST” ignored in favor of 20+ individual tests. Error handling boundaries clearly documented yet mixed throughout the code. Each violation involved disregarding instructions that included working examples and detailed rationale.

This wasn’t vague documentation open to interpretation. Every violation involved ignoring explicit instructions backed by code examples and rationale. Which raises an important question: why does properly-implemented, example-laden guidance fail to prevent these issues?

The answer involves how AI models behave as their context windows fill.

Managing Context and Drift

Here’s something I’ve noticed that complicates prevention: AI models seem to drift from their instructions as the context window fills. With Sonnet 4.5, this happens around 40% context usage (some call this entering the “dumb-zone” though I’m not sure I love that phrasing). Opus 4.5 drifts slightly later, but it still happens.

This drift undermines the entire prevention strategy. I’m investing heavily in comprehensive CLAUDE.md documentation and custom skills specifically to prevent these issues from occurring in the first place, not just catch them in review. But as the implementation agent’s context fills up (working through batching, validation, error handling, testing), it gradually drifts from the explicit guidance. The instructions about using Zod consistently, maintaining error handling boundaries, testing behavior over implementation are still there in context, but the agent stops following them reliably.

The pattern is consistent enough that I’ve adapted my workflow. I use one agent for initial implementation. Then I start fresh threads for subsequent work: one per file or logical group of changes. Each thread gets minimal context, primarily just the CLAUDE.md document and the specific code under review. Depending on changeset size, this might mean many agents.

Because these agents work with much less populated context, they seem to adhere more closely to the guidance. They follow the Zod validation patterns. They maintain error handling boundaries. They test behavior instead of implementation. The fresh context keeps them aligned with the documented principles.

This observation suggests something important: implementation phases should probably be scoped much smaller. Work in smaller increments so agents maintain less context and stay aligned with guidance. This aligns with good engineering practice anyway, though it can be challenging to scope work that granularly in practice.

I’m currently experimenting with other approaches. Subagents might play a role here, though I haven’t worked out exactly what that looks like yet. There’s a tension between comprehensive context (for coherent implementations) and fresh context (for adhering to guidelines). Whether this points to fundamentally different workflow patterns, better context management strategies, or evolution in AI coding agents themselves, I don’t yet know.

I haven’t systematically tested this across different models or context window sizes. It’s observational, based on several months working with Sonnet and Opus in Cursor IDE. I’m curious whether others have observed similar patterns and how they’re approaching it. Does drift happen at predictable percentages with other models?

Some specific questions I’m exploring: Does the Ralph Wiggum loop pattern (where the AI implements then reviews its own work) help maintain guideline adherence, or does it suffer from the same drift issues? On the tooling side, Kilo Code’s memory bank takes a different approach. According to their documentation: “When Memory Bank is active, Kilo Code begins each task with [Memory Bank: Active] and a brief summary of your project context, ensuring consistent understanding without repetitive explanations.” This suggests starting each task with synthesized context rather than full context or minimal context. I’m curious how this compares to the multi-agent workflow in practice.

Are there workflow strategies that maintain both context and guideline adherence? This feels like an area that needs more exploration.

What This Tells Us

The AI had explicit guidance on these patterns. The CLAUDE.md file documented Zod validation, DRY principles, error handling boundaries, and testing strategies. Custom skills provided frontend testing patterns and security guidelines. Yet these issues still surfaced.

This points to a fundamental limitation in current AI coding agents. The labs and tool makers define these guidance mechanisms (CLAUDE.md, skills, system prompts) as the way to provide instructions. Yet as context fills, the AI drifts from that guidance. If the system provides these mechanisms but doesn’t reliably follow them, that’s a failure of the AI system itself, not the documentation or the user.

The workarounds we’re exploring (multi-agent workflows, smaller implementation phases, Ralph Wiggum loops, memory banks) are essentially patches for this limitation. They help, but they’re user-driven adaptations to a problem that fundamentally needs to be addressed by the labs training these models and crafting these tools.

Despite these limitations, I want to be clear: working with AI coding agents has made me significantly more productive as an engineer. The volume and quality of code I can review and ship has increased substantially. More importantly, I’m learning and growing in ways I didn’t expect. Reviewing so much code (both good and problematic) has made me more detail-oriented. I think through problems at both detailed and macro levels more effectively. I’m researching more than ever before, diving into documentation and specifications to verify AI suggestions rather than accepting them at face value.

These tools are incredibly valuable. That’s precisely why these limitations matter. When something this useful has systemic issues that undermine its effectiveness, we need the problems addressed so the tools can reach their full potential.

Working code isn’t the same as production-ready code. The AI produced functional implementations that passed tests. Review uncovered patterns that would cause problems in production: unpredictable array lengths, 20-second test suites, missing null checks, inconsistent error handling.

The goal is a system (prevention strategies, context management, and review practices) that keeps code production-ready whether it comes from an AI agent, a junior developer, or a senior engineer working late on a Friday. In the meantime, these prevention strategies combined with workflow adaptations provide that practical system while we wait for the underlying limitations to be addressed.