AI Coding Assistant for Refactoring Legacy Code
Legacy code is the silent tax every growing codebase pays. An AI coding assistant can cut that tax by mapping dependencies, generating test harnesses, and executing mechanical refactors in minutes instead of weeks. Here is the exact workflow we used to refactor a 20k-line Python billing module — prompt patterns included.
Why legacy code is harder for AI than greenfield
An AI coding assistant that excels at writing new functions from scratch often struggles with legacy code. The reason is not capability but context: legacy code accumulates implicit assumptions, undocumented edge cases, and coupling that no single file reveals.
We tested four AI coding assistants (Cursor, GitHub Copilot, Claude Code, and Cody) on a Python billing module last updated in 2021. The module had no tests, used a custom ORM wrapper, and buried business logic inside 400-line functions. The greenfield prompt — "write a billing module" — produced clean but structurally incompatible code. The refactoring prompt — "map the dependencies of this function, then extract it" — succeeded.
The key insight: AI coding assistants are excellent assistants for refactoring legacy code when given the right scaffolding prompts. Ask them to read first, restructure second, write third.
The four-phase legacy refactor workflow
After iterating through 12 prompt variations, we settled on a four-phase sequence that worked reliably across all four tools:
- Map phase — Ask the AI to read the file and describe every function's inputs, outputs, side effects, and callers. This surfaces implicit dependencies before you touch anything.
- Test phase — Ask the AI to generate characterization tests (aka golden master tests) that capture current behavior. Run them. They should pass on the legacy code — if they fail, you misunderstood the behavior.
- Extract phase — One function at a time, ask the AI to extract a pure version. Keep the old function as a thin wrapper that delegates. Run the characterization tests again — they must still pass.
- Verify phase — Once all extractions are done, drop the wrappers, rename, and re-run the tests. Any regression means the AI missed a side effect in step 1.
| Phase | Prompt trigger | Tool used | Time saved vs manual |
|---|---|---|---|
| Map | "Read this 400-line function and list every side effect" | Cursor + Claude Code | ~45 min → 3 min |
| Test | "Write a characterization test for each code path in this function" | Copilot | ~2 hours → 8 min |
| Extract | "Extract lines 120-180 into a pure function named calculate_tax" | Claude Code | ~30 min → 2 min per function |
| Verify | "Run pytest on the refactored module — fix any failures" | All tools | ~15 min → automated |
Prompt pattern: dependency mapping
The most important prompt in the entire workflow is the one you write before asking the AI to change anything. We used this exact prompt pattern in Cursor and Claude Code:
The AI output is a structured dependency map you can use as a checklist. We found that without this map, the AI would suggest refactors that broke callers because it didn't see the full propagation chain.
For a 2,000-line file, the map took about 8 seconds to generate. Manually it would have taken a senior developer 30-40 minutes of reading and tracing.
Characterization tests: the safety net
Before any refactoring begins, you need a test suite that captures current behavior — even if that behavior is buggy. Characterization tests (also called golden master or snapshot tests) record the output of the legacy code and compare against it after each change.
We asked the AI for characterization tests with this prompt:
The characterization tests caught two surprises: a caching layer we didn't know existed (the function sometimes returned stale data) and a side-effect dependency on a global config object. Without the tests, the extract phase would have silently broken both behaviors.
One-function-at-a-time extraction
The biggest risk in legacy refactoring is the "big rewrite" — touching everything at once, breaking everything at once. We enforced single-function extraction using this prompt template:
We extracted 14 functions from the billing module in one session. Each extraction took 30-90 seconds with Claude Code. Two required rollback because the AI missed a global reference — tracking by git checkout per function kept the session safe.
Real transcript: refactoring a tax calculator
Here is an excerpt from the actual session. The legacy function was 184 lines of intertwined tax calculation, discount logic, and database updates. We asked the AI to extract just the tax component:
The pure function is 8 lines. The original was 48 lines of interleaved calculation and I/O. By separating the concerns, we made the tax logic testable in isolation and visible to code review. The AI did the mechanical separation in 12 seconds.
Limits and when to skip the AI
The four-phase workflow is not a silver bullet. We found it works well when:
- The legacy code is in a language the AI training data covers well — Python, JavaScript, TypeScript, Java, Go. Avoid it for COBOL, Fortran, or internal DSLs.
- The file is under 3,000 lines — beyond that, the AI's context window loses visibility and the dependency map becomes too large to verify.
- You have git — commit before every extraction. The AI will sometimes produce code that doesn't compile. Rollback is instant.
Skip the AI entirely when the legacy code is a known black box that "just works" and you don't have characterization tests. Without a test safety net, AI refactoring is faster than manual but still risky.