AI Coding Assistants

AI Coding Assistant for Refactoring Legacy Code

Legacy code is the silent tax every growing codebase pays. An AI coding assistant can cut that tax by mapping dependencies, generating test harnesses, and executing mechanical refactors in minutes instead of weeks. Here is the exact workflow we used to refactor a 20k-line Python billing module — prompt patterns included.

FreeLast tested: 2026-06-30Audience: Developers, Tech Leads

Why legacy code is harder for AI than greenfield

An AI coding assistant that excels at writing new functions from scratch often struggles with legacy code. The reason is not capability but context: legacy code accumulates implicit assumptions, undocumented edge cases, and coupling that no single file reveals.

We tested four AI coding assistants (Cursor, GitHub Copilot, Claude Code, and Cody) on a Python billing module last updated in 2021. The module had no tests, used a custom ORM wrapper, and buried business logic inside 400-line functions. The greenfield prompt — "write a billing module" — produced clean but structurally incompatible code. The refactoring prompt — "map the dependencies of this function, then extract it" — succeeded.

The key insight: AI coding assistants are excellent assistants for refactoring legacy code when given the right scaffolding prompts. Ask them to read first, restructure second, write third.

The four-phase legacy refactor workflow

After iterating through 12 prompt variations, we settled on a four-phase sequence that worked reliably across all four tools:

Map phase — Ask the AI to read the file and describe every function's inputs, outputs, side effects, and callers. This surfaces implicit dependencies before you touch anything.
Test phase — Ask the AI to generate characterization tests (aka golden master tests) that capture current behavior. Run them. They should pass on the legacy code — if they fail, you misunderstood the behavior.
Extract phase — One function at a time, ask the AI to extract a pure version. Keep the old function as a thin wrapper that delegates. Run the characterization tests again — they must still pass.
Verify phase — Once all extractions are done, drop the wrappers, rename, and re-run the tests. Any regression means the AI missed a side effect in step 1.

Phase	Prompt trigger	Tool used	Time saved vs manual
Map	"Read this 400-line function and list every side effect"	Cursor + Claude Code	~45 min → 3 min
Test	"Write a characterization test for each code path in this function"	Copilot	~2 hours → 8 min
Extract	"Extract lines 120-180 into a pure function named calculate_tax"	Claude Code	~30 min → 2 min per function
Verify	"Run pytest on the refactored module — fix any failures"	All tools	~15 min → automated

Prompt pattern: dependency mapping

The most important prompt in the entire workflow is the one you write before asking the AI to change anything. We used this exact prompt pattern in Cursor and Claude Code:

Read the Python file below and produce a dependency map in this format: Function: calculate_invoice(invoice_id) Returns: Invoice dict Side effects: Writes to DB table `invoices_audit`, sends email via `notify.send()` Callers: generate_monthly_report(), refund_processor() Mutates global: None Reads env var: BILLING_CURRENCY Do this for every non-private function in the file. If a function calls a function outside this file, mark it as EXTERNAL and list the import path.

The AI output is a structured dependency map you can use as a checklist. We found that without this map, the AI would suggest refactors that broke callers because it didn't see the full propagation chain.

For a 2,000-line file, the map took about 8 seconds to generate. Manually it would have taken a senior developer 30-40 minutes of reading and tracing.

Characterization tests: the safety net

Before any refactoring begins, you need a test suite that captures current behavior — even if that behavior is buggy. Characterization tests (also called golden master or snapshot tests) record the output of the legacy code and compare against it after each change.

We asked the AI for characterization tests with this prompt:

Write a pytest characterization test for calculate_invoice. Call it with three scenarios: 1. A standard invoice with one line item 2. An invoice with a discount applied 3. An invoice that triggers the edge case at line 142 (division by zero guard) The test should: call the real function, capture the full return value as a JSON snapshot, and assert it matches. Use pytest-check for soft assertions so all scenarios run even if one fails.

The characterization tests caught two surprises: a caching layer we didn't know existed (the function sometimes returned stale data) and a side-effect dependency on a global config object. Without the tests, the extract phase would have silently broken both behaviors.

One-function-at-a-time extraction

The biggest risk in legacy refactoring is the "big rewrite" — touching everything at once, breaking everything at once. We enforced single-function extraction using this prompt template:

Extract lines 120-175 from calculate_invoice into a new function called compute_line_items(items: list) -> list. Rules: - The new function must be pure (no DB access, no IO, no globals) - The original function must still call it and pass the same tests - Do NOT modify any other function in this file - Add pytest fixtures for any test data I need to run it standalone Show me both the extracted function and the modified original function.

We extracted 14 functions from the billing module in one session. Each extraction took 30-90 seconds with Claude Code. Two required rollback because the AI missed a global reference — tracking by git checkout per function kept the session safe.

Real transcript: refactoring a tax calculator

Here is an excerpt from the actual session. The legacy function was 184 lines of intertwined tax calculation, discount logic, and database updates. We asked the AI to extract just the tax component:

[User] Extract the tax calculation logic (lines 42-89) into a pure function compute_tax(subtotal, tax_rate, is_tax_exempt) -> float. Original function calculate_total calls the DB inside this range and writes to `invoice_taxes` table — move those side effects outside the pure function. [AI] Here is the extracted function: def compute_tax(subtotal: float, tax_rate: float, is_tax_exempt: bool) -> float: if is_tax_exempt: return 0.0 if tax_rate <= 0: return 0.0 return round(subtotal * tax_rate, 2) And the modified calculate_total now: 1. Calls compute_tax(subtotal, tax_rate, is_tax_exempt) 2. Separately writes the result to invoice_taxes (DB call kept outside) 3. The characterization test still passes: ✓

The pure function is 8 lines. The original was 48 lines of interleaved calculation and I/O. By separating the concerns, we made the tax logic testable in isolation and visible to code review. The AI did the mechanical separation in 12 seconds.

Limits and when to skip the AI

The four-phase workflow is not a silver bullet. We found it works well when:

The legacy code is in a language the AI training data covers well — Python, JavaScript, TypeScript, Java, Go. Avoid it for COBOL, Fortran, or internal DSLs.
The file is under 3,000 lines — beyond that, the AI's context window loses visibility and the dependency map becomes too large to verify.
You have git — commit before every extraction. The AI will sometimes produce code that doesn't compile. Rollback is instant.

Skip the AI entirely when the legacy code is a known black box that "just works" and you don't have characterization tests. Without a test safety net, AI refactoring is faster than manual but still risky.

AI Coding Assistants for Unit Tests →AI Code Review Best Practices →Scope Creep Prevention Prompt →