# Test-driven development for AI agents

Date: 2026-03-19

## Background

Test-driven development (TDD) is a software development methodology in which tests are written _before_ the code they validate. The classic TDD cycle is:

1. **Red**: Write a test for the desired behavior. Run it and confirm it fails.
2. **Green**: Write the minimum code to make the test pass.
3. **Refactor**: Clean up the code while keeping the tests green.

This approach has well-documented benefits: it forces the developer to think about the desired behavior before writing code, it provides a built-in regression suite, and it creates a clear evidentiary trail showing that the code change actually addresses the problem.

AI coding agents often struggle with TDD, however. This document describes the most common pitfalls and provides concrete guidance on how to execute TDD correctly.

## The fundamental rule: separate what you know from what you write

The single most important principle for an AI agent doing TDD is this: **your internal reasoning must not leak into the artifacts you produce.**

During TDD, you will often know things that should not appear in the code or PR description. For example:

- You know a test is expected to fail right now, but the test should not say that.
- You know a bug exists, but the test should not comment on the bug.
- You know a feature has not been implemented yet, but the test should not mention that fact.

Tests, code, and PR descriptions should be written from the perspective of their final, intended state. A test describes _correct_ behavior. A PR description describes the _completed_ work. The fact that you are writing these artifacts at a moment when things are temporarily broken is irrelevant to what the artifacts should say.

Think of it this way: six months from now, someone will read these tests. They should see a clear specification of correct behavior, not a historical diary of the development process.

Individual commit messages, however, are different. A commit message describes a particular change being made at a particular point in time, so it is inherently temporal. When a commit is one step in a multi-step process, the commit message should explain how that commit fits into the process. In TDD, this means the commit that adds failing tests should say so: it should explain that the tests are expected to fail and that a follow-up commit will fix the code. This helps human readers understand the commit history, and it also helps automated review tools (such as CI bots that comment on individual commits) avoid raising false alarms about expected failures.

## Scenario 1: fixing a bug

When you discover a bug and want to fix it using TDD, follow these steps.

### Step 1: Write the test

Write a test (or multiple tests) that exercises the specific case where the bug occurs. The test should assert the _correct_ behavior -- the behavior the code _should_ have, not the behavior it currently has. Write the test exactly as it should appear in the final codebase.

**Do not** add comments saying the code has a bug. **Do not** add comments saying you expect the test to fail. **Do not** write the assertion backward to make the test pass temporarily. The test should look identical to what it will look like after the bug is fixed.

Here is an example. Suppose you discover that `truncate("hello world", 8)` returns `"hello wo..."` (length 11) when it should return `"hello..."` (length 8), because the function appends `"..."` without subtracting its length from `maxLength`.

Bad -- the test comments on the bug:

```typescript
test("truncate keeps the total length within maxLength", () => {
  // BUG: truncate appends "..." without subtracting its length from maxLength,
  // so the result is longer than maxLength.
  expect(truncate("hello world", 8)).toBe("hello...");
});
```

Good -- the test simply asserts the correct behavior:

```typescript
test("truncate keeps the total length within maxLength", () => {
  expect(truncate("hello world", 8)).toBe("hello...");
  expect(truncate("hello world", 5)).toBe("he...");
  expect(truncate("hello world", 3)).toBe("...");
});
```

The bad version leaks your internal knowledge ("I know there's a bug here") into the test itself. A reader of the final codebase has no use for that comment. The good version is a clean statement of the expected behavior and nothing more.

### Step 2: Run the test and verify it fails

Run the test suite (or just the new tests) and confirm that the tests you wrote do in fact fail. This is the "red" phase. The failure confirms that the bug is real and that your test is actually exercising the buggy code path.

If the test passes, something is wrong: either the bug has already been fixed, or (more likely) your test is not actually exercising the buggy case. Investigate before proceeding.

It is not enough to confirm that the test fails; you must also confirm that it fails _for the expected reason_. Read the test output and check that the failure is the one you anticipated -- for example, that the assertion received the known-buggy value. A test can fail for all sorts of reasons unrelated to the bug: an import error, a typo in a function name, a missing test fixture, a wrong number of arguments, etc. If the test fails for one of those reasons, it is not actually reproducing the bug, and a "fix" that makes the test pass might just be fixing the unrelated problem while leaving the real bug untouched. The red phase is only meaningful if the failure demonstrates the specific incorrect behavior you intend to fix.

### Step 3: Commit the tests and push

Without fixing the bug, commit the new tests, push to GitHub, and create a PR (or push to an existing branch if a PR already exists). This does two things:

- CI will run the tests and record a failure, creating an objective record that the bug exists.
- The human reviewer can examine the test and confirm that it correctly describes the expected behavior before any code changes are made.

The commit message for this commit should explain the TDD context: these are tests for the correct behavior, they are expected to fail because the code has a bug, and the bug fix will come in a follow-up commit. This helps anyone reading the commit history understand why failing tests were committed intentionally. For example:

> Add tests for truncate length handling
>
> These tests assert the correct behavior of truncate() when maxLength is
> small. They are expected to fail on this commit because truncate()
> currently appends "..." without subtracting its length from maxLength,
> producing results longer than maxLength. The next commit will fix the
> bug.

Note that the _PR description_ is different from the commit message. The PR description should be written from the perspective of the _finished_ PR, not the intermediate state. See the section on PR descriptions below.

**Wait for CI to finish before proceeding.** If you push the fix commit before CI has completed for the test-only commit, the Jenkins build for the first commit will be aborted, and you will lose the concrete record of the test failure. That record is the whole point of this step.

### Step 4: Fix the bug

Now fix the bug in the code. This step should change only the production code, **not the tests**. If you find yourself needing to change the tests, that means the tests were not written correctly in step 1. Go back and understand why.

### Step 5: Verify the tests pass and push

Run the tests again and confirm they all pass. Commit the fix, push to GitHub, and wait for CI to pass.

### Step 6: Document the process in the PR

In the PR description's verification steps, explain the TDD evidence:

- Commit A added tests asserting the correct behavior; CI failed on those tests, confirming the bug is reproducible.
- Commit B fixed the bug without changing the tests; CI passed, confirming the fix.

This gives the reviewer a clear, verifiable chain of evidence.

## Scenario 2: adding a new function or feature

When building something new, TDD is equally valuable. The tests serve as a specification that can be reviewed before any implementation work begins.

### Step 1: Write the tests

Write a complete set of tests describing how the new function or feature should behave. These tests will not pass yet because the code they test does not exist, but the tests themselves should be written in their final form. Do not include comments like "this will fail because the function doesn't exist yet." Just write the tests as if the function already exists.

Depending on the language and test framework, you may need to create stub implementations so the project compiles. For example, in TypeScript you might export an empty function that throws `new Error("not implemented")`, or in Python you might define a function that just has `pass` as its body. That is fine -- the point is to get the tests to _run_ (and fail), not to get them to _pass_.

### Step 2: Commit the tests and push

Commit the tests, push to GitHub, and create a PR. The commit message should explain that these are tests for a function or feature that does not exist yet, and that the implementation will follow in a subsequent commit. At this stage, ask the human reviewer to look over the tests and confirm that they capture the correct expected behavior. This is an important checkpoint: it is much cheaper to iterate on the specification (the tests) before writing the implementation than to discover a misunderstanding after the implementation is complete.

There may be some back-and-forth here. The reviewer might ask for additional test cases, different edge-case handling, or changes to the expected behavior. Incorporate this feedback into the tests.

### Step 3: Implement the function or feature

Once the test suite is agreed upon, implement the function or feature to make all the tests pass. This step should not require changes to the tests. If it does, either the tests were wrong (go back to step 2) or the implementation is not matching the agreed-upon specification (fix the implementation, not the tests).

### Step 4: Verify and push

Run the tests, confirm they pass, commit the implementation, push to GitHub, and wait for CI to pass.

## Common pitfalls

### Pitfall 1: commenting on expected failure in tests

This is the most common mistake. When you know a test is going to fail, there is a strong temptation to document that fact in the test. Resist it.

Bad:

```python
def test_parse_date_with_timezone():
    # TODO: This test will fail until we implement timezone support
    assert parse_date("2026-03-18T12:00:00-05:00") == datetime(2026, 3, 18, 17, 0, 0)
```

```python
def test_parse_date_with_timezone():
    # KNOWN BUG: parse_date doesn't handle timezones correctly
    assert parse_date("2026-03-18T12:00:00-05:00") == datetime(2026, 3, 18, 17, 0, 0)
```

Good:

```python
def test_parse_date_with_timezone():
    assert parse_date("2026-03-18T12:00:00-05:00") == datetime(2026, 3, 18, 17, 0, 0)
```

The test is a statement of what `parse_date` should do, period.

### Pitfall 2: writing assertions that expect failure

Sometimes an AI agent, asked to write a test that is "expected to fail," interprets this as writing an assertion _that the test fails_. This inverts the intent entirely and makes the test pass when it should fail.

Bad -- asserting that the wrong behavior occurs:

```python
def test_pluralize_zero_count():
    # The current code treats 0 as singular due to a bug
    assert pluralize("cat", 0) == "cat"
```

Bad -- using `pytest.raises` or `assertRaises` to assert that the function crashes or misbehaves:

```python
def test_pluralize_zero_count():
    with pytest.raises(SomeError):
        pluralize("cat", 0)
```

Bad -- using an xfail marker or skip decorator to prevent the test from actually failing:

```python
@pytest.mark.xfail(reason="Bug: 0 treated as singular")
def test_pluralize_zero_count():
    assert pluralize("cat", 0) == "cats"
```

Good -- asserting the correct behavior, even though you know it will fail right now:

```python
def test_pluralize_zero_count():
    assert pluralize("cat", 0) == "cats"
```

The good version will fail when run against the buggy code, which is exactly what we want. The failure is the evidence.

### Pitfall 3: writing PR descriptions for the intermediate state

A PR description should describe the state of the PR when it is _ready for review_, not an intermediate state during development.

Bad -- written after committing the failing tests but before fixing the bugs:

> Two of the 38 tests are intentionally written to verify _expected_ behavior
> that differs from the _current_ behavior, exposing the following bugs: [...]
>
> The follow-up commit will fix the bugs and all tests should pass at that point.

This description talks about what will happen in the future. By the time the PR is reviewed, it no longer accurately describes the state of the code.

Good -- written from the perspective of the completed PR:

> 38 tests cover the behavior of all exported functions. Two bugs were found
> and fixed:
>
> 1. `truncate` with `maxLength` ≤ ellipsis length: [description]
> 2. `pluralize` with a count of 0: [description]
>
> ### Evidence from CI
>
> - **Commit 1** (abc1234): Added 38 tests without code changes. CI failed,
>   confirming the two bugs are reproducible.
> - **Commit 2** (def5678): Fixed the two bugs without changing any tests.
>   CI passed, confirming the fixes.

If you need to create the PR before the fix commit exists (e.g., because you want early feedback on the tests), you can write a minimal initial description and then update it to its final form once all the commits are in. The version that exists when you request review should describe the finished work.

Note that this guidance applies to PR descriptions, not to individual commit messages. Commit messages are inherently temporal -- they describe a particular change at a particular point in time -- so it is appropriate for a commit message to say "these tests are expected to fail" or "the next commit will fix the bug." See the earlier discussion in "The fundamental rule" section.

### Pitfall 4: modifying tests after fixing the bug

If you have followed this process correctly, fixing the bug (or implementing the feature) should not require any changes to the tests. The tests were written to assert the correct behavior from the start. The fix makes the code match the tests, not the other way around.

If you find yourself wanting to change a test after fixing the code, ask why:

- Did you write the test incorrectly in step 1? If so, you should also question whether the test was actually testing the right thing before the fix.
- Did the fix change the expected behavior? If so, the fix might be wrong, or the original specification (test) might have been wrong. Either way, this warrants discussion with the human, not a quiet edit.

In a properly executed TDD cycle, the diff of the fix commit should contain zero changes to test files.

## Why TDD is uncomfortable for AI agents (and why that's OK)

AI agents are trained to produce correct, working outputs. TDD deliberately introduces a state where things are broken. This creates tension: the agent's instinct is to make everything pass, but TDD requires temporarily leaving tests in a failing state.

It helps to reframe the situation. During the "red" phase:

- The tests are not broken. The tests are correct. The tests describe the behavior the code _should_ have.
- The _code_ is broken. The code does not yet match the specification laid out by the tests.
- A failing test is not a problem to be solved by changing the test. It is evidence to be preserved.

The discomfort with failing tests sometimes manifests in subtle ways: adding `// BUG` comments, using `xfail` markers, writing PR descriptions that apologize for the failures, or rushing to fix the code before committing the tests. All of these undermine the point of TDD. The intermediate failure state is not an embarrassment; it is the evidence that the tests are actually testing something meaningful.

## See also

- [Writing good tests](writing-good-tests.md) -- general principles for writing effective tests, covering topics like choosing the right testing target, writing clear and readable tests, and managing test data.

## Checklist

Before committing tests during TDD, review this checklist:

- [ ] Tests assert the _correct_ (desired) behavior, not the current (buggy) behavior.
- [ ] Tests contain no comments referencing bugs, expected failures, or TODOs about the code under test.
- [ ] Tests do not use `xfail`, `skip`, `expectedFailure`, or similar mechanisms to suppress failure.
- [ ] Tests do not assert that incorrect behavior occurs (e.g., asserting the buggy output).
- [ ] Tests are written in the form they should have in the final codebase, requiring no modification after the code is fixed.

Before writing a commit message for a test-only commit:

- [ ] The commit message explains the TDD context: what the tests cover, that they are expected to fail, and why (e.g., the bug that exists or the feature not yet implemented).
- [ ] The commit message states that a follow-up commit will fix the code or add the implementation.

Before writing or updating a PR description:

- [ ] The description describes the state of the PR at the time of review, not an intermediate development state.
- [ ] The description does not use future tense to describe work that has already been done (e.g., "the follow-up commit will fix...").
- [ ] If the PR uses a TDD approach, the description explains the commit structure as evidence (e.g., "commit A added tests, CI failed; commit B fixed the bug, CI passed").
