How to Use Claude Code or Codex for Real Coding Work

If you want to use Claude Code or Codex for real coding work, the biggest mistake is asking the agent to build before the architecture is clear.

Most bad experiences with AI coding agents come from using them too early and too vaguely.

People open Claude Code or Codex, paste a rough idea, and say "build this." Sometimes it works. More often it produces code that looks convincing but has weak architecture, sloppy edge-case handling, and very uneven tests.

The workflow that works for me is slower at the beginning and much faster by the middle. I do not start with implementation. I start with thinking.

The short version: Use Claude Code or Codex like a fast implementation partner, not a slot machine. Discuss the idea first, lock the architecture, break the work into small tasks, review every diff, and take testing seriously.

Do Not Start in Execution Mode

If the tool has a plan mode, I usually leave it off at the beginning.

Early-stage work should stay conversational. I want the model to help pressure-test the idea before it starts touching files. This is where the strongest model and higher reasoning settings help the most. If the task is ambiguous, architectural, or risky, I will often use the best model available with high or max reasoning.

The goal at this stage is not code. The goal is clarity.

Here is the kind of prompt that works well:

I want to build a subscription analytics dashboard for B2B SaaS teams.

Do not write code yet.
Ask me the most important product and architecture questions first.
Then propose a detailed implementation plan that covers:
- data model
- API boundaries
- auth and permissions
- UI structure
- likely failure points

Give me tradeoffs between two approaches and recommend one.

This usually surfaces the real issues early:

Do I need multi-tenant isolation?
Is this server-rendered or highly interactive?
Where does caching belong?
Which part is actually risky?

If you skip this stage, the agent starts making assumptions for you.

When I switch to the best model and high or max reasoning

I do that when the task is architectural, ambiguous, or expensive to get wrong. Greenfield projects, auth boundaries, data modeling, caching strategy, and tricky refactors are good examples.

I usually do not spend the strongest model on simple CRUD work or repetitive implementation. That is where a cheaper, faster execution loop is usually enough.

Lock the Plan Before You Write Code

Once the back-and-forth conversation has stabilized, I ask the agent to write down the plan and the reasoning behind it.

This matters more than most people think.

You want a durable record of:

What you are building
How you plan to build it
Why you chose this approach over alternatives
Where the risks are
How it should be tested

I usually ask for something like this:

Write the final implementation plan as a working spec.

For each major decision, include:
- chosen approach
- rejected alternative
- why we rejected it
- files or modules likely to change
- testing strategy

Keep it concise but detailed enough that we can execute from it.

That document becomes the operating manual for the rest of the work. It also stops the project from drifting every time the conversation gets long.

Plan the Sprint in Epics, Not Random Tasks

After the implementation plan is fixed, I ask the agent to turn it into a sprint plan.

This is another place where people lose control. They ask the agent to build everything at once. That usually creates giant diffs, mixed concerns, and poor reviewability.

A better approach is to break the work into deliverable epics with clear tasks.

For example:

Foundation and schema setup
Auth and role boundaries
Analytics queries and API routes
Dashboard UI
Testing, hardening, and cleanup

Then each epic gets smaller tasks that can be implemented safely in one pass.

The prompt can be simple:

Turn the implementation plan into a sprint plan.

Break it into deliverable epics.
For each epic, include:
- goal
- task list
- dependencies
- acceptance criteria
- key risks

Keep tasks small enough that one agent run can complete them cleanly.

This does two useful things:

It gives you a clean execution order
It gives the agent smaller, safer units of work

Why epics matter: The point of epics is not project management theater. It is to keep the AI agent from mixing schema work, UI work, auth work, and tests into one giant diff that is painful to review.

Execute One Task at a Time

Once the sprint exists, I stop discussing the whole project and start operating task by task.

This is where Claude Code and Codex become much more useful. They are very good when the scope is specific and the definition of done is clear.

A good execution prompt looks like this:

Take Epic 2, Task 1: add role-based access middleware.

Constraints:
- follow the existing auth pattern
- do not add dependencies
- keep the middleware isolated from page components
- add or update tests for the new behavior

When done, give me:
- files changed
- what was implemented
- what was tested
- any open questions

That is dramatically better than saying "add RBAC."

The more specific the task package is, the less time you spend correcting avoidable mistakes.

What I expect back after each task

A short summary of what changed
The exact files touched
What tests were added or updated
Any compromises, assumptions, or open questions

Review Every Task Like a Real PR

This part is non-negotiable.

Do not treat AI-generated code as either magical or disposable. Review it the same way you would review code from a smart but fast-moving teammate.

The questions I care about are:

Does the architecture still make sense?
Did the agent solve the right problem or just the easiest version?
Is the code over-engineered?
Did it silently change behavior outside the task boundary?
Are the names, abstractions, and file placements coherent?

AI agents are often directionally correct but locally messy. Review is where the quality comes back.

Unit Testing Is Where You Save Yourself

This is the part I would emphasize the most.

AI agents can write plausible code very quickly. They can also write plausible bugs very quickly. Unit tests are one of the few reliable defenses against that.

The mistake is letting the agent improvise the test strategy after the implementation is already written. I prefer to be very deliberate here.

The main risk: AI-written code often looks more correct than it actually is. The danger is not obviously broken code. The danger is subtle behavior that passes casual review and fails later in production.

Before or during implementation, I ask for the test plan explicitly:

Before finalizing the code, list the unit test matrix for this task.

Cover:
- happy path
- boundary conditions
- invalid input
- permission failures
- regression cases

Call out any behavior that should be integration-tested instead of unit-tested.

That one step catches a lot of weak testing.

What I check in AI-written unit tests before approving them

Whether the tests challenge behavior instead of mirroring implementation details
Whether permission failures and invalid input are covered
Whether mocks are lightweight enough that the test still proves something real
Whether edge cases are named clearly enough to document intent
Whether a few cases should really be integration tests instead

What Good AI-Written Tests Look Like

Bad AI-written tests usually have one of these problems:

They only test the happy path
They assert implementation details instead of behavior
They mock so much that the test proves nothing
They miss authorization and failure cases
They duplicate the code's assumptions instead of challenging them

Good tests do the opposite.

Imagine the agent is implementing a function that archives a project. A useful test suite would not stop at "project archives successfully." It would also check:

a user cannot archive another workspace's project
archiving an already archived project is handled correctly
audit logs are written
invalid project IDs fail cleanly

A simple example:

it("rejects cross-workspace archive attempts", async () => {
  await expect(
    archiveProject({
      actorWorkspaceId: "workspace-a",
      projectWorkspaceId: "workspace-b",
      projectId: "proj_123",
    })
  ).rejects.toThrow("Forbidden");
});

That kind of test is valuable because it checks a real failure mode, not just a happy-path output.

The Workflow in Practice

If I simplify the whole process, it looks like this:

Keep the early conversation open and exploratory.
Use stronger models and higher reasoning for architecture and ambiguous decisions.
Lock the implementation plan and record the rationale.
Turn the plan into sprint epics and concrete tasks.
Ask the agent to implement one task at a time.
Review every task carefully.
Be extremely attentive to unit tests and edge cases.

That is the difference between using AI as autocomplete and using it as an engineering partner.

A simple rule: Never let the agent earn the right to implement by writing lots of code. Let it earn the right by showing a sound plan, clear tradeoffs, and a test strategy you trust.

Where People Usually Go Wrong

Most failures come from one of these habits:

Starting implementation before the idea is clear
Asking for huge end-to-end builds in one shot
Not writing down decisions and tradeoffs
Skipping review because the output "looks right"
Treating tests as optional or low priority

If you fix those five things, your success rate with Claude Code or Codex goes up immediately.

Final Thought

The best way to use Claude Code or Codex is not to hand over the keyboard and hope for the best.

Use them for what they are exceptionally good at: executing clearly defined work, moving fast through mechanical implementation, and helping you maintain momentum.

But keep ownership of the hard parts: the architecture, the planning, the sequencing, the review, and especially the tests.

That is where good software still gets made.