Our prompt engineering folder structure

Most teams treat prompts like config files — flat, stringly-typed, lost to history within months. We treat them like code. This is the actual folder structure we have settled on across the Moonlabs portfolio after eighteen months of iteration, with the reasoning for each decision.

Nothing in this essay is novel. The point is that almost no team has actually written this down, and most early-stage teams reinvent something worse from scratch. We are publishing the convention so other operators can borrow it.

The top-level structure

Inside the application repo:

src/
  prompts/
    _shared/
      preamble.md
      output-formats/
      personas/
    extract-contract-fields/
      v3.md
      v2.md
      v1.md
      tests/
        gold-cases.jsonl
        regression-snapshot.json
      README.md
    summarise-ticket-thread/
      v2.md
      v1.md
      tests/
      README.md
    ...

Five rules generate that structure. We work through them below.

Rule one: each prompt is a folder, not a file

The first instinct of every engineer encountering this is to put prompts in prompts/extract-contract-fields.md. A single file. Easy to find.

We have settled, after pain, on each prompt being a folder. Three reasons:

Versions. Every prompt has a history. v3.md is current. v2.md is what shipped last month. v1.md is the first cut. Diffing them is trivially valuable when a regression shows up six weeks later.
Tests. Every prompt has evals. The evals live next to the prompt, in tests/. A flat file makes the eval set hard to colocate.
Documentation. Every prompt has a README.md. Why it exists, what it is for, who owns it, what model it was tuned against, what the known failure modes are. A single-file prompt has nowhere to put this.

The folder is therefore the natural unit, not the file. The cost is one extra directory level. The benefit is that the prompt now has the same engineering surface as any other unit of production code.

Rule two: prompts are immutable once shipped

We never edit v3.md once it is in production. If we need to change it, we create v4.md. The old version stays exactly as it was.

The reason: if v3.md is in production and we discover a regression, we want to know what changed. Diffing v3 to v4 gives a clean answer. Editing v3 in place destroys that history.

The discipline is annoying for about three days. After that it is the only thing that makes sense. The cost of one extra file per change is trivial. The cost of debugging a regression with no audit trail is significant.

Rule three: tests are inline JSONL, not in a separate test framework

We have tried both. The fancy test framework approach failed in our hands for early-stage products. The reason: every fancy framework wants you to define schemas, fixtures, runners, reports. By the time you have set it up, you have spent two days that you should have spent shipping.

The lean alternative — a gold-cases.jsonl file in the prompt's tests/ folder, where each line is {input: ..., expected_property: ..., notes: ...}, plus a small script in the project that runs them — has lower setup cost and higher iteration speed. We have not yet hit a scale where the lean approach failed.

The script that runs them is generic. It lives in src/prompts/_shared/runner.ts. It takes the prompt file, the gold cases, the model, and emits a pass/fail report. About 80 lines.

Rule four: the README is short and structured

Every prompt's README.md follows a five-line template:

- Purpose: (one sentence)
- Model: (which model + temperature this was tuned for)
- Owner: (who to ping when it breaks)
- Last reviewed: (date)
- Known failure modes: (bulleted)

Five lines. Five fields. Boring on purpose.

The reason for the rigidity is that without it, prompt READMEs drift into either nothing (empty file) or too much (an essay nobody reads). The five-line template is the sweet spot where every prompt has the minimum useful documentation and no engineer is tempted to skip the README "because it would take too long."

The single most useful field is Known failure modes. When a regression shows up at 4pm on a Friday, the on-call's first move is to read this line. Almost always, the failure has been documented before. The fix accelerates dramatically.

Rule five: shared components live in `_shared/`

Three sub-folders in _shared/:

preamble.md — the standard model-of-self prompt prefix we use across all prompts in the company (sets the tone, identifies the system).
output-formats/ — reusable JSON schema fragments that get included into prompts via templating.
personas/ — reusable persona definitions (e.g. "You are a senior conveyancer reviewing this draft.").

The convention is {{> _shared/output-formats/contract-fields-v2.md }} for inclusion, in whichever templating language the project uses. We have used Handlebars, Liquid, and a minimal custom syntax depending on the project's primary language. The templating choice does not matter much; the fact of shared components matters a lot.

The reason: across a year of running multiple AI features, we converged on roughly four output schemas and three personas that got reused everywhere. Putting them in _shared/ means we only update the canonical version when we need to change them, and every prompt that uses them benefits.

What we deliberately do not do

Three things we have tried, and rejected.

A central prompt registry service. Frameworks like LangSmith and others want you to centralise prompts in a managed service with versioning, tagging, and CI integration. For mid-sized teams this is genuinely useful. For early-stage teams it is overkill — the cost of integrating and maintaining the service exceeds the cost of the folder convention. We reach for it around year three of a company; before that, files in a repo are sufficient.

A YAML metadata header on each prompt. Some teams put structured metadata at the top of each .md file (model, temperature, version, author). We tried this; engineers ignored it. The README convention is friction-equivalent and produces better results.

Auto-generated prompts from code. Several teams build their prompts in code, dynamically. Sometimes this is necessary (truly user-specific prompts). For 80% of prompts in a typical AI product, the prompt is mostly static with a few interpolations. Treating those as code rather than as content makes them harder to review and change. We default to static .md files with interpolation markers, and only use code-generated prompts when the static approach genuinely cannot capture the variability.

Why this matters for the products we ship

The reason any of this matters is that AI features in production break in ways that are not obvious from inspection. The same prompt that worked perfectly on Tuesday produces subtly wrong output on Thursday because a model update shipped, or because a new edge case came in from a user, or because we changed an upstream dependency without realising. The folder structure exists to make those breaks debuggable in minutes rather than hours.

We have evidence for this. In the year before we adopted the convention, our average time-to-resolve on a prompt regression was 4-7 hours. In the year after, it dropped to 35-55 minutes. The difference is almost entirely the folder structure plus the discipline of v3/v2/v1 immutability.

For a team running a single AI feature on the side of a larger product, this is a nice-to-have. For a team running ten or twenty AI features in production, it is the difference between staying on top of the system and silently losing.

What we teach in the Academy

We teach this convention in week 4 of the Coding pillar at the Academy, after students have shipped their first AI feature and have started to feel the pain of prompts living loose in their codebase. By the end of the cohort, every student's project has the convention baked in. We have tried teaching it earlier; it does not land. The students need to feel the pain of the alternative first.

If you are reading this and you are a few months into running AI features in production, the move is not to retrofit the convention to every prompt today. The move is to start the convention with the next prompt you write and migrate the existing ones over time. Compounding.

The Moonlabs Academy teaches operator-grade AI engineering conventions — including this folder structure — in the Coding pillar. Twelve weeks in Derby. Real production projects shipped weekly.

Our prompt engineering folder structure

The top-level structure

Rule one: each prompt is a folder, not a file

Rule two: prompts are immutable once shipped

Rule three: tests are inline JSONL, not in a separate test framework

Rule four: the README is short and structured

Rule five: shared components live in `_shared/`

What we deliberately do not do

Why this matters for the products we ship

What we teach in the Academy

Louis O'Connell-Bristow

Keep reading

The curriculum with no textbook: teaching a field that reinvents itself every quarter

The candle and the token: what happens when the price of thinking collapses

The first week in the Incubator, hour by hour

Your next chapter starts here.

The top-level structure

Rule one: each prompt is a folder, not a file

Rule two: prompts are immutable once shipped

Rule three: tests are inline JSONL, not in a separate test framework

Rule four: the README is short and structured

Rule five: shared components live in _shared/

What we deliberately do not do

Why this matters for the products we ship

What we teach in the Academy

Louis O'Connell-Bristow

Keep reading

The curriculum with no textbook: teaching a field that reinvents itself every quarter

The candle and the token: what happens when the price of thinking collapses

The first week in the Incubator, hour by hour

Your next chapter starts here.

Rule five: shared components live in `_shared/`