Fast Code AI

In Chapter 2, we showed how we built an AI pipeline that standardises 50,000 messy product names a day. This chapter is about everything that went wrong before we got there.

When you first start working with Large Language Models, the instinct is to treat them like magic boxes. Just give them all your instructions, feed them the data, and watch the perfect output roll in.

Our goal was to generate standardised product names, categories, and sizes from raw, messy alcohol data (mostly beer and wines) for an on-demand delivery marketplace. The standardised name had to be clean enough to display directly on an e-commerce storefront.

To give you an idea, here is a simplified look at the raw input and what we needed the LLM to output:

Raw Input (Messy Manufacturer Data)

{
  "Manufacturer_Category_1": "Beer",
  "Manufacturer_Category_2": "Craft Beer",
  "Brand_Path": "Stoneface Brewing Co",
  "Manufacturer_Item_Name": "STONEFACE MOZ ACCALYPSE DDH",
  "Manufacturer_Size": "4 pk",
  "upc": 636251776120
}

Expected Output (Standardised for E-commerce)

{
  "Standardized_Product_Name": "Stoneface Brewing Co Mozaccalypse Double Dry Hopped IPA",
  "Category": "Alcohol > Beer > Craft Beer",
  "Size": "16 oz x 4 ct"
}

Because there were dozens of formatting rules, exceptions, and corrections required to turn that raw data into a clean product name, we decided to articulate every single rule in one massive prompt.

Crafting that "humongous rule" prompt turned out to be a nightmare. Through trial, error, and thousands of API calls, we had to completely rethink our approach. Here are the 6 hardest lessons we learned about moving from basic prompt writing to true LLM Systems Engineering.

1. Stop Using Mega-Prompts (Divide and Conquer)

When you cram too many rules into a single prompt, LLMs develop a kind of tunnel vision. They might execute Rule #4 perfectly, yet ignore the Title Case requirement or forget to censor an expletive.

On top of that, not every rule applies to every product. For instance, our "Appellation location" rules were only relevant when the raw data actually included appellation details. As the rule set expanded, it became increasingly difficult to ensure the model consistently honoured every applicable constraint.

The Fix: We broke the monolithic prompt into a pipeline of smaller, well-defined tasks. Initially, this multi-step approach felt less stable than a single comprehensive prompt. But with careful prompt tuning and sequencing, it ultimately outperformed the one-shot prompt in both reliability and overall accuracy.

2. Make the Model Do More to Get Better Results

We needed to extract the specific brand of each product from the raw data. Initially, we just asked, "What is the brand?"

Eventually, we changed the prompt to: "Output the brand of the product, AND output the exact source field where you obtained this information (e.g., raw data, Google search, UPC barcode lookup)."

Counterintuitively, making the model output more information made its primary output better.

Why? Because forcing the model to cite its source acts as a grounding mechanism. It forces accountability and drastically reduces hallucinations because the model cannot rely purely on its pre-trained knowledge — it has to point to the exact input field.

3. Force the Model to Explain Its Reasoning

When building complex pipelines, debugging a bad LLM output is frustrating. You find yourself asking, "Why on earth did it name the product that?"

To solve this, we started asking the model to explain its reasoning before outputting the final JSON. Suddenly, debugging became easy. We could read the output and see exactly which rules the model considered, which rules were unclear, or if two rules were conflicting.

Pro-Tip: Even if you don't need the reasoning in your final database, ask for it anyway. We found that models perform slightly better when they have to justify their work. It forces a "Chain of Thought" that leads to higher quality answers.

4. Ask the Model to Double-Check Its Own Rules

This sounds almost too simple to work, but it does. At the end of our prompts, we added an instruction asking the model to review the rules again and double-check its own proposed answer against them before finalising the output.

Just like a human reviewing their work before hitting submit, this simple repetition and self-correction loop caught edge-case errors and noticeably improved our overall accuracy.

5. Don't Use AI for Deterministic Tasks

To verify the category of a product, we scraped data from multiple websites. We fed these results to the LLM and asked it to compare them. If the category from each website matched, we flagged it as high confidence.

Comparing text strings to see if they match seems incredibly simple — perfect for an LLM, right?

Wrong. We found that in rare cases (<0.01%), the LLM would make a mistake. Because the model was already overloaded with other cognitive tasks, it would occasionally misalign simple matches.

The Lesson: An LLM is a probability machine, not a strict logic engine. For strict, rules-based comparisons, we moved the task out of the prompt and into a simple Python script. Why settle for 99.99% accuracy from an LLM when a few lines of traditional code will give you 100%? Do the robust things manually.

6. How to Detect When an LLM is Confused

For our standardised product names, we used a triangulation approach: we ran the prompt through 3 different LLMs simultaneously.

All 3 match = High Confidence
2 match = Medium Confidence
All 3 different = Low Confidence (Flag for human review)

However, running three separate model inferences for every attribute (such as size or category) quickly became too expensive. For those cases, we needed a single model call that could also assess its own confidence. So we made the instruction explicit: return the answer only if you are confident, otherwise, output "ambiguous."

The problem? LLMs hate admitting they don't know something. They will bend over backwards to give you an answer rather than outputting "ambiguous."

To actually infer when a model was confused, we developed two workarounds:

Consistency Checks: We asked the model the exact same question multiple times. If the answer fluctuated, we manually flagged it as ambiguous.

The Token-Time Proxy: We analysed the time and token length of the model's reasoning. Even if the model eventually arrived at a conclusive answer, the very fact that it needed a massive amount of tokens to "talk itself" into the answer was mathematical proof that the model was confused. Long reasoning = high probability of ambiguity.

Final Thoughts

Prompt engineering is no longer just about finding the right "magic words." As your projects scale, it becomes about Systems Engineering. It's about building pipelines, forcing accountability, triangulating confidence, and knowing exactly when to take the decision out of the AI's hands entirely.

If you missed Chapter 2 — the full case study on how we built this pipeline — read it here.

And if you're sitting on messy data at scale, this pattern works. We've seen it firsthand.

Messy data? We don't clean it. We solve it.
— Fast Code AI

Want to know more about AI ML Technology

Incorporate AI ML into your workflows to boost efficiency, accuracy, and productivity. Discover our artificial intelligence services.

Know More

1. Stop Using Mega-Prompts

2. Make the Model Do More

3. Force Reasoning

4. Double-Check Its Own Rules

5. Don't Use AI for Deterministic Tasks

6. Detect When an LLM is Confused

7. Final Thoughts

Fast Code AI Chronicles — Chapter 2.5: Why Mega-Prompts Fail

1. Stop Using Mega-Prompts (Divide and Conquer)

2. Make the Model Do More to Get Better Results

3. Force the Model to Explain Its Reasoning

4. Ask the Model to Double-Check Its Own Rules

5. Don't Use AI for Deterministic Tasks

6. How to Detect When an LLM is Confused

Final Thoughts

Want to know more about AI ML Technology

TABLE OF CONTENTS

Read More Blogs