6 Hard Lessons from Building an LLM Data Pipeline
Share
In Chapter 2, we showed how we built an AI pipeline that standardises 50,000 messy product names a day. This chapter is about everything that went wrong before we got there.
When you first start working with Large Language Models, the instinct is to treat them like magic boxes. Just give them all your instructions, feed them the data, and watch the perfect output roll in.
Our goal was to generate standardised product names, categories, and sizes from raw, messy alcohol data (mostly beer and wines) for an on-demand delivery marketplace. The standardised name had to be clean enough to display directly on an e-commerce storefront.
To give you an idea, here is a simplified look at the raw input and what we needed the LLM to output:
Raw Input (Messy Manufacturer Data)
{
"Manufacturer_Category_1": "Beer",
"Manufacturer_Category_2": "Craft Beer",
"Brand_Path": "Stoneface Brewing Co",
"Manufacturer_Item_Name": "STONEFACE MOZ ACCALYPSE DDH",
"Manufacturer_Size": "4 pk",
"upc": 636251776120
}Expected Output (Standardised for E-commerce)
{
"Standardized_Product_Name": "Stoneface Brewing Co Mozaccalypse Double Dry Hopped IPA",
"Category": "Alcohol > Beer > Craft Beer",
"Size": "16 oz x 4 ct"
}Because there were dozens of formatting rules, exceptions, and corrections required to turn that raw data into a clean product name, we decided to articulate every single rule in one massive prompt.
Crafting that "humongous rule" prompt turned out to be a nightmare. Through trial, error, and thousands of API calls, we had to completely rethink our approach. Here are the 6 hardest lessons we learned about moving from basic prompt writing to true LLM Systems Engineering.
When you cram too many rules into a single prompt, LLMs develop a kind of tunnel vision. They might execute Rule #4 perfectly, yet ignore the Title Case requirement or forget to censor an expletive.
On top of that, not every rule applies to every product. For instance, our "Appellation location" rules were only relevant when the raw data actually included appellation details. As the rule set expanded, it became increasingly difficult to ensure the model consistently honoured every applicable constraint.
The Fix: We broke the monolithic prompt into a pipeline of smaller, well-defined tasks. Initially, this multi-step approach felt less stable than a single comprehensive prompt. But with careful prompt tuning and sequencing, it ultimately outperformed the one-shot prompt in both reliability and overall accuracy.
We needed to extract the specific brand of each product from the raw data. Initially, we just asked, "What is the brand?"
Eventually, we changed the prompt to: "Output the brand of the product, AND output the exact source field where you obtained this information (e.g., raw data, Google search, UPC barcode lookup)."
Counterintuitively, making the model output more information made its primary output better.
Why? Because forcing the model to cite its source acts as a grounding mechanism. It forces accountability and drastically reduces hallucinations because the model cannot rely purely on its pre-trained knowledge — it has to point to the exact input field.
When building complex pipelines, debugging a bad LLM output is frustrating. You find yourself asking, "Why on earth did it name the product that?"
To solve this, we started asking the model to explain its reasoning before outputting the final JSON. Suddenly, debugging became easy. We could read the output and see exactly which rules the model considered, which rules were unclear, or if two rules were conflicting.
Pro-Tip: Even if you don't need the reasoning in your final database, ask for it anyway. We found that models perform slightly better when they have to justify their work. It forces a "Chain of Thought" that leads to higher quality answers.
This sounds almost too simple to work, but it does. At the end of our prompts, we added an instruction asking the model to review the rules again and double-check its own proposed answer against them before finalising the output.
Just like a human reviewing their work before hitting submit, this simple repetition and self-correction loop caught edge-case errors and noticeably improved our overall accuracy.
To verify the category of a product, we scraped data from multiple websites. We fed these results to the LLM and asked it to compare them. If the category from each website matched, we flagged it as high confidence.
Comparing text strings to see if they match seems incredibly simple — perfect for an LLM, right?
Wrong. We found that in rare cases (<0.01%), the LLM would make a mistake. Because the model was already overloaded with other cognitive tasks, it would occasionally misalign simple matches.
The Lesson: An LLM is a probability machine, not a strict logic engine. For strict, rules-based comparisons, we moved the task out of the prompt and into a simple Python script. Why settle for 99.99% accuracy from an LLM when a few lines of traditional code will give you 100%? Do the robust things manually.
For our standardised product names, we used a triangulation approach: we ran the prompt through 3 different LLMs simultaneously.
However, running three separate model inferences for every attribute (such as size or category) quickly became too expensive. For those cases, we needed a single model call that could also assess its own confidence. So we made the instruction explicit: return the answer only if you are confident, otherwise, output "ambiguous."
The problem? LLMs hate admitting they don't know something. They will bend over backwards to give you an answer rather than outputting "ambiguous."
To actually infer when a model was confused, we developed two workarounds:
Consistency Checks: We asked the model the exact same question multiple times. If the answer fluctuated, we manually flagged it as ambiguous.
The Token-Time Proxy: We analysed the time and token length of the model's reasoning. Even if the model eventually arrived at a conclusive answer, the very fact that it needed a massive amount of tokens to "talk itself" into the answer was mathematical proof that the model was confused. Long reasoning = high probability of ambiguity.
Prompt engineering is no longer just about finding the right "magic words." As your projects scale, it becomes about Systems Engineering. It's about building pipelines, forcing accountability, triangulating confidence, and knowing exactly when to take the decision out of the AI's hands entirely.
If you missed Chapter 2 — the full case study on how we built this pipeline — read it here.
And if you're sitting on messy data at scale, this pattern works. We've seen it firsthand.
Messy data? We don't clean it. We solve it.
— Fast Code AI
Incorporate AI ML into your workflows to boost efficiency, accuracy, and productivity. Discover our artificial intelligence services.
View All
How we built a multi-model reasoning pipeline using open-source LLMs to standardise 50,000 messy retail product names a day — and automated 68% of the work.
How Fast Code AI pointed its own AI at the team and built Flexa — expert-level physiotherapy tracking on your phone.
How we overcame GPU constraints and budget limitations to successfully train large-scale diffusion models at a startup.
© Copyright Fast Code AI 2026. All Rights Reserved