Applied AI
PROMPTICA: A Scientific Prompting Framework for Caribbean Teams
Most teams treat prompting as folklore: copy a magic phrase, paste it in, hope it works. PROMPTICA treats it as engineering. Structured context, role and constraint design, evaluation, and iteration, so the quality of what comes out of Claude and other frontier models stops being guesswork.
Original artwork · maestro AI Labs
Ad-hoc prompting fails in production for three reasons: it is inconsistent, it hallucinates, and nobody measures it. PROMPTICA is maestro's tested prompting framework that swaps folklore for method. Lead with structured context, set explicit constraints, show worked examples, force structured output, and run an evaluation loop. Done right, a task that succeeds 45 percent of the time with a thrown-together prompt can clear 85 percent with a designed one. Caribbean teams learn it hands-on inside the IMPACT AI Lab.
Key takeaways
- Ad-hoc prompting breaks in production because it is inconsistent, it invents facts, and nobody measures it.
- PROMPTICA is five repeatable moves: context first, explicit constraints, worked examples, structured output, and an evaluation loop.
- An evaluation set turns prompt quality into a number you can track, so a change is proven better instead of merely felt to be better.
- In maestro benchmarks, a task that succeeds about 45 percent of the time ad-hoc can reach roughly 85 percent once it is rebuilt with PROMPTICA.
- A designed prompt cuts reruns and lets a smaller, cheaper model do work a larger one used to be needed for.
- Caribbean teams learn the method on their own real tasks inside the IMPACT AI Lab.
Watch how most teams prompt a frontier model and you will see the same ritual. Someone found a phrase that worked once. They paste it into Claude, change a few words, and pray. When the output is wrong, they shrug and try again with slightly different wording. There is no record of what worked, no way to know whether the next run will hold up, and no shared standard across the team. This is folklore, not engineering. And folklore does not survive contact with production.
PROMPTICA is the framework maestro built to fix that. It treats a prompt the way a good engineer treats any input to a system: as something you design, test, and improve on purpose. The point is not clever wording. The point is reliable, repeatable output from models like Claude, on real Caribbean work, run after run. This piece walks through why the casual approach breaks, what PROMPTICA actually does, and how regional teams pick it up.
Why ad-hoc prompting fails in production
A magic phrase works in a demo because a demo is one run. Production is a thousand runs across a dozen colleagues and three months. That is where three cracks open.
The first is inconsistency. The same loose prompt, run on Monday and again on Thursday, gives you answers of different shape, length, and quality. When a model has no firm instruction about format and constraints, it fills the gap with its own guesses, and those guesses drift. A summary that was three crisp bullets one day comes back as four rambling paragraphs the next. For a one-off, fine. For a process that feeds a report or a customer reply, that variance is a defect.
The second is hallucination. A vague prompt invites the model to invent. Ask it to "tell me about the new tax rules" with no grounding, and it will confidently produce plausible figures that may be wrong. The model is not lying; it is doing exactly what an underspecified instruction asks, which is to generate something that sounds right. Without supplied context and explicit limits, you get fluent fiction.
The third, and the one almost nobody fixes, is that there are no evals. Teams ship prompts the way they would never ship code: with no test that says "this works." They have no number for how often the prompt produces an acceptable answer, so they cannot tell whether a change made things better or worse. Quality becomes a matter of opinion and mood. PROMPTICA exists to replace opinion with measurement.
Maestro internal benchmark
Task-success rate: ad-hoc prompts vs PROMPTICA-structured prompts
Indicative figures from maestro client engagements · "success" means output accepted with no human rework
The PROMPTICA method, step by step
PROMPTICA breaks prompting into moves you can teach, check, and reuse. Five of them carry most of the weight.
Context first. Before you ask for anything, give the model what it needs to answer well. The relevant policy, the source document, the customer's actual message, the numbers from your own system. A frontier model cannot know your loan product or your ministry's procedure unless you put it in front of it. Most "the AI got it wrong" failures are really "the human gave it nothing to work from."
Explicit constraints. Say what good looks like and what is off limits. Length, tone, audience, and the hard rules: do not invent figures, do not give legal advice, only use the supplied document. Constraints are guard rails. They are what turn a creative writing engine into a dependable tool.
Worked examples. Show one or two examples of an input and the exact output you want. Models are strong pattern matchers; a single good example often does more than a paragraph of description. This is how you lock in format and standard at the same time.
Structured output. Ask for the answer in a fixed shape, such as labelled fields or JSON, so the result drops straight into your next step without a human reformatting it. Structure is also a quiet accuracy check: a model forced to fill named fields is less likely to wander.
Evaluation loop. Collect a set of real test cases. Run the prompt against all of them. Score the outputs. Change one thing, run again, and keep the version that scores higher. This is the part that turns prompting into engineering, because now you have evidence, not a hunch.
Weak prompt versus strong prompt
The difference is concrete, so here it is. A weak prompt for a Caribbean SME trying to triage customer emails looks like this:
Summarise this email and tell me what to do. [email pasted here]
It will return something. It will also vary wildly run to run, sometimes miss the actual request, and occasionally invent an order number that was never there. Now the same job, structured the PROMPTICA way:
ROLE: You are a support triage assistant for a Jamaican retail SME.
CONTEXT: Our categories are Order, Refund, Delivery, Product Question, Other.
Refund policy: 14 days, receipt required. Delivery is islandwide, 3-5 days.
TASK: Read the customer email below and return triage details only.
CONSTRAINTS:
- Use ONLY information in the email. Do not invent order numbers or dates.
- If something is not stated, write "not stated".
- Reply in the JSON shape shown.
EXAMPLE OUTPUT:
{"category":"Refund","urgency":"high","customer_ask":"refund for damaged blender","missing_info":"receipt number"}
EMAIL:
[email pasted here]
The second prompt sets a role, supplies the business context, fixes the categories, bans invention, handles the unknown case, and pins the output to a shape your system can read. Same model, same email. One is a gamble; the other is a process you can trust and measure. That gap is the whole argument.
What teams report after adopting PROMPTICA
Outcomes across early maestro cohorts
Indicative, self-reported figures · illustrative of typical results, not audited
A second example: routing a government service request
The SME case is the friendly one. Public-sector work raises the stakes, because a wrong answer to a citizen is not a lost sale, it is a person sent to the wrong office or told the wrong thing about their pension. Take a ministry that wants to route inbound requests to the correct department. The lazy version looks like this:
Which department should handle this and what should we tell the citizen? [message pasted here]
That prompt will happily route a request to a unit that does not exist, quote a fee it made up, and write a reply in a register no public officer would use. Now the PROMPTICA build:
ROLE: You are an intake clerk for a national ministry's citizen service desk.
CONTEXT: Valid departments are Registration, Benefits, Licensing, Complaints, Other.
We do not quote fees or processing times unless they appear in the supplied policy note.
POLICY NOTE: [pasted policy excerpt, or "none provided"]
TASK: Read the citizen message and return routing details only.
CONSTRAINTS:
- Route to ONE valid department from the list above. If unclear, use "Other".
- Do NOT invent fees, deadlines, office names, or document requirements.
- If the policy note does not cover the ask, set needs_human_review to true.
- Reply in the JSON shape shown, nothing else.
EXAMPLE OUTPUT:
{"department":"Benefits","needs_human_review":false,"citizen_ask":"status of pension application","next_step":"confirm reference number"}
MESSAGE:
[message pasted here]
The strong version cannot route to a phantom department, cannot fabricate a fee, and flags anything it is unsure about for a human rather than guessing. That last field, needs_human_review, is the part public-sector teams care about most. It turns the model into a triage layer that knows the limit of its own knowledge, which is exactly the property you want when the downstream cost of a confident error is high. The same discipline underpins our work on AI safety through Section 9.
How to build an evaluation set so quality is measured, not guessed
The evaluation loop is the move teams skip, and it is the one that separates a prompt you hope works from a prompt you know works. Building the set is less work than people fear. Start with twenty to fifty real cases pulled from your own history: actual emails, actual citizen messages, actual documents, including the awkward ones that broke things before. For each case, write down the correct answer once, by hand. That hand-labelled answer is your reference, and it is the most valuable asset in the whole exercise.
Then run the prompt across every case and score each output against its reference. Keep the scoring brutally simple to begin: pass or fail, did this answer need human rework or not. Add finer scores later if you need them. Now you have a single number, say 62 out of 80 cases passed, and that number is the truth about your prompt. Change one element, the constraint wording or an example, run the whole set again, and compare. If the number went up, keep the change. If it went down, throw it away. You are no longer arguing about which prompt feels better in a meeting; you have evidence on the table.
The discipline pays off twice. The first time is now, when you pick the better prompt. The second time is later, when a new model version ships and you want to know whether to switch. Run your evaluation set against the new model and you get an answer in an afternoon instead of a vague sense that the new one seems smarter. This is the same posture maestro takes across its AI research and development work: measure first, opine second.
The cost angle: fewer reruns, smaller models doing more
Better prompting is also a cost story, and for a Caribbean SME watching its API bill that often matters more than quality. A vague prompt that fails four times in five means five model calls to get one usable answer, plus the human minutes spent rereading and retrying. A designed prompt that lands first time is one call. Across a team running hundreds of tasks a day, that difference compounds into real money and real hours, which is where the indicative figure of around 62 percent fewer reruns comes from.
There is a second saving that is less obvious. A well-structured prompt often lets a smaller, cheaper model do work that an unstructured prompt could only get from the largest, most expensive one. When you supply the context, fix the format, and constrain the task tightly, you are doing some of the reasoning for the model, so it does not need as much raw capability to land the answer. We have watched teams move a routine triage job from a flagship model down to a mid-tier one with no drop in their evaluation score, cutting the per-task price several times over. Smaller models running on cheaper hardware also tie into the case for localized, sovereign models in the region: the less you depend on the single most powerful frontier model for every call, the more of your stack you can run on your own terms.
Why regional context and data matter
A frontier model trained mostly on North American and European text carries North American and European assumptions. Ask it about credit, returns policy, public holidays, or place names, and its defaults skew foreign. For a Caribbean team, that is not a small annoyance. A model that quietly assumes a US returns window or invents a parish that does not exist produces output that is wrong in ways that look right to anyone not paying close attention.
PROMPTICA handles this head-on. The context-first step is where you inject the local reality: your country's regulations, your currency, your customer base, the way people actually write to you in the Caribbean, including the slip between standard English and Patois. The constraint step is where you forbid the model from filling gaps with foreign defaults. Tuned this way, the same model becomes genuinely useful for a credit union in Kingston, a ministry in Bridgetown, or a logistics startup in Santo Domingo, because it is answering inside the world those teams operate in, not a generic one.
How it plugs into the IMPACT AI Lab
A framework only matters if people can run it without you in the room. That is why PROMPTICA is taught hands-on inside the IMPACT AI Lab, maestro's training programme for Caribbean teams. Participants do not sit through theory. They bring a real task from their own organisation, build a weak prompt, measure how often it fails, then rebuild it with the PROMPTICA moves and watch the success rate climb on their own test cases.
By the end, a team has more than a few good prompts. They have a method they can apply to the next task and the one after that, a small library of evaluated prompts tied to their actual work, and a shared vocabulary so a prompt written by one colleague is legible to the rest. That is the difference between a workshop people forget and a capability that stays in the building.
Who PROMPTICA is for
Three groups get the most out of it. Small and medium enterprises, where one or two people are trying to do the work of five and need AI that produces usable output the first time, not the fourth. Government and public-sector teams, where consistency and the ban on invented facts are not nice-to-haves but the whole job, given the stakes of a wrong answer to a citizen. And startups, where speed is everything and a reliable prompting practice is the difference between an AI feature that ships and one that embarrasses you in front of users.
What ties them together is simple. Each is trying to get dependable work out of a frontier model under real pressure, and each is currently leaving most of that potential on the table by treating prompting as a guessing game. PROMPTICA gives them the engineering discipline to stop guessing.
Frequently Asked Questions
What is PROMPTICA?
PROMPTICA is maestro's tested prompting framework for getting reliable, repeatable work out of frontier models such as Claude. It replaces ad-hoc, copy-and-hope prompting with an engineering method: context first, explicit constraints, worked examples, structured output, and an evaluation loop. It is tuned for Caribbean and Latin American teams and their data.
How is PROMPTICA different from generic prompt tips?
Generic tips give you isolated tricks. PROMPTICA gives you a repeatable process plus a way to measure whether a prompt actually works. The evaluation loop is the core difference: you score prompts against real test cases and keep the version that performs best, so quality is evidence-based rather than a matter of taste.
Do I need to be technical to use PROMPTICA?
No. The framework is built so non-engineers in marketing, operations, customer service, and policy can apply it. The structured-output step uses simple labelled formats, and the IMPACT AI Lab teaches the whole method hands-on using each team's own work, not abstract examples.
Why does regional context matter so much for prompting?
Frontier models are trained mostly on North American and European text, so their defaults skew foreign. Without local context and constraints, a model will quietly assume the wrong currency, policy, or place names. PROMPTICA's context-first and constraint steps inject Caribbean reality and forbid the model from filling gaps with foreign defaults.
How big a difference does structured prompting actually make?
In maestro's internal benchmarks, a task that succeeds around 45 percent of the time with an ad-hoc prompt can reach roughly 85 percent once it is rebuilt with PROMPTICA. These are indicative figures from client engagements, where success means the output is accepted with no human rework. Results vary by task, but the direction is consistent.
How do teams learn PROMPTICA?
Through the IMPACT AI Lab, maestro's training programme for Caribbean teams. Participants bring a real task, measure how often their current prompt fails, rebuild it with the PROMPTICA method, and leave with an evaluated prompt library, a repeatable method, and a shared vocabulary their whole team can use. You can read more in our piece on training Caribbean builders inside the IMPACT AI Lab.
How do I build an evaluation set for my prompts?
Collect twenty to fifty real cases from your own history, including the awkward ones, and hand-write the correct answer for each. Run your prompt across all of them and score pass or fail. Change one thing, run again, and keep the version that scores higher. That single number is what turns prompt quality from opinion into evidence.
Can PROMPTICA cut my AI costs?
Yes, in two ways. A designed prompt that lands first time means fewer model calls than one that fails and gets retried, which our cohorts report as around 62 percent fewer reruns. Tight context and constraints also often let a smaller, cheaper model do work that previously needed the largest one, lowering the per-task price.
Does PROMPTICA work for government and public-sector teams?
It is built for them. Public-sector work depends on consistency and a hard ban on invented facts, which are exactly the constraints PROMPTICA enforces. A needs-human-review flag lets the model triage what it is confident about and escalate the rest, so a wrong answer to a citizen becomes far less likely.
What are the most common prompting mistakes PROMPTICA prevents?
The big four are giving the model no context to work from, leaving the output format open so results vary run to run, asking for facts the model has to invent, and shipping a prompt with no test of whether it works. The five PROMPTICA moves close each of those gaps directly. You can see the contrast in the weak versus strong examples above.
Stop treating prompting as luck. Explore PROMPTICA to see the full framework, or bring your team into the IMPACT AI Lab to learn it hands-on on your own real work. maestro helps Caribbean SMEs, government teams, and startups turn frontier models into dependable tools.