Most companies don't fail at AI because the technology doesn't work. They fail because they run pilots that can't produce a real decision. The scope is too broad, the timeline has no end date, the success criteria were never written down, and when 90 days are up, nobody can say whether the thing worked. That's not a pilot. That's an experiment without a question, and without a question, there's no path to scale and no honest conversation about whether to continue.
Here is the structure that actually produces a decision.
What a structured AI pilot actually is
An AI pilot program is a time-boxed, scope-limited deployment of an AI tool within a defined group, with predetermined success criteria and a specific decision at the end: scale it, stop it, or redesign it. That last part is what most organizations skip, and skipping it is why so many pilots stall.
The difference between a structured pilot and an unstructured AI experiment is accountability. A structured pilot has a named sponsor who owns the outcome, a group small enough to measure, a timeline with a hard stop, and written success criteria that were agreed on before the technology was switched on.
For mid-market companies, the right framework runs 60 to 90 days. Shorter pilots don't give users enough time to change behavior. Longer pilots drift, lose focus, and produce conclusions that are hard to defend. Sixty to ninety days is enough time to answer the question the organization actually cares about: does this tool change how work gets done, and is that change worth the cost and complexity of scaling it?
Phase 1: Picking the right use case to test
The most common mistake in a pilot is starting with the wrong problem. Companies pick an AI tool first, then retrofit a use case. A structured pilot starts the other way around.
The selection phase, typically the first two to three weeks, has one job: find the highest-value, lowest-risk use case in the organization. High-value means the problem is one the business genuinely cares about solving. Low-risk means the stakes of a failed pilot are containable. Automating the CEO's inbox is high-risk. Drafting first versions of internal reports is low-risk.
Picking the right use case requires an honest read on where the business sits today. What workflows are consuming the most time from capable people? Where is there already manual, repetitive work that takes judgment but not creativity? Where would a 30 percent efficiency gain change a headcount conversation? Before the selection phase begins, a structured AI readiness review is worth completing to understand current data quality, access controls, and governance gaps that could affect the pilot scope or timeline.
Phase 2: Setting the pilot up to succeed
The setup phase, typically weeks two through four, is where most pilots silently fail. Everything looks fine on the surface, but the conditions for a real test were never created.
Setup means four things are done before the tool goes live. The pilot group has been defined and briefed. The tool has been configured for the specific use case, not installed generically. Baseline metrics have been captured so there is something to compare against at the end. And a sponsor has explicitly accepted accountability for the outcome.
The pilot group should be small enough to manage and representative enough to matter. The right size for most mid-market pilots is 10 to 25 users. Fewer than that and it's hard to separate individual behavior from real signal. More than that and tracking what's actually happening week to week becomes a job in itself.
Capturing baselines before the pilot begins is non-negotiable. If the organization doesn't know how long a task took before the tool was introduced, it can't measure whether the tool helped. Document the baseline in a format that the decision-makers at the end of the pilot will accept as evidence.
Phase 3: Running the pilot without losing the thread
The run phase, weeks four through ten, is when the technology is live and in use. The goal during this phase is not to make the tool succeed. The goal is to observe what actually happens.
A weekly check-in with the pilot group, fifteen minutes, is enough to track adoption, surface friction, and catch the moments when people quietly stopped using the tool. Those abandonment points are as informative as the successes. When a user works around the tool rather than through it, that's a signal about the tool, the use case, or the onboarding, and it's worth understanding which one before the decision gate.
The table below shows how a structured pilot differs from the kind that drifts:
| Characteristic | Structured pilot | Unstructured pilot |
|---|---|---|
| Success criteria | Written down before kickoff | Defined after results are in |
| Pilot group | 10 to 25 users, clearly defined | "Whoever wants to try it" |
| Timeline | Hard stop at 60 to 90 days | Open-ended |
| Baseline data | Captured before kickoff | Reconstructed afterward |
| Decision gate | Scale, stop, or redesign | "Keep experimenting" |
| Sponsor | Named and accountable | Shared or unclear |
Phase 4: Measuring what actually changed
The measure phase, typically weeks ten through twelve, is the most honest moment in a pilot. It's when the gap between what the organization hoped the tool would do and what it actually did becomes visible.
Measurement should happen across three dimensions. First, output quality: is the work product better, faster, or more consistent than it was before? Second, adoption rate: what percentage of the pilot group used the tool with enough regularity to count, and what does "regular" mean for this specific workflow? Third, cost-benefit: when the time savings or quality improvement is applied to the full population that would use the tool at scale, does the math justify the license cost, the training overhead, and the security and governance work required to roll it out?
The measure phase should also surface risks that weren't visible before the pilot went live. Data handling questions. Access control gaps. Cases where users found workarounds that bypassed the tool's guardrails. These findings are as important as the performance numbers when it comes to the decision gate.
Phase 5: The scale-or-stop decision gate
The decision gate, at the end of the pilot's timeline, is where the structure pays off. If the work was done correctly in phases one through four, the final decision is not a judgment call. It's a conversation about evidence.
Scale means the pilot produced results against the criteria that were set before kickoff, the adoption rate supports confidence in the numbers, and the cost-benefit math justifies a broader rollout. Stop means one of those conditions wasn't met, and the honest choice is to end it before the organization commits more time and budget to something that isn't working. Redesign means the use case was off but the technology showed real potential in a different context, and the right next step is to run a second, tighter pilot with a corrected scope.
The decision gate is not the end of the AI adoption process. It's the beginning of either a rollout plan or a pivot. For companies that want experienced advisory support before or during a pilot, without the cost of a full engagement, Heartwood provides on-demand technology strategy guidance from senior practitioners. Ask your hardest pilot question before you commit to a direction.
Common questions about AI pilot programs
How long should a pilot run?
For most mid-market companies, 60 to 90 days is the right window. That's long enough for users to move past the novelty effect and build real habits with the tool, but short enough to maintain focus and produce a decision before the organization loses momentum. Pilots shorter than 30 days tend to measure enthusiasm rather than productivity change. Pilots that run beyond 90 days without a defined endpoint tend to drift, lose accountability, and make the final decision harder to defend.
How many people should be in the pilot group?
Ten to 25 people is the right range for most mid-market AI pilots. That's enough users to separate individual behavior from real signal, and small enough to track what's actually happening week to week. The group should represent the people who will eventually use the tool at scale, not just a self-selected group of early enthusiasts. Enthusiast-only pilots consistently produce adoption rates that the broader organization can't match, which makes the scale decision harder, not easier.
What counts as a successful pilot?
A successful pilot is one that produces a defensible decision, not necessarily one where the tool performed well. Success means the organization can answer three questions: did the tool improve output quality or speed against the baseline? Did the pilot group adopt it at a rate that's meaningful for this workflow? Does the cost-benefit math work when applied to the full population? A pilot where the answer to all three is yes points clearly to scale. A pilot where the answer to even one is no still counts as successful if it tells the organization something true.
Should finance or ops pilot first?
Operations tends to be the better starting point for most mid-market companies. Ops typically has more repetitive, well-documented workflows, clearer baseline metrics, and a higher tolerance for process change than finance does. Finance can work when the use case is narrow and well-defined, but finance teams are often more cautious about new tools touching their data, which can slow the pilot down before it produces anything useful. Start where the workflow is clearest and the stakeholder is most willing. A successful first pilot builds internal confidence for harder ones later.
What do we do if the pilot fails?
A failed pilot is more valuable than a pilot that never reached a decision. If the use case was wrong, the pilot revealed something the organization needed to know before spending more resources. If the technology wasn't ready, that's a finding that prevents a costly rollout. Document what didn't work and why. Share the findings with leadership rather than quietly setting them aside. The companies that build a real AI capability over time treat a failed pilot as data, not as evidence that the whole direction is wrong.
Not sure what you need yet? Ask the panel.
Heartwood is an AI advisory panel for mid-market executives who need on-demand technology strategy guidance. Start with your toughest question.
Try Heartwood free