Evaluating AI Vendors: A Practitioner's Checklist

I have sat through roughly fifty AI vendor pitches in the last eighteen months. The decks are almost interchangeable. Gradient background, a founder quote about transformation, a logo wall, a benchmark chart that is always favorable to the presenter, and a demo that works on the cherry-picked input.

Ten questions do most of the filtering. If a vendor cannot answer these, the pitch is the product, not the technology. Here they are.

1. What exactly happens to the data we send you?

Not "we take privacy seriously." Walk me through the path. Which systems touch the prompt, which subprocessors, how long is it stored, where is it stored, who at your company can read it, and under what circumstances. A vendor that cannot produce this diagram on request has not thought about it enough for me to trust them with regulated data.

2. Show me evals on my data, not your demos.

Demos are curated. Benchmarks are public and therefore contaminated. What I want is: give me a trial with a hundred of my real examples, show me the outputs, and let me grade them. If a vendor will not do this, there is a reason. Usually the reason is that their system works on the demo and not much else.

3. How do you handle model versioning?

Your vendor's underlying model will change. What is their policy when the provider ships a new version? Do they pin, do they pass-through, do they regression-test. I have watched a provider "transparently upgrade" a model and silently break production workflows. The right answer involves regression testing against a held-out suite before rolling customers forward, with a rollback path.

4. What is your incident history?

Every mature vendor has one. Ask to see it. A vendor that claims zero incidents in two years of production either does not run production or does not notice their incidents. Both are disqualifying. What I want to see is: they had incidents, they wrote postmortems, and the postmortems show learning.

5. What does auditability look like?

When a customer complaint comes in and the answer was wrong, can we reconstruct what the model saw, what it returned, and why? If the audit log is "the final answer," that is not enough. We need the prompt, the retrieved context identifiers, the tool calls if any, and the timestamp. This is table-stakes for regulated industries and increasingly table-stakes elsewhere.

6. Show me the contract clauses on model changes and data use.

Two clauses I read carefully: the training-use clause (is my data used to train their or anyone else's models) and the change-notification clause (will I be told before the underlying model changes, and can I opt out). If these are not in the draft contract, or they are phrased as "we may," assume the worst case and ask for tighter language.

7. What is your privacy posture for subprocessors?

The vendor's posture is only as strong as the weakest subprocessor they route to. A vendor hosting on AWS US-East and claiming EU data residency is either wrong or lying, and both outcomes are bad. Ask for the full list. Ask what happens if a subprocessor changes. Ask whether you will be notified.

8. How good is the model card?

A serious vendor has a model card: what the model does, what it does not do, known limitations, known failure modes, evaluation methodology. If the model card is marketing copy, the vendor is selling a brand. If it is technical and candid about limitations, the team behind it is serious.

9. What customers have churned, and why?

Nobody asks this. Ask it. The logos on the website are sales. The churn is operations. A vendor comfortable discussing why customers left has a healthy engineering culture. A vendor visibly thrown by the question does not.

10. Who owns the prompt engineering and evals, you or us?

This is the question that reveals the nature of the product. If the vendor owns the prompts and evals, you are buying an outcome and you are trusting them completely. If you own them, you are buying a platform and the work is yours. Both are valid. Many vendors are evasive about the answer because it is somewhere in between and the answer changes the price.

How to use this list

Do not send these as an email. Send three or four of them as part of the early conversation. The ones that matter most depend on your threat model, your regulatory context, and what you are buying. The point of the list is not to fail vendors. It is to find the ones who have been in production long enough to have the answers, and who respect you enough to give them straight.

Those vendors exist. They are worth the search.