What Agentic AI Unlocks for GxP Software (and What It Demands of You)

In This Article

The shift: AI as a lifecycle participant, not just a tool What it unlocks Where it fails predictably What GAMP guidance actually says A working pattern: human-in-the-loop with named gates Where to start

Frameworks Referenced

GxPAIGAMP 5ValidationAutomated Testing

The shift: AI as a lifecycle participant, not just a tool

For most of the last decade, AI in regulated software has meant either machine-learning models inside a product (a classifier, a forecasting engine) or static authoring assistants (a smarter spell-check). Both fit reasonably well into existing GxP frameworks: validate the model, qualify the assistant, move on.

Agentic AI is different. A coding agent is now a participant in the development process itself — it reads the requirements, writes the code, generates the test cases, drafts the validation evidence, and increasingly executes deployment steps. The artefacts that prove your software is fit for its intended use are partly authored by the same system you are building. That is a structural change, not an incremental one.

The right question is no longer "is the AI validated?". The right questions are: where in the lifecycle is the AI acting, what artefacts does it produce, and what evidence of human critical thinking surrounds those artefacts when an auditor asks?

What it unlocks

The capability that matters most is artefact density. A validation campaign for a moderately complex GxP application can require hundreds of documents — user requirements, functional specifications, design specifications, risk assessments, traceability matrices, IQ/OQ/PQ protocols, test scripts, and the signed evidence to back each one. Most organisations under-invest in this layer because the cost of producing it manually is high relative to its perceived informational content.

Agentic AI changes the economics. A well-prompted agent can take a user requirement, derive functional specifications, propose risk controls, generate test scripts in Gherkin or your preferred format, and emit a traceability matrix that links them — in minutes rather than weeks. Traceability that used to drift the moment a single requirement changed now stays current because regenerating it is cheap.

Automated testing benefits in the same way. Coverage that was previously aspirational becomes routine: every requirement gets at least one positive and one negative test, edge cases are enumerated systematically, and the suite re-runs on every commit. Combined with property-based testing and mutation testing, the floor of test quality rises substantially. For data-integrity-critical paths, this is genuinely material risk reduction.

Three secondary unlocks are worth naming. Audit-trail completeness improves because the agent can reliably emit structured logs of its own actions. Documentation freshness improves because regeneration is no longer a separate project. And periodic review becomes more honest, because the cost of re-running a full assessment against the current state of the system has collapsed.

Where it fails predictably

Hallucination is the headline risk and the easiest to over-index on. An agent will confidently produce a function name, a regulatory citation, or a test expectation that is plausible but wrong. In a non-regulated context this is annoying. In a GxP context an unverified citation in a risk assessment is a finding waiting to happen.

Non-determinism is the deeper risk. The same input prompt does not always produce the same output, and small upstream changes in the model, the tooling, or the surrounding context can shift behaviour without any change in your code. The validated state of yesterday is not necessarily the validated state of today. This is the same problem that ML/AI guidance has been wrestling with for years; it now applies to your build pipeline, not just your product.

Three further risks deserve discipline. Data leakage: prompts and context windows can carry confidential or personal data into a third-party model, with implications for data integrity and privacy. Change of intended use: an agent originally scoped to draft test cases drifts into authoring requirements, and the validation rationale no longer fits the actual use. Provenance: if an agent's training data includes copyrighted code or contaminated patterns, your software inherits a debt you cannot easily measure.

Finally, there is the risk of false confidence in coverage. A test suite that looks comprehensive because the agent generated 400 cases is not necessarily a suite that exercises your real failure modes. Quantity without critical thinking is decoration.

What GAMP guidance actually says

GAMP 5 Second Edition (2022) is the relevant baseline. Two of its threads matter directly. First, the explicit framing of "critical thinking" as the discipline that makes risk-based validation work — not procedure for procedure's sake, but informed judgement about where the real risks lie and where evidence needs to be densest. Second, the lifecycle approach: validation is something you maintain, not something you finish.

ISPE's GAMP community has since published more specific guidance on AI and machine learning, including the GAMP Good Practice Guide on Artificial Intelligence and a series of concept papers. The principles are consistent and worth internalising: a clear statement of intended use; risk assessment that accounts for AI-specific failure modes; a defined lifecycle with explicit retraining or re-evaluation triggers; supplier assessment for AI components; and data-integrity discipline (ALCOA+) extended to training data, prompts, and model outputs.

Read together, the message is not "do not use AI in GxP". It is: be precise about where AI sits, what intended use it serves, what risk it creates at each step, and what evidence demonstrates that the risk is controlled. This applies as cleanly to a coding agent inside your CI pipeline as it does to a model inside your product.

A working pattern: human-in-the-loop with named gates

The pattern that aligns most cleanly with GAMP guidance has two properties. First, the agent is scoped narrowly: it produces drafts of specific artefacts (test cases, traceability rows, risk-assessment entries) but does not execute risk-bearing actions without explicit human approval. Second, critical thinking happens at named, signed gates rather than diffusely across the workflow.

Concretely: a senior engineer reviews and signs off on agent-generated requirements before they enter the requirements baseline. A QA practitioner reviews agent-generated test scripts against actual failure modes, not just syntactic correctness. A named reviewer signs the traceability matrix. The agent accelerates each of these gates; it does not replace the judgement at them.

The harder discipline — and the one most teams will need to grow into — is evidence pinning. The validation file should reference the model and tool versions that produced each artefact, and a periodic re-run under controlled conditions should be part of how you decide whether the validated state still holds. You do not need a perfect logging system on day one, but you do need a clear answer to "how would we reconstruct, today, what the agent produced and on what basis?".

Where to start

Pick one artefact class first. Test-case generation is usually the safest starting point: outputs are concrete, easy to evaluate, and the failure modes (missing edge cases, irrelevant tests) are visible to a competent reviewer. Traceability matrices are a close second.

Establish the surrounding evidence before you scale. Decide what gets logged, who reviews what, and how model versions and prompts are recorded. If you cannot answer those questions for one artefact class, you cannot answer them for ten.

Then expand deliberately. The benefit compounds: every artefact class you bring under the same logging-and-review discipline reduces the marginal cost of the next one. The unlock is real, but only if the evidence keeps pace with the velocity. That is what GAMP has been pointing at for thirty years, and agentic AI is the latest reason it matters.

Jasper Donkers

Management Consultant at The Digital Capability Company

Builds and validates regulated digital systems for life-sciences clients. Working at the intersection of GxP compliance, software engineering, and applied AI.

Also writes at HiddenCove