<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="http://pierredemalliard.com/feed.xml" rel="self" type="application/atom+xml" /><link href="http://pierredemalliard.com/" rel="alternate" type="text/html" /><updated>2026-05-07T21:25:09+00:00</updated><id>http://pierredemalliard.com/feed.xml</id><title type="html">Pierre de Malliard</title><subtitle></subtitle><author><name>Pierre de Malliard</name></author><entry><title type="html">Eligible Isn’t Compliant</title><link href="http://pierredemalliard.com/2026/05/07/Eligible-Is-Not-Compliant.html" rel="alternate" type="text/html" title="Eligible Isn’t Compliant" /><published>2026-05-07T00:00:00+00:00</published><updated>2026-05-07T00:00:00+00:00</updated><id>http://pierredemalliard.com/2026/05/07/Eligible-Is-Not-Compliant</id><content type="html" xml:base="http://pierredemalliard.com/2026/05/07/Eligible-Is-Not-Compliant.html"><![CDATA[<h1 id="eligible-isnt-compliant">Eligible isn’t Compliant</h1>

<p><em>Closing the gap between AWS services and validated AI in GxP environments</em></p>

<hr />

<p>Every life sciences customer I talk to arrives with the same question, phrased differently each time:</p>

<blockquote>
  <p>“You keep saying Bedrock and AgentCore are <em>eligible</em>. Show me one customer that actually made it <em>compliant</em> in their GxP environment.”</p>
</blockquote>

<p>The frustration is fair. “HIPAA-eligible” and “GxP-compatible” are AWS statements about AWS Services. They don’t say anything about whether the application a customer builds on top is validated, inspection-ready, or acceptable to a regulator. Eligibility is the starting line, not the finish line.</p>

<hr />

<h2 id="the-gap-in-one-sentence">The gap in one sentence</h2>

<p>The shared responsibility model for GxP AI follows the more general cloud shared responsibility model: AWS provides capabilities and the customer owns the final decision on how to build.</p>

<table>
  <thead>
    <tr>
      <th>Capability domain</th>
      <th>AWS provides</th>
      <th>Customer owns</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Qualified infrastructure &amp; identity</strong></td>
      <td>ISO 27001/9001, SOC 1/2/3, FedRAMP, HIPAA eligibility; IAM, IAM Identity Center, AgentCore Identity</td>
      <td>Supplier qualification documentation, role definitions, least-privilege policies, segregation of duties</td>
    </tr>
    <tr>
      <td><strong>Observability</strong></td>
      <td>AgentCore Observability (sessions, traces, spans), CloudTrail, CloudWatch</td>
      <td>What to monitor, thresholds, alerting rules, log retention policy</td>
    </tr>
    <tr>
      <td><strong>Encryption &amp; data protection</strong></td>
      <td>KMS, TLS primitives, Bedrock Guardrails PHI/PII redaction, Comprehend Medical</td>
      <td>Sensitive data classification, key management policy, data residency decisions</td>
    </tr>
    <tr>
      <td><strong>Immutability &amp; versioning</strong></td>
      <td>Bedrock immutable model versions, CloudFormation IaC, S3 Object Lock (WORM)</td>
      <td>Pinning model versions, prompt and tool version control, change approval process</td>
    </tr>
    <tr>
      <td><strong>Control flow &amp; HITL</strong></td>
      <td>Step Functions, Prompt Flows, AgentCore Policy, MCP elicitation support</td>
      <td>HITL trigger design, confidence thresholds, escalation paths, approver workflows</td>
    </tr>
    <tr>
      <td><strong>Evaluation &amp; assurance</strong></td>
      <td>AgentCore Evaluations, Bedrock model evaluations, Automated Reasoning, Audit Manager, Config conformance packs (21 CFR Part 11)</td>
      <td>Validation strategy (CSA / IQ/OQ/PQ), acceptance criteria, intended use documentation, SOP integration</td>
    </tr>
  </tbody>
</table>

<p>AWS delivers the primitives. The customer defines the policy. The regulator inspects how those two combine in your application. You don’t become compliant by procuring eligible services. You become compliant by <em>using them inside a validated application</em>.</p>

<hr />

<h2 id="why-uniform-validation-doesnt-work-anymore">Why uniform validation doesn’t work anymore</h2>

<p>Traditional CSV treated every system the same: full IQ/OQ/PQ regardless of risk. That approach breaks on AI for multiple reasons: First, it’s expensive. Full CSV on every prompt change is not survivable. Second, it’s not flexible enough. A literature summarization tool used internally has nothing in common, risk-wise, with an AI agent that influences a regulatory submission.</p>

<p>Two forces are driving the change. FDA’s <a href="https://www.fda.gov/media/161521/download">Computer Software Assurance guidance</a> is the efficiency fix: match validation rigor to actual risk, not to a uniform template. The EU AI Act’s August 2026 high-risk enforcement is the deadline. Conformity assessment, technical documentation, and ongoing monitoring are now legal obligations for high-risk AI, and most life sciences use cases (clinical decision support, pharmacovigilance, quality) land in that bucket.</p>

<p>CSA gives you the framework. The EU AI Act sets the clock.</p>

<p>The classic example: an AI agent that summarizes scientific literature.</p>

<ul>
  <li><strong>Internal team meeting summaries</strong> → low risk → minimal controls, basic audit trail</li>
  <li><strong>Input to research direction</strong> → medium risk → structured HITL, drift monitoring, Guardrails</li>
  <li><strong>Cited in a regulatory submission</strong> → high risk → full IQ/OQ/PQ, automated reasoning, red team, CAPA</li>
</ul>

<p>Before any PoC: classify the workload.</p>

<blockquote>
  <p>Severity × probability × detectability</p>
</blockquote>

<p>What is the harm being done if something goes wrong? How likely is this scenario ? Can I detect any failures or hallucinations? These questions ultimately dictate everything downstream: validation cost, time to production, documentation burden, and whether the system will survive an inspection.</p>

<hr />

<h2 id="what-regulators-actually-expect">What regulators actually expect</h2>

<p>Regulators don’t care about specific use cases. They care about a small set of properties the system must demonstrate, regardless of whether it’s harmonizing clinical data or screening promotional material. Five expectations cover most of what matters, and each maps to a concrete pattern that you can build with features of eligibile AWS Services.</p>

<p><strong>1. Reproducibility: same input, same output.</strong>
The probabilistic nature of LLMs can be monitored: You can pin model versions, lower model temperatures, version your prompts in source control, define your applications as Infrastructure as Code (Iac). You get a deterministic system made of stochastic parts. AgentCore evalutions can help you monitor and verify and track agentic behavior. <em>This is relevant for: clinical data harmonization (eDTM → ADaM) and anything that feeds regulatory submissions.</em></p>

<p><strong>2. Traceability: every factual claim ties to a source.</strong>
Retrieval-augmented generation with citation attribution via Bedrock Knowledge Bases. Automated Reasoning checks against the grounded context. If it can’t cite it, it can’t say it. Final authorship stays human. AI accelerates drafting and consistency. <em>This is relevant for: clinical study report authoring, MLR content review.</em></p>

<p><strong>3. Explainability: the reasoning chain is inspectable, not only the output.</strong>
AgentCore Observability captures sessions, traces, and spans: the full chain of tool calls, retrieved documents, and intermediate decisions that produced a given answer. For multi-step agents, this is the audit trail. In pharmacovigilance signal detection, the question “why did you flag this case?” needs a reconstructable answer, not only a probability score.</p>

<p><strong>4. Human accountability: a qualified person signs off on consequential decisions.</strong>
Step Functions HITL gates, Prompt Flow approvals, AgentCore Policy enforcing action boundaries in code, not only in procedure. Confidence thresholds route low-certainty outputs to human review. The system can recommend; the human decides. <em>Where this bites: MLR pre-screening (committee retains authority), AE case adjudication, any AI output that influences a regulatory submission.</em></p>

<p><strong>5. Continuous monitoring and change control: validation is where compliance begins, not where it ends.</strong>
CloudWatch thresholds on model performance, drift detection via AgentCore Evaluations, AWS Config for configuration drift, Audit Manager for continuous evidence collection. This expectation is where most customers hit the “partner AI” problem: AI features auto-activating in validated Veeva, Box, ServiceNow, and Salesforce environments through monthly tenant updates. Unless change control detects and qualifies those activations <em>before</em> they go live, you’re running an unvalidated change in a validated GxP system. Nearly every customer who has invested in GxP SaaS platforms carries this exposure today.</p>

<p>These five are the lens. If your architecture can demonstrate all five with AWS-native evidence, the use case specifics become details, not obstacles.</p>

<hr />

<h2 id="validation-as-an-ongoing-journey">Validation as an ongoing-journey</h2>

<p>The unlock for GxP AI is treating the validation lifecycle as infrastructure, not paperwork.</p>

<ul>
  <li><strong>Sprint 0:</strong> risk classification, CSA-based validation strategy, IaC baseline (CloudFormation), AWS Config conformance pack enabled, CI/CD pipeline with compliance gates.</li>
  <li><strong>Dev sprints:</strong> feature work with prompts versioned, models pinned, Guardrails configured, and evidence auto-collected via Audit Manager.</li>
  <li><strong>Validation sprint (not quarter):</strong> IQ is automated CloudFormation deployment verification. OQ and PQ are automated test execution against pre-agreed acceptance criteria, including AgentCore Evaluations for AI-specific performance tests. Trace review happens in AgentCore Observability. Quality gate signs off.</li>
  <li><strong>Production:</strong> continuous validation via CloudWatch thresholds, AgentCore Evaluations, drift detection, quarterly review via Audit Manager reports, change control triggered on any model or prompt change.</li>
</ul>

<p>The regulated landing zone pattern (pre-qualified multi-account environments with change management, configuration, and security baked in) means each new AI workload starts on a validated foundation instead of building one. Customers using this approach report 30–40% reductions in qualification cycle times, and timelines from 12–16 weeks down to 4–8 weeks for adding a new service like Bedrock to an already-qualified estate.</p>

<p>Validation becomes reusable infrastructure, not paperwork redone each time.</p>

<hr />

<h2 id="the-bottom-line">The bottom line</h2>

<p>“Is this feasible in a GxP environment?” is the wrong question. Of course it’s feasible, and multiple top-10 pharma have done it. The real questions are:</p>

<ol>
  <li>Have I classified the workload against actual risk, not assumed uniform validation?</li>
  <li>Is my validation strategy expressed as infrastructure, so it can be re-run rather than re-invented?</li>
  <li>Do I have the AI-specific controls (prompt versioning, tool dependency locking, drift detection, automated reasoning) that standard software validation doesn’t cover?</li>
  <li>Am I governing the partner AI features that are activating in my validated platforms whether I’m ready or not?</li>
</ol>

<p>Eligibility gets you through the AWS part of the conversation. Compliance is what you build on top. The good news: the patterns are now concrete enough to be repeatable.</p>

<hr />

<p>Additional resources: <a href="https://aws.amazon.com/blogs/machine-learning/a-guide-to-building-ai-agents-in-gxp-environments/">A Guide to Building AI Agents in GxP Environments</a>.</p>]]></content><author><name>Pierre de Malliard</name></author><summary type="html"><![CDATA[Eligible isn’t Compliant]]></summary></entry><entry><title type="html">My favorite Blog posts</title><link href="http://pierredemalliard.com/2023/11/29/Favorite-blog-posts.html" rel="alternate" type="text/html" title="My favorite Blog posts" /><published>2023-11-29T00:00:00+00:00</published><updated>2023-11-29T00:00:00+00:00</updated><id>http://pierredemalliard.com/2023/11/29/Favorite-blog-posts</id><content type="html" xml:base="http://pierredemalliard.com/2023/11/29/Favorite-blog-posts.html"><![CDATA[<p>Here is a list of blogs that have influenced my thinking: I keep this list mostly for personal reference in no particular order</p>

<ul>
  <li>Cities and ambition, <em>Paul Graham</em> <a href="https://www.paulgraham.com/cities.html">link</a></li>
  <li>The Techno-optimism manifesto , <em>Vitalik Buterin</em> <a href="https://vitalik.eth.limo/general/2023/11/27/techno_optimism.html">link</a></li>
  <li>To a Caveman very few things are resources <em>Naval Ravikant</em> <a href="https://nav.al/caveman">link</a></li>
  <li>The fear setting exercise, <em>Tim Ferriss</em> <a href="https://tim.blog/2017/05/15/fear-setting/">link</a></li>
  <li>Climbing the wrong hill, <em>Chris Dixon</em> <a href="https://cdixon.org/2009/09/19/climbing-the-wrong-hill">link</a></li>
</ul>]]></content><author><name>Pierre de Malliard</name></author><summary type="html"><![CDATA[Here is a list of blogs that have influenced my thinking: I keep this list mostly for personal reference in no particular order]]></summary></entry><entry><title type="html">Large Language Models - Strengths, Weaknesses, Opportunities &amp;amp; Threats</title><link href="http://pierredemalliard.com/2023/10/09/LargeLanguageModels.html" rel="alternate" type="text/html" title="Large Language Models - Strengths, Weaknesses, Opportunities &amp;amp; Threats" /><published>2023-10-09T00:00:00+00:00</published><updated>2023-10-09T00:00:00+00:00</updated><id>http://pierredemalliard.com/2023/10/09/LargeLanguageModels</id><content type="html" xml:base="http://pierredemalliard.com/2023/10/09/LargeLanguageModels.html"><![CDATA[<p>Understand the difference between <em>Language Tasks</em> and <em>Knowledge Tasks</em>. Language tasks involve understanding, generating language such as writing essays, formatting an output to JSON etc. Knowledge Tasks require accessing and providing factual information or real-world knowledge. LLMs are good at Language Tasks (reformatting outputs, reformulating etc) but struggle with Knowledge tasks:</p>

<ul>
  <li>They produce incorrect and contradictory statements</li>
  <li>They produce dangerous and socially unacceptable statements (That include bias and other socially unacceptable output)</li>
  <li>Limited Training, Retraining and Inference is expensive</li>
  <li>The knowledge cannot easilty be updated: Updating just one fact is near-to-impossible</li>
  <li>Lack of attribution: no easy way to determine which document in the training data is responsible for which part of the knowledge</li>
  <li>Poor performance on non-language tasks (reasoning tasks etc.)</li>
</ul>

<p>Retrieval Augemented Generation (RAG) seem to be the holy grail, but they are not the panacea:</p>
<ul>
  <li>Implicit world knowledge (in LLM) can interfere with knowlege from retrieved documents (hallucination)</li>
  <li>Only as good as the vector embeddings generated for each chunk of data</li>
</ul>

<p>Some academic research hints at the fact that “hacky” get-arounds allow for more robust solutions when dealing with LLMs:</p>
<ul>
  <li>Improve consistency of answers by asking the same question multiple times and finding the most consistent answer (through majority voting)</li>
  <li>Leverage feedback based mechanisms</li>
</ul>

<p>Threats:</p>
<ul>
  <li>Multimodal models</li>
  <li>Smaller models (Alpaca, Llama, Mistral)</li>
  <li>Security &amp; Safety Issues: Jailbreak attacks, Prompt Injection attacks, Data Exfiltration attacks, Data poisoning attacks etc.</li>
  <li>Evaluation: Latency, Tokens, Human Evaluation</li>
</ul>

<p>Opportunities :</p>
<ul>
  <li>LLMOps: Including Data Drift, Model Quality Drift</li>
  <li>Design for potential model retraining / Fine-tuning: Capture the API input/output to allow for proprietary model training</li>
</ul>

<p><strong>Interesting reads / Sources:</strong></p>
<ul>
  <li><a href="https://web.stanford.edu/class/cs224u/2020/">Stanford Natural Language understanding</a></li>
  <li><a href="https://www.youtube.com/watch?v=cEyHsMzbZBs">Thomas Dietterich, whats wrong with LLM</a></li>
  <li><a href="https://arxiv.org/pdf/2307.15043.pdf">Adversarial attacks on LLM</a></li>
</ul>]]></content><author><name>Pierre de Malliard</name></author><summary type="html"><![CDATA[Understand the difference between Language Tasks and Knowledge Tasks. Language tasks involve understanding, generating language such as writing essays, formatting an output to JSON etc. Knowledge Tasks require accessing and providing factual information or real-world knowledge. LLMs are good at Language Tasks (reformatting outputs, reformulating etc) but struggle with Knowledge tasks:]]></summary></entry></feed>