Knowledge Requirements, Agentic Architectures, and Prompt Evolution for LLM-Supported Human Reliability Analysis
Authors
PrimaryMichael Hildebrandt— Institute for Energy Technology · michael.hildebrandt@ife.no
Human Reliability Analysis (HRA) is a core component of nuclear probabilistic risk assessment. Despite decades of method development, HRA remains time-intensive, expert-dependent, and inconsistent across analysts and contexts. Large language models (LLMs) are now being explored as a means of supporting and partially automating HRA work. Xiao et al., 2025 demonstrate that a knowledge graph RAG layer combined with a multi-agent LLM pipeline can estimate base human error probabilities using IDHEAS. These contributions establish feasibility of LLM-supported HRA, but do not address systematic quality improvement over time.
Deploying LLMs for HRA requires meeting several distinct knowledge demands: working familiarity with established methods such; domain knowledge of nuclear plant operations including emergency operating procedures and system interactions; plant-specific knowledge; conduct of operations; and grounding in human factors theory and empirical HRA data. It also requires analytic capability and the ability to review and evaluate output quality. A foundational LLM applied without augmentation satisfies none of these reliably.
This paper maps a progression of increasingly capable architectures. Retrieval-augmented generation (RAG) connects the model to method references and procedural documentation. Knowledge graph RAG adds structured taxonomic and causal knowledge. Tool-augmented agents provide query access to HRA databases and operational event records. Agentic architectures running multi-step HRA workflows with human-in-the-loop review checkpoints can cover the full analytical sequence: task analysis, context characterisation, performance shaping factor evaluation, error probability estimation, and uncertainty assessment, while maintaining analyst oversight and audit trails.
The paper's main proposal is that prompt evolution should be treated as a quality assurance mechanism, not a design-time activity. Rather than fixing prompts and agent workflows at deployment, we propose a meta-level process in which analyst review outcomes serve as quality signals, successful and unsuccessful analyses are compared for structural and contextual patterns, and the system generates candidate refinements to prompts and agent data flow configurations. This approach draws on automated prompt optimisation methods such as DSPy (Declarative Self-improving Python) and OPRO (Optimization by PROmpting), adapted to the auditability and regulatory traceability requirements of nuclear safety practice. Over successive analyses, institutional knowledge accumulates in the prompt, context and workflow layer rather than in model weights, producing improvements that can be versioned, audited, and reviewed independently of model updates. We discuss implications for validation, benchmarking, and regulatory adoption.
✅Status: The abstract has been accepted!
← Check another abstract