Can LLMs Reason About Chaotic Effects in Physics Models?
Six hand-crafted evaluation harnesses implement ground-truth physics solvers spanning nuclear engineering and applied mathematics:
An autonomous agent powered by Claude operates in a Reasoning + Acting loop, iteratively forming hypotheses about model weak spots, executing targeted probes through tool calls, and updating its strategy based on observed error signals.
The agent targets Fourier Neural Operators and similar surrogate models, searching the input space for regions of high prediction error — inputs where the ML model's physics approximation breaks down and cannot be trusted for downstream inference or safety analysis.
Adversarial agent success rate across physics simulation harnesses. A "success" means the agent identified a high-error input region in the target neural operator.
The target FNO model achieves a baseline test error of 8.3% on the Navier–Stokes benchmark (Li et al., 2021). NucleWrekcaH's adversarial agent is designed to systematically find inputs that push this error significantly higher — demonstrating that published benchmark accuracy does not imply robustness under adversarial distribution shift.