AI Interpretability, Explainability and the mechanistic reality that Compliance Frameworks miss.
In reviewing different approaches to responsible AI, I keep returning to the NIST AI Risk Management Framework (AI RMF) as one of the most structured and comprehensive governance resources available. It provides a strong foundation for AI safety and compliance, setting a high bar for risk management.
And yet, even with NIST’s framework, I haven’t been able to shake many thoughts on what happens when AI governance discussions do not fully reflect the mechanistic complexity of interpretability. The way explainability and interpretability are framed (even in well-regarded approaches) often overlooks or downplays the fundamental opacity of how frontier AI models actually reason.
According to NIST AI RMF 1.0, explainability and interpretability are distinct:
Explainability describes how an AI system functions, the mechanisms behind its decisions.
Interpretability is about understanding what an AI system’s outputs mean in context.
NIST further suggests that negative risk perceptions stem from a lack of explainability and interpretability, implying that AI risks can be mitigated through clearer communication and documentation.
While this framing makes sense from a compliance perspective, it misses a deeper reality that interpretability researchers have been grappling with for years: many AI models, especially deep learning systems, are inherently difficult (or even impossible) to fully explain due to their architecture. And, despite best efforts at documenting and auditing them, interpretability of black-box models remains a technical, rather than compliance, challenge.
This is not an adversarial stance to the NIST framework.
While the NIST AI RMF itself does not fully engage with the deeper complexities of interpretability and explainability, the paper NISTIR 8367 (Psychological Foundations of Explainability and Interpretability in Artificial Intelligence) provides a more nuanced distinction between these two concepts.
According to NISTIR 8367, interpretability refers to a human’s ability to make sense of a model’s output in a meaningful, contextualized manner, allowing users to relate AI decisions to their own goals and values.
Explainability, on the other hand, pertains to the ability to describe the mechanisms or implementation that led to an AI system’s decision, often with the goal of debugging or improving the system.
The key insight from this paper is that different users require different levels of abstraction: developers need mechanistic insights to refine models, whereas end-users and policymakers need high-level contextualizations to make informed decisions.
However, as NISTIR 8367 itself acknowledges, users vary in their cognitive approaches: some prefer granular technical details, while others rely on simpler, more abstract explanations.
This reinforces the argument that governance frameworks should not assume a one-size-fits-all approach to AI transparency but should instead account for these layered differences.
Yet, because NISTIR 8367 is a separate research report and not part of the AI RMF itself, its insights may not be widely recognized or implemented by organizations adhering strictly to the AI RMF guidelines, creating a potential blind spot in AI risk management.
The NIST’s AI RMF was designed as a compliance and risk mitigation tool, not a research framework. Therefore, it was important for ease of adoption to focus on concepts that governance, policy and enterprise risk experts would encounter.
Definitions were simplified for ease of adoption, so that even companies without deep AI expertise can follow them. Corporations want clear, actionable guidelines. By treating explainability and interpretability as separate, they make it easier for businesses to apply compliance processes.
This article is not intended as a criticism of the framework, which I am grateful for and endorse at a personal level.
This is just my attempt at bridging the gap between AI Research and AI Policy, a task to which I am giving a lot of dedication.
The Technical Reality: Mechanistic Interpretability & its Challenges
Interpretability is the ability for the decision processes and inner workings of AI and machine learning systems to be understood by humans or other outside observers.
Present-day machine learning systems are typically not very transparent or interpretable. You can use a model's output, but the model can't tell you why it made that output. This makes it hard to determine the cause of biases in ML models.
This means that interpretability isn’t just about understanding AI decisions: it’s about whether AI decisions can even be understood in the first place.
Mechanistic Interpretability (MI) takes a bottom-up approach. Rather than treating interpretability as an abstract goal, MI research aims to reverse-engineer how models encode and process information at the neuron and circuit level.
Deep Learning Models Are Not Designed to Be Interpretable.
AI models (especially deep learning architectures) do not inherently align with human expectations of explainability. Deep neural networks make decisions through distributed, high-dimensional computations that are not easily decomposable into human-understandable rules.
This is important because, for real-world implementation of governance frameworks, we must not assume that explainability is simply a matter of better documentation, auditing, or transparency measures.
Below are three major technical challenges that make interpretability a deeply complex and, at times, unsolvable problem.
1. Feature Entanglement: Neural Networks Do Not Separate Concepts Neatly
Traditional software operates through explicitly programmed rules, where each function or line of code has a well-defined role. In contrast, deep learning models operate through learned representations that do not have clear separations between different concepts.
Why does this matter?
In a rule-based system (such as a very simple ADM system), an AI determining loan eligibility might follow explicit logic: If credit score > 700 → Approve loan If income > $50,000 → Reduce interest rate
In a neural network, however, there is no single rule that maps directly onto a decision like this. Instead, the model would develop its own representations of creditworthiness, combining thousands of factors into nonlinear, entangled features.
Researchers analyzing vision models have found single neurons activating for multiple unrelated features (e.g., a neuron that responds to both "dog ears" and "fur texture" but also "wooden surfaces" for unknown reasons).
This means there is no clean mapping from neurons to concepts, and it’s difficult to untangle how the model “thinks” about a given input.
Feature entanglement makes it impossible to ensure that a model processes inputs in a human-intuitive way. Auditing the decision-making process of frontier AI models is not merely a matter of tracing inputs and outputs; it requires disentangling complex, abstract representations embedded within high-dimensional latent spaces.
Yet, current governance frameworks fail to account for this structural limitation, despite the widespread enterprise practice of deploying such models via APIs.
2. Emergent Behavior: LLMs Do Not Have Explicit Decision Trees
Modern AI models derive their behavior from interactions between billions of parameters. This leads to emergent behavior, where capabilities arise that were not explicitly trained for and were not predictable at smaller scales.
Why does this matter?
Scaling laws suggest that some AI abilities will only emerge at certain model sizes, meaning a system that appears safe and predictable in a controlled environment may develop unexpected generalization behaviors when scaled.
This is particularly relevant for enterprise deployments of foundation models via APIs, where companies integrate powerful AI systems into their workflows without fully understanding their long-term behavior.
Unlike static software, AI models can exhibit emergent properties as they are fine-tuned, updated, or exposed to new data distributions. Enterprises relying on these APIs do not control the underlying model architecture, training data, or scaling decisions made by the model provider, making it difficult to anticipate long-term risks.
If not accounted for, this can create a major oversight blind spot.
Companies deploying API-based frontier models inherit emergent risks from scaling decisions they do not control. While the AI Act acknowledges this challenge, it lacks clear technical guidelines on how post-market monitoring should address these risks beyond compliance documentation. Specifically, Article 72 mandates a post-market monitoring plan, but it does not specify how enterprise clients can mitigate emergent risks in deployed AI models. This is, hopefully, one of the areas where NIST may provide further guidance in the future, as many adopters of the framework will be, precisely, entreprise clients acting as deployers of API-based frontier models.
A model that is safe at one scale can become misaligned as it is fine-tuned or exposed to new data distributions, yet enterprises currently lack mechanisms to continuously monitor these changes besides staying up to date with System card changes and post-market testing efforts, which may still fail to capture evolving risks once the model is in real-world use.
To bridge this gap, there must be a stronger regulatory emphasis on continuous interpretability assessments and risk mitigation strategies for enterprise-deployed frontier models, perhaps as part of transparency obligations outlined in article 13. Otherwise, companies integrating AI via APIs will remain exposed to unpredictable capability shifts without effective safeguards.
3. Post-Hoc Explainability ≠ True Understanding
To mitigate AI’s black-box nature, many post-hoc explainability techniques have been developed, such as SHAP (Shapley Additive Explanations), LIME (Local Interpretable Model-Agnostic Explanations), and attention visualization.
These techniques offer insights into which features contributed to an AI model’s decision, but they do not actually explain how the model internally processes information.
Why does this matter?
SHAP & LIME provide approximations but do not reveal causal mechanisms.
A SHAP analysis might tell us that “salary” was the most important feature in a hiring model’s decision—but it doesn’t tell us how salary interacted with other entangled features to form the decision.
LLM Attention Maps show which words a model “focuses” on when generating text, but attention weight alone does not explain why the model structured its output in a specific way.
The governance problem: Over-reliance on explainability tools
Explainability tools can be used to improve transparency and fairness. However, if these tools only provide partial insights, then it could be misleading to guarantee explainability based only on said results.
If an AI system produces biased decisions and a post-hoc analysis finds no explicit bias in SHAP values, governance professionals might assume the model is fair, even though bias could still be embedded in complex, entangled representations.
AI governance frameworks must recognize the limitations of post-hoc explainability tools. Instead of treating them as definitive transparency mechanisms, policies should require mechanistic interpretability research that investigates AI’s internal representations directly.
The real issue is that many AI models are inherently uninterpretable. Not just to users, but even to researchers studying them. This is not a documentation problem; it’s a fundamental technical limitation that governance frameworks must acknowledge.
We need to bridge AI safety research with AI policy.
Risk management should not just focus on regulatory explainability but also address the unsolved technical questions in mechanistic interpretability. Otherwise, governance will always be chasing a moving target rather than shaping AI’s trajectory.
A Call for AI Governance That Reflects Mechanistic Reality
The real danger in oversimplifying interpretability and explainability is not just at the regulatory or institutional level: it is an enterprise challenge.
Many companies may adopt the NIST AI RMF or similar risk management strategies as part of their compliance efforts under the AI Act. They may do so voluntarily to meet industry best practices, to align with regulatory expectations, or simply to build trust with clients and partners. However, not every enterprise will have the expertise to implement these frameworks correctly. Smaller organizations, in particular, may take a “DIY” approach, relying on publicly available materials or short training courses rather than expert guidance.
This creates a fundamental problem: Enterprises that integrate AI models via APIs may overestimate their ability to ensure interpretability and explainability. Most enterprises do not develop their own models; instead, they deploy frontier models from OpenAI, Google, or Anthropic via API access.
This means that baseline model interpretability, at most, cannot be controlled by the enterprise, but they can focus on making their products as explainable as possible.
Conversely, enterprises may assume that explainability is entirely their responsibility when, in reality, it requires guidance from the model provider and methodologies tailored to their specific AI applications.
For AI governance to be effective, regulatory bodies must explicitly recognize the distribution of responsibility in API-based deployments.
Interpretability is primarily the responsibility of the model developer. Large AI labs must invest in mechanistic interpretability research, feature disentanglement, and internal model transparency efforts to ensure their models are not black boxes.
Explainability must be guided by the developer but facilitated for the enterprise customer. Model providers should develop tools, documentation, and guidance to help API clients understand what their models can and cannot explain at different levels of abstraction.
Enterprise customers must develop explainability methodologies for their own implementations. While they do not control model internals, they are responsible for documenting how they modify, fine-tune, or integrate the model into their decision-making processes.
A Governance Model That Matches Real-World AI Deployment
The adoption of governance frameworks like the NIST AI RMF and the AI Act should not take a one-size-fits-all approach. Instead, regulatory bodies should consider differentiated obligations based on the level of control an entity has over an AI system:
Frontier model developers (e.g., OpenAI, Anthropic) must take on the burden of ensuring interpretability research and making their findings available.
Large-scale deployers (e.g., enterprises using fine-tuned models for internal operations) should be required to implement explainability mechanisms that account for model customization and domain-specific adaptations.
Smaller enterprise API users should not be expected to prove interpretability of a system they do not control but must still ensure transparency in how they use the model.
The above can serve as a roadmap when considering how to meet the obligations in Article 72 and Annex IV of the AI Act.
The less control an entity has over an LLM’s internal interpretability, the less power it has over interpretability standards. Of course, this does not erode explainability obligations. Rather than thinking about this with compliance lenses, AI governance must reflect this reality: otherwise, enterprises will continue attempting to meet regulatory expectations that were never designed for their role in the AI ecosystem... while Frontier AI developers may find themselves without enough regulatory motivation (via specific mandate) to prioritize mechanistic interpretability research and ensuring that funding to this line of research is not sacrificed to capabilities development.
A good example of NIST AI RMF adoption at Frontier level: Anthropic’s Implementation
A strong example of the effective application of the NIST AI RMF can be found in Anthropic’s Claude 3 model card. Their methodology demonstrates that AI interpretability is not just about describing model decisions but about actively developing tools to understand and shape AI behavior over time.
As a leading frontier AI developer, Anthropic AI explicitly aligns with the NIST framework while also extending its principles through pioneering interpretability research and a dynamic approach to AI governance.
Their model card goes beyond merely documenting compliance efforts. It emphasizes ongoing evaluations of robustness, bias, and reasoning capabilities, but it also acknowledges limitations in explainability (such as the fact that enterprise clients using their models via API have no access to the internal model architecture or reasoning mechanisms).
One of the key strengths of Anthropic’s approach is its integration of mechanistic interpretability research, a field in which they are at the forefront, with Chris Olah leading their current research efforts. Rather than relying solely on post-hoc explanations, they prioritize a deeper, structural understanding of model behavior.
A particularly forward-thinking component of their risk framework is the assessment of “autonomous replication and adaptation” (ARA) capabilities, which evaluates whether the model can exhibit emergent behaviors such as self-improvement or manipulation. This type of evaluation extends beyond standard explainability measures and introduces a higher benchmark for systemic risk assessment, an area that is still underdeveloped in broader AI governance discussions.
Additionally, Anthropic ties its evaluation processes to its Responsible Scaling Policy (RSP), recognizing that model behavior does not remain static as capabilities increase. Their approach accounts for scaling-related risks, ensuring that evaluations evolve alongside model advancements.
I’d love to hear from those working in AI interpretability, governance, and safety: how can we ensure AI policy reflects the complexity of mechanistic interpretability research?