Emergent Introspective Awareness in LLMs

Anthropic just released a new post on emergent introspective awareness in LLMs.

Here are my notes:

The key experiment: the team injected concept vectors—anger, justice, etc. directly into the model’s hidden activations, then asked, “Do you feel anything unusual in your thoughts?”

Roughly 20 % of the time, Claude 4.1 detected the injection and named the concept before it could be inferred from its own output—an instance of the model reading its own activations.

Introspective awareness is defined by four strict criteria:

Accuracy: the self-report must be true
Grounding: the report must causally depend on the internal state
Internality: awareness must arise from within, not from observing external outputs
Metacognition: it must be a thought about a thought

Further tests show Claude can:

Distinguish injected thoughts from input text
Tell whether a response was genuinely its own or prefilled by the user
Even modulate its internal activations when instructed to think—or not think—about a concept

Anthropic calls this functional introspection: not consciousness, but a measurable, causal form of self-access.

评论