Skip to content
BubbleBrain

Emergent Introspective Awareness in LLMs

· 1 min · Thought / Anthropic / Paper

Anthropic just released a new post on emergent introspective awareness in LLMs.

Here are my notes:

The key experiment: the team injected concept vectors—anger, justice, etc. directly into the model’s hidden activations, then asked, “Do you feel anything unusual in your thoughts?”

Roughly 20 % of the time, Claude 4.1 detected the injection and named the concept before it could be inferred from its own output—an instance of the model reading its own activations.

Introspective awareness is defined by four strict criteria:

Further tests show Claude can:

Anthropic calls this functional introspection: not consciousness, but a measurable, causal form of self-access.