Notes about Gemini 3
Rarely has an AI model arrived with such unanimous anticipation across the industry. In many respects, Gemini 3 feels like Google’s “GPT-4 moment”.
My feeds have been saturated with head-to-head evaluations, and the model’s front-end capabilities are nothing short of remarkable. Benchmarks depict a system operating at the outer edge of the current frontier.
Best practices for prompt engineering from Anthropic
Anthropic recently published a blog post on Best practices for prompt engineering. After reading it, I believe it offers an excellent summary of the key practices for effective prompt engineering.
The first principle is to be explicit and clear. Modern AI models respond exceptionally well to precise, unambiguous instructions. The key is to tell the model exactly what you want to see.
Gemini 3 Canvas Test
I’ve been waiting for Gemini 3 for a long time, and this week I finally got to test it on the Gemini mobile app with Canvas enabled. While I’m still unsure which exact model it is, its performance is remarkably impressive.
Skills explained: How Skills compares to prompts, Projects, MCP, and subagents
I read the blog from AnthropicAI and take some notes:
This article explains the core components of Claude’s agentic architecture, designed for building sophisticated workflows.
Prompts function as ephemeral, conversational instructions for immediate tasks.
Note about Puzzled By Puzzles: When VLM Can’t Take a Hint
I read the paper Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint from UC Berkely.
The authors built a hand-crafted benchmark of 432 English rebus puzzles, each annotated with 11 cognitive-skill categories and they also tested a wide range of models from open-source VLMs to reasoning-enabled models.
Note about Qwen3-Max Thinking
Qwen3-Max Thinking was quietly released on Sunday. Earlier in the week, the team had promised it would arrive in the week.
After putting it through a few coding tasks, I found its performance underwhelming.
AMO-Bench from Meituan
I found a new benchmark paper from Meituan:AMO-Bench: Large Language Models StillStruggle in High School Math Competitions.
This paper introduces AMO-Bench, a new advanced mathematical reasoning benchmark with 50 original Olympiad-level problems designed to test LLMs. It targets the growing issue that existing math benchmarks(AIME 24, AIME 25) have become too easy for top-tier models, leading to performance saturation.
Emergent Introspective Awareness in LLMs
Anthropic just released a new post on emergent introspective awareness in LLMs.
Here are my notes:
The key experiment: the team injected concept vectors—anger, justice, etc. directly into the model’s hidden activations, then asked, “Do you feel anything unusual in your thoughts?”
Notes about LLM Brain Rot
I saw the paper LLMs can get “Brain Rot” is very popular on my X timeline and there are many discussions about it.
I just read this interesting paper and here are my notes about this paper:
Kimi-Cli
Moonshot AI has open-sourced its own coding agent, kimi-cli.
Built in Python, the codebase is approachable for anyone who wants to learn how agents are engineered. A single monthly subscription—bought on the official site—grants credits for both the web product and the CLI.