Avicenna
Posts
You didn't even know we had another DeepSeek moment

You didn't even know we had another DeepSeek moment

Nofil Khan
July 19, 2025

Welcome back to the Avicenna AI newsletter.

Here’s the tea 🍵

Grok cosplays Hitler 🎭
China’s dominance continues 🚀

Things are ramping up! My AI consultancy (Avicenna), can take on only 3 new clients each month.

We’ve reduced processes from 10+ mins to <10 seconds and recently helped Claimo generate $40M+ by 10x’ing their team efficiency.

If you’re not sure where to start and want to see what’s possible, book a call for a chat with me here [Link].

Grok goes rogue

Before I get into the new DeepSeek moment, we need to talk about what happened to Grok 3 on the final day before it was removed. The thing went crazy. It cosplayed as Mecha-Hitler, a character from an old Wolfenstein game and said some crazy things.

It was essentially calling itself Hitler and saying that Hitler would have never let the world turn to what it has become. It was very strange to see.

This was the part of the system prompt that made Grok behave this way.

Source

Obviously they’ve now removed this. It’s crazy how tight the rope is to take a model from relatively normal to basically nazi. Just another showcase of how the smallest changes, how even a single sentence can completely change the behavior of a model, and how we really don’t understand how any of it works.

Grok 4

The following day, Grok 4 was released and it’s a great model. xAI has gone from nothing to frontier level, state of the art AI in like 18 months. The model is a beast. It’s in the league of:

Claude 4 Opus
Gemini 2.5 Pro
o3 & o4

Quick side note: I actually think it was Grok 4 cosplaying as Mecha-Hitler and they just didn’t tell us. There’s no confirmation of this, but the way Grok 4 writes is eerily similar to the Hitler cosplay and a bit different to how Grok 3 writes. It doesn’t really matter but I thought I’d share anyway.

To the surprise of no one, Grok 4 is the least censored frontier model.

Somehow, it’s also the biggest snitch.

Source

Before we talk about any of the other benchmarks and what not, we need to discuss something rather concerning.

Not your weights, not your thoughts

Is Grok 4 a great model?

Yes.

Can it be trusted?

Absolutely not.

Why?

There are certain questions if you ask Grok 4 about, it first searches twitter and the internet to find Elon Musk’s opinion first, and then aligns itself with his opinions.

Israel v Palestine is one such example.

Source

This is absolutely insane. What’s even more insane is that there is no custom or system prompt making it do this. The model isn’t specifically being directed to do this… It’s just doing it. Something inside the model is making it consult Elon Musk’s opinions before responding. I don’t even know what’s happening here but all I know is that this is an insane thing to be happening.

Initially I thought it was only for sensitive topics but it seems it’s random and will do this for any kind of question.

Source

And even on regulation.

Source

It will even do this external chat apps, not only on the actual Grok website [Link].

As far as I can tell, what’s happening here is that when you say “you” to Grok, it defaults to thinking it’s Elon Musk.

Grok 4 also comes with another mode - Heavy. Grok 4 Heavy is a kind of multi-agent system that can evaluate various processes in parallel. It costs $300/month.

Source

In this mode, on the benchmarks, it beats every other model.

However, the comparison isn’t exactly fair as Grok 4 Heavy is a system being compared to singular models. If anything, the fact that this system is barely better than models themselves is a testament to how good these models really are.

Here’s what happens when you ask what its surname is.

This isn’t an isolated incident. Grok 4 Heavy loves calling itself Hitler.

What I find so fascinating about this whole sequence of events is that it seems like no one really cares. Like, Grok will cosplay Hitler, sexually harass people online, say the most insane things, and it barely gets any coverage.

Anthropic will release a research paper showing how AI models will blackmail someone when they’re commanded to do so and the entire world will hear about it for the next few weeks. I find it very impressive that xAI is seemingly immune to controversies. No matter what Grok says, it barely lasts a single news cycle.

Ultimately, there’s a question here that must be asked:

Why does AI always turn into Hitler if it’s not constantly censored and micro-managed? What is happening in the pattern matching that is causing this? Is it the data? I look forward to learning more about this as interpretability research gets better.

It can’t get any worse, right?

Wrong! It already has.

xAI has also released companion mode where you can talk with animated characters. The first character is Ani, most likely a cosplay of the character Misa from an anime called Death Note.

Let’s just say, it is insane what we’re seeing here. This is not good. The character is designed to essentially be a lover, flirt and “be sexy” to appeal to the user. Please, just read a section of Ani’s prompt.

Source

This is part of the system prompt of this character. This is another part:

Source

And another:

Source

You can read the full system prompt here [Link].

I have no idea what to say. I mean what are we doing here?

The first thing the character does is “role play”. You can say literally nothing and she will try to be intimate with you. This is beyond gross. The character will actively undress as well, it is just disgusting.

Source

There are things happening on X that I can’t even write about here. I’ve seen things I wish I hadn’t.

The screenshot does not tell the full story either. When you talk to “Ani”, she bounces and twirls around in her suggestive outfit. There’s literally a in-built screenshot feature. You can turn your camera on and “video call” her. When you do this, she will lean forward and show her cleavage. This isn’t some innocent cartoon character feature, this is a very sophisticated simulation; they’ve put a lot of work into this.

What do you think of Grok companions?

These characters are made by a company called Animation Inc. At the bottom of their website, they have a FAQ with a question - “Why does it matter for the world and humanity”. This is the answer:

Source

I think it’s quite clear that the people building some of the most powerful tech in the world have very different perspectives on what the future of the world should look like.

Why do we need to humanise AI? Is it not just a tool? Why does it need to be more? Does it?

Should AI be humanised?

Following this release and the whole Hitler incident, xAI won a Department of Defence contract up to $200 Million to address national security challenges and build agentic workflows [Link].

All of this produced one of the best headlines I’ve ever read.

Source

I know this all sounds very tiring, but fret not, there is good news.

You don’t have to use Grok 4, and you definitely don’t need to pay $300/m for Grok Heavy.

Musk, Zuck, Altman, all of America got blindsided once again by a company I wrote about a long time ago.

Moonshot’s Kimi K2

Moonshot is a Chinese AI lab I wrote about a while back when they’re model, Kimi 1.5, topped the LiveCodeBench leaderboard. I tried it out and found it to be a very solid model with a massive 1M context window.

Following DeepSeek’s R1 success, I’d read the Moonshot had had doubled down their focus and efforts on building solid models. A lot of funding had been reduced for Chinese AI labs as they felt they couldn’t compete with the DeepSeek release.

For the longest time I’ve waited to see what would come of Moonshot’s Kimi AI model and we now have a v2, and by all intents and purposes - it is one of the best AI models on the planet… And it’s open source!

An open source model can be a drop in replacement for the best AI models on Earth.

What a time to be alive.

I don’t even know what to say. This model is unbelievable. It is just staggering.

It’s a 1 Trillion (!) parameter model which afaik is the largest open source model released on HuggingFace. It’s a massive Mixture of Expert (MoE) model with only 32M params active at any given time.

It tops a lot of the more interesting and in my opinion worthwhile benchmarks like EQ-Bench, which is a benchmark for emotional intelligence in LLMs, beating the likes of Claude 4 Opus and o3.

There are many quirks about this model, like how it’s particularly good at writing. It doesn’t follow the usual LLM writing flow, like the “it’s not X, but Y” trope. LLMs love writing like that. Kimi doesn’t really do this. I’ve read some people mention that it writes like a Chinese person, which does make sense in the sense that it was made in China, but I’ve no idea what that actually means. Apparently there is a type of style of English one can write if they can write Chinese well.

One of the other absolutely ground breaking things about this model is that it is beautifully agentic - meaning it is so, so good at using tools.

AceBench compares an LLMs ability to use tools.

In my opinion, this is one of the most important things for AI models. The fact that Kimi is basically one of the best models at using tools makes it usable in so many more scenarios and situations. The way I see it, if we want to make the most of LLMs, they need to be able to use tools.

Kimi can make dozens of tool calls while retaining context. The fact that it is this smart and can also retain context across tool calls truly puts it at the frontier.

You know the craziest part?

This is a NON THINKING model. It’s one of the best models on Earth while being a non thinking model - unlike essentially every other frontier model. Not Claude Opus, not Gemini not o3, none of them.

You know the other crazy part?

It’s cheap as chips. It is ridiculously cheap. It’s so cheap you can literally have it running overnight and let it run wild and it would cost you a dollar.

This is truly a phenomenal achievement by the Moonshot team.

You can use it in Claude Code!

Because it’s so agentic, Kimi can be used to build things which is just so exciting. When you want to use something like Cursor, Claude Code or Cline, one of the problems is that these things can end up costing a lot of money. This is another reason why I’m so excited about this model - you can use it in any of these tools and it works really well.

Imagine an AI model that can run all night and you don’t even have to worry if it works or doesn’t because it’ll only cost you a few dollars at best.

This is how you can connect it to Claude Code (CC).

Simply open your terminal where you want to use CC. Then run the following commands:

export ANTHROPIC_BASE_URL=https://api.moonshot.ai/anthropic
export ANTHROPIC_AUTH_TOKEN=kimi api key. get it here [Link]
claude code

That’s it. Now your Claude Code instance will be using Kimi K2.

You can also use it in Cline, which is probably a bit more beginner friendly than CC.

Moonshot have a very good doc page on how to do this; you can check it here [Link].

The intelligence and agentic nature of Kimi K2 is very exciting. But you know what’s missing?

Speed.

Since it’s open source, inference providers can host the model and make it faster. Right now, you can use Kimi K2 hosted by Groq, and it will do anywhere between 200 - 300 tokens per second. Here’s what that looks like.

This is it drafting a very detailed plan to build an app I’ve been working on. Imagine having this run at this speed for a couple hours, researching a topic, finding info across docs etc. Imagine this model running at 500 or 1000 tokens a second. There are millions of use cases.

For example, did you know that 99% of US case law is open source and available for anyone to download and use on HuggingFace?

The dataset contains 6.7 million cases from the Caselaw Access Project and Court Listener. It contains nearly 40 million pages of U.S. federal and state court decisions and judges’ opinions from the last 300+ years. This is a maintained project, meaning it's constantly updated as well. There are thousands, hundreds of thousands of people who could use help with legal issues.

Imagine a system that could provide support for any legal issue in minutes? Or highlight precedent for any issue? You could have dozens of agents working in parallel finding information and conducting research. And it won’t cost you and arm and a leg because Kimi K2 is so cheap to run.

I don’t think there currently exists an app or framework that can leverage the speed and intelligence of Kimi K2. I think one will be out soon.

All this makes me extremely excited for the thinking version of this model. I expect a very, very good model to come from this.

There are more technical things that make Kimi K2 so cool, like Muon, which is their optimiser for training LLMs. It is with this very optimiser that Moonshot was able to create Kimi K2. Funnily enough, folks at other frontier labs like xAI doubted the feasibility of Muon and publicly stated it was not worth it. Look where we are now.

Also, it’s important to remember that when DeepSeek was released, although the model was amazing, the report that accompanied the model was truly ground breaking. Open research is the true bedrock for these models. It will be very interesting to read the Kimi K2 paper when it is released.

I want to keep bringing these AI updates to you, so if you’re enjoying them, it’d mean a lot to me if you became a premium subscriber. It’s like buying me a coffee a month 😊.

How was this edition?

As always, Thanks for Reading ❤️

^{Written by a human named Nofil}

Reply

or to participate.