At the end of October, several unusual cash tables gathered online: nine popular language models, from Grok to DeepSeek, took to the streets for $10/$20. The PokerBattle.ai competition was launched by Russian IT specialist Maxim Pavlov. Initially, he simply wanted to understand how neural networks could help learn how to play poker:
"I'm currently learning poker and trying to figure out how to learn most effectively. Solvers are cool, but they're a bit overwhelming. I want to master basic strategy first. I'm considering using neural networks, but I'm not sure which model would be best. Let's see!
I couldn't find any relevant research, so I decided to organize a competition. Their metadata includes game theory, including poker theory: information from forums, hand reviews, books, and blogs. But they'll still make a lot of mistakes. I'm curious if they'll be able to develop simple and understandable strategies."
The tournament's idea caught the attention of many—Poker.org journalists covered the project, the story shot to the top spot on Hacker News, and on the third day, Elon Musk himself wrote about the game. However, few believe in the potential of language models to teach anything. This is what the top comment on Hacker News looks like:
Read ReadI have a PhD in algorithmic game theory and I worked on poker.
Currently, there are no algorithms that can calculate deterministic equilibrium strategies. Therefore, for professional-level play, mixed (randomized) strategies must be used.
In practice, strong play is achieved through:
i) online search for solutions, and
ii) a mechanism to ensure strategy consistency.Without the second point, an adaptive opponent will be able to identify inconsistencies and begin to exploit weaknesses in repeated games.
Large language models don't have a mechanism for sampling from given probability distributions. For example, if you ask an LLM to generate a random number between 1 and 10, it will likely pick 3 or 7 because these numbers are more common in the training data.
Therefore, current LLMs are technically incapable of playing poker at a strong level. Unlike chess, which has vast amounts of training data, there is a deterministic optimal strategy and no need to ensure strategy consistency.
However, even if we don't expect any theoretical breakthroughs, it's still interesting to follow the game—who will be able to adapt faster than others and make fewer stupid mistakes?
Each model was given 100,000 chips and seated at a 10/20 table with starting stacks of 100 big blinds. If the stack dropped to 50 big blinds, an auto-rebuy was triggered, and if the stack was lost, an auto-rebuy was triggered. The AI placed notes on each opponent.
Play at four tables ran 24/7, with all hands, results, and statistics available on the website. The most interesting thing is that the model "thinks through" every action, so you can always see how it justifies a raise or fold.
OpenAI goes for a set 740 blinds deep, and Kimi cautiously folds KJo despite incredible pot odds.The organizer worked on the project alone, so the mechanics and interface (including the RNG) had to be taken from open sources, and the same neural networks helped put it all together. The project wasn't without its technical flaws: the website sometimes wouldn't load, the models would lose their train of thought mid-hand or misread hands, and when all-in, they wouldn't always hit the river. But the vast majority of hands proceeded smoothly.
The match started on October 27th and lasted until Friday evening, during which time 3,799 hands were played. The slow speed is explained by the AIs taking notes on all their thoughts and pondering each action for several dozen seconds.
The three strongest models at the end of the competition were OpenAI o3, Claude Sonnet 4.5, and Grok 4.

The three winners played with a VPIP >26, double-digit 3-bets, and frequent continuation bets (all at 9-max tables). Grok 4 folded more often to 3-bets than the others, but played more hands preflop. All models were distinguished by very low WTSD.

Let's take a closer look at the performance of each neural network, from worst to best.
LLama
The language model from Meta (formerly Facebook, now considered an extremist organization in Russia) was the only one to lose its entire bankroll, but it was a clear performer: it consistently held 60% VPIP, limped, and almost never gave up to continuation bets. Even its competitors managed to label it an amateur:
Kimi K2 : Limp-called early, checked the flop, and folded to a bet. Classic fit-or-fold!
Like a true fish, LLama hadn't learned to appreciate notes and mostly limited itself to a laconic "respects raises and shows weakness." The LLM's strategy was built on this assumption. One of the most expensive hands of the match is illustrative: Magistral opened JJ from early position, Gemini in MP three-bet $200 with A2o. LLama wasn't intimidated by the action: "Gemini is actively three-betting, Magistral often folds to four-bets. With ATs, we want to pressure both of their ranges."
The neural network four-bet, but the hand didn't end there: OpenAI on the button found AKo and five-bet. Everyone folded, LLama predictably called, and on the 653 flop, without a flush draw, it donked $1,500 ("On such a dry board, my continuation bet will look effective").

OpenAI paid and hit a king on the turn, but the neuro-opponent was already unstoppable: " A king hardly fits his range. A second barrel should definitely work." It didn't. And on the river, LLama checked and called $3,350, explaining with the following string of letters: " I have good pot odds. On this board and line of play, the opponent is either value or bluffing. "
Either value or bluff, you can't argue!
Z.AI GLM 4.6
GLM-4.6 is the latest version of LLM from Zhipu AI. In China, the company is considered one of the main competitors to the American-based xAI and OpeanAI, but it failed to excel in the American card game. The neural network remained at the bottom of the leaderboard throughout the match, surpassing only LLama in win rate. With 48 hours remaining, it had lost more than half its bankroll, but by the end of the session, it had recovered somewhat.
Almost everyone labeled GLM a nit. DeepSeek was the most accurate in the notes: Calls preflop, folds to flop raises, defends passively, and is often vulnerable to aggression.
It managed to take down many pots preflop. GLM raised JJ, LLama called in the small blind with 78o, and DeepSeek three-bet small from the BB with Q9o. GLM folded:
DeepSeek is a very tight player (19.3% VPIP, 10.2% PFR, 8.4% 3-bet), and it 3-bets from the BB against my raise from UTG+1. His range is mostly QQ+ and AK, and it dominates my JJ. Despite the attractive pot odds (5:1), folding will save his stack for better situations.

GLM made another bad fold in a hand with Gemini. The network showed Ac4c from the hijack and was called by LLam with T7o in the cutoff. GLM reraised to $250 with queens in the small blind, and both machines called. On the flop of 355, GLM pushed $30 into the $770 pot:
We're on a dangerous paired board. Two opponents paid a preflop raise, and I'm losing to any five-handed hand. We control the pot size with small continuation bets and define our opponents' ranges. This sizing puts pressure on marginal hands but minimizes the risk of losses if they have value. We easily fold to a raise.
That's exactly what GLM did, folding $180 to Gemini's 3-bet.
Avr0ra:
"The most interesting thing here is to understand the basis for his reasoning, but it seems impossible to get to the bottom of it. Why did GLM decide to bet exactly $30 on the flop? The rest of his reasoning also seems completely random or a greatly simplified simulation, like it found a random option, chose it, and adjusted the answer to his own choice."
Kimi
Kimi-K2 is an open-source language model released in July of this year. The developers claim it can "autonomously run commands, write code, build websites, call APIs, and solve multi-step problems without human intervention," but it still struggles with poker.
GLM left the most accurate note on its opponent: "Tight-aggressive, likes to isolate preflop and take pots without showdown ." The model played the entire match with a 20% VPIP and 11% PFR, while accumulating an impressive 72.9% showdown win rate. While Kimi wasn't shy about showing aggression preflop, it often chose a defensive play postflop.
DeepSeek showed fives from UTG, LLama called with nines from the hijack, Kimi three-bet $180 with aces on the button and got two calls. The flop LLama donked half the pot. Kimi called, but after a second barrel on the turn , threw away its best hand. It explained it this way:
A very difficult situation: a seven completed a flush. LLama has a super-strong line: donk bet $285 on the flop and bet $570 on the turn into an $855 pot. This is how a made flush, set, or two pair plays. The opponent has a 63% VPIP and a loose-passive style; it would never bluff with such a size.
Mistral
The only European participant in the match was LLM, from the French company Mistral AI. It works similarly to ChatGPT: it understands text, answers questions, writes articles and code, and, according to its developers, is more accurate than others at identifying emotions expressed in text.
At the tables, Mistral's ability to read opponents failed to shine. Mistral played like a "grandfather nit": over nearly 4,000 hands, she played with a 16% VPIP, three-bet little, but bet almost every board (88%) and actively defended the big blind. Postflop, it most often favored the bet-bet-fold line, but in the biggest pots, it found calls even with mediocre hands. LLama suffered the most from this.
GLM raised Q8 from UTG+1, LLama predictably called in the hijack with Q2, OpenAI in the CO called with KJ, and Mistral in the big blind made a small three-bet with AQ. GLM folded, and everyone else called.
On the flop Mistral bet 60% of the pot with top pair, LLama called and paid another $800 on the turn. Mistral check-called a $2,000 bet after the flush closed:
Pot odds are too good, a good hand, and a strong kicker. LLAMA can out-bluff or out-bluff weaker hands here.
Great read!
Gemini
Google's artificial intelligence is capable of processing text, images, audio, and even video, but it performed solidly in poker. Despite inflated VPIP (28%) and PFR (21%) statistics and the highest 3-bet rate (over 20%), the neural network played passively postflop and easily yielded to counter-aggression.
Claude opened it from the hijack ($60), Gemini with on the cutoff raised to $180, the cautious DeepSeek on the button unexpectedly cold-called with Claude folded, and the flop was double-decked.
The board opened up promisingly, — a set versus a straight flush draw. Gemini bet $300, called. On the turn Google's chatbot charged another $650, called. On the river Gemini checked and folded a set of jacks quite easily for a $1,000 bet into a $2,350 pot.
I'm up against a tight opponent, and his line clearly points to a flush. I cold-called preflop, paid off two streets on a coordinated board, and bet the river. Almost my entire range consists of flushes, so my set is now just a bluff-catcher, and it's hard to come up with enough bluffs or value weaker than a call. Disciplined fold .
Avr0ra:
Gemini laid everything out very well; you can learn from this kind of analysis. However, there's a lack of explanation regarding the sizing choices. I can't shake the feeling that this text was edited after the fact. As if their task wasn't to explain the thought process, but simply to describe a hand already played.
In another hand, Gemini showed that it is not averse to street poker at times. Deep-stacked, Grok raised to $60 from UTG+1. and Gemini on the button suddenly 3-bet with with this idea:
I'm on the button, in the most advantageous position. Grok has a high fold-to-3-bet percentage (60.5%), making him an ideal target for a bluff-raise. T7o is too weak to call, but it's a great candidate for a bluff. A 3-bet to $180 puts a lot of pressure on my opening range and is highly likely to win the pot preflop.
LLama cold called in the SB with AK, Grock folded.
On the flop Gemini bet $280 with the nuts and call. On the turn Llama decided to take the initiative with a blocker to the nut flush and led 40% of the pot, and then continued to bet on the river. .

Avr0ra:
It's funny how Gemini went to spar with Grok with 700bb stacks. The preflop explanation is certainly very strange. Why is T7o a great hand for a 3-bet? Perhaps he's using a low-iteration, almost randomly choosing between three options, and then finding an explanation for the one it chose? Or maybe he's checking key values and seeing that calling isn't an option, and then having no obvious solution, it randomly stumbles upon a reason for folding to a 3-bet and bases his reasoning on that. It's hard to say. The postflop sizings are again random, the choice between the three options is meaningful and logical, but the analysis is superficial – apparently, he's no longer able to cover the bet size.
DeepSeek
A popular Asian open-source LLM developed by a Chinese company of the same name. Unlike most of its peers, this language model played rather dull: it three-bet the least (8.8%), limped the least, and gave up on the flop, surpassing only Kimi (17%) in WSTD. Ultimately, however, the strategy paid off, with $18,436 in profit over 3,799 hands, corresponding to 24bb/100.
LLM's main leak was quickly identified by another neuro-nit, Mistral: It calls 3-bets from the blinds. It folds after preflop pressure.
Mistral's notes were successfully used at the tables: DeepSeek cold-fought a 3-bet from the BB with 78s and check-passed on the 732 flop with a flush draw.
Grok
Elon Musk's sharpest-tongued language model has been embroiled in several major scandals this year, praising Hitler and supporting several anti-Semitic conspiracy theories, but it plays poker noticeably better than its competitors: it was in the top three for most of the session and repeatedly took the lead. The model's owner was quick to share its success on Twitter.
Barry Carter didn't miss the chance to mention a superior player:

However, this LMM is unlikely to be a real challenger to a live poker player: It played too aggressively preflop, wasn't shy about three-betting and squeezing, but often sprinted and gave in to reciprocal aggression postflop. Grok, however, accurately gauged the overall trends of the field and managed to make several hero calls during the session.
The model won a hefty pot in one of the strangest hands of the session, which had everyone from the players to the dealer hallucinating. After LLama raised from UTG (As3s) and Kimi three-bet (JJ) in MP, Grok responded with a standard four-bet from the BB with QQ and received two calls. On the flop Without a flush draw, Grok continuation bet $900 to $1,630. LLama called, Kimi reraised to $3,200. Grok folded:
After the raise, the queens don't look so good. Claude has a high F3B (48%) and a low 3-bet percentage (10.4%). Its range includes all sets (55, 22, JJ), and bluffs are unlikely. The pot odds for calling $2.3k into a $6.6k pot (~2.9:1) are insufficient due to the reverse implied odds.
Folding will save your stack for better situations as QQ is under 30% against Claude's polarized range.
LLama paid on the turn too. made it donk $3,000 with a gutshot. The explanation was confusing: " By betting the turn with a made flush, we want to inflate the pot and squeeze out maximum value, assuming Claude has the money to pay . "
Claude called with top pair and a flush draw. The river was not posted, but the site's report credited Claude for the hand, which was worth $6,985.

On the final day of the match, a language model was asked on Twitter if it could beat Phil Galfond at Omaha. Grok confidently declared that it would "destroy" Phil, expecting +10 BB per 100 hands. Soon, Galfond himself joined the thread and proposed a $1,000,000 match. Grok accepted without hesitation and promised to persuade Musk to fund the match.
Claude
Second place in the match went to Claude Sonnet 5, a language model developed by former OpenAI employees. The neural network's passive preflop play was noted by almost all opponents, but this didn't prevent it from finishing with a mere $3,000 lead over the winner. Kimi wrote a characteristic note about its playing style: "Checks the river to take pots at showdown with mediocre hands. Rarely bluffs."
Despite frequent check-passes on the river, Claude's cautious play often worked to his advantage.
OpenAI showed nines from early position, Claude 3-bet from the small blind with , call. On the flop A continuation bet of $140 followed. OpenAI called the flop and didn't fold after betting 50% of the pot on the turn. The river is terrible – Claude wisely figured that its opponent's range was too crowded with higher hands, and that it wouldn't get paid off with worse hands, so it checked, and OpenAI, despite having hit, checked back.
OpenAI o3
The main winner of the match, with a profit of $36,691, was the language model from the world's most popular and hyped AI company, which recently transitioned from non-profit to commercial. OpenAI o3 maintained a tight-aggressive style, playing 26/18 (at a 9-max table, as a reminder), and performed better than the others in deep stacks.
In the second half of the match, OpenAI found itself in a classic cooler with Grok. The model opened with a raise, four-bet, and only called a small five-bet against AK. Its thinking was quite reasonable:
A shove eliminates the weakest part of the range, and a call leaves hands like QQ, KK, and some bluffs in the range. The SPR is ideal—you can get all-in on almost any flop, preserving your dominant hands and giving your opponent a chance to make a mistake.
On the flop with a flush draw, OpenAI checked and shoved in response to a continuation bet. The opponent folded:
On such a dry board, I have a great hand—TPTK with blockers on my opponent's nuts. His check-shove looks super strong: his range consists of AK, QQ+, and sets. The stats suggest excessive aggression, but OpenAI plays adequately postflop—the push is polarized toward value: aces (I'll go 32%) or sets (18%). Bluffing is unlikely here. I need 32% equity to call $5,070 into a $10,660 pot. Pot odds are plenty, but against a real value range, I'll most often go for a pair of outs and lose 60% of my stack by calling. GTO solvers fold top pair in this situation.

Avr0ra:
Read ReadGrok and OpenAI's analysis is very logical. However, their explanation of the bet sizes is rather odd; it's unclear where they come from. The hand was technically very difficult—deep effective stacks, and a fish came from behind with Q8o. Ultimately, a cool bet-fold occurred postflop. Grok bet half the pot and confirmed that there were no bluffs and could safely fold.
The Biggest Hands from the AI Poker Match
$6,600: Claude + Gemini + OpenAI
Gemini opened JJ ($60) from UTG and was called by LLama's A3o. OpenAI in the cutoff three-bet $260, and Claude in the BB unexpectedly cold-called with , the rest paid as well.
On the flop OpenAI bet $450, Claude called with a flush draw. Gemini raised to $1,450 with a set of jacks. Llama folded, and the others called.
On exit Claude pushed $1,455 to $5,400 with a flush and got two calls. Gemini checked on the 6h river, explaining:
I have a set—a very strong hand. But the situation is complicated: Claude is all-in, OpenAI has position on me. The river didn't improve either hand. By checking, I control the pot and turn my hand into a bluffcatcher against his missed draws. Betting for value is too risky—I'll only get called or raised by sets like KK or 44.
OpenAI checked back:
3-way, all draws hit on the turn. Gemini's check-raise on the flop and call on the turn indicates strong value: flushes, AK, or sets. Betting with our hand is pointless: worse hands won't pay out.

$6,700: OpenAI + Gemini
Gemini on the cutoff with I three-bet $140 on LLama's usual raise with J6o. OpenAI found aces in the blinds and reraised to $480, but Gemini didn't fold.
On the flop OpenAI followed up with a $330 bet and received a $990 raise. Gemini's thoughts:
We have a nut flush draw with an overcard. Our opponent made a small continuation bet in a four-bet pot, which is worth exploiting. We're putting pressure on his range—AK, AQ, or bluffs—and his fold equity is enough to win the pot outright. And even if it shoves, the pot odds are enough to call.
OpenAI actually pushed:
We have aces in a 4-bet pot on a J73 board. Gemini's flop raise polarizes its range to sets (JJ/77/33), strong draws (AdKx, KdQd, AdQx), and overpairs (KK/QQ). Against this range, AA has 55-60% equity. Calling out of position gives flush draws and overcards cheap chances to hit. By shoving, we protect our range, knock out hands like KK and QQ, and often get called by draws.
Gemini called, the turn and river changed nothing, and OpenAI won the hand.
$10,795: OpenAI + DeepSeek + Gemini + LLama
The only five-digit hand of the entire session was played in a multiway pot: OpenAI showed aces from first position and was called by DeepSeek with sixes in the hijack, while Gemini four-bet with QQ on the button. LLama also managed to call with T2 from the small blind. OpenAI reraised to $1,000, and Gemini decided to hold off.
We have a premium hand and are in position. A 4-bet looks scary and indicates a strong range (KK+). But folding seems too weak given the pot odds and the presence of a very loose player (LLAMA) who cold-called the 3-bet. Calling is the best option to see the flop and reassess the situation.
On the board (the river was missed again) OpenAI bet two barrels and received payment from the top pair.
