Revsiting the Decision Transformer with Tic Tac Toe

Nov 25, 2025

15 minute read

Note: code accompanying this project can be found here.

This weekend, as an afterthought, I guess due to some background threads that had been accumulating in my head for a while, I decided to finally take a quick & dirty shot at one of those things that’s fascinated me, but not quite enough to actually get around to trying it until quite some time has passed: the Decision Transformer¹ (DT). I remember reading about this when it came out and just being a bit.. I dunno, simultaneously amazed and annoyed. I had recently finished a round of studying reinforcement learning again, and at the time it seemed like people were just throwing transformers at everything to see what sticks. At the same time, the idea seemed to just make sense: what if we just.. condition a sequence prediction task on how well we want it to perform? But it also completely subverted everything I understood about RL and the importance of Bellman value estimation. Anyways, I left it, just curious to see what would follow it up, and I guess I didn’t hear much more, although there was certainly some related work being done.

But over time, as RL became more and more important for LLM post-training, various approaches to RL have been pulling at my internal thoughts here and there, and I started to wonder why I never see the DT mentioned. So just before I went to bed on Sunday night I did what is becoming a somewhat bad habit, I admit, and just started asking Gemini: how does it work again, and how does it formulate the sequence problem as a prediction task, and.. that little demon sneaking up on my shoulder.. how easy would it be to just try something.. I ended up formulating a fairly detailed prompt for an age-old test setup that probably everyone interested in RL has tried during their graduate studies: Tic Tac Toe. I certainly remember trying to get an NN to understand TTT, before I knew much at all about RL, and failed spectacularly, and never got around to revisiting it. So I thought I’d just use it this time around as a playground, knowing that I should be able to expect the NN to memorize the solution, so it should be pretty simple to just perform optimally. And I’d get that working and call it a night.

Well, it almost went how I hoped, so I’ll jot down the details here for posterity. But this is a bit aside from the point of this post, so I’ll leave the remarks about vibe coding at the bottom.

The Vanilla DT

The Decision Transformer’s architecture is elegant. Instead of learning a policy directly, it learns to predict an action based on modeling the history of states, actions, and a desired outcome as a sequence to be predicted, predicting the next token just like a language model. More specifically, the input is a sequence of triplets: a State (the board), an Action (a move), and a Return-to-Go (RTG), which is the total reward you want to achieve from that point onward. Actually we arrange it so that the RTG comes first, which, due to causal masking, conditions the state and action prediction on the reward. This last, keeping the “reward you want to achieve” in there, is non-obvious. Predicting the reward is intuitive, as that’s what happens during a rollout: you make a move, you get a reward. But conditioning on it is less clear. Feeding the desired reward in at inference time, rather than letting it be a passive outcome. And why would you ever train on low rewards anyways, right? Well, for every game, there is always an outcome, and what works and what doesn’t is what leads to that outcome. It stands to reason that the model would have something to learn about what actions lead to poor performance and good performance. And then, when using it, you simply ask for good performance. Makes sense!

To play, you prompt the model with a high RTG (like +1.0 for a win), and it predicts the action most likely to lead to that outcome based on the patterns it saw during training.

Diagram 1

Diagram 1: The sequence of RTGs, states, and actions during a rollout, as seen by the DT model.

My first version was trained on games played by two random agents. I fired up the training script, and the initial results were exciting! The loss went down, and the validation win rate against a random opponent started climbing. It seemed to go right up for 50 epochs, so I went to bed happy. In the morning it occured to me to see if that was a stable result, but when I ran it for 200 epochs I saw that after it peaked, it began to plummet. (Epochs here were arbitrarily set to 5000 rollouts.)

Figure 1: Overfitting to the random dataset.

This feels like classic overfitting, but I wasn’t sure, as I thought that it was training on an infinite number of random games, how could it overfit? Herein lies the danger of vibe coding: I didn’t realize the LLM had not generated an infinite online dataset like I had imagined, instead it had generated an initial dataset of just 5,000 games. Of course it memorized the solutions to that and was overfitting. You could see the inflection point where it goes from generalizing to memorizing in the loss curve.

More on vibe coding at the end, but let’s focus on DT for now.

Fighting Overfitting with an “Infinite” Dataset

The hypothesis was straightforward: if the model is memorizing a static dataset, let’s not give it one. I modified the code to generate game rollouts on the fly. For every training batch, the model would see a fresh set of games it had never encountered before. This, I thought, would force it to learn a generalizable strategy.

The result? Now it didn’t really overfit, because it simply didn’t learn to play well. I was genuinely surprised.

Figure 2: Learning infinite random games caused it to learn nothing.

This was a much more interesting failure. One of those things that is completely obvious in retrospect, but didn’t occur to me: since the DT is an offline algorithm, and I’m using it with sparse rewards, it is smart enough to figure out that when it wins, some moves must be good, but it can’t figure out which moves. In other words, even a winning game was composed of random moves, so it just learns to predict bad, random moves, even if many of those tend to lead to winning. I still thought this should lead to stagnation at worst, but apparently it actually overfits to emulate a random player and starts losing again. It tries to condition on winning, but at the end of the day it just doesn’t know which moves are good.

It wasn’t just memorizing data anymore. The model was perfectly learning to imitate a bad teacher. The “infinite” stream of data was from random agents, and a perfect imitation of a random player is still, fundamentally, a random player. So, for the vanilla DT to work, it needs expert rollout examples. Empirically this seems to be different from what the paper claims in its introduction:

Training only on random walk data – with no expert demonstrations – we can generate optimal trajectories at test time by adding a prior to generate highest possible return

However, I’ll need to study it more carefully to figure out what the caveats are here and where the discrepancy really lies.

A Better Signal

The problem with RTG is that gives the same “win/lose” grade to every move in a game. This is what we mean by “sparse reward”, we don’t know what was actually right and wrong in the way it played, just that after all those random moves, there was a win or a loss. I needed a way to credit the specific moves that were good and penalize the ones that were bad. The answer from traditional reinforcement learning is Advantage².

The advantage A(s, a) is defined as Q(s, a) - V(s), which simply asks: “How much better is the return from taking action a in state s compared to the average return you’d get from just being in state s?”

Diagram 2

Diagram 2: The sequence modified to provide advantages, states, and actions during a rollout.

This is a much sharper signal. I replaced RTG with Advantage in the transformer’s input sequence. The new prompt during inference wouldn’t be “get a high return,” but “take an action with a high advantage.”

The Offline Advantage Experiment

One sticking point is that to calculate advantage, I first needed to estimate the V and Q values, which requires averaging returns over the whole dataset. Hard to do when I was generating data on the fly. So just to start, I went back to the pregenerated, purely offline approach. Note that this was only possible to do “exactly” because of the small state space of Tic Tac Toe.

Generate a dataset of 20,000 random games.
Use Monte Carlo estimation: for each state s, V(s) is the average of all returns ever seen starting from that state. Q(s, a) is the same, but for a specific state-action pair.
Train the DT by conditioning it on these pre-calculated advantages.

The code accomplishes this by tracking a dict for state and a dict for state-action pairs, indexed by their exact values. In a bigger game this would be a combinatorial explosion! That’s one advantage (ha) of using such a simple game for exploratory work like this.

The result was at first satisfying, but then, not… the model almost immediately hit an 80% overall win rate, far surpassing any previous version. Very quickly though, it collapsed even more spectacularly than before, plunging to a sub-50% win rate.

Figure 3: Trying with offline advantage calculations, we again overfit to the initial dataset.

This was my first real encounter with distributional shift. The model got so good, so quickly, that during validation it started encountering board states that were extremely rare in the initial random dataset. It wandered “off the map,” and its advantage estimates for these new states were either non-existent or based on noisy, unreliable data. Its policy, overfitted to its value estimates, became useless, and it started making catastrophic blunders.

Online Advantage with a Replay Buffer

The solution was to combine the best parts of both previous attempts. I needed the adaptive, ever-improving data from an online approach, but the stable value estimates of an offline one. The answer was a replay buffer³.

The reason the offline approach “worked” at first is because the advantage was calculated based on a sufficiently large bunch of games. If I just redefined the “bunch” as the most recent set of games, and renormalized the advantage calculation as I went, it should allow a continuously evolving picture of what moves are considered ‘succesful’ based on recent history, and help avoid overfitting to the initial pregenerated dataset.

The final training loop looked like this:

Generate & Store: In each epoch, generate a new batch of games and add them to a large replay buffer, discarding the oldest games.
Re-Estimate: Use the entire replay buffer to recalculate the V and Q tables. This gives us stable, up-to-date advantage estimates.
Train: Train the model for one epoch on a sample of data from the buffer, using these fresh advantage values.

Diagram 3

Diagram 3: The training loop with advantage calculation per epoch of random rollouts. (Diagram only minimally edited from SVG generaetd by Gemini 3, by the way!)

This worked. The catastrophic overfitting vanished. The win rate climbed quickly to about 66% overall (~80% when playing as Player 1) and then stayed there, flat and stable, for hundreds of epochs.

Figure 4: Finally, a successful result by updating the advantage calculations at every epoch! (Red line)

The model had finally learned everything it could from its teacher. The 66% ceiling wasn’t a bug; it was the theoretical maximum performance for a policy trained to exploit a random opponent. It had mastered that specific “meta-game.”

The next logical step? Using this stable agent as a seed for a self-play loop to see if it can teach itself, Alpha-Zero style. But that’s a project for another weekend.

Side note on “careful vibe coding”

I have to admit that I’m a bit blown away by how good Gemini is at vibe coding these kind of little self-contained science projects. It almost one-shotted the problem, with a single small bug fix necessary in the validation function. Then, the script ran perfectly, trained a network and I could see the win rate improving in Tensorboard, just like that. No coding at all on my part except diagnosing that one small failure. Using LLMs and Agents for coding can certainly be frustrating, but when it works this well, it’s certainly an interesting feeling. Hard to describe.

Anyway, I continued the project the next morning for a few hours to fix a few other things and to investigate the performance a bit more. This thing became an interesting experience in what I like to think of as “careful vibe coding”, where I do in fact kind of just let Gemini write out the program, but then I check pretty closely what it has generated, and make sure it’s doing what I think and doing things right. And even then, I managed to miss an important detail: in my mind I was already expecting it to use an online dataset, but in fact I hadn’t specified this and it took the assumption in a different direction. Of course, I’ve see this happen with colleagues when working on teams too, so nothing new, but it still surprises me when I didn’t notice it until being confused about the results I was getting. I should have seen it during review!

I will say that once the LLM starts diverging from your local version (because I don’t accept all changes), it starts to get a lot more difficult to handle, and switching to an agent system is pretty much necessary otherwise copy & pasting becomes a nightmare. And then you have to be even more careful, rejecting agent proposals if they aren’t right, which happens frequently, and can be difficult to judge on the spot. (I’m still on the lookout for an agent system that works better for me with a git tree/rebasing workflow. Maybe I’ll build one.) In the end though I got a result that probably would have taken me a week or more to do manually, so I’m happy with the outcome. Writing this blog post and fixing up the code for reproducibility honestly took me about 5x longer than getting the actual experiment working, and running it. But maybe that’s also because I don’t post much so I’m not used to it.

One technique I’ve started using is to have a fairly involved chat about a topic in the Chat window, and then copy & paste the whole conversation into the agent’s prompt, or simply into a .md file in the project directory. This allows to formalize everything before kicking off a coding session, and helps to focus the agent on the plan you’ve already developed.

I didn’t exactly take that approach here, but close. I started with a couple of warm up questions on what the DT was and how it works,

Can you summarize briefly for me how the decision transformer works?

and,

How are sequences organized and how is the return conditioning performed? Is it just a reward token pretended to the beginning of the sequence?

.. then I dropped the following into the chat. I did it this way because I expected it all to be achievable in a single file; I might not have tried this on a more complicated multi-file project. It’s not a long prompt, but it’s detailed and tells it all the steps that I thought through in advance that I would need to build the project in a modular way and include logging and validation for analysis. I thought it would be best if the LLM considered all these requirements at once, and this panned out in practice:

Let’s try an exercise. In a single python file, write: 1. a tictactoe game board which can receive the next move, list available next moves, and evaluate the winner; 2. a random player agent; 3. an agent that calls a pytorch model to predict a distribution over the board and samples a move (piece placement); 4. a function that calculates full rollouts between two agents; 5. a pytorch data loader that returns batched sets of rollouts; 6. a basic transformer model using torch.nn.transformerdecoder; 7. a training function that trains a model on the random games using the ideas from Decision Transformer; and 8. a validation function that uses the dataloader to evaluate the performance of the trained model against a random agent. Make the program take a dataclass-based configuration that can be overridden with argparse. Add tensorboard logging for training loss and validation win rate.

The result was a program that “just worked” and did everything I asked, except for one crash in the validation function due to sequence length mismatches, that was easily fixed with one more prompt. I should add that this is with Gemini 2.5 Pro. I’ve been getting good results with it and haven’t experimented much with other services for a while, since for my level of usage it’s free. And it’s frighteningly good at writing PyTorch code.

I don’t think vibe coding is appropriate everywhere, far from it, but I do think that with some care, collaboration with an LLM can be hugely beneficial. But I also think it’s an acquired skill, not something you can be an expert at from day 1, so I think it’s nice to take notes on what works and what doesn’t, consider this section a contribution to that. I am certainly still learning. Not just how to work with it successfully, but also tons of little lessons on how insidius it can be at hiding subtle problems. Even when you think you’ve checked the code carefully, it will do unexpected things in places you don’t think to check. Be careful out there!

On the other hand I probably wouldn’t have even done this project without vibe coding, and that’s where I’m finding major benefit. Being able to just explore these little side thoughts that I have, on a whim.. “hey what happened to the Decision Transformer anyway?” and actually get something working quite quickly, rather than giving up after spending a few hours realizing that it’s going to take me another week to even know whether it works or not, is kind of huge, and is helping to deepen my knowledge on topics that I might not otherwise have time for.

Chen et al., Decision Transformer: Reinforcement Learning via Sequence Modeling, 2021. ↩︎
Peng et al., Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning, 2019. ↩︎
Zheng et al., Online Decision Transformer, 2022. ↩︎

Stephen Sinclair

The Vanilla DT

Fighting Overfitting with an “Infinite” Dataset

A Better Signal

The Offline Advantage Experiment

Online Advantage with a Replay Buffer

Side note on “careful vibe coding”