On playing Quantum Tic-Tac-Toe
This post is about a project I did with three fellow students at the university for the course Machine Learning. The idea for Qttt was inspired by the code cup competition in 2012. In the same year I gave a presentation at the NSVKI conference in Utrecht. You can download the slides presented at the university, the conference if you like.
The game of Quantum Tic-Tac-Toe is, like the name suggests a version of the all known tic-tac-toe game. The difference lies in the fact that each move the player picks two instead of one position on the board. These two positions are quantum states, meaning that only one of them will actually belong to the player, the other position will "collapse". This collapsing and owning occurs when there is a cyclic entanglement between quantum positions. The above animation (borrowed from Wikipedia) illustrates how this works.
How did we build a program to play this game?
Most games are a reinforcement learning problem: There are a number of possible moves each round for multiple rounds, and often there are multiple good strategies. This is different from supervised learning where usually only one move/decision is trained.
How to determine what the best move will be? We used TD-learning to map the probability of winning for each board position. This is an exploratory method, meaning that not every move is computed in advance, but the "belief" in the probability of winning is updated after every game (or batch of games). To compute this "map" the player bot has an exploratory mode where there is a 10% chance it will not make the best move, but do a random one. If we do not try a random move every now and then it is likely that the bot will end in a sub-optimal decision pattern.
Why not compute all possible game plays? For normal tic-tac-toe the possible amount of game plays is not very high. In fact it will fit on a old-school floppy disk. The problem with Qttt and the 4x4 board we used is that the amount of possible game plays is so high that pre-computed it needs more space. Enormously more space. We decided to train a feed-forward neural network to estimate the map generated by/for the TD-learning score for each board state.
So what were the results? To determine the baseline performance we let two bots compete that only made random moves for a million games. The opening player won 47% of the games, the second player won 42% of the games, and 11% of the games resulted in a tie. We chose to improve the second player, since it had a disadvantage. Using a neural network with a hidden layer of 200 nodes we improved the probability of winning to around 90%.