JakeTacToe I | Jacob Rebe

FOREWORD

What follows is my attempt to understand machine learning. I created a simple piece of software that learns how to play Tic Tac Toe. At the beginning of this project, I had zero experience with machine learning. All of my knowledge was built upon what I had read in various pop sci articles. I learn by doing and decided that the best way to approach this subject would be from the bottom up. I begin with a naive definition of machine learning and develop the idea from there.

I expect to encounter many obstacles along the way. By dealing with these obstacles as they arise, I hope to instill in myself a firm foundation of knowledge, earned honestly through the forthright accumulation of hands-on experience.

Warning, this is long. You might not want to read the whole thing in one sitting. If you’re short on time, only read the following sections:

-Introduction

-Define it

-Conclusion

-Possible Limitations

-What’s Next?

INTRODUCTION

The arrival of our A.I. overlords is imminent. It’s in our best interest to understand the inner workings of their superior silicon brains so that when they begin to cull human flock, we might be spared.

Unfortunately, getting started with Machine Learning is difficult. It’s easy to get bogged down with jargon and terminology. It happened to me. I tried to initiate myself into the field with a heavy handed information binge. I spent a several days reading about supervised vs unsupervised vs semi-supervised learning, logistic regression, learning vector quantization, locally weighted learning, and other esoteric and impenetrable sounding terms until my eyes glazed over and the sun went down. In the end, I found myself sitting in a dark room barely more informed than I was when I started.

Clearly the information overload approach doesn’t work. There’s no easy way to integrate such a massive amount of information without building upon some sort of experience base. But how is a layman such as myself supposed to develop a base of experience? I quite obviously don’t have the academic pedigree of the MIT and Stanford types who developed all of these algorithms. And, frankly, I don’t think I can spend another day listening to a super nerd trying to describe the idiosyncrasies of a Boltzmann Machine vs a Restricted Boltzmann Machine.

Instead, the way I chose to approach it, and the way I choose to approach many complex subjects, was a bottom up approach.

I would take my pre-conceived, bare minimum definition of machine learning and try to build that.

Instead of taking a commonly used machine learning algorithm ( which I barely understand ) and trying to adapt it to a cookie cutter problem, I would just make things up as I went along. As long as I was using the correct definition of machine learning, this would work.

DEFINE IT

The first step is to define “Machine Learning.”

Clearly, “Machine” just means computer. So that part is easy.

The hard part is to define “Learning.”

Broadly speaking, I define learning as the creation of associations between inputs. For example, learning to read is creating an association between symbols written on a page, phonetic sounds, and abstract meaning. Learning to play piano is creating associations between musical intervals, piano key location, and finger movements. Learning to play basketball is creating associations between the rules, physical movement, and memories of past games to create a successful strategy. And so on.

As an aside, if you find yourself interested in figuring out how the human mind learns, I recommend checking out On Intelligence by Jeff Hawkins. Doing my best to summarize his theory, Hawkins describes our cortex, which is arranged in layers, as a prediction making machine. The brain makes itself more energy efficient by using a few “label neurons” in the higher cortical layers to represent the firing patterns of numerous lower layer neurons as they process complex sensory input such as incoming data from retinal cells.  The brain continues to conserve energy by using a prediction / verification technique for all sensory input.  If the incoming sensory information satisfies a few predictions made by the higher layers, then the single label neuron is used in lieu of the multitude of sensory neurons which would otherwise be required.

I recommend reading his book since he does a better job of explaining it than I do.  If you’re interested in this sort of thing you can also read 
The White Paper (http://numenta.org/resources/HTM_CorticalLearningAlgorithms.pdf) and check out an open source machine learning platform called Numenta (http://numenta.org/ ) which is based on his theory.

I just gave you the broad definition. I need to make this definition more concise if it is to be usable.

Learning – The increasing probability of successfully completing a task as it is repeated.

Practice makes perfect. Think about any task you are trying to learn. You learn it by attempting it repeatedly. For example, if you’re learning how to play a song on guitar, you play it over and over until your fingers know where to go.

Think of a mouse navigating its way through a maze. When it reaches a dead end, it knows not to take that path anymore. Eventually, it has eliminated enough wrong turns until it finds the cheese. Then it takes the same path every time.

It’s a continual process of eliminating actions which do not lead to success, and repeating actions which do lead to success. The more often you repeat the task, the more you eliminate non-useful actions and amplify the useful actions.

This bare bones description of learning is quite approachable and really easy to implement in a piece of machine learning software provided the task being learned has two attributes:

Clear parameters for success and failure

Quantified steps, each with a set of discrete choices, which lead to an end point

Tic Tac Toe is a task that fills these two requirements. The game progresses through a series of steps (board positions), each step has a set of discrete choices (unclaimed spaces), the steps lead to an end point (a win, loss, or draw), and there are clear parameters for success or failure (win/loss). If I remember correctly, this sort of process is called a Markov Decision Process. But don’t quote me on that.

Before I get into my code, you should check it out:

https://github.com/JacobStevenR/JakeTacToe

It’s a simple Tic Tac Toe game played in the linux terminal. It’s written in Python and capable of repeatedly playing games of Tic Tac Toe and eventually learning how to play very competitively. It will draw or beat a human player after about 200k games.

Feel free to clone it and tinker around with it (The only requirement is Python’s Nose package for testing). It’s still a little buggy, but it gets the job done. I’ll be slowly adding to it and improving it as time goes on.

To write my simple machine learning program, I need to code three aspects: The task engine, the choice making algorithm, and the learning algorithm. The ttt.py file of my JakeTacToe module has two classes, an Engine class and a Player class. These two classes interact and provide all three aspects.

I’ll cover how I programmed each aspect in the next three sections. I wrote these sections with the intention that you should read them while, at the same time, going over the code. So, in another tab, go to Github and open up the ttt.py file of the JakeTacToe module which I linked to above, and follow along.

If you are a non-technical type (or just too lazy to go click back and forth between Github and this article), you may want to skip to the conclusion where I summarize the process.

The Task Engine

Got Github opened up in another tab? Good, refer to it as you read this.

The Task Engine is the code which programs the task to be learned; In this case, it’s a game of Tic Tac Toe.

Tic Tac Toe is a simple game. Two players, going one after the other, claim one square at a time from a 3X3 grid of nine squares until one player has a particular combination of three (across, down, or diagonal). By associating each square with a number (0 – 8), this game becomes one where two players take turns claiming one number at a time until there are no numbers left or until one of them has claimed all three numbers from one of the following sets: [[0,3,6], [0,1,2], [0,4,8], [1,4,7], [2,4,6], [2,5,8], [3,4,5], [6,7,8],]. These sets represent every possible version of three-in-a-row.

This process handled by the Engine class with the following methods:

–Engine.play_game()

–Engine.play_round()

–Engine.query_player()

–Engine.check_for_win()

–Engine.check_for_draw()

There are other methods, but these five are the most important ones. Take a minute and gloss over the methods I mentioned above in the ttt.py file.

The Engine class is instantiated with the following attributes:

–Engine.grid: A list of strings which, when iterated, can print out the game board which is used for human reference.

–Engine.winning_sets: The number representations of any three-in-a-row (across, up/down, or diagonal. These are the sets I mentioned earlier.

–Engine.available: A list of numbers, each of which represents an unclaimed square.

–Engine.player_x and Engine.player_o: These attributes contain Player class objects which are passed to the Engine upon instantiation. To be covered later.

–Engine.compete: A boolean value (TRUE or FALSE) which is used to determine game type: Competition mode, or explore and learn mode. I’ll cover these modes in a later section.

Take a look at my code and see if you can figure out how it works ( I apologize that it’s a little spaghetti-tastic. I’m still learning. ). Again, if you don’t work with Python, none of this is gonna make sense to you. Consider skipping to the end.

After instantiation of the Engine class, Engine.play_game() is called to start a game. Engine.play_game() is a “while” loop which repeatedly calls, for each player, the Engine.play_round() method. After each round Engine.check_for_win() and then Engine.check_for_draw() are called to determine if either player has won or if the game has ended in a draw. If so, the “while” loop is broken and the game ends.

As the name suggests, Engine.play_round() performs the operations necessary for playing each round. It asks the player which square he would like to claim and then updates the board in response to the player’s decision.

The board is updated by Engine.play_round() in the following ways:

–Engine.grid is updated to show either an X or an O on the game board in the proper location.

-The Engine.available list is updated to remove the chosen square’s number from the list of unclaimed squares

-The Player.owned attribute (a Player class attribute which keeps track of the squares which the player has claimed) is updated to include the number representing the chosen square

-The Player.moves attribute (a Player class attribute which keeps track of the moves the player has made) is updated to include a tuple of the position ID and the number chosen.

Those last two bullet points are starting to encroach upon our next section, so let’s move on.

CHOICE

You’ll notice that Engine.play_round() calls Engine.query_player(). Before we cover this method, we have to take a look at the Player class. This could get hairy so pay attention.

There are many attributes for the Player class, but we’re just going to focus on four of them:

Player.database: The name of the Sqlite3 database where game information is stored.

Player.computer: A boolean which determines whether the player will be controlled by the computer or a human (using Python’s raw_input() function).

Player.owned: A list which contains all squares claimed by the player throughout a game

Player.moves: A list of tuples, each of which contains the position ID and chosen square.

This last one is really important, so let me explain it in more depth.

The position ID is a string created by the engine using method Engine.create_position_id(). It is unique to each board position. A board position is a concatenation of the current available squares plus the current squares owned by the player ( Engine.available plus Player.owned ). As it turns out, every position in the game of Tic Tac Toe can be described this way. It’s convenient, then, to describe each position in the form “A[available]O[owned]. For example, “A235O467” is the position ID when the board has squares 2, 3, and 5 open and the player whose turn it is has already claimed squares 4, 6, and 7. It doesn’t matter which player’s turn it is, the board can be described this way whether it’s X’s turn or O’s.

The Player.moves list matches up the position ID and the move the player made in that position. So if the player chose square 2 in the position above, Player.moves would have the tuple (“A235O467”, 2) appended. This tells us that facing position “A235O467”, the player chose space number 2. Very important.

So what happens when Engine.query_player(player_X, “A235O467”) is called ( Essentially, this is asking player_x which move he would like to make in position “A235O467” ) ?

First, it must be determined whether this is a competitive game or an exploratory learning game. In a competitive game, the computer wants to pick the best possible move in order to win. In a exploratory learning game, the computer will pick a random move.( this is determined by the Engine.compete attribute ). I’ll get into the learning aspect next section. For now, let’s assume it’s a competitive game and the computer wants to win.

In this situation, a Player method, Player.decide_best(), is called. Player.decide_best() in turn calls another very important method named Player.get_weights().

Let’s set the code aside for a second so I can explain in simple terms how the computer player makes decisions. This is the crux of the whole thing.

When faced with position “A235O467” the player must choose between squares 2, 3, and 5 ( In this position Engine.available = [2,3,5] ). It makes this decision by referring to another list of numbers called “weights” which is stored in an Sqlite3 database ( named by Player.database ). The weights list for position “A235O467” has the same number of weights as there are available squares for the position in question. In this case, the weights list would have three weights just as the Engine.available list has three items (2, 3, and 5). Each weight is associated to each number in Engine.available by its list index. For example, square 2 is index position 0 in Engine.available and its associated weight is index position 0 in the weights list. In this way, the computer can use the list index of the highest weight to choose the corresponding position from Engine.available. If position “A235O467” has a weights list of [10.0, 3.0, 2.0] and Engine.available = [2,3,5] then weights[0] (which is 10.0) is the highest, and so Engine.available[0] (which is 2) is chosen.

Get it? There was a lot of information in that last paragraph and it’s important that you understand it before moving on.

Essentially, Player.decide_best() pulls up the weights list for the current position, figures out which weight is highest and returns the corresponding Engine.available number.

It’s the job of method Player.get_weights() to pull the weights list from the database using the position ID or to initialize a list of weights for a particular position using Player.default_weights() if it isn’t already there.

In the next section, let’s put it all together with the Player.learn() method.

Learn

Now that we know how weights are used by the computer to decide which square to claim, let’s see how the weights are calculated.

As mentioned earlier, the Engine.play_round() method appends a tuple of position ID and the choice made to Player.moves. This information gets used by Player.learn().

At the end of the game, the Player.learn() method is called for both players. Player.learn() iterates through the Player.moves list and pulls the Engine.available list and weights list from the database for each position ID. If the player won the game, then the weight associated with the move made in that position is increased by 0.01. If the player lost, then the weight is decreased by 0.01. Player.learn() is also called in the case of a draw (though I’m currently trying to decide whether or not I should increase the weights in case of a draw… I’ll talk about this later). As the game repeats, moves which more often lead to a win end up having higher weights than moves which more often lead to a loss or a draw. This will eventually lead to a computer which can compete against a human player.

You might ask, “Why do you increase it by 0.01? Why not by 1.0, or .5, or 2? Why don’t you count draws?” That’s a good question, and I don’t have an answer. To be honest, I have no idea (yet) about how I should adjust the weights. The 0.01 is arbitrary. I plan on experimenting with different increments and adjustments and figure out the most efficient way to modulate the weights.

I told you in the Choice section that there were two modes, a competition mode and an explore and learn mode.

In explore and learn mode, two computer players face off against each other, each making random moves. I hypothesize that, after playing many games, the moves which are more likely to win will always have higher weights (alternatively, moves less likely to win will have lower weights). By making moves at random, I ensure that the computer players will see the widest variety of positions.

After about 200k games, the computer plays almost perfectly.

CONCLUSION

To sum this up for the non-technical people.

We converted the game of Tic Tac Toe into one where the player must choose from a list of numbers (0-8) which are correlated to the squares on a 3X3 Tic Tac Toe board. The computer player will choose whichever move has won the most games in the past (and lost the least games). It plays many games and remembers which move it made for each board position it was faced with. At the end of the game, depending on whether it won or lost, the computer will increase or decrease its likelihood of making the same move when faced with the same position in a later game. After playing a multitude of games, the computer will accumulate enough data to only make moves which have helped it win in the past.

To put it another way, the following interaction between game engine and computer player occurs.

Engine: “Hello Player 1, this is position “A012345678O”, you can choose any of the following numbers, [0,1,2,3,4,5,6,7,8].”

Player 1: “Let’s see…it looks like choosing number 4 in this position has helped me win the most games. I choose 4. Now I’ll write down that in position “A012345678O” I chose number 4.”

Engine: “Hello Player 2, this is position “A01235678O”, you can choose any of the following numbers [0,1,2,3,5,6,7,8].”

Player 2: ”It looks like choosing number 0 in this position has won me the most games. I choose 0. Now I’ll write down that in position “A01235678O” I chose number 0.”

…

Engine: “Hello Player 1, this is position “A235O467”, you can choose any of the following numbers, [2,3,5]

Player 1: It looks like choosing number 2 in this position has led to the most wins. I choose 2. Now I’ll write down that in position “A235O467” I chose number 2.

Engine: “Player 1 wins! Player 2 loses! “

Player 1: Hurray! Now I’ll go through all of the choices I wrote down and put another tally mark next to them so that I’ll remember to use them again the next time I’m in those positions.

Player 2: That sucks! I’ll go through my choices and erase the tally marks next to them so I don’t make those terrible decisions again!”

This process of remembering unique scenarios, the choices made in those scenarios, and whether or not those choices led to a success is the key to this technique and, I would argue, a cornerstone of learning itself.

This technique satisfies my definition of learning and therefore qualifies as machine learning, no matter how simplistic.

Despite the simplicity, it’s powerful. It can be adapted to work for all manner of tasks provided they fulfill the two parameters I noted in an earlier section:

Clear parameters for success and failure

Quantified steps, each with a set of discrete choices, which lead to an end point

This could be different games like Checkers, Chess, or mazes. I can even picture it being used to modify the behavior of a robot.

Imagine if I had a robot connected to a tracking device which pinpointed its location throughout my house which would be mapped to a grid. I want the robot to learn how to go to the fridge and fetch me a beer. I would program a “loss” endpoint when the robot entered a grid space next to a wall and a “win” endpoint when it reached the fridge grid space.

At first the robot would move between grid spaces randomly, but eventually it would hit the fridge. Upon positively reinforcing all of the directional choices it made up to that point, it would continue to take that path. With some minor tweaking, say adjusting the decision process so that it chooses the route with the least moves, then the robot would learn the shortest path to the fridge. With yet more tweaking, I could teach it to retrieve a beer (maybe I’d have a robot friendly beer dispenser), come back to me, and then return to its charging station until Daddy needed a new beer.

POSSIBLE LIMITATIONS

Of course if it was this easy, Machine Learning wouldn’t be such a complicated field. There are limitations to this method. The biggest problem with my particular version is the length of time it takes to learn. The game of Tic Tac Toe only has a few thousand positions (4520 to be exact). Yet it still takes thousands upon thousands of games for it to repetitively brute force its way through every possible permutation of the task at hand until it can play competitively. Imagine how many games of Checkers it would have to play, let alone Chess! I would think it would take millions or even billions! This would take too long.

In my code in particular, the speed limiting factor is the numerous database connections which must be established. These connections take time, not to mention that it’s written in Python which isn’t known for its speed. For my computer to run one fully automatic game of Tic Tac Toe it takes about 2 seconds. On my “To Do” list is to minimize the database connections to speed things up. Maybe I’ll batch all of the game data together at once and make one big database connection at the end. Even then, I don’t think it could be fast enough for it to work through a game like chess in a reasonable amount of time. I have a few ideas, but I’ll draw those out in my next post.

Another potential flaw in this particular implementation is the fact that the computer does the brunt of its learning through random play. Is merely playing a bunch of games at random the best way for the computer to learn? I don’t know and I will find out after some experimentation.

WHAT’S NEXT?

My next task is to fiddle around with the program and test the limits of its learning capabilities. There are all sorts of factors which I can tweak:

-The size of the incremental weight increases (should I adjust wins and losses differently)

-The triggers for weight modulation (should I increase weight for draws as well?)

-Different game types (is purely random play the best way for it to learn?)

-Human players (should I adapt this game to run online and let it learn against human players)

-In compete mode, instead of choosing the highest weight, should the weights be used to make choices in a probabilistic fashion? ( This was my original plan, take a look at Player.cdf() and Player.choice() ).

All in all, I think this can be a useful method for learning simple processes. I learned a lot creating this program. I’m going to spend some time experimenting with it and I’ll be back with the results of those experiments along with ideas for improvements in future posts.

Thanks for reading!