blog | news | projects | notebooks | politics | github

Building a Better Pac-Man Expecitmax Agent Using Machine Learning

UC Berkeley's Pac-Man Projects are a great resource for learning about introductory artificial intelligence.

The AI course I'm enrolled in at Northeastern, CS4100/5100, is currently using the pac-man projects, and I think it's a great idea. As a student, I think it's often too easy to get caught up in treating your classes and assignments as adversaries, working for hours just to get something done, or just to get a grade, rather than working for your own personal and professional development.

I think the pac-man projects have the potential to make the process of working on homework a personal challenge.

For me, this means that I continued to work on the assignments even after I had passed the threshold for getting what I considered a good grade.

The second assignment included implementing minimax, expectimax, alpha-beta pruning, and designing an evaluator function to estimate the utility of states.

The implementation of minimax and expectimax search is beyond the scope of this post, but it's important to know that for those search to work properly, it needs to be able to accurately measure the value of game states.

Building the function by hand

I spent much more time on tweaking my evaluation function than I did on actually implementing the search algorithms.

The evaluation function I wrote evaluated a proposed state that pacman was in. A pacman state contains information like the positions of all the ghosts in the maze, the list of all the food (dots), etc.

I chose to compute the utility of some state as a linear combination of a series of features:

score = 1    * currentScore + \
        -1.5 * distanceToClosestFood + \
        -2   * (1./distanceToClosestActiveGhost) + \
        -2   * distanceToClosestScaredGhost + \
        -20  * numberOfCapsulesLeft + \
        -4   * numberOfFoodsLeft

You can see the full function, along with an explanation of the coefficients I used, here.

My manual tweaks worked really well. Pacman hovered around a 80-90% win rate, and was able to consistently average over 1100 points, including the points he got when he died.

For reference, my assignment instructions explained:

With depth 2 search, your evaluation function should clear the smallClassic layout with two random ghosts more than half the time and still run at a reasonable rate (to get full credit, Pacman should be averaging around 1000 points when he's winning).

While I was tweaking the coefficients, I thought back to Andrew Ng's machine learning course on coursera. I only stuck with it for a few weeks, but that was enough to give me the intuition that this problem would lend itself to linear regression/batch gradient descent.

Generating the training samples

(If you want to follow along, see the code here)

Before I could do anything useful, I had to generate training samples, but before I could generate training samples, I had to figure out a way to send the coefficients to my evaluation function.

This took me a pretty long time, actually - it was very hard to wade through the pacman code and find how the system sent commands to the agent files. (In this case, the agent file was MultiAgent.py)

It turns out that the pacman system allows the user to send an arbitrary number of arguments to a pac-man agent. They're taken in as comma-separated arguments to pacman.py itself, after the -a flag, after specifying the agent.

parser.add_option('-a','--agentArgs',dest='agentArgs',
                  help='Comma seperated values sent to agent. e.g. "opt1=val1,opt2,opt3=val3"')

After modfiying the MultiAgents.py MultiAgentSearchAgent constructor to take in the coefficients as parameters, I was able to store them as global variables, and then access them in my evaluation function.

I then wrote up a short script to generate the training samples, by calling the pacman game with generated coefficients.

I settled on generating random values in the range [-4, 4] for the coefficients. I moved this script and the pacman files to my cloud server, let it run for a day or so, and picked up about 2,234 training samples.

This generated a comma-separated data file, where the first six entries are generated coefficients, and the last entry is the average score for pacman over the course of 10 games.

1.8600012677476485, -0.03552331692745003, 1.841124773175646, 2.654340114506205, -0.8045050830568572, 0.2193506754716683, 766.3
2.9350528054346565, 0.5709530250248402, 0.07857250685131767, -0.6385025282304264, 3.1487636335695903, 2.3691954268662307, -196.2
-0.11202021883532609, -3.02759883536649, -0.5656389911377335, -3.9603078866464356, -1.8367063824885976, 2.4630907239770012, -514.2
-3.016637223082224, -2.245754984319735, -0.10921157861722275, 0.6817250004998341, -2.196688389121393, -2.296173281793439, -628.6
-0.6770222252550351, 0.17597684021023863, 1.3055683751614957, 0.6289997585585114, 0.07639269362973078, 2.7845594064081647, -496.3

The last value in each line is the average score over 10 games with that set of coefficients.

Applying gradient descent

This was the script I used to apply gradient descent.

My theta was very simply the vector consisting of all the coefficients.

I chose an alpha value of .01, and my algorithm had no trouble finding a minimum.

Running it for 100 iterations:

Gradient descent run for 100 iterations

makes it seem like we've found the lower limit for the cost. Running it for 500 iterations convinces me even further:

Gradient descent run for 500 iterations

Yeah, we're there.

After running for 500 iterations, my script comes up with these coefficients:

158.55581188, -80.87408792, -1.69430904, 3.274168, -10.32809479, -13.93307463

Seen conveniently next to the features they are applied to:

158.55581188 * currentScore +
-80.87408792 * distanceToClosestFood +
-1.69430904  * (1./distaneToClosestActiveGhost) +
3.274168     * distanceToClosestScaredGhost +
-10.32809479 * numberOfCapsulesLeft +
-13.93307463 * numberOfFoodsLeft

...and next to the coefficients I made:

score = 1    * currentScore + 
        -1.5 * distanceToClosestFood + 
        -2   * (1./distanceToClosestActiveGhost) + 
        -2   * distanceToClosestScaredGhost + 
        -20  * numberOfCapsulesLeft + 
        -4   * numberOfFoodsLeft

They are, unsurprisingly, wildly different.

But how do they stack up?

Comparing the hand-made function to gradient descent

Gradient descent:

Average Score: 847.48
Scores:        221, 960, -71, 984, -148, 935, 967, 1326, 889, 1115, 974, 1119, -159, 973, -137, 134, 918, 1160, 962, 1717, -78, -122, -267, 983, 1386, 969, 1212, 909, 876, 938, 1358, 987, -157, 315, -206, 1750, 938, 954, 921, -377, 689, 1357, 955, 964, 975, -308, 1309, 1317, 1160, 961, 1361, -339, 1154, 1107, 974, 984, 985, -105, -144, 1149, 1179, -156, 313, 965, 1371, 1483, 963, 972, 1329, 972, 112, 1332, 958, 319, 1154, 1370, 974, 969, 1146, 1109, 918, 1476, 1163, 1322, 972, 966, 1154, 987, 906, 953, 964, 1349, 1169, 973, 330, 1253, 1351, 1149, 959, 967
Win Rate:      78/100 (0.78)

hand-made:

Average Score: 1159.78
Scores:        1369, 1364, 237, -206, 969, 1372, 1722, 1378, -321, 1753, 1562, 1338, 1567, 1360, 1553, 1369, 978, 1748, 1761, 1363, -279, 1356, 1750, 1678, 1371, 1362, 1365, 1311, 1370, 1372, 632, 1335, 1367, 1703, 1480, 1162, 1371, 1573, 1312, 1367, -26, 1351, -174, 1171, -381, 1362, 1371, 976, 1352, 1564, 1544, 1371, 1344, -162, 1768, 1736, 1380, -133, 1758, 976, 1367, 1766, 978, -287, 1728, 1550, 1372, 1331, -398, 1515, 1364, 90, 1355, 1165, -346, 1761, 1588, 1298, 1375, 1362, 1713, 1342, 1361, -113, 1331, 1565, 1730, 507, 1357, 1760, 1358, 1371, -198, 1558, 967, 978, 1352, 1057, 1371, 1335
Win Rate:      83/100 (0.83)

So, the win rate is actually pretty close (~5% difference).

But the difference in the average score between the two functions is 312.3. That's pretty big.

I expected gradient descent to win out over my hand-made algorithm. I think it's likely that the way I made my training samples is limiting gradient descent's potential; [-4, 4] is a pretty small window for selecting sample coefficient values.

I think better training data could be generated by using my hand-made coefficients as base values for the generated coefficients, or simply generating more training data with a larger range of possible coefficient values.

I'd be interested to see how an unsupervised algorithm performs on this problem!


About Me

I'm interested in building technological platforms that leverage what we know about social dynamics to help people live their lives better.

I'm currently working at the Human Dynamics Group at the MIT Media Lab, creating systems that attempt to measure and impact human social and health behaviors.

I've also worked with the Lazer Lab, inferring partisan dynamics from congressional public statements.

You can e-mail me at dan@dcalacci.net

Send me encrypted messages using my PGP key. (via keybase)

Resume here.

see what music I listen to