### Humans and computers, making choices in games

Different games come with different challenges from a decision-making perspective. For this posting we will look at the following illustrative examples: - Chess - Bridge (cardplay only, not bidding) - Poker (Texas Hold’em) - Risk

A possible way of categorizing the games would be to look at the information available when a decision needs to be made.

Chess and Risk are games that provide full information at any given time, but only in chess is the outcome of decisions given (the result of any sequence of moves is deterministic and has a single result). In Risk the momentary outcome might be influenced by rolling dice, so the result of a chain of events is still deterministic but there are many possible results depending on how the dice fall.

A further complication is the fact that there are other players and we cannot decide for them (which is true for all the games we are looking at).

In bridge card play, disregarding the bidding stage and information from that, partial information is available. The declarer (or “attacker” if you like) has 100% information on his sides’ assets, and 100% influence on what decisions his side makes. The defenders each have 50% information on their sides’ assets and 50% on the opposing sides assets. Furthermore, each defender can only decide on what card to play from their own hand, their partner makes his/her own decisions.

Interestingly, the information available grows, or rather the number of possibilities diminishes, with each card that is played. Possible card distributions are reduced rapidly and the defenders supply both each other and declarer with additional information. Decisions made by other players, as well as any agreed signaling conventions between the defenders are clues (defenders often have agreements on how to improve their partners available information, but this is also information to the declaring opponent).

Since all 52 cards are dealt, the contents of hidden hands is known, but not how the contents are distributed between the hands. A human might say that the game is mostly technical with a significant portion of psychology.

In poker there is not a lot of information available initially. Since Texas Hold’em is so popular we can use that as an example. The “raw” information (cards) consists of your own two cards. Later, first three and then one more and finally one more card is added to your known world of information. However, since there is betting at all these stages, most of the information available is what decisions your opponents make (or do not make). Since the balance between “raw” information and information gleaned from interpreting the opponents actions leans so heavily towards interpretation a human might say that the game is mainly psychological with a significant portion of odds calculation.

So how do algorithms compare to humans when it comes to making decisions in these games?

If we begin with chess we find a vast amount of literature, research and practical implemetations.

From a computing point of view, chess is typically a tree-search problem. Since the number of possible moves at each step easily exceeds 30 in the middle game, a brute force approach quickly becomes impractical. For instance, a search in 8 steps with choice of 40 different moves will require evaluation of >6500 billion positions.

In order to reduce the number of evaluations needed, pruning algorithms are used. The most popular one for this type of problem is called alpha-beta.

The reasoning behind alpha-beta is that opponent has no reason to pick a move that would give us the opportunity to pick a move that gives us a better position than we could achieve had the opponent picked another move, and vice versa. Alpha and Beta represent the max and min values used to prune the tree using this line of reasoning.

Let us consider the tree of depth 8 (depth 0 is the current position):

- We traverse the tree left-to-right and therefore begin by reaching the bottom left position in the tree at depth 8 (we have made 4 moves and so has the opponent, who has made the last move which we will call O4). When we reach this position our evaluation algorithm is run and returns an assessment of the position, value X. We then continue to evaluate all the possible moves for the opponent (all O4) that stem from the same last move we made (call it U4). Our opponents best move will be the one with the lowest X, Alpha.
- Now we move on to our next move at depth 7 and begin to evaluate the opponents moves at depth 8 in a similar fashion. The difference is that we now know that it would be silly of us to choose a move at depth 7 that would give the opponent the possibility of making a move with lower X than Alpha. As soon as we encounter any such possible move for the opponent we can prune that entire sub-tree. In effect Alpha is our lower bound.
- As you have probably figured out by now, the reasoning at depth 6 will be reversed. The opponent would not willingly give us the opportunity to score above an upper bound that is the Beta value. Our lower bound is the opponents upper bound and vice versa.
- Etc etc until the entire tree from depth 8 and up is aggregated, resulting in a choice of best move and an evaluation of that move

Theoretically the number of position evaluations can be reduced to something down to the square root of the brute force positions, assuming that excellent moves are available early in the evaluation. Developments in this area focus on good evaluation algorithms and getting good Alpha and Beta values early. For those interested you can look at the Scout improvement.

In contrast, expert human players focus on determining a limited selection of candidate moves. They evaluate a tree with much fewer possible moves at each depth.

Comparing computers to humans, computers are the winners since quite a few years back. There are still certain types of positions where the horizon (calculation depth) is too far away for a computer to make the correct decision, typically positions with interlocked pawn chains and no material gains in sight or end games where a small mistake ends in a draw due to rules limiting consecutive moves with nothing in particular happening. But computers will not choose these positions because programmers have developed opening books that avoid anti-computer type positions and some of the horizon problems are solved with end-game books containing specific solutions to some problems that are not solvable within the horizon.

From a “computers trying to solve problems better than humans” perspective chess is relatively easy. The Japanese game of Go is a similar example.

The Risk problem on the other hand appears both simple and difficult. Tree search type solutions (as for bridge) typically require c^d evaluations brute force, where c is the number of choices in each position and d is the depth, so it seems that we might accept a large c in exchange for a limited d. We would also have to consider the possible outcome of die rolls. With that line of reasoning we might examine all possible outcomes of die rolls (weighted according to probability) and perhaps achieve a decent algorithm by running a tree search for each die roll outcome combination. That seems simple, even if c becomes very large.

There has been work done using genetic algorithms and even TD(lambda) (machine learning) for Risk. This indicates that maybe it isn’t that simple after all.

However, I like to think that commercially available Risk games probably just use a simple greedy algorithm and perhaps even some manipulation of the die rolls for different settings. Greedy algorithms just grab as much as there is immediately available with no eye to the future.

Explanation:

- Consider a tree of depth 2 with two branches at each level. Each node has a numerical value
- Problem formulation: choose a path that gives the greatest sum of values in the nodes you have passed
- A greedy algorithm will always choose the node with the highest value at the nearest level. This might lead to a selection of nodes with low values at the next level, while a different first choice might have led to a jackpot node. In the long run, all things being equal, a greedy algorithm will still outperform random behavior

Evaluation of a position seems to be a challenge, but perhaps a simplistic approach (the number of reinforcements generated by the position in the next round, for instance) is workable.

We might suspect that humans use a sort of semi-greedy approach. On the one hand grabbing is good. On the other hand there are reinforcements to be gained by holding continents until the next turn, and allowing opponents to generate vast numbers of reinforcements will eventually lose. Also the map is not symmetrical and the continents are not identical. So humans tend to balance greed against many other factors.

In bridge card play most programs use so called “double dummy” evaluations. This means simulating possible card distributions in the hidden hands and then running searches for the best card to play overall. An interesting point is that these evaluations have two not so obvious flaws:

- If the number of simulations is too low, situations where the location of a certain card is actually a 50/50 proposition will instead be skewed in some direction. This can lead to selecting card plays that actively guess where the card is, while alternative plays would score better because guessing is sometimes avoided
- The double dummy approach is similar to the approach described in chess and Risk (and Go etc etc). The difference is that the players do not actually see all the cards. Each search is just using one possible distribution of the cards. Therefore, algorithms assuming that the opponents will make the best play to counter any play we make are basically flawed. Given a choice, opponents will often choose the play with the best odds, which will be a weighted result based on many possible distributions of the cards.

More advanced algorithms consider the fact that the opponents will be in a similar situation to the one we are in. Such approaches are called “single dummy” in bridge. They tend to score significantly better in defense and better in offense (remember, the defenders have fragmented information and distributed decision-making).

Human experts play exactly like this. Sometimes a double dummy approach is sufficient, for instance if a play is found that always leads to the goal. The single dummy considerations gain weight the more the outcome is influenced by the opponents decisions than the specific cards they hold.

Humans still beat machines sometimes at bridge, but it seems the machines will be clearly better quite soon.

Poker is a real challenge, both for humans and computers. Information is clearly limited and any straightforward approach will probably only be successful for so-called “heads-up” (when only two players are involved).

Algorithm development has involved Bayes theorem, Nash equilibrium, Monte Carlo simulation or neural networks, all of which are imperfect techniques. The current trend for poker AIs is to use reinforced learning or similar techniques, with huge training samples. Training is usually done AI vs AI.

Since poker is so much about interpreting the actions of other players, future development will probably be based on game theory and attempting to dynamically model the current opponents (so that game strategies can be individually adapted for each current opponent, where weaknesses can be exploited and bad risk/reward ratios avoided).

Human experts tend to (over?)exploit the risk/reward ratios (“I have an ace in hand, so I will bet enough that an opponent with no pair and no ace often will fold immediately”). They also tend to introduce a random element to make it more difficult for the opponents to build a mental model of them, while spending small or reasonable amounts of money early in a session (and observing the other players) to build their own models of the opponents. The fact that humans leak information (and observe such leaks) other than betting/not betting, size of bet etc etc, such as twitching, hesitating or talking a lot, makes the human live game more different from computer games than any other games we have looked at. It is also possible to put on some play-acting and try to fool any observers.

On-line poker is on the other hand basically the same regardless of human or AI players.

An interesting observation is that poker AIs do not always produce similar strategies to the ones human experts use. There is also absolutely no fear or personal loss involved in computer evaluations.

It is entertaining to think that very good human poker players either have the ability to turn off any fear of losing or that they simply lack the common sense to worry about losing lots of money and therefore can exploit any such weaknesses in other players. AI development has led to computer programs being able to perform well against humans. Recent developments have led to computer programs beating humans even in multi-player scenarios.

As a conclusion we can say that regardless of whether the games we look at have perfect or imperfect information algorithms are developed that enable computers to beat humans at the games. Considering the difficulty as well as the variation in the games one cannot help but consider if the same might be true for other areas in society. Balancing complex systems in society is a challenge that seems (and probably is) much more difficult than any of the games we have looked at, but perhaps algorithms could help us make better decisions or even be trusted to make the decisions for us.