Using Probabilistic Knowledge And Simulation To Play Poker (Darse Billings).pdf

(1082 KB) Pobierz
49261447 UNPDF
Using Probabilistic Knowledge and Simulation to Play Poker
Darse Billings, Lourdes Peña, Jonathan Schaeffer, Duane Szafron
Department of Computing Science, University of Alberta
Edmonton, Alberta Canada T6G 2H1
{darse, pena, jonathan,
duane}@cs.ualberta.ca
Abstract
Until recently, artificial intelligence researchers who use games
as their experimental testbed have concentrated on games of
perfect information. Many of these games have been amenable to
brute-force search techniques. In contrast, games of imperfect
information, such as bridge and poker, contain hidden information
making similar search techniques impractical. This paper
describes recent progress in developing a high-performance poker-
playing program. The advances come in two forms. First, we
introduce a new betting strategy that returns a probabilistic betting
decision, a probability triple, that gives the likelihood of a fold,
call or raise occurring in a given situation. This component unifies
all the expert knowledge used in the program, does a better job of
representing the type of decision making needed to play strong
poker, and improves the way information is propagated
throughout the program. Second, real-time simulations are used to
compute the expected values of betting decisions. The program
generates an instance of the missing data, subject to any
constraints that have been learned, and then simulates the rest of
the game to determine a numerical result. By repeating this a
sufficient number of times, a statistically meaningful sample is
used in the program’s decision–making process. Experimental
results show that these enhancements each represent major
advances in the strength of computer poker programs.
search is not successful since it is often impractical to
search the game trees that result from all possible
instances of the missing information.
Two examples of imperfect information games are
bridge and poker. Recently, at least two research groups
have made an effort to achieve high-performance bridge-
playing programs [Ginsberg, 1999; Smith et al ., 1998].
The progress has been impressive, and we may not have
to wait long for a world-championship caliber program.
Until now, the computing community has largely
ignored poker (a recent exception being [Koller and
Pfeffer, 1997]). However, poker has several attributes that
make it an interesting and challenging domain for
mainstream AI research [Billings et. al ., 1998a].
We are attempting to build a program that is capable of
beating the best human poker players. We have chosen to
study the game of Texas Hold’em, the poker variation
used to determine the world champion in the annual
World Series of Poker. Hold’em is considered the most
strategically complex poker variant that is widely played.
Our program, Loki, is a reasonably strong player (as
judged by its success playing on the Internet) [Billings et.
al ., 1998a; 1998b]. The current limitation in the
program’s play is its betting strategy - deciding when to
fold, call/check, or raise/bet. A betting strategy attempts
to determine which betting action will maximize the
expected winnings (or minimize the losses) for a hand.
The previous version of Loki used several expert-
knowledge evaluation functions to make betting
decisions. These routines were rigid in the sense that they
always returned a single value: the “best” betting
decision. Although these evaluation functions allowed
Loki to play better than average poker, it was inadequate
to play at a world-class level, since continually upgrading
this knowledge is difficult and error-prone.
This paper introduces two major advances in the
capabilities of computer-poker-playing programs. Each is
shown experimentally to result in substantial
improvements in Loki’s play.
First, this paper introduces a new betting strategy that
returns a probability triple as the knowledge
representation of the evaluation function. The routine
returns three probabilities (one each for fold, call/check,
and raise/bet). The program can then randomly select the
betting decision in accordance with the probability triples.
Representing decisions as a probability distribution better
captures the type of information needed to perform well
in a noisy environment, where randomized strategies and
misinformation are important aspects of strong play. This
1. Introduction
Past research efforts in computer game-playing have
concentrated on building high-performance chess
programs. With the Deep Blue victory over World Chess
Champion Garry Kasparov, a milestone has been
achieved but, more importantly, the artificial intelligence
community has been liberated from the chess “problem”.
The consequence is that in recent years a number of
interesting games have attracted the attention of AI
researchers, where the research results promise a wider
range of applicability than has been seen for chess.
Computer success has been achieved in deterministic
perfect information games, like chess, checkers and
Othello, largely due to so-called brute-force search. The
correlation of search speed to program performance gave
an easy recipe to program success: build a faster search
engine. The Deep Blue team took this to an extreme,
analyzing roughly 250 million chess positions per second.
In contrast, until recently imperfect information games
have attracted little attention in the literature. In these
games, no player knows the complete state and each
player has to infer the missing information to maximize
the chances of success. For these games, brute-force
1
Copyright © 1999. American Association for Artificial Intelligence
{www.aaai.org}. All rights reserved.
49261447.006.png 49261447.007.png
component also allows us to unify the expert knowledge
in a poker program, since the same component can be
used for betting decisions, opponent modeling, and
interpreting opponent actions.
Second, Loki now bases its betting strategy on a
simulation-based approach; we call it selective sampling .
It simulates the outcome of each hand, by generating
opponent hands from the sample space of all appropriate
possibilities, trying each betting alternative (call/check,
bet/raise) to find the one that produces the highest
expected winnings. A good definition of appropriate
hands is one of the key concepts in defining selective
sampling and it is one of the main topics of this paper. As
with brute-force search in chess, the simulation (search)
implicitly uncovers information that improves the quality
of a decision. With selective sampling, the knowledge
applied to a simulation quantifies the value of each
choice, improving the chance of making a good decision.
Simulation-based approaches have been used in other
games, such as backgammon [Tesauro, 1995], bridge
[Ginsberg, 1999], and Scrabble 1 [Sheppard, 1998]. The
simulation methods presented in this paper are quite
similar to those used by Ginsberg in Gib, although there
are several distinctions in the details, due to differences in
the games.
For deterministic perfect information games, there is a
well-known framework for constructing these
applications (based on the alpha-beta algorithm). For
games with imperfect information, no such framework
exists. For handling this broader scope of games we
propose that selective sampling simulation be such a
framework.
to act, one of three betting options is available: fold,
call/check, or raise/bet. There is normally a maximum of
three raises allowed per betting round. The betting option
rotates clockwise until each player has matched the
current bet or folded. If there is only one player remaining
(all others having folded) that player is the winner and is
awarded the pot without having to reveal their cards.
3. Building a Poker Program
A minimal set of requirements for a strong poker-
playing program includes assessing hand strength and
potential, betting strategy, bluffing, unpredictability and
opponent modeling. Descriptions of these as they are
implemented in our program, Loki, can be found in
[Billings et. al ., 1998a; 1998b]. There are several other
identifiable characteristics that may not be necessary to
play reasonably strong poker, but may eventually be
required for world-class play.
The architecture of the previous version of Loki, which
we now call Loki-1, is shown in Figure 1. In the diagram,
rectangles are major components, rounded rectangles are
major data structures, and ovals are actions. The data
follows the arrows between components. An annotated
arrow indicates how many times data moves between the
components for each of our betting actions.
To make a betting decision, the Bettor calls the Hand
Evaluator to obtain an assessment of the strength of the
current cards. The Bettor uses this hand strength, the
public game state data and expert-defined betting
knowledge to generate an action (bet, call or raise). To
evaluate a hand, the Hand Evaluator enumerates over all
possible opponent hands and counts how many of them
would win, lose or tie the given hand. After the flop, the
probability for each possible opponent hand is different.
For example, the probability that hole cards of Ace-Ace
are held after the flop is much higher than 7-2, since most
players will fold 7-2. Each possible hand has a weight in
the Weight Table for each opponent, and these weights
are modified after each opponent action. Updating the
probabilities for all hands is a process called re-weighting .
After each opponent action, the Opponent Modeler calls
the Hand Evaluator once for each possible hand and
increases or decreases the weight for that case to be
consistent with the new information. The Hand Evaluator
uses the Weight Table values to bias the calculation,
giving greater weight to the more likely hands. The
absolute values of the probabilities are of little
consequence, since only the relative weights affect the
later calculations.
Loki-1 uses expert knowledge in four places:
1. Pre-computed tables of expected income rates are
used to evaluate its hand before the pre-flop, and to
assign initial weight probabilities for the various
possible opponent hands.
2. The Opponent Modeler applies re-weighting rules to
modify the opponent hand weights based on the
previous weights, new cards on the board, opponent
betting actions, and other contextual information.
2. Texas Hold’em
A hand of Texas Hold’em begins with the pre-flop ,
where each player is dealt two hole cards face down,
followed by the first round of betting. Three community
cards are then dealt face up on the table, called the flop ,
and the second round of betting occurs. On the turn , a
fourth community card is dealt face up and another round
of betting ensues. Finally, on the river , a fifth community
card is dealt face up and the final round of betting occurs.
All players still in the game turn over their two hidden
cards for the showdown . The best five card poker hand
formed from the two hole cards and the five community
cards wins the pot. If a tie occurs, the pot is split.
Typically, Texas Hold’em is played with 8 to 10 players.
Limit Texas Hold’em has a structured betting system,
where the order and amount of betting is strictly
controlled in each betting round. 2 There are two
denominations of bets, called the small bet and the big bet
($10 and $20 in this paper). In the first two betting
rounds, all bets and raises are $10, while in the last two
rounds they are $20. In general, when it is a player’s turn
2
1 ™ Milton-Bradley company.
2 In No-limit Texas Hold'em, there are no restrictions on the size of
bets.
49261447.008.png
Opponent
Model
1081
Opponent Modeler
Each time it is Loki-2’s turn to bet, the Action Selector
uses a single probability triple to decide what action to
take (note that the Bettor is gone). For example, if the
triple [0.0,0.8,0.2] is given, then the Action Selector
would call 80% of the time, raise 20% of the time, and
never fold. The choice can be made by generating a
random number, allowing the program to vary its play,
even in identical situations. This is analagous to a mixed
strategy in game theory, but the probablility triple
implicitly contains contextual information resulting in
better informed decisions which, on average, can out-
perform a game theoretic approach.
The Triple Generator is responsible for generating
probability triples. As shown in Figure 2, this routine is
now at the heart of Loki-2. The Triple Generator takes a
two-card hand and calls the Hand Evaluator to evaluate
the cards in the current context. It uses the resulting hand
value, the current game state, and expert-defined betting
rules to compute the triple. Note that in addition to using
the Triple Generator to produce a triple for our known
hand, it can also be used to assess the likely behavior of
the opponent holding any possible hand.
For the Hand Evaluator to assess a hand, it compares
that hand against all possible opponent holdings. To do
this, it uses the opponent Weight Table. In Loki-2, the
Opponent Modeler now uses probability triples to update
this table after each opponent action. To accomplish this,
the Triple Generator is called for each possible two-card
hand. It then multiplies each weight in the Weight Table
by the entry in the probability triple that corresponds to
the opponent’s action. For example, suppose the previous
weight for Ace-Ace is 0.7 (meaning that if it has been
dealt, there is a 70% chance the opponent would have
played it in exactly the manner observed so far), and the
opponent now calls. If the probability triple for the current
context is [0.0, 0.2, 0.8], then the updated weight for this
case would be 0.7 x 0.2 = 0.14. The relative likelihood of
the opponent holding Ace-Ace has decreased to 14%
because there was no raise. The call value of 0.2 reflects
the possibility that this particular opponent might
deliberately try to mislead us by calling instead of raising.
Using a probability distribution allows us to account for
uncertainty in our beliefs, which was not handled by the
previous architecture. This process of updating the weight
table is repeated for each entry.
An important advantage of the probability triple
representation is that imperfect information is restricted to
the Triple Generator and does not affect the rest of the
algorithm. This is similar to the way that alpha-beta
search restricts knowledge to the evaluation function. The
probability triple framework allows the “messy” elements
of the program to be amalgamated into one component,
which can then be treated as a “black box” by the rest of
the system. Thus, aspects like game-specific information,
complex expert-defined rule systems, and knowledge of
human behavior are all isolated from the engine that uses
this input.
weight
table
70%
KK 65%
1081
1
Public
Game
State
. . .
Hand Evaluator
1081
entries
1
Hand
Betting
Rule-base
Bettor
Public
Data
Figure 1. The architecture of Loki-1.
3. The Hand Evaluator uses enumeration techniques to
compute hand strength and hand potential for any hand
based on the game state and the opponent model.
4. The Bettor uses a set of expert-defined rules and a
hand assessment provided by the Hand Evaluator for
each betting decision: fold, call/check or raise/bet.
This design has several limitations. First, expert
knowledge appears in various places in the program,
making Loki difficult to maintain and improve. Second,
the Bettor returns a single value (fold, call, raise), which
does not reflect the probabilistic nature of betting
decisions. Finally, the opponent modeler does not
distinguish between the different actions that an opponent
might take. A call/check versus a bet/raise gives valuable
information about the opponent’s cards. These issues led
to a redesign of how knowledge is used in Loki.
The new version of Loki, called Loki-2 makes two
fundamental changes to the architecture. First, it
introduces a useful, new data object called a probability
triple that is used throughout the program (Section 4).
Second, simulation with selective sampling is used to
refine the betting strategy (Section 5). Loki-2 can be used
with or without simulation, as shown in Figure 2. With
simulation, the Simulator component replaces the simpler
Action Selector.
fold, call
or raise
4. Probability Triples
A probability triple is an ordered triple of values, PT =
[ f , c , r ], such that f + c + r = 1.0, representing the
probability distribution that the next betting action in a
given context is a fold, call, or raise, respectively.
Probability triples are used in three places in Loki-2. The
Action Selector uses a probability triple to decide on a
course of action (fold, call, raise). The Opponent Modeler
uses an array of probability triples to update the opponent
weight tables. The Simulator (see Section 5) uses
probability triples to choose actions for simulated
opponent hands.
3
AA
Private
Data
49261447.009.png
Opponent
Model
1081
O pponent Modeler
w eight table
70%
K K 65%
0
.2 .8
.3 .7
1081
1
0
Public
G ame
State
. . .
Triple Generator
1081 entries
H and Evaluator
H and
1
N
A ction Selector
Simulator
Betting
Rule-base
fold, call
or raise
Figure 2. Using the Triple Generator in Loki-2.
The current architecture also suggests future
enhancements, such as better methods for opponent
modeling. For example, the cards seen at the showdown
reveal clues about how that opponent perceived each
decision during the hand. These hindsight observations can
be used to adaptively measure important characteristics
like aggressiveness, predictability, affinity for draws, and
so forth. The Opponent Modeler can maintain each of these
properties for use by the Triple Generator, which combines
the information in proper balance with all the other factors.
The knowledge is implicitly encoded in the probability
distribution, and is thereby passed on to all components of
the system.
Since the more objective aspects of the game could
eventually be well defined, the ultimate strength of the
program may depend on the success in handling imperfect
information, and the more nebulous aspects of the game,
such as opponent modeling.
permit), the same approach is not feasible for real
imperfect information games because there are too many
possibilities to consider [Koller and Pfeffer, 1997].
Consider a 10-player game of Texas Hold’em. By the
time the flop cards are seen , some players may have folded.
Let’s assume one player bets, and it is Loki’s turn to act.
The program must choose between folding, calling or
raising. Which one is the best decision? 1
After the program’s decision, every other active player
will be faced with a similar choice. In effect, there is a
branching factor of 3 possible actions for each player, and
there may be several such decisions in each betting round.
Further, there are still two betting rounds to come, each of
which may involve several players, and one of many (45 or
44) unknown cards. Computing the complete poker
decision tree in real time is in general, prohibitively
expensive. Since we cannot consider all possible
combinations of hands, future cards, and actions, we
examine only a representative sample from the
possibilities. A larger sample and more informed selection
process will increase the probability that we can draw
meaningful conclusions.
5. Simulation-Based Betting Strategy
The original Bettor component consisted of expert-
defined rules, based on hand strength, hand potential, game
conditions, and probabilities. A professional poker player
defined the system as a first approximation of the return on
investment for each betting decision. As other aspects of
Loki improved, this simplistic betting strategy became the
limiting factor to the playing strength of the program.
Unfortunately, any rule-based system is inherently rigid,
and even simple changes were difficult to implement and
verify for correctness. A more flexible, computation-based
approach was needed.
In effect, a knowledge-based betting strategy is
equivalent to a static evaluation function. Given the current
state of the game, it attempts to determine the action that
yields the best result. If we use deterministic perfect
information games as a model, the obvious extension is to
add search to the evaluation function. While this is easy to
achieve in a perfect-information game such as chess
(consider all possible moves as deeply as resources
5.1 An Expected Value Based Betting Strategy
Loki-2’s new betting strategy consists of playing out
many likely scenarios to determine how much money each
decision will win or lose. Every time it faces a decision,
Loki-2 invokes the Simulator to get an estimate of the
expected value (EV) of each betting action (see the dashed
box in Figure 2 with the Simulator replacing the Action
Selector). A simulation consists of playing out the hand a
specified number of times, from the current state of the
game through to the end. Folding is considered to have a
zero EV, because we do not make any future profit or loss.
Each trial is played out twice—once to consider the
consequences of a check/call and once to consider a
4
AA
1 “Best” is subjective. Here we do not consider other plays, such as
deliberately misrepresenting the hand to the opponents. The expected
value for a whole session is more important than the expected value for a
single hand.
49261447.010.png 49261447.011.png
Zgłoś jeśli naruszono regulamin