This post is dedicated to a new and important result in game theory – the refutation of Mertens’ conjecture by Bruno Ziliotto. Stochastic games were defined by Shapley (1953). Such a game is given by

• a set Z of states,
• a set N = {1,2,…,n} of players,
• for every state z and every player i a set A_i(z) of actions available to player i at state z. Denote by Λ = { (z,(a_i)_{i ∈ N}) : a_i ∈ A_i(z) for every i} the set of all pairs (state, action profile at that state).
• for every player i, a stage payoff function u_i : Λ → R, and
• a transition function q : Λ → Δ(Z), where Δ(Z) is the set of probability distributions over Z.

The game starts at an initial state z^1 ∈ Z and is played as follows. At every stage t, each player i chooses an action a_i^t ∈ A_i(z^t), receives a stage payoff u_i(a_1^t,…,a_n^t), and the play moves to a new state, z^{t+1}, that is chosen according to q(z^t;a_1^t,…,a_n^t).

In this post I assume that all sets are finite. The N-stage game is a finite game, and therefore by backwards induction it has an equilibrium. As I mentioned in my previous post, the discounted game has an equilibrium (even a stationary equilibrium) because of continuity-compactness arguments.

Suppose now that the game is two-player zero-sum, and suppose that players do not observe the states nor the actions of each other. Rather, there is a set of signals S and  the transition function q is a function q:  Λ → Δ(Z × S × S). At each stage t, after choosing their actions, a triplet (z^{t+1},s_1^t,s_2^t) is chosen, and each player i observe the private signal s_i^t.   The signal of each player maybe correlated with the current action of the other player (thereby provide information about the action that the other player just chose), and/or with the current or new state (thereby provide information about the state variable), and/or with the payoff. Plainy, if a player does not know the current state, his set of action should be the same for all states.

This model includes the model of repeated games with incomplete information (Aumann and Maschler, 1995), stochastic games with imperfect monitoring of actions (Coulomb (2003) and Rosenberg, Solan, and Vieille (2003)), Markov decision problems with partial observation on the state variable (Rosebnerg, Solan, and Vieille, 2001), and many more models.

Denote by v_N(z) the value of the N-stage game at the initial state z, and by v_λ(z) the value of the λ-discounted game at the initial state z. Three decades ago Jean-Francois Mertens conjectured that the two limits lim_{N → ∞} v_N(z) and lim_{λ → 0} v_λ(z) exist and are equal. This conjecture was proven in many classes of games, but nobody was able to prove it in the most general case. Bruno Ziliotto, a PhD student from Toulouse, provided an example, which shows that Mertens’ conjecture is false: the discounted value does not converge as the discount factor goes to 0. Here it is.

There are 7 states, and each player has two actions in each state: C and Q. In this example information is symmetric: the information that the players know is the same. This simplifies the analysis of the game, because there is no need to study beliefs over beliefs, beliefs over beliefs over beliefs, etc. Thus, each player observes the action that the other player has just chosen. In addition, there  are two signals {D,E} that the players may observe, on top of the actions. Formally, the set of signals is S = {D,E} × {C,Q} × {C,Q}.

The payoff is independent of the actions of the players, but rather depends only on the current state. In some states the payoff is 0 whatever the players play, and in other the payoff is 1 whatever the players play. The name of the state will indicate the payoff it yields:

In the states 0*, 0+, and 0++ the payoff is 0. This means that player 1 does not want the game to stay in these states.

In states 1*, 1+, 1++, and 1T the payoff is 1. This means that player 2 does not want the game to stay in these states.

States 0* and 1* are absorbing; once the game reaches those states, it remains there forever (and the payoff is 0 in 0*, and 1 in 1*).

In states 0+ and 0++ the transition depends only on the action of player 1. In states 1+, 1++, and 1T the transition depends only on the action of player 2. To complete the description of the game I need to describe the transition functions (transition of states and signals).  The following figure provides the needed description.

The figure on the top shows the transition function at states 0++ and 0+. At state 0++, the action Q of player 1 leads to an absorbing state with payoff 0 and provides the signal E (bad outcome for player 1), while at state 0+ the action Q of player 1 leads to state 1++ and provides the signal D.  Thus, the signal reveals whether the game moved to 0* or 1++. If at state 0++ player 1 plays C, then with probability 1/4 the game moves to state 0+ and the signal is D, with probability 1/4 the game remains in state 0++ and the signal is D, and with probability 1/2 the game remains in state 0++ and the signal is E. Similarly, if in state 0+ player 1 plays C, with probability 1/2 the game remains in state 0+ and the signal is D, while with probability 1/2 the game moves to state E and the signal is E. The figure to the bottom, which describes the transition in the states 1++, 1+, and 1T, can be read analogously.

What can we learn from these two figures?

1. Player 1 controls states 0++ and 0+, but wants to leave them as soon as possible, because the payoff is 0 as long as the game stays there.
2. Similarly, player 2 controls states 1++, 1+, and 1T, but wants to leave them as soon as possible, because the payoff is 1 as long as the game stays there.
3. The players know when the game is absorbed and when it moves from 0+ to 1++, and from 1+ to 0++.
4. Points 1 and 3 mean that player 1 wants to maximize the discounted probability to move to state 1++. The term discounted probability means the discounted sum Σ_{n=1}^∞ λ(1-λ)^n P(reach state 1++ in stage n). To this end player 1 needs to play action Q in state 0+. Since he does not know when the game is in 0+, he should maximize the probability to be in this state. The continuation payoff in state 1++ is irrelevant for his optimal strategy; as soon as this payoff is positive, he wants to maximize the discounted probability to get it (because the payoff is 0 until he gets to state 1++).
5. In states 0+ and 0++, whenever player 1 plays C and the signal E is observed, the play moves to state 0++ (and in particular it is not in 0+).  By Bayes rule, whenever the signal D is observed when playing C, the probability to be in state 0+ increases. Indeed, if at a given stage the probability to be in state 0+ is p, and if at that stage player 1 played the action C and the signal was D, then by Bayes rule the probability to be in state 0+ in the next stage is ((1/2)p + (1/4)(1-p)) / (1/2) = p + (1/2)(1-p) > p.
6. The decision problem of player 1 boils down to a Markov decision problem in which the state space is the space [0,1] of beliefs: a state in the MDP is the probability that the game is in 0+. In this state, when choosing the action Q, with probability 1-p the process moves to an absorbing state with payoff 0 (state 0*) and with probability p it moves to a good state 1++. When choosing C in this problem, with probability (1/2) the process moves state 0 (the signal is E and we know that the state is 0++) and with probability (1/2) the process moves to state p+(1/2)(1-p).
7. It is not trivial to solve this MDP, but Bruno Ziliotto did it for us, and found the value for both initial states 0++ and 0+.
8. A similar analysis can be done for player 2 and states 1++, 1+, and 1T. The addition of state 1T implies that the MDP that player 2 faces is more complicated, but Bruno was brave enough to solve it as well. When plugging all numbers, it turns out that (v_λ) does not converge as the discount factor λ goes to 0.

That’s it. A short example that put an end to the work of many game theorists, who tried hard to prove the Mertens’ conjecture.

Chapeau à Bruno!