A Markov Decision Problem (MDP) is a model for sequential decision making, in which the underlying state of nature evolves in a stationary way. An MDP is given by a set of states S, an initial state s(0) in S, a set of available actions A(s) for each state s in S, and, for each state s in S and available actions a in A(s), a payoff r(s,a) and a probability distribution q(. | s,a) on S.
The process starts at the initial state s(0). At every stage n, the current state s(n) is known, the decision maker chooses an action a(n) in A(s(n)), receives the stage payoff r(s(n),a(n)), and a new state s(n+1) is chosen according to q(. | s(n),a(n)) and is told to the decision maker. The decision maker’s goal is to maximize the discounted sum of his stage payoffs:
sum-over-n-from-0-to-infty of λ-to-the-power-n times r(s(n),a(n)).
The value of the MDP, that is, the maximum that the decision maker can obtain, depends on the discount factor λ. Denote by v(λ) the value function of the MDP. Which functions can be obtained as the value function of some MDP with finite sets of states and actions?
From now on I restrict the discussion to MDP’s with finite sets of states and actions. Blackwell (1965) proved that for every discount factor λ the decision maker has a pure stationary optimal strategy. It is easy to see that the payoff that corresponds to a pure stationary optimal strategy is the solution of a set of equations, which are linear in λ, and whose coefficients are determined by the payoff function r and the transition function q. It follows that for every pure stationary strategy σ, the corresponding payoff function g(λ ; σ,s) is a rational function of λ. Since there are finitely many pure stationary strategies, we deduce that the value function is the maximum of finitely many rational functions.
Can we obtain any maximum of rational functions as the value function of some MDP? The answer is negative. For example, since the set of states and actions are finite, the payoff function r is bounded, say, by M. In particular, the payoff function of any strategy is bounded by M/(1-λ). In particular, any rational function whose denominator has a root inside the unit ball of the complex plane, or that has a root on the unit ball of the complex plane that has multiplicity larger than 1, cannot be the value function of an MDP.
Is that the only restriction that we have? The answer is still negative. It is not difficult to see that the roots of the denominator of the payoff function of a pure stationary strategy are the inverse of the eigenvalues of the transition matrix, which by a known result in matrix theory must be unit roots, that is, for any root ω of the denominator (which is a complex number) there is an integer k such that ω-to-the-power-k is equal to 1. Thus, a rational function whose denominator has a root that lies on the unit ball of the complex plane and is not a unit root cannot be the value function of an MDP.
Is that all? Yes. Let F be the set of all rational functions f : [0,1) → R that satisfy the following property: any root of the denominator either (a) lies outside the unit ball of the complex plane, or (b) lies on the unit ball of the complex plane, has multiplicity 1, and is a unit root. Let V be the set of all functions that are the maximum of finitely many functions in F. A function v is the value function of some MDP if and only if it is in V.
In this post I outlined one direction of the proof. Anyone who is interested in reading the construction that proves the other direction is referred to this paper.