# On Markov Decision Processes

Sequential planning under uncertainty is a basic optimization problem that arises in many different settings, ranging from artificial intelligence to operations research. In a generic system, we have an agent who chooses among different actions and then receives a reward, after which the system moves on in a stochastic way. Usually the aim is to maximize the expected (discounted) reward of the system over a finite or, in certain cases, as described below, an infinite time horizon.

To obtain a tractable problem, it is often assumed that the transition law of the underlying state process is Markovian, i.e., that only the current state has an influence on future states. Such a situation leads to a Markov decision process (MDP); textbooks on MDPs include [1, 3, 5, 7]. MDPs differ from general stochastic control problems in that the actions are taken at discrete time points, rather than continuously. Stochastic shortest-path problems, consumption and investment of money, allocation of resources, production planning, and harvesting problems are a few examples of MDPs.

The formulation and development of MDPs started in the 1950s with Shapley, Bellman, and, later, Howard, Dubins, Savage, and Blackwell. The early achievements are closely related to the study of stochastic dynamic games. Subtle mathematical problems in the theory include measurability issues with arbitrary Borel state spaces, which naturally arise in, for example, partially observable Markov decision processes. However, with a problem that has enough structure, the solution algorithm is a very simple backward induction algorithm, which works as follows:

Suppose that the decision time points are numbered \(n = 0, 1, . . . , N\), the one-stage rewards are \(r_n\), the terminal reward is \(r_N\), the transition kernel of the state process is \(Q_n( \cdot | x,a)\), and the set of admissible actions at stage \(n\) given state \(x\) is \(D_n(x)\). The value functions \(V_n(x)\) that represent the maximal expected reward starting in state \(x\) from stage \(n\) up to the final stage \(N\) can then be recursively computed by the following optimality equation:

\[\begin{equation}\tag{1}

V_N (x) = r_N (x).

\end{equation}\]

\[\begin{equation}\tag{2}

V_n(x)= \sup_{a \in D_n(x)}\{r_n(x,a)+\int V_{n+1}(y)Q_n(dy|x,a)\}.

\end{equation}\] When we finally reach \(V_0\) we have computed the maximal expected reward of the system up to the time horizon \(N\); the maximizers in this recursion yield the optimal strategy. This algorithm is very straightforward; the real challenge in applying the theory comes in the “curse of dimensionality.” Going forward in time, the number of states that can be reached often increases dramatically with the number of admissible actions. For large \(n\), the number of optimization problems to be solved is vast. If, for example, every state can have just two possible successors, then at stage \(n\) there are already \(2^n\) different states for which we have to solve \((2)\). This is not feasible for most interesting applications.

But we do have hope of being able to solve such problems. If we have a stationary model and a long time horizon, one trick is to approximate the problem by one with an infinite time horizon. With an infinite time horizon, we are left with a fixed-point problem to solve; various algorithms are available for such problems, including policy iteration and linear programming.

Many problems are not stationary, however. In such cases we need to use tools from approximate dynamic programming (see, for example, [6]). Loosely speaking, the solution methods combine backward induction with a forward simulation of states. The idea is to improve a given approximation of the value functions along a simulated path of the state process.

**Figure 1.**Typical set of admissible actions.

To find a fair price for storage, we first need a stochastic model for the gas pricing process. Most gas price models are continuous. One of the first models is the Schwartz model (see [8]), in which the log-price dynamics for gas is given by

\[\begin{equation}

d~ \mathrm{log}(P_t ) = \alpha (\mu_t – \mathrm{log}(P_t)) dt + \sigma_t dW_t,

\end{equation}\]

where \(W_t\) is a Brownian motion, \(\sigma_t\) is the time-dependent volatility of the process, \(\mu_t\) is its mean, and \(\alpha\) is the mean-reversion factor.

The valuation problem can now be solved as an MDP. The state of the problem is the current storage level \(x\), together with the current gas price \(p\); the action \(a\) is the change in the amount of gas. The set of admissible actions is quite complicated (Figure 1 shows a typical set) – the capacity of the gas storage is of course restricted, and the maximal speed at which gas can be injected or withdrawn depends on the storage level.

To include transaction costs, a loss of gas at the pump, and/or a bid–ask spread in the market, we introduce for a quantity of gas the “ask price” \(k\) and the “bid price” \(e\), which we assume to be affine functions of the price. The one-stage reward function \(r_n (p,a)\) of the MDP is given by \(–k (p) \cdot a\) if \(a > 0\), by \(0\) if \(a = 0\), and by \(–e (p) \cdot a\) if \(a < 0\).

The terminal reward function depends on the contract. Having specified the data, we can easily write down the optimality equation. The first step then is to get as much information as possible about the value function and the maximizers from the optimality equation in order to simplify the numerical algorithms. Here, indeed, it is possible to figure out (by induction) the structure of the optimal injection and withdrawal strategy (see, for example, [2, 9]), which can be characterized by three regions that depend on the gas price \(p\): When the current gas storage level is below a certain bound \(\underline{b} (p)\), it is optimal to inject gas, either as much as possible or up to \(\underline{b} (p)\), whichever occurs first. If the current gas storage level is above a certain bound \(\bar{b} (p)\), it is optimal to withdraw gas, either as much as possible or down to \(\bar{b} (p)\), whichever occurs first. When the level is in between, the optimal strategy is to do nothing.

**Figure 2.**Price grid for the recombining tree (black) with simulated price paths.

As described in [2], we solved the gas storage problem using different numerical algorithms, all based on a combination of the backward induction algorithm and knowledge of the structure of the optimal strategy. One algorithm used a recombining (linearly growing) tree to approximate the gas pricing process (see Figure 2) and hence avoid the curse of dimensionality. Another algorithm combined backward induction with a forward simulation of the gas pricing process, with linear regression on a finite number of basis functions to approximate the value function (see also [4]). The resulting optimal strategy can be seen in Figure 3. These algorithms work rather well and can also be used for more complicated gas pricing process models.

**Figure 3.**Strategy bounds for a regression method:

*(light blue) and*

__b___{n}(p)*b*(dark blue).

_{n}(p)*This article was contributed by the SIAM Activity Group on Control and Systems Theory via Francois Dufour of the University of Bordeaux, the group’s SIAM News liaison. The biennial conference of SIAG/CST will be held in Paris this summer from July 8-10.*

**References**

[1] N. Bäuerle and U. Rieder, *Markov Decision Processes with Application to Finance*, Springer, Berlin, 2011.

[2] N. Bäuerle and V. Riess, *Gas storage valuation with regime switching*, (2014), arXiv: 1412.1298.

[3] D.P. Bertsekas, *Dynamic Programming and Optimal Control*, Vol. 2, Athena Scientific, Belmont, MA, 2012.

[4] A. Boogert and C. de Jong, Gas storage valuation using a multifactor price process, *J. Energy Markets, 4* (2011), 29-52.

[5] O. Hernández-Lerma and J.B. Lasserre, *Discrete-time Markov Control Processes*, Springer, Berlin, 1996.

[6] W.B. Powell, *Approximate Dynamic Programming: Solving the Curses of Dimensionality*, John Wiley & Sons, Hoboken, NJ, 2011.

[7] M.L. Puterman, *Markov Decision Processes: Discrete Stochastic Dynamic Programming*, John Wiley & Sons, Hoboken, NJ, 2009.

[8] E. Schwartz, The stochastic behavior of commodity prices: Implications for valuation and hedging, *J. Finance, 52* (1997), 923-973.

[9] N. Secomandi, Optimal commodity trading with a capacitated storage asset, *Management Science, 56* (2010), 449-467.