Optimal Policies for Quantum Markov Decision Processes

Ming-Sheng Ying Yuan Feng Sheng-Gang Ying

Citation: M. S. Ying, Y. Feng, S. G. Ying. Optimal policies for quantum markov decision processes. International Journal of Automation and Computing. http://doi.org/10.1007/s11633-021-1278-z doi:  10.1007/s11633-021-1278-z
Citation: Citation: M. S. Ying, Y. Feng, S. G. Ying. Optimal policies for quantum markov decision processes. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1278-z doi:  10.1007/s11633-021-1278-z

doi: 10.1007/s11633-021-1278-z

Optimal Policies for Quantum Markov Decision Processes

More Information
    Author Bio:

    Ming-Sheng Ying is a Distinguished Professor and Research Director of the Center for Quantum Software and Information at the University of Technology Sydney, Australia. He is also Deputy Director for Research (adjunct position) at the Institute of Software at the Chinese Academy of Sciences, and holds the Cheung Kong Chair Professorship at Tsinghua University, China. He has published books: Model Checking Quantum Systems: Principles and Algorithms (2021) (with Yuan Feng), Foundations of Quantum Programming (2016) and Topology in Process Calculus: Approximate Correctness and Infinite Evolution of Concurrent Programs (2001). He received a China National Science Award in Natural Science (2008). He has served on the editorial board of several publications including Artificial Intelligence. He is currently Editor-in-Chief of ACM Transactions on Quantum Computing. His research interests include quantum computation, theory of programming languages, and logics in AI. Email: mingsheng.ying@uts.edu.au (Corresponding author) ORCID iD: 0000-0003-4847-702X

    Yuan Feng received the B.Sc. degree in mathematics from Department of Applied Mathematics, Tsinghua University, China in 1999, and received the Ph. D. degree in computer science from Department of Computer Science and Technology, Tsinghua University, China in and 2004. He is currently a professor at Centre for Quantum Software and Information (QSI), University of Technology Sydney (UTS), Australia. His research interests include quantum programming theory, quantum information and quantum computation, and probabilistic systems. E-mail: yuan.feng@uts.edu.au ORCID iD: 0000-0002-3097-3896

    Sheng-Gang Ying received the B. Sc. degree in physics from Department of Physics, Tsinghua University, China in 2010, and received the Ph. D. degree in computer science from Department of Computer Science and Technology, Tsinghua University, China in 2015, He is currently an associate researcher at State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China. His research interests include quantum programming theory, quantum Markov systems. E-mail: yingsg@ios.ac.cn ORCID iD: 0000-0002-5052-5142

图(1)
计量
  • 文章访问数:  6
  • HTML全文浏览量:  13
  • PDF下载量:  5
  • 被引次数: 0
出版历程
  • 收稿日期:  2020-07-22
  • 录用日期:  2021-01-13
  • 网络出版日期:  2021-03-20

Optimal Policies for Quantum Markov Decision Processes

doi: 10.1007/s11633-021-1278-z
    作者简介:

    Ming-Sheng Ying is a Distinguished Professor and Research Director of the Center for Quantum Software and Information at the University of Technology Sydney, Australia. He is also Deputy Director for Research (adjunct position) at the Institute of Software at the Chinese Academy of Sciences, and holds the Cheung Kong Chair Professorship at Tsinghua University, China. He has published books: Model Checking Quantum Systems: Principles and Algorithms (2021) (with Yuan Feng), Foundations of Quantum Programming (2016) and Topology in Process Calculus: Approximate Correctness and Infinite Evolution of Concurrent Programs (2001). He received a China National Science Award in Natural Science (2008). He has served on the editorial board of several publications including Artificial Intelligence. He is currently Editor-in-Chief of ACM Transactions on Quantum Computing. His research interests include quantum computation, theory of programming languages, and logics in AI. Email: mingsheng.ying@uts.edu.au (Corresponding author) ORCID iD: 0000-0003-4847-702X

    Yuan Feng received the B.Sc. degree in mathematics from Department of Applied Mathematics, Tsinghua University, China in 1999, and received the Ph. D. degree in computer science from Department of Computer Science and Technology, Tsinghua University, China in and 2004. He is currently a professor at Centre for Quantum Software and Information (QSI), University of Technology Sydney (UTS), Australia. His research interests include quantum programming theory, quantum information and quantum computation, and probabilistic systems. E-mail: yuan.feng@uts.edu.au ORCID iD: 0000-0002-3097-3896

    Sheng-Gang Ying received the B. Sc. degree in physics from Department of Physics, Tsinghua University, China in 2010, and received the Ph. D. degree in computer science from Department of Computer Science and Technology, Tsinghua University, China in 2015, He is currently an associate researcher at State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China. His research interests include quantum programming theory, quantum Markov systems. E-mail: yingsg@ios.ac.cn ORCID iD: 0000-0002-5052-5142

English Abstract

Citation: M. S. Ying, Y. Feng, S. G. Ying. Optimal policies for quantum markov decision processes. International Journal of Automation and Computing. http://doi.org/10.1007/s11633-021-1278-z doi:  10.1007/s11633-021-1278-z
Citation: Citation: M. S. Ying, Y. Feng, S. G. Ying. Optimal policies for quantum markov decision processes. International Journal of Automation and Computing . http://doi.org/10.1007/s11633-021-1278-z doi:  10.1007/s11633-021-1278-z
    • Markov decision process (MDP) offers a general framework for modelling sequential decision making where outcomes are random[1]. It stemmed from operations research and has been widely used in a broad range of areas, including manufacturing, economics, ecology, biology, automatic control and robotics. Since Kaelbling et al.[2] introduced MDPs, in particular partially observable Markov decision processes (POMDPs) into artificial intelligence (AI), they have been successfully applied in planning, scheduling, machine learning, to name just a few.

    • Recently, MDPs have been generalised into the quantum world in two slightly different ways:

      1) The notion of quantum observable Markov decision process (QOMDP) was defined by Barry et al.[3] as a quantum generalisation of Kaelbling et al.′s POMDP[2]. The following two problems were studied there:

      i) Policy existence problem for the infinite horizon: Given a QOMDP, a starting state, and a value $ V $, whether there is a policy that achieves reward at least $ V $?

      ii) Goal-state reachability problem for the finite-horizon: Given a QOMDP, a starting state $ s $, and a goal state $s' $, whether there is a policy that can reach $s' $ from $ s $ with probability 1?

      The most interesting result in [3] is a computability separation between POMDPs and QOMDPs indicating that the goal-state reachability is decidable for POMDPs but undecidable for QOMDPs.

      2) Another quantum generalisation of MDPs, called qMDP, was defined in [4]. A major difference between QOMDPs and qMDPs is that a policy in a QOMDP maps directly a (pure or mixed) quantum state to an action, whereas a policy in a qMDP maps the outcome of measurement on a quantum state to an action. It was proved that the goal-reachability problem for infinite-horizon qMDPs with probability $ 1 $ or $ p<1 $ is EXP-hard or undecidable, respectively. The authors have employed quantum Markov chains (QMCs) as the semantic model in their research on static analysis of quantum programs[5, 6]. In particular, the termination problem of quantum programs can be reduced to reachability of QMCs. This observation lead the authors further to developing model checking techniques for quantum systems[7-9]. Essentially, the main results in [4] are extensions of the corresponding results of [9] to qMDPs.

    • In the last five to ten years, a new research line has been rapidly emerging at the intersection of quantum physics and AI & machine learning[10,11]. The interaction between these two areas is bidirectional:

      1) Quantum physics helps to solve AI & machine learning problems via quantum computation.

      2) AI & machine learning methodologies and techniques are employed to help solving problems in quantum physics.

      For more detailed discussions about this area, the reader is referred to several excellent surveys[12-14].

      Reinforcement learning is a basic machine learning paradigm, in which an agent learns behaviour through trial-and-error interactions with the dynamic environment[15, 16]. Several quantum reinforcement learning models have already been proposed either for enabling to apply reinforcement learning in the quantum world or for enhancing reinforcement learning by exploiting quantum advantage (see for example [17-20]). A question naturally arises here: How can the quantum Markov decision processes introduced in [3, 4] be used as a mathematical framework of quantum reinforcement learning?

    • The main problem considered in [3, 4] is the reachability of quantum Markov decision processes (QOMDPs and qMDPs). However, a crucial step in classical reinforcement learning is to find an optimal behaviour of the agent that can usually be formulated as an optimal policy in MDPs. This paper solves the optimal policy problem for quantum Markov decision processes to provide a useful mathematical tool for decision making and reinforcement learning in the quantum world. In this paper, we focus on the case of finite horizon. The case of infinite horizon will be discussed in a forthcoming paper. We adopt a model that extends a qMDP defined in [4].

      The paper is organised as follows: Quantum mechanics is briefly reviewed in Section 2 in a way that the AI community can easily understand. Several basic notions, including qMDP, policy and expected reward, are defined in Section 3. A backward recursion for the expected reward with a given policy is established, and an algorithm based on it for computing the expected reward is presented in Section 4. In Section 5, the Bellman principle of optimality is generalised to qMDPs and an algorithm for finding optimal policies for qMDPs is given. A key step in the algorithms presented in Sections 4 and 5 is the computation of quantum probabilities. For readability, we separate it from the other parts of the algorithms and solve it in Section 6. An illustrative example is shown in Section 7. The paper concludes with several remarks about further studies.

    • For the convenience of the reader, we review the basics of quantum mechanics. In this paper, we only consider quantum systems of which the state spaces are finite-dimensional. So, their mathematical descriptions can be presented in the languages of vectors and matrices. We assume the reader is familiar with matrix algebra. All operations (e.g., addition, multiplication, scalar product) of vectors and matrices used in this paper are standard.

      Quantum states. Following the convention in quantum theory, we use the Dirac notation to write

      $$ |\psi\rangle = (a_1,\cdots,a_{n})^{\rm{T}} $$

      for a column vector, i.e., an element in the $ n $-dimensional complex vector space ${\bf{C}}^n$, where $\bf{C}$ is the field of complex numbers, and ${\rm{ T }}$ stands for transpose. If the components of the vector satisfies the normalisation condition:

      $$ \sum\limits_{i = 1}^n |a_i|^2 = 1$$

      then $ |\psi\rangle $ is called a unit vector. The dual of $ |\psi\rangle $ is the row vector $\langle\psi| = (a_1,\cdots,a_n)$. According to the basic postulates of quantum mechanics, a pure state of an $ n $-level quantum system can be represented by a unit vector $|\psi\rangle\in{\bf{C}}^n$. For example, the state of a qubit (quantum bit) is a 2-dimensional vector:

      $$ (\alpha, \beta)^{\rm T} = \left(\begin{array}{cc}\alpha\\ \beta\end{array}\right) = \alpha |0\rangle + \beta |0\rangle $$

      with $ |\alpha|^2+|\beta|^2 = 1 $, where $|0\rangle = (1,0)^{\rm{T}}, |1\rangle = (0,1)^{\rm{T}}$ is a basis of the 2-dimensional space ${\bf{C}}^2$. Two example qubit states are

      $$ |\pm\rangle = \dfrac{1}{\sqrt{2}}(|0\rangle\pm |1\rangle). $$

      More generally, a mixed state of an $ n $-level system is described by an $ n\times n $ positive semi-definite matrix $ \rho = (\rho_{ij}) $ with trace:

      $$ {{\rm{tr}}}(\rho) = \sum\limits_i \rho_{ii} = 1 $$

      called a density matrix in ${\bf{C}}^n$. It turns out that each density operator $ \rho $ can be written in the form of

      $$ \rho = \sum_i p_i |\psi_i\rangle\langle\psi_i| $$

      where $ \{|\psi_i\rangle\} $ is a family of pure states, and $ \{p_i\} $ is a probability distribution. So, mixed state $ \rho $ can be interpreted as follows: The system is in state $ |\psi_i\rangle $ with probability $ p_i $. For example, if a qubit is in state $ |0\rangle $ with probability $ \dfrac{2}{3} $ and in state $ |1\rangle $ with probability $ \dfrac{1}{3} $, then it can be depicted by the density matrix:

      $$ \rho_0 = \dfrac{2}{3}|0\rangle\langle 0|+\dfrac{1}{3}|-\rangle\langle -| = \dfrac{1}{6}\left (\begin{array}{cc}5 & -1\\ -1&1\end{array}\right). $$

      Dynamics of quantum systems. For any matrix $ A = (a_{ij}) $, we write $ A^\dagger $ for the transpose and conjugate of $ A $, i.e., $ A^\dagger = (b_{ij}) $ with $ b_{ij} = a_{ji}^\ast $ for all $ i, j $. An $ n\times n $ complex matrix $ U $ is called a unitary matrix if

      $$ U^\dagger U = I_{n\times n} $$

      where $ I $ is the identity matrix. If the states of a closed quantum system at times $ t $ and $t'$ are $ |\psi\rangle $ and $ |\psi^\prime\rangle $, respectively, then they are related to each other by a unitary matrix $ U $ which depends only on the times $ t $ and $ t' $:

      $$ |\psi'\rangle = U|\psi\rangle. $$

      If the system is in mixed state $ \rho,\rho^\prime $ at time $ t, t^\prime $, respectively, then,

      $$ \rho^\prime = U\rho U^\dagger . $$

      For example, the NOT gate and Hadamard gate on a qubit are respectively described by unitary matrices:

      $$ X = \left(\begin{array}{cc}0 & 1\\ 1&0\end{array}\right),\; H = \dfrac{1}{\sqrt{2}}\left(\begin{array}{cc}1 & 1\\ 1 &-1\end{array}\right) $$

      and the state $ \rho_0 $ is transferred by $ H $ into

      $$ H\rho_0 H^\dagger = \dfrac{1}{3}\left(\begin{array}{cc}1 & 1\\ 1&2\end{array}\right). $$

      More generally, the dynamics of an open quantum system is described by a super-operator. The notion of super-operator can be introduced in several different (but equivalent) ways. Here, we choose to use the Kraus operator-sum representation, which is convenient for computation. A super-operator $ {\cal{E}} $ transforms a density matrix to another and is defined by a family of $ n\times n $ matrices $ \{E_i\} $:

      $$ {\cal{E}}(\rho) = \sum\limits_i E_i\rho E_i^\dagger $$

      for each density matrix $ \rho $, where it is required that

      $$ \sum\limits_i E^\dagger E_i = I_{n\times n}. $$

      It is obvious that $ {\cal{E}} $ degenerates to a unitary matrix whenever $ \{E_i\} $ is a singleton. For example, the bit flip action transfers the state of a qubit from $ |0\rangle $ to $ |1\rangle $ and vice versa, with probability $ 1-p $, $ 0\leq p\leq 1 $. It is described by the super-operator:

      $$ {\cal{E}}(\rho) = E_0\rho E_0+E_1\rho E_1 $$

      where

      $$ E_0 = \sqrt{p}I,\; E_1 = \sqrt{1-p}X $$

      and $ I, X $ are the $ 2\times 2 $ unit matrix and the NOT gate, respectively. For example, state $ \rho_0 $ is transformed by $ {\cal{E}} $ to

      $$ {\cal{E}}(\rho) = \left(\begin{array}{cc}\dfrac{1}{6}+\dfrac{2p}{3} & -\dfrac{1}{6}\\ -\dfrac{1}{6} & \dfrac{5}{6}-\dfrac{2p}{3} \end{array}\right). $$

      Quantum measurements. To acquire information about a quantum system, a measurement must be performed on it. A quantum measurement on an $ n $-level system is described by a collection $ M = \{M_m\} $ of $ n\times n $ complex matrices satisfying the normalisation condition:

      $$ \sum\limits_m M^\dagger _m M_m = I_{n\times n} $$

      where the indices $ m $ stand for the measurement outcomes. We write $ O(M) = \{m\} $ for the set of all possible outcomes of $ M $. If the state of a quantum system is $ |\psi\rangle $ immediately before the measurement, then the probability that result $ m $ occurs is

      $$ p(m) = \langle \psi|M_m^\dagger M_m |\psi\rangle $$

      and the state of the system after the measurement is

      $$ \dfrac{M_m |\psi\rangle}{\sqrt{p(m)}}. $$

      If the state of a quantum system was $ \rho $ before measurement, then the probability that result $ m $ occurs is

      $$ \begin{array}{l} p(m) = {{\rm{tr}}}(M_m\rho M_m^\dagger ) \end{array} $$ (1)

      and the state after the measurement is

      $$ \begin{array}{l} \dfrac{M_m\rho M_m^\dagger }{p(m)}. \end{array} $$ (2)

      For example, the measurement on a qubit in the computational basis $ \{|0\rangle, |1\rangle\} $ is $ M = \{M_0,M_1\} $, where

      $$ M_0 = |0\rangle\langle 0| = \left(\begin{array}{cc}1&0\\0&0\end{array}\right),\; M_1 = |1\rangle\langle 1| = \left(\begin{array}{cc}0&0\\0&1\end{array}\right). $$

      If we perform $ M $ on a qubit in state $ \rho_0 $, then the probability that we get outcome $ 0 $ is

      $$ p(0) = {tr}(M_0\rho_0) = {tr}\left(\begin{array}{cc}\dfrac{5}{6}&0\\ 0&0\end{array}\right) = \dfrac{5}{6} $$

      and the probability of outcome $ 1 $ is $ p(1) = \dfrac{1}{6} $. In the case that the outcome is $ 0 $, the qubit will be in state $ |0\rangle $ after the measurement, and in the case that the outcome is $ 1 $, it will be in state $ |1\rangle $.

      Composite quantum systems. In the example presented in Section 7, we will need the notion of a tensor product of vector spaces. For each $ 1\leq i\leq n $, let $ {\cal{H}}_i $ be a vector space with $ \{|\psi_{ij}\rangle\} $ as an orthonormal basis. Then the tensor product ${\cal{H}}_1\otimes \cdots\otimes{\cal{H}}_n$ is the vector space with $\{|\psi_{1j_1},\cdots,\psi_{nj_n}\rangle\}$ as an orthonormal basis. For example, the state space of two-qubits is ${\bf{C}}^2\otimes{{\bf{C}}}^2$. A two-qubit system can be in a separable state

      $$ |\psi\rangle = |\psi_1\rangle\otimes |\psi_2\rangle = |\psi_1\rangle,|\psi_2\rangle $$

      where $ |\psi_1\rangle,|\psi_2\rangle $ are one-qubit states, e.g., $|0,0\rangle, |0,1\rangle, $$ |1,+\rangle, |+,-\rangle $. It can also be in an entangled state that cannot be written as the product of two one-qubit states, like the EPR (Einstein-Podolsky-Rosen) pair or Bell state:

      $$ |\beta_{00}\rangle = \dfrac{1}{\sqrt{2}}(|00\rangle+|11\rangle). $$
    • Recall from [1] that an MDP consists of decision epochs, states, actions, transition probabilities and rewards. The decision epochs are the points of time where decisions are made. In this paper, we only consider the case of finite horizon – the set $ {\cal{T}} $ of decision epochs is finite. We write $ {\cal{S}} $ for the set of possible states of the system and $ {\cal{A}} $ for the set of allowable actions. At each decision epoch, the system occupies a state $ s\in {\cal{S}} $, and the decision maker take an action $ a $ chosen from $ {\cal{A}} $. As a result of taking action $ a $ in state $ s $ at decision epoch $ t $, the decision maker receives a reward $ r_t(s,a) $, and the system evolves as follows: At the next decision epoch, the system is in state $ s^\prime $ with probability $ p_t(s^\prime|s,a) $.

      A qMDP is a quantum generalisation of MDP where the dynamics of and the observation on the system are governed by the laws of quantum mechanics. Formally, we have:

      Definition 1. A qMDP is a $ 7 $-tuple

      $$ {\cal{P}} = ({\cal{T}},{\cal{H}},\rho, {\cal{A}},\{{\cal{E}}_t(\cdot |a):t\in{\cal{T}}, a\in{\cal{A}}\}, {\cal{M}},\{r_t:t\in {\cal{T}}\}) $$

      where:

      1) ${\cal{T}} = \{1,2,\cdots,N\}$ is the set of decision epochs.

      2) ${\cal{H}} = {\bf{C}}^n$ is the state space of an $ n $-level quantum system.

      3) $ \rho $ is a density matrix in $ {\cal{H}} $, called the starting state.

      4) $ {\cal{A}} $ is a set of action names.

      5) For each $ t\in{\cal{T}} $ and $ a\in{\cal{A}} $, $ {\cal{E}}_t(\cdot |a) $ is a super-operators in $ {\cal{H}} $.

      6) $ {\cal{M}} $ is a set of quantum measurements in $ {\cal{H}} $. We write:

      $$ {\cal{O}} = \bigcup\limits_{M\in{\cal{M}}}\left[\{M\}\times O(M)\right]. $$

      7) For each $ 1\leq t\leq N-1 $, $r_t:{\cal{O}}\times{\cal{A}}\rightarrow {{\bf{R}}}$ (real numbers) is the reward function at decision epoch $ t $, and $r_N:{\cal{O}}\rightarrow {\bf{R}}$ is the reward function at the final decision epoch $ N $.

      An MDP is a decision maker together with a classical (but stochastic) system on which the decision maker can take actions. In contrast, a qMDP consists of a decision maker and a quantum system of which the state space is ${\cal{H}} = {\bf{C}}^n$. The state of this quantum system is described by a density matrix. $ {\cal{A}} $ and $ {\cal{M}} $ are the sets of actions and measurements, respectively, allowable to perform on the system. For each decision epoch $ t\in{\cal{T}} $, the decision maker acquires information about the system through performing a chosen measurement $ M\in{\cal{M}} $. It is possible that different outcomes occur with certain probabilities. Each $ (M,m)\in{\cal{O}} $ is called an observation, meaning that measurement $ M $ is performed and the outcome is $ m $. For each action $ a\in{\cal{A}} $, the super-operator $ {\cal{E}}_t(\cdot|a) $ models the evolution of the system if $ a $ is taken on it between $ t $ and the next epoch. So, if the system is in state $ \sigma $ before action $ a $, then it will be in state $ {\cal{E}}_t(\sigma|a) $ after action $ a $. Obviously, $ {\cal{E}}_t(\cdot|a) $ can be seen as the quantum counterpart of the matrix

      $$ \left\{p_t(s^\prime|s,a)\right\}_{s,s^\prime\in{\cal{S}}} $$

      of transition probabilities in a MDP. For each $ (M,m)\in{\cal{O}} $ and $ a\in{\cal{A}} $, $ r_t(M,m,a) $ is the reward that the decision maker gains by taking action $ a $ at decision epoch $ t $ when the outcome of measurement $ M $ is $ m $. Note that in a MDP, the reward depends on the state of the system. However, in a qMDP, the reward depends on the observation $ (M,m) $ about the system rather than directly on the state of the system, because usually the state of a quantum system cannot be fully known. Since at the final epoch $ t = N $, no action will be taken, the domain of the reward function $ r_N $ is $ {\cal{O}} $ but not $ {\cal{O}}\times{\cal{A}} $.

      Now we start to examine the behaviour of a qMDP by introducing the following:

      Definition 2. Let $ 1\leq t\leq N $. Then a sequence

      $$ h_t = (M_1,m_1,a_1,\cdots,M_{t-1},m_{t-1},a_{t-1},M_t,m_t) $$

      is called a history of $ t $ epochs if $(M_1,m_1), \cdots,$ $ (M_{t-1},m_{t-1}),(M_t,m_t)\in{\cal{O}} $ and $a_1,\cdots,a_{t-1}\in{\cal{A}}$.

      History $ h_t $ records the activities of the decision maker: For each $ j\leq t $, she/he performed measurement $ M_j $ on the system, got outcome $ m_j $, and then took action $ a_j $ on it. It is assumed that measurement $ M_j $ happened before action $ a_j $. If $ a_j $ was taken before $ M_j $, then the result would be different because a measurement usually changes the state of a quantum system. We write $ {tail}(h_t) = (M_t,m_t) $. The set of histories of $ t $ epochs is denoted $ H_t $. Obviously, if $ h_t\in H_t $, $a_t,a_{t+1},\cdots,a_{t+(k-1)}\in{\cal{A}}$ and $(M_{t+1},m_{t+1}), $$ (M_{t+2},m_{t+2}),\cdots,(M_{t+k},m_{t+k})\in{\cal{O}},$ then

      $$ \begin{aligned}[b] & (h_t,a_t,M_{t+1},m_{t+1},a_{t+1},M_{t+2},m_{t+2},...,a_{t+(k-1)}, \\ & M_{t+k},m_{t+k})\in H_{t+k} \end{aligned}$$

      for $ 1\leq k\leq N-t $.

      A policy specifies the rule to be used by the decision maker to choose the measurements and actions performed at all decision epochs. For any nonempty set $ X $, we write $ {\cal{D}}(X) $ for the set of probability distributions over $ X $.

      Definition 3. A randomised history-dependent policy is a sequence $\pi = (\alpha_0,\beta_1,\alpha_1,\cdots,\beta_{N-1},\alpha_{N-1})$, where:

      1) $ \alpha_0\in{\cal{D}}({\cal{M}}) $.

      2) $ \alpha_t: H_t\times{\cal{A}}\rightarrow{\cal{D}}({\cal{M}}) $ for $t = 1,\cdots,N-1$.

      3) $ \beta_t: H_t\rightarrow{\cal{D}}({\cal{A}}) $ for $t = 1,\cdots,N-1$.

      For each $ M\in{\cal{M}} $, $ \alpha_0(M) $ is the probability that $ M $ is chosen at the beginning of the decision process. For each $ 1\leq t\leq N-1 $, $ h_t\in H_t $, $ a\in{\cal{A}} $ and $ M\in{\cal{M}} $, $ \beta_t(h_t)(a) $ is the probability that action $ a $ is chosen to take between decision epoch $ t $ and $ t+1 $ given history $ h_t $, and $ \alpha_t(h_t,a)(M) $ is the probability that measurement $ M $ is chosen to perform at epoch $ t+1 $ given history $ h_t $ and that action $ a $ was taken between epoch $ t $ and $ t+1 $. In particular, $ \pi $ is a deterministic (history-dependent) policy if $ \alpha_0 $, $ \alpha_t(h_t,a) $ and $ \beta_t(h_t) $ are all single-point distributions; that is, $ \alpha_0\in {\cal{M}} $, and

      $$ \alpha_t: H_t\times{\cal{A}}\rightarrow{\cal{M}},\; \beta_t:H_t\rightarrow{\cal{A}} $$

      for $ t = 1,\cdots,N-1. $

      Now let $ \pi $ be a randomised history-dependent policy, $ 1\leq t\leq N $ and

      $$ h_t = (M_1,m_1,a_1,\cdots,M_{t-1}, m_{t-1},a_{t-1},M_t, m_t)\in H_t. $$

      Then repeated applications of (1) and (2) yield the probability of $ h_t $ under $ \pi $:

      $$\begin{array}{l} \begin{split}p^\pi(h_t)=\alpha_0(M_1)\prod_{j = 1}^{t-1}&[p_j\times\beta_j(h_j)(a_j)\times\alpha_j(h_j,a_j)(M_{j+1})]\timesp_t \end{split} \end{array} $$

      where $h_j = (M_1,m_1,a_1, \cdots,$ $ M_{j-1},m_{j-1},a_{j-1},M_j,m_j) $ for $ 1\leq j\leq t $, and

      $$ \begin{array}{l} \begin{cases} p_1 = {{\rm{tr}}}\left(M_{1m_1}\rho M_{1m_1}^\dagger \right)\\ p_j = {{\rm{tr}}}\left(M_{jm_j}{\cal{E}}_{j-1}(\rho_{j-1}|a_{j-1})M_{jm_j}^\dagger \right),\; 1<j\leq N\end{cases} \end{array} $$ (3)
      $$ \begin{array}{l} \begin{cases} \rho_1 = M_{1m_1}\rho M_{1m_1}^\dagger /p_1\\ \rho_j = M_{jm_j}{\cal{E}}_{j-1}(\rho_{j-1}|a_{j-1})M_{jm_j}^\dagger /p_j,\; 1<j\leq N. \end{cases} \end{array} $$ (4)

      Intuitively, the system starts in state $ \rho $. At the initial epoch, measurement $ M_1 $ is chosen by policy $ \pi $ with probability $ \alpha_0(M_1) $ to perform on the system, outcome $ m_1 $ is obtained with probability $ p_1 $, and the state of the system is changed to $ \rho_1 $. Then action $ a_1 $ is chosen by $ \pi $ with probability $ \beta_1(h_1)(a_1) $, and the system is transformed into state $ {\cal{E}}_1(\rho_1|a_1) $. In general, at epoch $ j $, measurement $ M_j $ is chosen with probability $ \alpha_{j-1}(h_{j-1},a_{j-1})(M_j) $ to perform on state $ {\cal{E}}_{j-1}(\rho_{j-1}|a_{j-1}) $, outcome $ m_j $ occurs with probability $ p_j $, and the state of the system becomes $ \rho_j $. Furthermore, action $ a_j $ is chosen with probability $ \beta_j(h_j)(a_j) $ and it transforms the system into state $ {\cal{E}}_j(\rho_j|a_j) $.

      The following lemma gives a more compact representation of probability function $ p^\pi(\cdot) $.

      Lemma 1.

      $$ p^\pi(h_t)=\alpha_0(M_1)\prod_{j = 1}^{t-1}\;[\beta_j(h_j)(a_j)\times\alpha_{j}(h_{j},a_{j})(M_{j+1})]\times {{\rm{tr}}}(\sigma_t) $$

      where

      $$ \begin{array}{l} \begin{cases} \sigma_1 = M_{1m_1}\rho M_{1m_1}^\dagger \\ \sigma_j = M_{jm_j}{\cal{E}}_{j-1}(\sigma_{j-1}|a_{j-1})M_{jm_j}^\dagger (1<j\leq n). \end{cases} \end{array} $$ (5)

      Proof. By a routine calculation. □

      Finally, we can define the reward received by the decision maker in a qMDP. For each randomised history-dependent policy $ \pi $, if $ \pi $ is used in the decision process, then the expected total reward over the decision making horizon is

      $$ \begin{array}{l} v_N^\pi = \sum\limits_{h_N\in H_N}p^\pi (h_N)\times r(h_N) \end{array} $$ (6)

      where $ r(h_N) $ is the reward received along history $ h_N $, i.e.,

      $$ r(h_N) = \sum\limits_{t = 1}^{N-1}r_t(M_t,m_t,a_t)+r_N(M_N, m_N) $$

      if $h_N = (M_1,m_1,a_1,\cdots,M_{N-1},m_{N-1},a_{N-1},M_N,m_N)$.

    • As in the case of MDPs, a direct computation of the reward in a qMDP based on defining emuation (6) is very inefficient. In this section, we establish a backward recursion for the reward function so that dynamic programming can be used in policy evaluation for qMDPs. To this end, we first introduce a conditional probability function. Let $ \pi $ be a randomised history-dependent policy, $ 1\leq t\leq N $ and

      $$ \begin{array}{l} h_t = (M_1,m_1,a_1,\cdots,M_{t-1},m_{t-1},a_{t-1},M_t,m_t)\in H_t\\ f_t = (a_t,M_{t+1},m_{t+1},\cdots,a_{N-1},M_N,m_N)\in ({\cal{A}}\times {\cal{O}})^{N-t}. \end{array} $$

      Clearly, the concatenation $ (h_t,f_t) $ of $ h_t $ and $ f_t $ is in $ H_N $. By repeated applications of (1) and (2), we obtain the conditional probability of $ f_t $ under $ \pi $ on $ h_t $:

      $$ \begin{array}{l} p^\pi(f_t |h_t)=\prod\limits_{j = t}^{N-1}\left[\beta_j(h_j)(a_j)\times\alpha_j(h_j,a_j)(M_{j+1})\timesp_{j+1}\right] \end{array} $$ (7)

      where

      $$ h_j = (h_t,a_t,M_{t+1},m_{t+1},\cdots,a_{j-1},M_j,m_j) $$

      and $ p_j $′s are defined by (3). Similar to Lemma 1, we have:

      Lemma 2.

      $$ \begin{array}{l} p^\pi(f_t|h_t) = \prod\limits_{j = t}^{N-1}\left[\beta_j(h_j)(a_j)\times\alpha_j(h_j,a_j)(M_{j+1})\right]\times\dfrac{{{\rm{tr}}}(\sigma_N)}{{{\rm{tr}}}(\sigma_t)} \end{array} $$ (8)

      where $ \sigma_j $′s are defined by (5).

      Proof. By a routine calculation. □

      Using the conditional probability function $ p^\pi(\cdot|h_t) $, we can compute the expected reward in the tail of a decision process. More precisely, for each randomised history-dependent policy $ \pi $, function

      $$ u_t^\pi:H_t\rightarrow R $$

      is defined to be the expected total reward obtained by using policy $ \pi $ at decision epochs $t,t+1,\cdots,N$; i.e., for every $ h_t\in H_t $,

      $$ \begin{array}{l} u_t^\pi(h_t) = \displaystyle\sum\limits_{f_t\in ({\cal{A}}\times{\cal{O}})^{N-t}}p^\pi(f_t|h_t)\times r(f_t) \end{array} $$ (9)

      where

      $$ r(f_t) = \sum\limits_{j = t}^{N-1}r_j(M_j,m_j, a_j)+r_N(M_N,m_N). $$

      Theorem 1 presents a backward recursion that shows how to compute the conditional reward $ u_t^\pi $ at decision epoch $ t $ from the conditional reward $ u^\pi_{t+1} $ at the next epoch $ t+1 $.

      Theorem 1. (Backward Recursion) For each $ 1\leq t\leq N-1 $, we have:

      $$ \begin{array}{l} \begin{split}u_t^\pi(h_t) = \sum_{a_t\in{\cal{A}}}\sum_{M_{t+1}\in{\cal{M}}}\beta_t(h_t)(a_t)\times\alpha_t(h_t,a_t)(M_{t+1})\times \\ \left[ r_t(M_t,m_t,a_t)+\sum p_{t+1}\times u_{t+1}^\pi(h_t,a_t,M_{t+1},m_{t+1})\right] \end{split} \end{array} $$ (10)

      where the third $ \sum $ is over $ m_{t+1}\in O(M_{t+1}) $.

      Proof. Straightforward. □

      The aim of policy evaluation is to compute the total reward $ v_N^\pi $. The following lemma gives a representation of $ v_N^\pi $ in terms of the conditional reward $ u_1^\pi $ at the first decision epoch.

      Lemma 3.

      $$ \begin{array}{l} v_N^\pi = \displaystyle\sum\limits_{(M_1,m_1)\in{\cal{O}}}\alpha_0(M_1)\times{{\rm{tr}}}(M_{1m_1}\rho M_{1m_1}^\dagger )\times u_1^\pi(M_1,m_1) \end{array} $$ (11)

      Proof. Routine. □

      Combining Theorem 1 and Lemma 3 enables us to develop a dynamic programming algorithm for evaluating $ v^\pi_N $; see Algorithm 1. Note that there is an essential difficulty in step 2 of this policy evaluation algorithm, namely the computation of quantum probabilities $ p_{t+1} $. The same difficulty arises in the next section where the optimal policies for qMDPs are considered. So, this problem will be carefully addressed in Section 6.

      Algorithm 1. Policy evaluation

      1) Set $ t=N $ and $ u_N^\pi(h_N)=r_N(M_N,m_N) $ for $ h_N\in $$ H_N $ with $ {tail}(h_N)=(M_N,m_N) $.

      2) Substitute $ t-1 $ for $ t $. Compute $ u_t^\pi(h_t) $ for $ h_t\in H_t $ with $ {tail}(h_t)=(M_t,m_t) $ using (10).

      3) If $ t=1 $, go to step 4. Otherwise, return to Step 2).

      4) Compute $ v_N^\pi $ by (11).

      Note that for each $ t $, the computation of $ p_t $ according to Theorem 3 takes time $ O(t n^6) $ where $ n $ is the dimension of the Hilbert space. Furthermore, as $ |H_t| = $$ |{\cal{O}}|^{t} |{\cal{A}}|^{t-1} $, and for each $ h_t\in H_t $, the computation of $ u_t^\pi(h_t) $ in (10) requires $|{\cal{O}}| \times |{\cal{A}}|$ multiplications each of which needs to compute $ p_{t+1} = p_{t+1}(h_{t+1}) $, the total complexity of Algorithm 1 is $ O(|{\cal{O}}|^{N+1} |{\cal{A}}|^N n^6) $.

    • Now we turn to consider how to compute optimal policies. The optimal expected total reward over the decision making horizon is defined by

      $$ v_N^\ast = \sup\limits_\pi v_N^\pi. $$

      For any $ 1\leq t\leq N $ and $ h_t\in H_t $, $ u_t^\ast(h_t) $ is defined to be the optimal expected total reward from decision epoch $ t $ onward when the history up to time $ t $ is $ h_t $, i.e.,

      $$ u_t^\ast(h_t) = \sup\limits_\pi u_t^\pi(h_t) $$

      where $ \pi $ traverses over all randomised history-dependent policies. Similar to Lemma 3, Lemma 4 shows that the optimal total reward $ v_N^\ast $ can be represented in terms of the optimal reward $ u_1^\pi $ at the first decision epoch.

      Lemma 4.

      $$ \begin{aligned} & v_N^\ast =\\ & \sup_{M_1\in{\cal{M}}}\left\{\sum\limits_{m_1\in O(M_1)}{{\rm{tr}}}(M_{1m_1}\rho M_{1m_1}^\dagger )\times u_1^\ast(M_1,m_1)\right\} \end{aligned} $$ (12)

      Proof. By a routine calculation. □

      Lemma 4 provides a method for computing the optimal total reward $ v_N^\ast $ through computing the optimal reward $ u_1^\ast $ at the first decision epoch, which can be computed by backward recursion based on the following quantum generalisation of the Bellman optimality equations:

      $$ \begin{aligned} \begin{split}[b] & u_t(h_t) =\sup\limits_{a_t\in{\cal{A}}}\sup\limits_{M_{t+1}\in{\cal{M}}} \\ & \left[ r_t(M_t,m_t,a_t)+\sum p_{t+1}\times u_{t+1}(h_t,a_t,M_{t+1},m_{t+1}) \right] \end{split} \end{aligned} $$ (13)

      for $t = 1,\cdots,N-1$, where the summation is over $ m_{t+1}\in O(M_{t+1}) $, and $ p_{t+1} $ is given by (3), and

      $$ \begin{array}{l} u_N(h_N) = r_N(M_N,m_N) \end{array} $$ (14)

      if

      $$ h_N = (M_1,m_1,a_1,\cdots,M_{N-1}, m_{N-1}, a_{N-1}, M_N,m_N). $$

      Theorem 2 shows that the optimal expected reward $ u_t^\ast(h_t) $ at decision epoch $ t $ can be computed by solving the optimal equations.

      Theorem 2. (The principle of optimality) Let $u_t:H_t\rightarrow R (t = 1,\cdots,N)$ be a solution of the optimality equations (13) and (14). Then,

      $$ u_t(h_t) = u_t^\ast(h_t) $$

      for all $t = 1,\cdots,N$ and $ h_t\in H_t $.

      Proof. First, we show that $ u_t^\ast(h_t)\leq u_t(h_t) $ by backward induction on $ t $. By definition, it holds for $ t = N $. Now assume that it holds for $ t+1 $. Then using Lemma 4.3.1 in [21] and Theorem 1, we obtain:

      $$\begin{aligned}[b] & u_t^\ast(h_t) = \sup\limits_\pi u^\pi_t(h_t)=\sup\limits_\pi\sum\limits_{a_t}\sum\limits_{M_{t+1}}\beta^\pi_t\\ & (h_t)(a_t)\times \alpha_t^\pi(h_t,a_t)(M_{t+1})\times \\ & [r_t(M_t,m_t,a_t)+\sum\limits_{m_{t+1}}p_{t+1}\times \\ & u_{t+1}^\pi(h_t,a_t,M_{t+1},m_{t+1})] \leq \\ & \sup\limits_\pi\sup\limits_{a_t}\sup\limits_{M_{t+1}}[r_t(M_t,m_t,a_t) +\sum\limits_{m_{t+1}}p_{t+1} \times \\ & u_{t+1}^\pi(h_t,a_t,M_{t+1},m_{t+1})] =\\ & \sup\limits_{a_t}\sup\limits_{M_{t+1}}[r_t(M_t,m_t,a_t) +\sum\limits_{m_{t+1}}p_{t+1}\times \\ & \sup\limits_\pi u_{t+1}^\pi(h_t,a_t,M_{t+1},m_{t+1})] =\\ & \sup\limits_{a_t}\sup\limits_{M_{t+1}}[r_t(M_t,m_t,a_t) +\sum\limits_{m_{t+1}}p_{t+1}\times \\ & u_{t+1}^\ast(h_t,a_t,M_{t+1},m_{t+1})]\leq\\ & \sup\limits_{a_t}\sup\limits_{M_{t+1}}[r_t(M_t,m_t,a_t) +\sum\limits_{m_{t+1}}p_{t+1}\times \\ & u_{t+1}(h_t,a_t,M_{t+1},m_{t+1})] = u_t(h_t). \end{aligned} $$

      Note that the last inequality comes from the induction hypothesis for $ t+1 $.

      Secondly, we show that $ u_t(h_t)\leq u_t^\ast(h_t) $. For given $ t $ and $ h_t $, and for any $ \epsilon >0 $, by the definition of $u_t, $$ u_{t+1},\cdots,u_N$, we have

      $$ \begin{aligned} & \exists a_t^\ast, M_{t+1}^\ast\ {\rm s.t.}\ u_t(h_t)-\epsilon\leq r_t(M_t,m_t,a_t^\ast)+\\ & \sum_{m_{t+1}}p_{t+1}\times u_{t+1}(h_t,a_t^\ast,M_{t+1}^\ast,m_{t+1})\\ & \forall m_{t+1}, \exists a_{t+1}^\ast, M_{t+2}^\ast\ \\ & {\rm s.t.}\;\; u_{t+1}(h_{t+1})-\epsilon\leq r_{t+1}(M_{t+1},m_{t+1},a_{t+1}^\ast)+\\ & \sum_{m_{t+2}}p_{t+2}\cdot u_{t+2}(h_{t+1},a_{t+1}^\ast,M_{t+2}^\ast,m_{t+2})\\ & \forall m_{N-1}, \exists a^\ast_{N-1}, M_N^\ast\ \\ & {\rm s.t.}\;\; u_{N-1}(h_{N-1})-\epsilon\leq r_{N-1}(M_{N-1},m_{N-1},a_{N-1}^\ast)+ \\ & \sum_{m_{N}}p_{N}\cdot u_{N}(h_{N-1},a_{N-1}^\ast,M_{N}^\ast,m_{N}). \end{aligned} $$

      Here,

      $$ h_j = (h_t,a_t^\ast,M_{t+1}^\ast,m_{t+1},\cdots,a_{j-1}^\ast,M_j^\ast,m_j) $$

      for $ t<j\leq n $. We choose a deterministic policy

      $$ \pi^\ast = (\alpha_0,\beta_1, \alpha_1,\cdots,\beta_{N-1},\alpha_{N-1}) $$

      such that $ \alpha_j(M_j,m_j) = M_{j+1}^\ast $ and $ \beta_j(M_j,m_j) = a_{j+1}^\ast $. Then, we obtain:

      $$ \begin{aligned}[b] & u_t(h_t)\leq r_t(M_t,m_t,a_t^\ast) +\sum\limits_{m_{t+1}}p_{t+1}\times \\ & u_{t+1}(h_t,a^\ast_t,M_{t+1}^\ast,m_{t+1})+\epsilon \leq \\ & r_t(M_t,m_t,a_t^\ast)+\sum\limits_{m_{t+1}}p_{t+1}\times [r_{t+1}(M_{t+1}^\ast,m_{t+1},a_{t+1}^\ast) + \\ & \sum\limits_{m_{t+2}} p_{t+2}\times u_{t+2}(h_{t+1},a^\ast_{t+1},M_{t+2}^\ast,m_{t+2})+\epsilon]+\epsilon =\\ & r_t(M_t,m_t,a_t^\ast)+\sum\limits_{m_{t+1}}p_{t+1}\times \\ & r_{t+1}(M_{t+1}^\ast,m_{t+1},a_{t+1}^\ast)+ \sum\limits_{m_{t+1},m_{t+2}} p_{t+1}\times \\ & p_{t+2}\times u_{t+2}(h_{t+1},a^\ast_{t+1},M_{t+2}^\ast,m_{t+2})+2\epsilon \leq \\ & Q+ \sum\limits_{m_{t+1},m_{t+2},\cdots,m_N}p_{t+1}\times p_{N+2}\times \cdots \times p_N\times \\ & u_N(h_{N-1},a^\ast_{N-1},M_N^\ast,m_N)+(N-t)\epsilon =\\ & Q+\sum\limits_{m_{t+1},m_{t+2},\cdots,m_N}p_{t+1}\times \\ & p_{N+2}\times \cdots \times p_N\times r_N(M_N^\ast,m_N)+(N-t)\epsilon \stackrel{*}{ = }\\ & \sum\limits_{m_{t+1},m_{t+2},\cdots,m_{N-1}}p_{t+1}\times \\ & p_{t+2}\times\cdots\times p_{N} \times [r_t(M_t,m_t,a_t^\ast) +r_{t+1}(M_{t+1}^\ast,m_{t+1},a_{t+1}^\ast)\\ & +\cdots+ r_{N-1}(M_{N-1}^\ast,m_{N-1},a_{N-1}^\ast) +r_N(M_N^\ast,m_N^\ast)] = \\ & u_t^{\pi^\ast}(h_t)+(N-t)\epsilon \leq \\ & u_t^\ast(h_t)+(N-t)\epsilon \end{aligned} $$

      where

      $$ \begin{aligned}[b] & Q = r_t(M_t,m_t,a_t^\ast)+\sum\limits_{m_{t+1}}p_{t+1}\times r_{t+1}(M_{t+1}^\ast,m_{t+1},a_{t+1}^\ast) +\\ & \sum\limits_{m_{t+1},m_{t+1}}p_{t+1}\times p_{t+2}\times r_{t+2}(M_{t+2}^\ast,m_{t+2},a_{t+2}^\ast)+ \cdots+ \\ &\sum\limits_{m_{t+1},m_{t+2},...,m_{N-1}}p_{t+1}\times p_{t+2}\times \cdots\times p_{N-1} \cdot \\ & r_{N-1}(M_{N-1}^\ast,m_{N-1},a_{N-1}^\ast). \end{aligned} $$

      Note that the equality " $ \stackrel{*}{ = } $" follows from that

      $$ {\begin{aligned}[b] & \sum\limits_{m_{t+1},m_{t+2},...,m_N}p_{t+1}\timesp_{t+2}\times\cdots\timesp_N \timesr_{t+j}(M_{t+j}^{\ast},m_{t+j},a_{t+j}^\ast)=\\ & \sum\limits_{m_{t+1},m_{t+2},...,m_{t+j}}p_{t+1}\timesp_{t+2}\times\cdots\timesp_{t+j}\timesr_{t+j}(M_{t+j}^{\ast},m_{t+j},a_{t+j}^\ast) \end{aligned} }$$

      for $j = 1,\cdots,N-t-1$, and similar equalities for $ r_t $ and $ r_N $. Finally, arbitrariness of $ \epsilon $ leads to $ u_t(h_t)\leq u_t^\ast(h_t) $. □

      It can be seen from the above proof that for any $ \epsilon>0 $, we can find a deterministic policy

      $$ \pi^\epsilon = (\alpha_0^\epsilon,\beta_1^\epsilon,\alpha_1^\epsilon,\cdots,\beta_{N-1}^\epsilon,\alpha_{N-1}^\epsilon) $$

      such that

      $$ \begin{aligned}[b] & r_t(M_t,m_t,\beta^\epsilon_t(M_t,m_t))+\sum_{m_{t+1}}p_{t+1}\times\\ & u^\ast_{t+1}(h_t,\beta^\epsilon_t(M_t,m_t),\alpha^\epsilon_t(M_t,m_t),m_{t+1})+\dfrac{\epsilon}{N-1}\geq u_t^\ast(h_t) \end{aligned} $$

      for $t = 1,2,\cdots,N-1$. Then $ \pi^\epsilon $ is an $ \epsilon $-optimal policy in the sense that $ v_N^{\pi^\epsilon}\geq v_N^\ast-\epsilon $. In particular, if both $ {\cal{A}} $ and $ {\cal{M}} $ are finite and $ O(M) $ is finite for every $ M\in{\cal{M}} $, then there is a deterministic policy

      $$ \pi = (\alpha_0,\beta_1,\alpha_1,\cdots,\beta_{N-1},\alpha_{N-1}) $$

      such that

      $$\begin{aligned}[b] &r_t(M_t, m_t,\beta_t(M_t,m_t))+\sum_{m_{t+1}}p_{t+1}\times \\ &u^\ast_{t+1}(h_t,\beta_t(M_t,m_t),\alpha_t(M_t,m_t),m_{t+1}) = u_t^\ast(h_t) \end{aligned} $$

      for $t = 1,2,\cdots,N-1$. Based on this observation, a dynamic programming algorithm can be developed for computing the optimal expected reward $ v^\ast_N $ and finding an optimal policy for the case that $ {\cal{A}} $, $ {\cal{M}} $ and all $ O(M) $ with $ M\in{\cal{M}} $ are finite.

        Algorithm 2. Finding optimal policy

        1) Set $ t=N $ and $ u_N^\ast(h_N)=r_N(M_N,m_N) $ for $ h_N\in H_N $ with $ {tail}(h_N)=(M_N,m_N) $.

        2) Substitute $ t-1 $ for $ t $. Compute $ u_t^\ast(h_t) $ for $ h_t\in H_t $ with $ {tail}(h_t)=(M_t,m_t) $ using (13) (with $ u_t $ and $ u_{t+1} $ be replaced by $ u_t^\ast $, $ u_{t+1}^\ast $, respectively).

      Set

      $$\begin{aligned}[b] &(\beta_t (M_t,m_t),\alpha_t(M_t,m_t))\in\arg\max_{(a,M)\in{\cal{A}}\times{\cal{M}}} \\ &\left\{r_t(M_t,m_t,a)+\sum_{m\in O(M)} p_{t+1}\times u_{t+1}(h_t,a,M,m)\right\} \end{aligned} $$

        3) If $ t=1 $, go to Step 4). Otherwise, return to Step 2).

        4) Compute $ v_N^\ast $ by (12).

      As in Algorithm 1, the problem of computing the quantum probability $ p_{t+1} $ arises in Step 2) of this algorithm. Furthermore, it is easy to see that Algorithm 2 has the same complexity as Algorithm 1.

    • Now we present a method for computing the quantum probabilities $ p_t $ needed in both Algorithms 1 and 2. First of all, an elegant formula for $ p_t $ can be easily derived from its defining equation (3) by induction on $ t $.

      Lemma 5. For each $h_t = (M_1,m_1,a_1,\cdots, M_{t-1},m_{t-1},$ $ a_{t-1},M_t,m_t)\in H_t $, we have:

      $$ p_t = p_t(h_t) = \dfrac{{\rm{tr}}(\sigma_t)}{{\rm{tr}}(\sigma_{t-1})} $$

      where $ \sigma_j $′s are defined by (5).

      Proof. By induction on $ t $. □

      However, it is hard to compute probability $ p_t $ directly using the above lemma because $ (t-1) $-fold iterations of super-operators occurs in $ \sigma_t $. The matrix representation of a super-operator is usually easier to manipulate than the super-operator itself. Suppose super-operator $ {\cal{E}} $ has the representation:

      $$ {\cal{E}}(\rho) = \sum\limits_iE_i\rho E_i^\dagger $$

      for all density matrices $ \rho $. Then the matrix representation of $ {\cal{E}} $ is the $ n^2\times n^2 $ matrix:

      $$ M = \sum\limits_i E_i\otimes E_i^* $$

      where $ A^{\ast} $ stands for the conjugate of matrix $ A $, i.e., $ A^{\ast} = (a^{\ast}_{ij}) $ with $ a^{\ast}_{ij} $ being the conjugate of complex number $ a_{ij} $, whenever $ A = (a_{ij}) $. We write:

      $$ |\Phi\rangle = \sum\limits_j|jj\rangle $$

      for the (unnormalized) maximally entangled state, where $ \{|j\rangle\} $ is an orthonormal basis of ${\cal{H}} = {\bf{C}}^n$.

      Lemma 6. ([6], Lemma 2.1) Let $ I $ be the $ n\times n $ unit matrix and $ M $ the matrix representation of super-operator $ {\cal{E}} $. Then for any density matrix $ \rho $, we have:

      1) $ ({\cal{E}}(\rho)\otimes I)|\Phi\rangle = M(A\otimes I )|\Phi\rangle. $

      2) ${\rm{tr}}(\rho) = \langle\Phi |\rho\otimes I|\Phi\rangle$.

      For every $ j $, we write $ N_j $ for the matrix representation of super-operator $ {\cal{E}}_j(\cdot |a_j) $, and

      $$ L_j = M_{jm_j}\otimes M_{jm_j}^\ast. $$

      Then a combination of the above two lemmas yields an elegant formula for computing the quantum probability $ p_t $ through ordinary matrix multiplications:

      Theorem 3.

      $$ p_t = \dfrac{\langle\Phi|N_tL_{t-1}N_{t-1}\cdots L_1N_1(\rho\otimes I)|\Phi\rangle}{\langle\Phi|N_{t-1}L_{t-2}N_{t-2}\cdots L_1N_1(\rho\otimes I)|\Phi\rangle}. $$

      Proof. This theorem can be easily proved by combining Lemmas 5 and 6. □

    • We now give an example to illustrate the ideas introduced in the previous sections. Suppose a quantum robot $ \clubsuit $ is walking in an $ (n_h+1)\times (n_v+1) $ grid environment shown in Fig. 1 (when $ n_h = 3 $ and $ n_v = 2 $). Initially, the robot is at the $ S = (0,0) $ location. At each decision epoch, it can choose to move horizontally or vertically, each implemented by a (one-dimensional) Hadamard quantum walk[21]. After each move, the robot's location information is (partially) obtained by making a measurement detecting its positions. If the robot is found outside the grid, it gets a penalty (a negative reward $ -r $), and then restarts from the original point $ S $; If it reaches the target slot $ T = (n_h, n_v) $, then it stays there and a reward $ R $ is received; For other cases, no reward or penalty incurs. We assume that $ n_h+n_v> 0 $ and $ R>r>0 $.

      Figure 1.  A quantum robot walking in a grid (with $n_h=3$ and $n_v=2$

      Formally, let $ {\cal{H}}_h $ and $ {\cal{H}}_v $ be two $ 2 $-dimensional vector spaces with $\{|0 \rangle _h, |1 \rangle _h\}$, $\{|0 \rangle _v, |1\rangle _v\}$, respectively, as an orthonormal basis. They will serve as the state spaces of the coins for the horizontal and vertical walks, respectively. The location space $ {\cal{H}}_p $ is the vector space with $\{|i,j \rangle _p : (i,j) \in \overline{G}\}$ as an orthonormal basis, where

      $$ \overline{G} = \{-1,\cdots, n_h+1\} \times \{-1,\cdots, n_v+1\}. $$

      The shift operators for the horizontal and vertical walks are defined by

      $$ \begin{array}{l} S_{hp} = \displaystyle\sum\limits_{(i,j)\in G}\displaystyle\sum\limits_{k\in \{0,1\}} |k\rangle _h \langle k| \otimes |i-(-1)^k ,j\rangle _p\langle i,j| \end{array} $$

      and

      $$ \begin{array}{l} S_{pv} = \displaystyle\sum\limits_{(i,j)\in G}\displaystyle\sum\limits_{k\in \{0,1\}}|i,j-(-1)^k\rangle _p\langle i,j| \otimes |k\rangle _v \langle k| \end{array} $$

      respectively, where

      $$ G = \{(i,j) \in \overline{G}\mid 0\leq i\leq n_h, 0\leq j\leq n_v, (i,j)\neq (n_h,n_v)\}. $$

      Note that we use subscripts to indicate which subsystems the corresponding operators are performed on. For example, $ S_{hp} $ acts only on systems $ {\cal{H}}_h $ and $ {\cal{H}}_p $. For $ \star\in \{h, v\} $, let $ {\cal{W}}^\star $ be the quantum-walk super-operator in the grid $ \overline{G} $ along direction $ \star $, except that it stops when reaching the target slot or getting out of the grid, i.e., for any quantum state $ \rho $,

      $$ {\cal{W}}^\star(\rho) = U^\star \rho\ U^{\star \dagger} + \sum\limits_{(i,j)\in \overline{G}\backslash G}P_{i,j}\rho P_{i,j} $$

      where

      $$ \begin{array}{l} U^h = (S_{hp}\otimes I_v)(H_h\otimes I_p\otimes I_v) \\ U^v = (I_h \otimes S_{pv})(I_h\otimes I_p\otimes H_v)\\ P_{i,j} = I_h \otimes | i,j\rangle _p\langle i,j |\otimes I_v \end{array} $$

      and $ H $, $ I $ stand for the Hadamard matrix and the unit matrix, respectively. Let $ {\cal{E}}_{{reset}} $ be the super-operator which resets the robot to the initial state (i.e., the position to $ (0,0) $, and the coin states to $|0 \rangle$) when it walks outside the grid. To be specific, $ {\cal{E}}_{{reset}} $ has the Kraus operators:

      $$ \{E^{in}, E^{out}_{i,j,k,l} \mid (i,j)\in \overline{G} \backslash G^{in}, k,l\in \{0,1\}\} $$

      where $ G^{in} = G\cup \{(n_h,n_v)\} $, and

      $$ \begin{array}{l} E^{in} = I_h \otimes \displaystyle\sum_{(i,j)\in G^{in}}| i,j\rangle _p\langle i,j |\otimes I_v\\ E^{out}_{i,j,k,l} = |0\rangle _h \langle k| \otimes |0,0\rangle _p\langle i,j| \otimes |0\rangle_v \langle l|. \end{array} $$

      Now, for $ t\in {\cal{T}} $, let

      $$ {\cal{E}}_t(\cdot |\star) = {\cal{E}}_{{reset}}\circ{\cal{W}}^\star $$

      be a composed super-operator obtained by first applying $ {\cal{E}}_{{reset}} $ and then $ {\cal{W}}^\star $. Moreover, let $ {\cal{M}} = \{\Pi_p\} $ be a set consisting of a single measurement $ \Pi $ on system $ {\cal{H}}_p $, where $ \Pi = \{\Pi^!, \Pi^{\times}, \Pi^{?}\} $, $\Pi^{!} = |n_h,n_v \rangle \langle n_h,n_v|$,

      $$ \Pi^{?} = \sum\limits_{(i,j)\in G} |i,j \rangle \langle i,j| $$

      and $ \Pi^\times = I_p - \Pi^{!} - \Pi^{?} $. Finally, we define:

      $$\begin{array}{l} r(m) = \begin{cases} R, & \text{if } m = {!} \\ -r, & \text{if } m = \times \\ 0, & \text{if } m = {?}. \end{cases} \end{array} $$

      For each $ 1\leq t\leq N-1 $, $ M\in {\cal{M}} $, and $ a\in \{h, v\} $, let $ r_t(M, m, a) = r_N(M, m) = r(m) $.

      With the notations presented above, the robot-walking system can be modelled by a qMDP:

      $$ {\cal{P}} = ({\cal{T}},{\cal{H}},\rho, {\cal{A}},\{{\cal{E}}_t(\cdot |a):t\in{\cal{T}}, a\in{\cal{A}}\}, {\cal{M}},\{r_t:t\in {\cal{T}}\}) $$

      where $ {\cal{H}} = {\cal{H}}_h \otimes {\cal{H}}_p\otimes {\cal{H}}_v $, $\rho = |0 \rangle _h \langle 0|\otimes |0,0\rangle_p\langle0,0|\otimes | $$ 0\rangle _v\langle0|$, and $ {\cal{A}} = \{h, v\} $. As $ {\cal{M}} $ contains only a single measurement, we can simply denote a history in $ H_t $ by $h_t = (m_1, a_1, m_2, \cdots , a_{t-1},m_t)$. Furthermore, it is easy to see that $ m_1 = {?} $ in any history $ h_t $.

      In the remainder of this section, we compute the optimal expected total reward $ v_N^\ast $ as well as (one of) the corresponding policies in several simple cases. Our strategy is to first calculate for any $h_t = (m_1, a_1, $$ m_2, \cdots, a_{t-1},m_t) \in H_t$, $ 1\leq t\leq N $, the probability $ p_t = p_t(m_t \mid h_{t-1}, a_{t-1}) $ defined in (3). Then (13) is recursively employed to calculate $ u_t(h_t) $. Finally, we get $ v_N^\ast = u_1(?) $ from Lemma 4 and Theorem 2.

      Case $ N = 1 $: This case is trivial, as no decision is needed to make, and $ v_1^\ast = r(?) = 0 $.

      Case $ N = 2 $: From (13), we have for any $ h_2 = (?, a_1, m_2) $, $ u_2(h_2) = r(m_2) $, and

      $$ u_{1}(?) = \max\limits_{a_1\in\{h,v\}} \{p_2(! \mid\ ?, a_1) \cdot R - p_2(\times \mid\ ?, a_1) \cdot r\} $$

      where

      $$ \begin{array}{l} p_2(! \mid\ ?, a_1) = \begin{cases} 1/2, & \text{if } n_{a_1} = 1\ \wedge\ n_{\overline{a}_1} = 0 \\ 0, & \text{otherwise}. \end{cases} \end{array} $$

      $ \overline{a}_1 $ denotes the direction other than $ a_1 $, and

      $$ \begin{array}{l} p_2(\times \mid\ ?, a_1) = \begin{cases} 1/2, & \text{if } n_{a_1} >0 \\ 1, & \text{if } n_{a_1} = 0. \end{cases} \end{array} $$

      Thus,

      $$\begin{array}{l} v_2^\ast = u_{1}(?) = \begin{cases} (R-r)/2, & \text{if } n_h + n_v = 1 \\ -r/2, & \text{otherwise } \end{cases} \end{array} $$

      and one of the optimal policies could be taking $ a_1 = h $ if $ n_h \geq 1 $, and $ a_1 = v $ otherwise.

      Case $ N = 3 $: For any $ h_2 = (?, a_1, m_2) $,

      $$ u_2(h_2)=r(m_2)+ \max_{a_2\in\{h,v\}} \{p_3(!\mid h_2, a_2) \times R-p_3(\times\mid h_2, a_2)\times r\}. $$

      Note that when $ m_2 = \times $, the robot is reset to the initial state. Thus $ p_3(m\mid h_2, a_2) = p_2(m\mid\ ?, a_2) $ where $ p_2 $ is computed in the previous case. Similarly, as the robot is terminated when $ m_2 = {!} $, $ p_3(m\mid h_2, a_2) = 1 $ if $ m = {!} $ and 0 otherwise. Furthermore, it is not hard to show that

      $$\begin{aligned}[b] & p_3(!\mid\ ?, a_1, ?, a_2) = \\ & \begin{cases} 1/2, & \text{if } a_1 = a_2\ \wedge\ n_{a_1} = 2\ \wedge\ n_{\overline{a}_1} = 0 \\ 1/2, & \text{if } a_1 \neq a_2\ \wedge\ n_h = n_v = 1 \\ 0, & \text{otherwise} \end{cases} \end{aligned} $$

      and

      $$\begin{aligned}[b] & p_3(\times\mid\ ?, a_1, ?, a_2) =\\ & \begin{cases} 1/2, & \text{if } a_1 = a_2\ \wedge\ n_{a_1} = 1\ \wedge\ n_{\overline{a}_1} > 0 \\ 1/2, & \text{if } a_1 \neq a_2\ \wedge\ n_h >0\ \wedge\ n_v > 0 \\ 1, & \text{if } a_1 \neq a_2\ \wedge\ n_{a_1} > 1\ \wedge\ n_{a_2} = 0 \\ 0, & \text{otherwise}. \end{cases} \end{aligned} $$

      With these we know that for any $ a_1\in \{h, v\} $, $u_2 $$ (?, a_1, !) = 2R$, $ u_2(?, a_1, \times) = -r + v_2^\ast $, and

      $$ \begin{aligned}[b] & u_2(?, a_1, ?) = \max_{a_2}\{p_3(!\mid\ ?, a_1, ?, a_2) \cdot \\ & R - p_3(\times\mid\ ?, a_1, ?, a_2)\times r\} = \\ & \begin{cases} R/2, & \text{if } n_{a_1} = 2\ \wedge\ n_{\overline{a}_1} = 0 \\ (R-r)/2, & \text{if } n_h = n_v = 1 \\ -r/2, & \text{if } n_{a_1} = 1\ \wedge\ n_{\overline{a}_1} >1 \\ 0, & \text{otherwise } \end{cases} \end{aligned} $$

      and one of the optimal policies could be taking $ a_2 = \overline{a}_1 $ if $ n_h = n_v = 1 $, and $ a_2 = a_1 $ otherwise. Furthermore,

      $$ u_{1}(?) = \max\limits_{a_1} \left\{\sum_{m\in \{!, \times, ?\}} p_2(m\mid\ ?, a_1) \cdot u_2(?, a_1, m)\right\}. $$

      Let $ T(m) = p_2(m\mid\ ?, a_1) \cdot u_2(?, a_1, m) $. We compute:

      $$ \begin{array}{l} T(!) = \begin{cases} R, & \text{if } n_{a_1} = 1\ \wedge\ n_{\overline{a}_1} = 0 \\ 0, & \text{otherwise} \end{cases}\\ T(\times) = \begin{cases} (R-3r)/2, & \text{if } n_{a_1} = 0\ \wedge\ n_{\overline{a}_1} = 1 \\ -3r/2, & \text{if } n_{a_1} = 0\ \wedge\ n_{\overline{a}_1} > 1 \\ (R-3r)/4, & \text{if } n_{a_1} = 1\ \wedge\ n_{\overline{a}_1} = 0 \\ -3r/4, & \text{otherwise } \end{cases}\\ T(?) = \begin{cases} R/4, & \text{if } n_{a_1} = 2\ \wedge\ n_{\overline{a}_1} = 0 \\ (R-r)/4, & \text{if } n_h = n_v = 1 \\ -r/4, & \text{if } n_{a_1} = 1\ \wedge\ n_{\overline{a}_1} >1 \\ 0, & \text{otherwise. } \end{cases} \end{array} $$

      Thus,

      $$ \begin{array}{l} v_3^\ast = u_1(?) = \begin{cases} (5R-3r)/4, & \text{if } n_{\min} = 0 \wedge n_{\max} = 1 \\ (R-3r)/4, & \text{if } n_{\min} = 0 \wedge n_{\max} = 2 \\ (R-4r)/4, & \text{if } n_{\min} = 1 \wedge n_{\max} = 1 \\ -3r/4, & \text{otherwise } \end{cases} \end{array} $$

      where $ n_{\max} = \max\{n_h, n_v\} $, $ n_{\min} = \min\{n_h, n_v\} $, and one of the optimal policies could be taking $ a_1 $ such that $ n_{a_1} = n_{\max} $. Furthermore, $ a_2 = \overline{a}_1 $ if $ n_h = n_v = 1 $; otherwise, take $ a_2 = a_1 $.

    • In this paper, we studied the optimal policy for qMDPs of finite-horizon. It is shown that the problem can by solved by dynamic programming together with matrix multiplications for computing quantum probabilities. We hope that the mathematical framework of qMDPs developed in [3, 4] and this paper can provide certain theoretical foundations for quantum reinforcement learning[17-20], quantum robot planning[22-24] and other decision-making tasks in the quantum world.

      For future studies, one of the most interesting problems is to settle the complexity of the optimal policy problem (as well as other problems) for qMDPs (in comparison with that for classical MDPs and POMDPs[25, 26]). Another interesting problem is to find quantum algorithms (rather than classical algorithms as considered in the present paper) for solving qMDPs, in particular, speeding up the computation of quantum probabilities. Since the state space of a qMDP is a continuum and thus doomed-to-be infinite, it will be useful to extend analysis techniques for MDPs with infinite state spaces, e.g., bisimulation and metrics[27, 28], to the quantum case.

    • This work has been partly supported by National Key R&D Program of China (No. 2018YFA0306701), the Australian Research Council (Nos. DP160101652 and DP180100691), National Natural Science Foundation of China (No. 61832015) and the Key Research Program of Frontier Sciences, Chinese Academy of Sciences.

    • This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

      The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

      To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

参考文献 (28)

目录

    /

    返回文章
    返回