Volume 13 Number 6
December 2016
Article Contents
Zheng-Huan Zhang, Xiao-Fen Jiang and Hong-Sheng Xi. Optimal Content Placement and Request Dispatching for Cloud-based Video Distribution Services. International Journal of Automation and Computing, vol. 13, no. 6, pp. 529-540, 2016. doi: 10.1007/s11633-016-1025-z
Cite as: Zheng-Huan Zhang, Xiao-Fen Jiang and Hong-Sheng Xi. Optimal Content Placement and Request Dispatching for Cloud-based Video Distribution Services. International Journal of Automation and Computing, vol. 13, no. 6, pp. 529-540, 2016. doi: 10.1007/s11633-016-1025-z

Optimal Content Placement and Request Dispatching for Cloud-based Video Distribution Services

Author Biography:
  • Zheng-Huan Zhang received the B. Eng. degree in automation from University of Science and Technology of China, China in 2009. He is currently a Ph.D. degree candidate in Department of Automation at University of Science and Technology of China, China.
    His research interests include modeling, optimization and performance analysis of the networking.
    E-mail: zzhs@mail.ustc.edu.cn
    ORCID iD: 0000-0001-5066-4364

    Xiao-Feng Jiang received the B.Eng. and Ph.D. degrees in information science and technology from University of Science and Technology of China (USTC), China in 2008 and 2013. He is a post doctoral fellow at Department of Automation of USTC from 2013. He has been a post doctoral fellow at the Department of Electrical Engineering, Columbia University, USA, from 2015.
    His research interests include spectrum sensing, discrete event dynamic system, and wireless communications.
    E-mail: jxf@ustc.edu.cn

  • Corresponding author: Hong-Sheng Xi, received the B. Sc. and M. Sc. degrees in applied mathematics from University of Science and Technology of China (USTC), China in 1980 and 1985, respectively. He is currently a professor of School of Information Science and Technology, USTC. He also directs the Laboratory of Network Communication System and Control.
    His research interests include stochastic control systems, discrete-event dynamic systems, network performance analysis and optimization, and wireless communications.
    E-mail: xihs@ustc.edu.cn
    ORCID iD: 0000-0002-5747-9732
  • Received: 2016-01-13
    Published Online: 2016-10-27
Fund Project:

and National Natural Science Foundation of China No. 61503358

This work is supported by the State Key Program of National Natural Science Foundation of China No. 61233003

通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (5)  / Tables (1)

Metrics

Abstract Views (427) PDF downloads (5) Citations (0)

Optimal Content Placement and Request Dispatching for Cloud-based Video Distribution Services

  • Corresponding author: Hong-Sheng Xi, received the B. Sc. and M. Sc. degrees in applied mathematics from University of Science and Technology of China (USTC), China in 1980 and 1985, respectively. He is currently a professor of School of Information Science and Technology, USTC. He also directs the Laboratory of Network Communication System and Control.
    His research interests include stochastic control systems, discrete-event dynamic systems, network performance analysis and optimization, and wireless communications.
    E-mail: xihs@ustc.edu.cn
    ORCID iD: 0000-0002-5747-9732
Fund Project:

and National Natural Science Foundation of China No. 61503358

This work is supported by the State Key Program of National Natural Science Foundation of China No. 61233003

Abstract: The rapid progress of cloud technology has attracted a growing number of video providers to consider deploying their streaming services onto cloud platform for more cost-effective, scalable and reliable performance.In this paper, we utilize Markov decision process model to formulate the dynamic deployment of cloud-based video services over multiple geographically distributed datacenters.We focus on maximizing the average profits for the video service provider over a long run and introduce an average performance criteria which reflects the cost and user experience jointly.We develop an optimal algorithm based on the sensitivity analysis and sample-based policy iteration to obtain the optimal video placement and request dispatching strategy.We demonstrate the optimality of our algorithm with theoretical proof and specify the practical feasibility of our algorithm.We conduct simulations to evaluate the performance of our algorithm and the results show that our strategy can effectively cut down the total cost and guarantee users quality of experience (QoE).

Zheng-Huan Zhang, Xiao-Fen Jiang and Hong-Sheng Xi. Optimal Content Placement and Request Dispatching for Cloud-based Video Distribution Services. International Journal of Automation and Computing, vol. 13, no. 6, pp. 529-540, 2016. doi: 10.1007/s11633-016-1025-z
Citation: Zheng-Huan Zhang, Xiao-Fen Jiang and Hong-Sheng Xi. Optimal Content Placement and Request Dispatching for Cloud-based Video Distribution Services. International Journal of Automation and Computing, vol. 13, no. 6, pp. 529-540, 2016. doi: 10.1007/s11633-016-1025-z
  • As one of the most popular Internet applications, watching online videos has become an indispensable part of millions of Internet users' daily life. Cisco [1] reported that the video traffic has grown up to 67% in 2014 and will reach more than 80% of the total Internet traffic by 2019. Nowadays, large Internet video enterprises still construct their large-scale video distribution architectures mainly based on content delivery network (CDN) to serve huge and highly dynamic users' demands from widely geographically distributed regions. However, the semi-static resource allocation way in traditional CDNs would result in a low resource utilization rate, which will incur a waste leading to high cost. The situation is particularly severe when the system resources in CDNs often have to be over-provisioned in order to deal with peak traffic. Therefore, we need a more dynamic resource provisioning paradigm.

    Cloud computing technologies have emerged as an agile and cost-effective model to host various Internet applications and services. Its elastic and on-demand resource provisioning manner offers an attractive solution to video providers. There have been several practical cloud-based video applications, such as NETFLIX [2] (the largest U.S. video-on-demand platform), which has migrated its entire service platform onto Amazon cloud services [3]. A cloud platform with multiple distributed datacenters is ideal for video distribution services, which embodies in the following advantages: First, video service providers (VSPs) can dynamically rent resources from cloud providers, e.g., storage and bandwidth, according to the actual demand, which can significantly improve the utilization of resources and avoid huge investment and extra maintenance of fundamental IT infrastructure. Second, the prices of cloud resources across the datacenters exhibit location diversity which can be exploited to save costs. That is, VSPs can allocate relatively more resources in the cheaper datacenters. As the datacenters are distributed over regions, VSPs can dispatch user requests to the datacenter located in the adjacent geographic region to provide higher quality service.

    However, for VSPs, the two objectives to maximize overall users' quality of experience (QoE) and to minimize the total costs are generally unable to be achieved simultaneously. Therefore, it is challenging for a VSP to determine how to allocate resources and replicate contents at each datacenter, while dispatching requests to the appropriate datacenters. The VSP needs a joint optimization strategy to achieve the balance between the costs and users' QoE. There have been several related works investigating the deployment of video streaming services onto the geographically distributed cloud. Niu et al.[4, 5] focus on the resource allocation and request dispatching based on a time-series forecasting method. However, they only propose heuristic video placement strategies. Due to the dynamic nature of user demands, the deployment update issue is necessarily considered but more or less neglected in the above mentioned works. This neglect may not guarantee the optimality over a long run cost of the system.

    In this paper, we investigate the joint optimization problem of video distribution services deployed on geographically distributed cloud datacenters. We propose an optimal algorithm to make appropriate joint decisions of dynamic video placement and requests dispatching, which can simultaneously achieve cost minimization and users' QoE guarantee over a long run of the system. The main contributions of our work are summarized as follows:

    1) We explicitly formulate the dynamic optimal deployment problem of cloud-based video distribution services utilizing Markov decision process model. An average performance model is introduced to reflect the time-average profits (the revenues minus the costs) for the VSP over a long run. We maximize the average performance via joint orchestration of the request dispatching, content replications and content update decisions.

    2) We investigate the joint optimization problem and apply the sensitivity analysis based on performance potentials and sample-path-based policy iteration approaches to obtain the optimal stationary strategy which maximizes the time-average profits for the VSP. The optimal strategy maps the system states into a series of optimal decision actions. The VSP can practically apply it to adaptively adjust the rented resources, video placement and load distribution in each datacenter.

    3) We demonstrate the optimality of our algorithm with theoretical proof and conduct several experiments to validate the effectiveness of our algorithm. The results show that our dynamic algorithm can effectively cut down system costs and guarantee users' QoE to get high and stable revenues simultaneously, and also can adaptively handle the volatility of user requests.

    We next present a review of related works. A number of research works have studied the deployment of video services over the cloud in recent years. Zhu et al. [6] conceive cloud computing as a promising approach to deliver video services. Wang et al. [7] present a live streaming services framework called CALMS and demonstrate the practical feasibility in the migration. Chaisiri et al. [8] formulate the cloud resource supply and demand problem as a stochastic programming and propose an offline strategy. Wang et al. [9] focus on the quality of cloud service in service-oriented cloud computing and propose an accurate evaluation approach of the quality. Gorbenko and Popov [10] investigate the task-resource scheduling problem in cloud computing systems based on a logical model. Li et al. [11] develop a partial migration model of video on demand (VoD) services to a content cloud and design several heuristic migration strategies to reduce the cost. Lin et al. [12] investigate the admission control policy for VoD service systems under various QoS. Wu et al. [13] introduce a queueing network based model to characterize the dynamic behaviors of VoD services based on the cloud and design a dynamic heuristic resource provisioning algorithm. He et al. [14] focus on the dynamic resource provisioning problem according to the fluctuated demands. They introduce diverse pricing models and investigate the tradeoff between the procurement cost and the user QoE. Tang et al. [15] investigate a request redirection strategy to different virtual machines (VM) and service scaling strategies inside a datacenter under different traffic conditions. In their models [8, 11, 13-15], all requests and resources are concentrated in one cloud, while our work considers a geographically distributed cloud architecture. In our work the differentiated charging models and propagation delay of different data centers make it necessary for us to investigate a joint resource provisioning and request dispatching problem.

    Some researchers have put efforts in studying the deployment of cloud services into multiple geographically distributed datacenters. Xu and Li[16] develop a distributed algorithm for the workload management problem. Zhang et al. [17] present a dynamic service placement algorithm based on control-theoretic and game-theoretic models. But neither of the works [16, 17] consider the content placement and migration problem. Zhao et al. [18] design a cost-effective distributed heuristic algorithm to determine the placement of channel replicas and bandwidth reservation. Niu et al. [4] propose a joint load direction and bandwidth reservations strategy to satisfy the system QoS constraints. They predict the dynamic demands based on the time series analysis [19] and utilize the prediction as the input to solve the decisions periodically. He et al. [5] also predict the user demand using the time series method [19] and then apply the Nash Bargaining solution to solve the joint bandwidth allocation and requests direction optimization problem. Both of the works [4, 5] only propose heuristic video placement algorithms based on the solved bandwidth allocation, which can't guarantee the optimality. Their works only make optimal strategy based on one time slot system information. While our work proposes one optimal strategy combining the three decisions of resources allocation, video placement and requests dispatching together. Moreover, our strategy can guarantee the performance optimality over a long run of the system without the need of demand prediction. Qiu et al. [20] focus on migrating the content distribution services onto hybrid clouds. Hu et al. [21, 22] investigate the social video distribution over the cloud-centric CDN. Both of the works [20, 21] apply the Lyapunov optimization theory to design a joint content replication and load distribution algorithm that achieve the cost minimization over time with the delay constraints. Instead of simply defining user latency as a constraint, our work links up the VSPs' revenues with users' QoE which is expressed as a utility function of user latency. We optimize the revenue and the cost jointly, which offers VSPs different optimization preferences.

    The remainder of this paper is organized as follows. The formulation of the optimal deployment problem is described in Section 2. In Section 3, we apply sensitivity analysis to derive the optimal conditions and propose the policy iteration algorithm to solve the Markov decision process (MDP) optimization problem. In Section 4, we present the numerical evaluation results. In Section 5, we conclude the paper.

  • We consider a typical cloud-based video distribution architecture. Suppose that a video service provider provides a collection of video contents, denoted as set ${K}$, to its users spread over multiple geographical regions, denoted as set ${I}$. To implement video distribution services, the video service provider can exploit the public cloud infrastructure which consists of multiple geographically distributed datacenters, denoted as set ${J}$. The cardinalities of the sets ${I}$, ${J}$ and ${K}$ are denoted by I, J and K, respectively.

    In each datacenter, there are two types of inter-connected servers: storage servers, and computing servers provided in the unit of VM with a certain resource configuration. Each VM inside the same datacenter can access the storage servers via a high-speed datacenter network. VSP constructs the video services system by exploiting the two types of servers. Video contents can be replicated in storage servers, while requests can be dispatched to streaming services installed on VMs on the computing servers. VSP can dynamically adjust the number of the rented virtual machines and storage resources to satisfy the users' fluctuating demand. The goal for a VSP is to maximize the profit over time, which means to optimize the users' QoE for enhancing market competitiveness and meanwhile reduce the overall operational cost as much as possible. To achieve the goal, we aim to design an optimal algorithm to strategically make the following decisions: 1) Video placement: how many copies of each video are needed and in which datacenter each copy should be placed at each time slot? 2) Request dispatching: How should requests for a video be directed to the datacenters that store this video at each time slot? Next, we utilize a Markov decision process optimization framework to formulate the joint optimal problem. We summarize the important notations in {Table 1.

    Table 1.  Notations

  • Our system runs in a time-slotted fashion. Each time slot (TS) is of equal duration and enough for accomplishing a content replacement operation and VM configuration adjustment. Fig. 1 illustrates our system model.

    Figure 1.  System model

    States. We form the following three components, the new arrival requests $F(n)$, the serving queue $D(n)$ and the video placement state $Y(n)$ as the system state at the n-th TS, which is denoted as $X(n)=(F(n), D(n), Y(n))$.

    1) Let $f_i^k(n)$ denote the number of new arrival requests for video k from user region i in TS n. Considering that video access requests arrive independently of one another, a reasonable assumption is that the number of the arrival requests $f_i^k (n)$, for arbitrary $i \in {I}$, $k \in {K}$ in each TS, obeys a Poisson distribution. We have the state matrix $F(n)=[f_i^k (n)]_{I \times K}, i \in {I}, k \in {K}$.

    2) As illustrated in Fig. 1, all requests arising in one TS are dispatched to a certain datacenter in the same TS and will enter a corresponding serving queue in the next coming TS. Let $d_j^k (n)$ denote the number of requests for video k being served by datacenter j in TS n. We have the serving state matrix as $D(n)=[d_j^k (n)]_{J \times K}, j \in {J}, k \in {K}$.

    3) We denote $y_j^k$ as the placement indicator state variable. If video k is stored in datacenter j in TS n, $y_j^k (n)=1$; otherwise $y_j^k (n)=0$. The video placement state matrix is written as $Y(n)=[y_j^k (n)]_{J \times K}, j \in {J}, k \in {K}$.

    Actions. VSP as the decision maker takes a series of actions upon the current system state at each time slot to achieve better performance. The actions at the n-th TS are denoted as $V(n)=(A(n), O(n))$, in which A and O represent the set of request dispatching actions and the set of content replacement actions, respectively. In the following, we will introduce the two sets of actions in detail.

    1) For the requests for video k generated from user region i, we denote $a_{ij}^k (n)$ as the number of requests dispatched to datacenter j in TS n. The request dispatching matrices can be formed as

    ${{A}_{i}}={{\left( \begin{matrix} a_{i1}^{1} & \ldots & a_{i1}^{K} \\ \vdots & \ddots & \vdots \\ a_{iJ}^{1} & \ldots & a_{iJ}^{K} \\ \end{matrix} \right)}_{J\times K}}, i=1, 2, \cdots , I$

    and ${A}=[A_1, \cdots, A_i, \cdots, A_l]$. The value of action entry $a_{ij}^k$ relies on the corresponding state. We could dispatch requests for video k to datacenter j only when it stores the certain video content, i.e., if $y_j^k (n)=1, {a_{ij}^k (n)} \geq 0, \forall i$ ; otherwise $y_j^k (n)=0, a_{ij}^k (n)=0, \forall i$. Since all requests arising in one TS are dispatched without rejection, the dispatching actions should satisfy the request conservation constraints $\sum_{j\in {J}} {a_{ij}^k (n)}=f_i^k (n)$.

    2) $o_j^k (n)$ is denoted as the content replacement action for video k stored in datacenter j. The value space of $o_j^k (n)$ is ${-1, 0, 1}$, where actions - 1, 0 or 1 indicate that the content is removed, kept or copied, respectively. For video k, only one replica can be stored in datacenter j. Thus, we have if $y_j^k (n)=1, o_j^k (n) \leq 0$; otherwise $y_j^k (n)=0, o_j^k (n) \geq 0$. The set of content replacement actions are formed as the matrix $O(n)=[o_j^k (n)]_{J \times K}$.

    State transition. We expound the state transition process under the decision actions. By analyzing the component property of the state, the video placement states $Y(n)$ and video replacement actions $O(n)$ are not involved in the formula of state transition probability. Therefore, we focus on the changes of the arrival state $F(n)$ and the serving state $D(n)$.

    1) The upper bound of $f_i^k (n)$ is set as a positive integer M, thus we have $f_i^k (n)=m \in {0, 1, \cdots, M}$. Moreover, the entries of the state matrix, $f_i^k (n), i \in {I}, k\in {K}$ are independent of each other and follow the Poisson distribution with the parameters $\lambda_k, k \in {K}$, respectively. Accordingly, we can obtain the probability of the state transition as follows:

    $\begin{align} & P(f_{i}^{k}(n+1)=m|f_{i}^{k}(n)={{m}^{\prime }})=P(f_{i}^{k}(n+1)=m)= \\ & \left\{ \begin{array}{*{35}{l}} \frac{\lambda _{k}^{m}}{m!}{{\text{e}}^{-{{\lambda }_{k}}}}, & \text{when}\ m<M \\ 1-\sum\limits_{t=0}^{M-1}{\frac{\lambda _{k}^{t}}{t!}{{\text{e}}^{-{{\lambda }_{k}}}}} & \text{when}\ m=M. \\ \end{array} \right. \\ \end{align}$

    (1)

    The transitions which are not considered in this list have probability 0. Due to the independent nature of each entry in the state matrix $F(n)$, we have $ P(F(n+1)\vert F(n))=\prod_{i \in {I}, k \in {K}} {P(f_i^k (n+1))}$.

    2) Considering the characteristics of video watching, user requests will stay in the serving queue for at least one TS. The number of requests for video k departing from serving queue $d_j^k$ in TS n is denoted as $t_j^k (n)$, which follows a Poisson distribution with the parameter $\mu_k$. After the dispatching action $a_{ij}^k (n)$ is taken, the datacenter j will start serving the various requests from different regions in the next TS. The update equation of the serving queue $d_j^k$ is expressed as the following queue law:

    \begin{equation} d_j^k (n+1)=\max{\left\{d_j^k (n)-t_j^k (n), 0\right\}}+\sum_{i \in {I}} {a_{ij}^k(n)} \end{equation}

    (2)

    which indicates that the number of departing requests $t_j^k (n)$ cannot exceed the number of serving requests $d_j^k (n)$ in the current TS. Thus, the state $d_j^k (n)$ transition probability under the actions $a_{ij}^k (n), i \in {I}$ can be deduced via the Poisson distribution of departing variables $t_j^k (n)$. If there are T requests for video k being served by datacenter j in the TS n, $d_j^k (n)=T$, then $t_j^k (n)=0, 1, \cdots, T$. If the departing requests are less than the serving requests, i.e., $t_j^k (n)<T, t_j^k (n)=d_j^k (n)+\sum_{i \in {I}}{a_{ij}^k} -d_j^k (n+1)$, we have

    $\begin{align} & P(d_{j}^{k}(n+1)|d_{j}^{k}(n), a_{ij}^{k}(n), i\in I)= \\ & P(t_{j}^{k}(n)=d_{j}^{k}(n)+\sum\limits_{i\in I}{a_{ij}^{k}}-d_{j}^{k}(n+1)). \\ \end{align}$

    (3)

    According to the Poisson distribution,

    \begin{equation} \label{ttrans1} P(t_j^k (n)=\tau)=\dfrac{\mu_k^\tau}{\tau!} {\rm e}^{-\mu_k}, \tau=0, 1, \cdots, T-1. \end{equation}

    (4)

    If all the serving requests depart from the queue, i.e., $T-t_j^k (n)=0, d_j^k (n+1)=\sum_{i \in {I}}{a_{ij}^k}$ , then we have

    $\begin{align} & P(d_{j}^{k}(n+1)|d_{j}^{k}(n), a_{ij}^{k}(n), i\in I)= \\ & P(t_{j}^{k}(n)=T)=1-\sum\limits_{\tau =0}^{T-1}{\frac{\mu _{k}^{\tau }}{\tau !}{{\text{e}}^{-{{\mu }_{k}}}}}. \\ \end{align}$

    (5)

    The transitions which are not considered above have probability 0. Bear in mind that the entries are independent of one another, we can derive the formula of state transition probability as follows:

    $\begin{align} & P(X(n+1)|X(n), V(n))=\prod\limits_{i\in I, k\in K}{P(f_{i}^{k}(n+1))}\times \text{ } \\ & pro{{d}_{j\in J, k\in K}}P(d_{j}^{k}(n+1)|d_{j}^{k}(n), a_{ij}^{k}(n), i\in I) \\ \end{align}$

    (6)

    where $P\big(f_i^k (n+1)\big)$ and $P\big(d_j^k (n+1)\vert d_j^k (n), a_{ij}^k (n), i\in {I}\big)$ are expressed as the above formulas (1), (4), (5). From the specific formula of state transition probability, we can find that the transition of the system state depends only on the current state and the controller’s current action, which means that our system model fulfills the Markov property.

    Reward. We represent the VSP’s profits as the reward function of MDP model, which is formed as service revenue minus operational cost. Next, we introduce the service revenue and operational cost in detail, respectively.

    Service Revenue. The service revenue for a VSP is in direct proportion to the number of user access requests. To retain a large number of users for a long time, the VSP should always provide a high quality video watching experience to users. From the perspective of user experience, end users expect to receive video streams from cloud datacenters with low latency whenever possible. A small increase in the user-perceived latency would severely try the users’ patience and cause substantial customer dissatisfaction [23]. The user-perceived latency largely depends on the end-to-end propagation latency [24].

    We denote the latency between region i and datacenter j as $\zeta_{ij}$ and the average latency received by users requesting video k from region i is $ \sum_{j \in {J}}{a_{ij}^k (n)\zeta_{ij}}$. As the average latency increases, the users from region i are more likely to leave the service, which incurs more revenue loss. Thus, we represent the revenues as the potential loss through a generic decreasing utility function Ui associated with the average latency. As the literature [25], the utility function is set as the following function:

    \begin{equation} \label{quadrrev} U_i(a_{ij}^k)=-\beta f_i^k(n)\left(\frac{\sum_{j \in {J}}{a_{ij}^k(n) \zeta_{ij}}}{f_i^k(n)}\right)^2 \end{equation}

    (7)

    where $\beta$ is a positive price factor that converts latency to monetary terms. VSP can gain the total revenues for serving user requests $C_i=\sum_{i\in {I}}{\sum_{k\in {K}}{U_i (a_{ij}^k, \zeta_{ij})}}$.

    Operational cost. The operational cost for the VSP mainly arises from the resource rentals charged by cloud providers. We model the charges according to the pricing model of leading commercial cloud providers, such as Amazon EC2 [26] and S3 [27]. The operational costs in each time slot include the following categories:

    1) Storage cost

    The charge for storage at datacenter j is qj per byte per unit time. Video k occupies $h^k$ bytes storage space. We assume that the storage capacity in each data center is sufficient for storing contents. The total amount of storage cost at datacenter $j, \forall j \in {J}$, for caching video replicas is $\sum_{k \in {K}}{q_j h^k y_j^k (n)}$.

    2) Request service cost

    The price for renting a VM instance in datacenter j is cj per unit time. The configured bandwidth of VM in datacenter j is wj. Let $b^k$ denote the bandwidth occupied for serving a request for video k. To handle the dispatched requests, the VSP needs to rent at least $ \left\lceil\sum_{k\in {K}}{\frac {d_j^k (n) b^k]}{w_j}}\right\rceil$ VMs. Thus, the total amount of service cost at datacenter $j, \forall j\in {J}$, for streaming videos is $\sum_{j\in {J}}{c_j \times \left\lceil\sum_{k\in {K}}{\frac {d_j^k (n) b^k}{w_j}}\right\rceil }$.

    3) Content replacement cost

    Let $u_j^k$ denote the migration cost to copy video k into datacenter j. The removal of contents from a data center is cost-free. The total migration cost at datacenter $j, \forall j \in {J}$, is $\sum_{k \in {K}}{u_j^k * [o_j^k]^+}$, where notation $[x]^+=x$ if $ x\geq0$ and $[x]^+=0$ if $x< 0$.

    Therefore, the overall reward for the VSP in time slot n is

    $\begin{align} & r((n), V(n))= \\ & \alpha \sum\limits_{k\in K}{\sum\limits_{i\in I}{{{U}_{i}}(a_{ij}^{k}(n), {{\zeta }_{ij}})}}-\sum\limits_{j\in J}{\sum\limits_{k\in K}{u_{j}^{k}\times {{[o_{j}^{k}(n)]}^{+}}}}- \\ & \sum\limits_{j\in J}{\sum\limits_{k\in K}{{{q}_{j}}{{h}^{k}}y_{j}^{k}(n)}}-\sum\limits_{j\in J}{{{c}_{j}}\times \left\lceil \sum\limits_{k\in K}{\frac{d_{j}^{k}(n){{b}^{k}}}{{{w}_{j}}}} \right\rceil } \\ \end{align}$

    (8)

    where $\alpha$ is a positive parameter to achieve a preference between the users' QoE and the system payment to the cloud.

    Optimization formulation. As described above, the video service system deployed on multiple cloud datacenters can be formulated as a discrete time MDP, which is defined via the quadruplet $\big\langle X, V, p_{v_l}(S_i, S_j), r(X, V)\big\rangle$. In the model, $X(n)$, the joint state of our system in TS n, consists of three components $X(n)=(F(n), D(n), Y(n))$. Since all components take value in a finite integer set, there exist a finite number of possible states and the set of states is denoted by $S=\{S_1, S_2, \cdots, S_{N_S}\}$, where $N_{S}$ is the number of the elements in S. Similarly, the set of actions V is a finite set, and we have $V(n)\in V=\{v_1, v_2, \cdots, v_{N_V}\}$. Let $p_{v_l}(S_i, S_j)=P\big(X(n+1)=S_j\vert X(n)=S_i, V(n)=v_l \big)$ denote the transition probability that the system state transfers to state Sj at TS $n+1$ when the decision maker takes the action vl in state Si in TS n. The reward $r\big(X(n), V(n)\big)$ represents the profits when the VSP takes the action $V(n)$ in state $X(n)$ in TS n.

    We define the average performance measure $\eta=\lim_{L \to \infty}\frac{1}{L}\sum_{n=0}^{L-1}{r\big(X(n), V(n)\big)}$ as the long run time-averaged reward. A policy specifies a decision rule to be carried out at all decision epochs. In our system, the policy means a guide for VSP to decide how to update video contents placement and dispatch requests. In this work, we consider a stationary and deterministic policy $\theta$, which is a mapping from the state space to the action space. Namely, $V(n)=\theta(X(n))$, the decision maker chooses a certain action from the action space V only according to the current system state, unrelated with the time and history. Thus, the average performance measure under the policy $\theta$ is as follows:

    \begin{equation} \eta^\theta=\lim_{L \to \infty}\left\{\frac{1}{L}\sum_{n=0}^{L-1}r\big(X(n), \theta(X(n))\big)\right\}. \end{equation}

    (9)

    Thus, the optimization goal for MDP is to find the optimal policy $\theta$ from the policy space $\Pi_s^d$ which maximizes the system average reward over an infinite time horizon:

    \begin{equation} \max_{\theta \in \Pi_s^d}{\eta^\theta}. \end{equation}

    (10)

    As seen from the above analysis, the optimal policy provides an overall decision guide for VSP to achieve the maximization of the total rewards over a long run, with no need to predict the demand in each TS. However, it is not computationally feasible for our practical system to find the optimal policy by exhaustive search, i.e., by comparing the performance of all the policies. In the next section, we explore the special feature of Markov model to develop an efficient optimization approach.

  • In this section, we utilize the sensitivity-based analysis [28] to derive the performance difference formula and the optimal conditions for MDP. Then based on the performance difference, we design a sample-path-based policy iteration algorithm to find the optimal policy. We also discuss its practical implementation.

  • The performance difference formula forms the basis for the optimization, which provides an approach to compare any two policies. Thus, we can start with any policy, learn from its behavior and find a better policy based on the performance difference formula. Since the policy space is discrete and finite, this procedure can continue to iterate until the best policy is reached, which is called as the policy iteration. Next, we present the derivation of the performance difference formula.

    We first explain the related notations as follows. In our model, we assume that the decision maker chooses actions at different states $X(n)\in S$ independently. Let $\Pi_s^d$ denote the set of all such admissible stationary and deterministic policies and here we only consider that our Markov system runs under the policy $\theta \in \Pi_s^d$. Let $P^{\theta}=\big[p^\theta (S_j \vert S_i )\big]_{N_S\times N_S}$, $S_i, S_j\in S$, denotes the transition probability matrix, the $(i, j)$-th entry of which is the transition probability from state Si to state Sj under the policy $\theta$. Since $V(n)=\theta(X(n))$, the reward at state $X(n)$ is $r^\theta (X(n))=r(X(n), \theta(X(n)))$. Because the state space is finite, irreducible and aperiodic, we can deduce that $P^\theta$ is ergodic for each $\theta$ in our model. Then, $P^\theta$ has a stationary distribution row vector $\pi^\theta$ satisfying $\pi^\theta P^\theta=\pi^\theta, \pi^\theta e=1$, where $\pi^\theta=\left[\pi^\theta(S_1), \pi^\theta(S_2 ), \cdots, \pi^\theta (S_{N_S} )\right]_{1\times N_S}^{\rm T}$, and $e=[1, \cdots, 1]_{1\times N_S}^{\rm T}$ is a $1\times N_S$ dimensional column vector of ones and superscript T denotes the transpose. Due to the ergodicity, the long run average reward is independent of the initial state, i.e., $\eta^\theta (S_l )=\eta^\theta$ for all $S_l\in S$, and is equal to $\eta^\theta=\lim_{L \to \infty}{\frac{1}{L}\sum_{n=0}^{L-1}{r\big(X(n), \theta(X(n))\big)}}=\pi^\theta r^\theta$ in which $r^\theta=\big[r^\theta(S_1 ), r^\theta(S_2 ), \cdots, r^\theta(S_{N_S})\big]_{1\times N_S}^{\rm T}$.

    We next introduce the performance potential which is the main concept of sensitivity analysis in Markov system. The performance potential of state $S_l \in S$ under the policy $\theta$, $g^\theta(S_l)$, measures the “potential” contribution of state Sl to the long-run average reward $\eta^\theta$. It is defined on a sample path of $P^\theta$ as

    \begin{equation} g^\theta(S_l)={\rm E}\left\{\sum_{n=0}^{\infty}[r^{\theta}(X(n))-{\eta}^{\theta}]\Bigg| X(0)=S_l \right\}. \end{equation}

    (11)

    From this, we can derive the following expression [29]

    $g^\theta(S_l)$ = contribution of the current state SN+

    expected long-run potential contribution

    of the next state =

    $(r({{S}_{l}})-{{\eta }^{\theta }})+\sum\limits_{{{S}_{{{l}^{\prime }}}}\in S}{P({{S}_{{{l}^{\prime }}}}|{{S}_{l}}){{g}^{\theta }}({{S}_{{{l}^{\prime }}}}).}$

    (12)

    This can be written in a matrix form called the Poisson equation:

    $(I-{{P}^{\theta }}){{g}^{\theta }}+{{\eta }^{\theta }}e={{r}^{\theta }}$

    (13)

    where the potential vector $g^\theta=\big[g^\theta(S_1 ), g^\theta (S_2 ), \cdots, $ \$g^\theta(S_{N_S} )\big]_{1\times N_S}^{\rm T}$.

    Next, we develop the formula for the difference of the performance measure of any two policies and then investigate the optimal policy based on this performance difference formula. Let $P^{\theta^\prime}$ be the state transition probability matrix of the same state space S under another policy $\theta^\prime$. We denote its stationary distribution vector, reward vector and long-time average reward as $\pi^{\theta^\prime}=\big[\pi^{\theta^\prime} (S_1 ), \pi^{\theta^\prime} (S_2 ), \cdots, \pi^{\theta^\prime} (S_{N_S} )\big]_{1\times N_S}^{\rm T}$, $r^{\theta^\prime}=\big[r^{\theta^\prime} (S_1 ), r^{\theta^\prime}(S_2 ), \cdots, r^{\theta^\prime} (S_{N_S} )\big]_{1\times N_S}^{\rm T}$ and $\eta^{\theta^\prime}$, respectively. Then, we have $\eta^{\theta^\prime}=\sum_{l=1}^{N_S}{\pi^{\theta^\prime}(S_l ) r^{\theta^\prime} (S_l) }=\pi^{\theta^\prime} r^{\theta^\prime}$.

    According to the above Poisson (13), we pre-multiply both sides of the equation by $\pi^{\theta^\prime}$. Given that $\pi^\theta P^\theta=\pi^\theta, \pi^\theta e=1$, we have $\eta^\theta=-\pi^{\theta^\prime} \big(P^{\theta^\prime}-P^\theta \big) g^\theta+\pi^{\theta^\prime} r^\theta$. Then, we obtain the performance difference formula under two different polices $\theta$ and ${\theta^\prime}$ as follows:

    \begin{equation} \eta^{\theta^\prime}-\eta^\theta=\pi^{\theta^\prime}\big[(P^{\theta^\prime}-P^\theta)g^\theta+(r^{\theta^\prime}-r^\theta)\big]. \end{equation}

    (14)

    We call $\eta^{\theta^\prime}>\eta^\theta$ that the system under the policy ${\theta^\prime}$ performs better than it does under the policy $\theta$. Due to the fact that $\pi(S_l )>0$ for all $S_l\in S$ in ergodic case, we can compare the long run average rewards under two policies by comparing $P^{\theta^\prime}g^\theta+r^{\theta^\prime}$ and $P^\theta g^\theta+r^\theta$ component-wisely without need to solve the stationary distribution $\pi^{\theta^\prime}$.

    Lemma 1. If

    \begin{equation} P^{\theta^\prime} g^\theta+r^{\theta^\prime} \succeq P^\theta g^\theta+r^\theta \end{equation}

    (15)

    where $\succeq $ denotes $P^{\theta^\prime} g^\theta+r^{\theta^\prime} \geq P^\theta g^\theta+r^\theta$ for each component and at least one $>$ holds, then we have $\eta^{\theta^\prime}>\eta^\theta$.

    According to the above comparison lemma, we can easily obtain the optimality condition.

    Theorem 1. (Optimality condition) A policy $\theta^*$ is optimal if and only if for all $\theta \in \Pi_s^d$, it satisfies the following inequality:

    \begin{equation} P^{\theta^*} g^{\theta^*}+r^{\theta^*} \geq P^\theta g^{\theta^*}+r^\theta. \end{equation}

    (16)

    Thus, we can adopt the policy iteration to solve the MDP performance optimization. The basic iteration procedure is that for any given initial policy $\theta_z$, we can always find another better policy by utilizing the performance potential $g^{\theta_z}$. In z-th iteration, we can improve the performance by choosing $\theta_{z+1}$ such that $\theta_{z+1} \in \arg{\left\{\max_{\theta \in \Pi_s^d}\big[r^\theta+ P^\theta g^{\theta_z }\big]\right\}}$ component-wisely (i.e., to determine an action for each state). If in state Sl, action $\theta_z (S_l)$ attains the maximum, set $\theta_{z+1} (S_l )=\theta_z (S_l)$. According to Poisson (13), we have the optimal equation as follows. The policy $\theta^*$ is optimal if and only if it satisfies

    \begin{equation} g^{\theta^*}+\eta^{\theta^*} e=\max_{\theta \in \Pi_s^d}\left\{r^\theta+ P^\theta g^{\theta^* }\right\}. \end{equation}

    (17)

    Theorem 2. The above standard policy iteration can stop in a finite number of iterations and stops at an optimal deterministic memoryless policy.

    Proof. It is easy to obtain $P^{\theta_{z+1}} g^(\theta_z )+r^{\theta_{z+1}} \succeq P^{\theta_z } g^{\theta_z }+r^{\theta_z}$, i.e., $\eta^{\theta_{z+1}}>\eta^{\theta_z }$, the average reward will improve in every iteration. Since the set of policies is finite, policy iteration will finally terminate within finite steps. After z iterations, the policy iteration stops with $\theta^*=\theta_z=\theta_{z+1}$, which satisfies the optimality condition $P^{\theta^*} g^{\theta^*}+r^{\theta^*} \geq P^\theta g^{\theta^*}+r^\theta$ for all $\theta \in \Pi_s^d$.

  • We next present the sample-path-based policy iteration algorithm to find the optimal video replacement and request dispatching policy. The above analysis is the intuitive sensitivity-based view of policy iteration for MDPs [30]. A key step for the policy iteration is to obtain all states of performance potentials under the current policy. The standard approach is to solve Poisson (13) via matrix inversion. However, because the state space of our actual system is huge and complex, even though we may know all the entries of transition matrix, matrix inversion is often impracticable. Since the potential depends only on the current policy $\theta$ not on the improved policy $\theta^\prime$, we resort to a sample path method to estimate the performance potential according to the definition of performance potential (11). The estimate can be used as an approximation in the policy improvement step to determine an improved policy [28, 30].

    We first introduce the estimation procedure. Define the realization factors $\gamma(S_{l^*}, S_l )=g(S_l )-g(S_{l^* })$. We choose a reference state $S_{l^*}\in S$ and assume that $X(0)=S_{l^*}$. Define

    $\begin{align} & {{\omega }_{0}}({{S}_{{{l}^{*}}}})=0 \\ & {{\omega }_{N}}({{S}_{{{l}^{*}}}})=\min \{\omega :X(\omega )={{S}_{{{l}^{*}}}}, \omega >{{\omega }_{N-1}}({{S}_{{{l}^{*}}}})\}, \\ & N\ge 1. \\ \end{align}$

    The instants $\omega_0 (S_{l^*} ), \omega_1 (S_{l^*}), \cdots, \omega_L (S_{l^*}), \cdots$ are called as regenerative points of the Markov chain $X=\{X_\omega, \omega=0, 1, \cdots\}$, and the sample path between $\omega_N (S_{l^*})$ and $\omega_{N+1}(S_{l^*})$ is the N-th regenerative period. Next, given that a state $S_l \in \mathcal{S}, S_l\neq S_{l^*}$, we define $\omega_N (S_{l^*})=\min\{\omega:\omega>\omega_N (S_{l^*}), X_\omega=\omega_N (S_l )\}, N=0, 1, \cdots, $ and $X_N (S_l )=1$ if $\omega_N (S_l )<\omega_{N+1} (S_{l^*})$, and $X_N (S_l )=0$ otherwise. $X_N (S_l )$ indicates whether the state $S_l \in S$ occurs in the N-th regenerative period. An additional remark is that the Markov chain may not visit a given state Sl in a regenerative period. Consider L regenerative periods, if $X_N (S_l )=1$, we define

    \begin{equation} \Delta_N(S_{l^*}, S_l)=\sum_{\omega=\omega_N (S_l )}^{\omega_{N+1} (S_{l^*})-1}\big[r(X_\omega)-\bar{\eta}_L\big] \end{equation}

    (18)

    where $\bar{\eta}_L$ is the estimated average performance based on L regenerative periods:

    $\begin{align} & {{{\bar{\eta }}}_{L}}=\frac{\sum\limits_{N=0}^{L-1}{\left[ \sum\limits_{\omega ={{\omega }_{N}}({{S}_{l}})}^{{{\omega }_{N+1}}({{S}_{{{l}^{*}}}})-1}{r({{X}_{\omega }})} \right]}}{\sum\limits_{N=0}^{L-1}{[}{{\omega }_{N+1}}({{S}_{{{l}^{*}}}})-{{\omega }_{N}}({{S}_{{{l}^{*}}}})]}= \\ & \frac{1}{{{\omega }_{L}}({{S}_{{{l}^{*}}}})}\sum\limits_{\omega =0}^{{{\omega }_{L+1}}({{S}_{{{l}^{*}}}})-1}{r}({{X}_{\omega }}) \\ \end{align}$

    (19)

    where $\Delta_N(S_{l^*}, S_l)$ is undefined if $\chi_N (S_l )=0$. The occurrence number of state Sl during L regenerative periods is

    \begin{equation} L(S_l )=\sum_{N=0}^{L-1}\chi_N (S_l). \end{equation}

    (20)

    The ergodicity of Markov model means the system will visit state Sl infinite times, $\lim_{L \to \infty}{L(S_l)}=\infty$. Then, we set $g(S_{l^*})=0$ and we can give the estimated potential of the state Sl based on the sample path of L regenerative periods as follows. If $L(S_l )>0$,

    $\begin{align} & {{{\bar{g}}}_{L}}({{S}_{l}})=\frac{1}{L({{S}_{l}})}\sum\limits_{N=0}^{L-1}{{{X}_{N}}({{S}_{l}}){{\Delta }_{N}}({{S}_{{{l}^{*}}}}, {{S}_{l}})}\approx \\ & g({{S}_{l}})-g({{S}_{{{l}^{*}}}})=g({{S}_{l}}) \\ \end{align}$

    (21)

    $\bar{g}_L (S_l )$ is undefined if $L(S_l )=0$. By the law of large numbers, we have $\lim_{L \to \infty}{\bar{\eta}_L}=\eta$ with Probability 1.

    Theorem 3. As the number of regenerative periods $L \to \infty$, the sample-path-based potential estimate $\bar{g}_L(S_l )$ in (21) converges to its true value $g(S_l )$ with Probability 1.

    The above theorem guarantees the convergence of the estimation algorithm. The detailed proof is given in [28]. As the number of regenerative periods L increases, the estimation error decreases while the increase of the computation complexity may incur an extra waste. The regenerative periods L should be chosen carefully to guarantee the estimation error in an acceptable range.

    We devise a sample-path-based policy iteration algorithm with a fixed number of regenerative periods L as follows. Because of the estimation error in $\bar{g}_L^\theta$, instead of using the maximum value of $P^\theta g^\theta+r^\theta$ in the policy improvement step, it is more appropriate to use a small region for the expected potentials. We first introduce new definitions to simplify the notation. For any NS-dimensional vector g and a small positive number $\delta>0$, we set $^1$

    \begin{equation} \Omega_\delta(g)=\left[\max_{\theta \in \Pi_s^d}(P^\theta g^\theta+r^\theta)-\delta e, \max_{\theta \in \Pi_s^d}(P^\theta g^\theta+r^\theta)\right] \end{equation}

    (22)

    and define

    \begin{equation} \psi(g)=\left\{\theta: P^\theta g^\theta+r^\theta \in \Omega_\delta(g) \right\} \end{equation}

    (23)

    as the set of improved policies. The algorithm is as follows.

    Algorithm 1. Sample-path-based policy iteration algorithm with a fixed L

    1) Choose an integer $L>0$, a real number $\delta>0$, and an initial policy $\theta_0$, set $z=0$.

    2) Observe the system under policy $\theta_z$ to obtain a sample path with L regenerative periods. Estimate the poten- tials using (21). Denote the estimate as $\bar{g}_L^{\theta_z}$. (Set $\bar{g}_L^{\theta_z}(S_l )=\bar{g}_L^{\theta_{z-1}}(S_l)$ if $L_z (S_l )=0$, where $L_z (S_l )$ is the $L(S_l )$ in (20) in the z-th iteration, with $\bar{g}_L^{\theta_{-1}}$ $=0$).

    3) Choose any policy $\theta_{z+1} \in \psi(\bar{g}_L^{\theta_z})$ component-wisely. If at a state Sl, action $\theta_z (S_l)$ is in the set $\psi(\bar{g}_L^{\theta_z})$, then set $\theta_{z+1} (S_l )$= $\theta_z (S_l)$. Otherwise, we may choose randomly in $\psi(\bar{g}_L^{\theta_z})$.

    4) If $\theta_{z+1}=\theta_z$, then stop. Otherwise, set $z=z+1$ and return to 2).

    Using Theorem 3, we can obtain the unbiased estimates of performance potentials. When L is large enough and the estimated error is ensured to satisfy certain bound constraints, the average performance can be ensured to improve through 3) in each iteration. Since the set of policies is finite, policy iteration will finally terminate within finite steps. The above discussion proves that our policy iteration algorithm will converge to the optimal policy within finite steps.

  • We next discuss the practical implementation of our algorithm. The VSP construct a central controller to periodically collect system state information and execute the dynamical algorithm to implement a series of decision actions. The controller select an initial policy and start to run the system. In each time slot, the VSP controller counts the new arrival requests $f_i^k$. The length of serving queues $d_j^k$ are observed from the cloud log in real time. The controller maintains a video placement state table with entries $y_j^k$.

    $^1$For any two S-dimensional vectors a and b with $a<b$, we use $[a, b]$ to denote an S-dimensional array of intervals $[a, b]=([a(1), b(1)], [a(2), b(2)], \cdots, [a(S), b(S)])$. An S-dimensional vector $c \in [a, b]$ means that $c(i)\in [a(i), b(i)]$ for all $i=1, 2, \cdots, S$.

    Thus, we can obtain and execute the corresponding dispatching actions $a_{ij}^k$ and video replacement actions $o_j^k$ according to the current policy $\theta_z$. The controller updates all the states information with the natural transition of the system (service arrival and departure) and the execution of the corresponding actions.

    The controller can acquire a sample path by running in this way for a period of time, and then execute Algorithm 1 until the optimal policy is found. VSP controller can learn the optimal policy $\theta^*$ based on a sample path generated from a trial run in advance. Once the optimal policy is acquired, the VSP can directly execute certain actions according to the current system state without the prediction of future demands. The offline policy can be relearned after every period of time to cope with new situations.

    The request behavior of online video streaming demonstrates a daily pattern [31, 32]. The fluctuations in different parts of one day make it inappropriate for VSP to adopt the same strategy all the time. We can solve the particular strategy for each request pattern offline. Further, the decision maker just switches to the strategy corresponding to the current part of the day. For our system, only a fraction of states are “significant”. Extreme states, such as peak or trough of queue, are uncommon, which contribute little to the long-run average reward. Therefore, we pick up the significant states and utilize the approach to estimate the potentials of the significant states under the current policy.

    Moreover, we give several complement constraints to our model based on practical considerations, which can largely reduce the complexity of the state space and policy space. The constraints specify the interaction among the available values of actions $o_j^k (n)$, $a_{ij}^k (n)$ and the corresponding placement state $y_j^k (n)$ and the serving queue state $d_j^k (n)$. Since nowadays streaming video files are transcoded and split into segments, a datacenter can start to deliver the video segments after receiving a small portion of the file. Although it takes some time to accomplish the entire video file uploading, we can carry out the dispatching action only with a slight lag after starting to replicate the segments to the datacenter. Our model conforms to this actual situation, if $y_j^k (n)=0$, $a_{ij}^k (n)=0$, $o_j^k (n)=1$, then $y_j^k (n+1)=1$, $a_{ij}^k (n+1)\geq 0$, which indicates that once the replicating action is executed, the corresponding placement state transfers to the new state immediately. In addition, considering the removal time can always be omitted, we propose the constraint that if $o_j^k (n)=-1$, then $a_{ij}^k (n)=0, \forall i$, which avoids the conflict that the user requests are still being dispatched even if the video has been removed from the datacenter $o_j^k (n)=-1, y_j^k (n+1)=0$. Moreover, the removal action is permitted only when the serving queue is not empty at the current time slot. We have $y_j^k (n)=1, d_j^k (n)=0, o_j^k (n)\leq 0$. Otherwise, $y_j^k(n)=1, d_j^k (n)>0, o_j^k (n)=0$.

  • In this section, we conduct simulations to evaluate the performance of our optimal algorithm. The detailed design of the simulations and the results are specified as follows.

  • We simulate a cloud service provider which hosts three geographically distributed data centers and a video service provider which constructs its streaming service system on the cloud datacenters to distribute videos to users spreading over six regions. We set the total number of videos as 1 000 and the size of videos as 100 Mb. The video playback rates follow a uniform distribution between 300 Kbps and 1 200 Kbps. The occupied VM bandwidth for streaming video is considered roughly equivalent to the playback rate of video. For simplicity, we adopt homogenous VMs in all the datacenters and the configured bandwidth of a VM is uniformly set as 10 Mbps.

    Since the expectation of arrival requests is proportional to the average arrival rate, the diversity of the videos’ popularity can be directly reflected in the average arrival rates. Assuming that the relative access frequency of videos follows a zipf-like distribution with parameter {$s=1.05$ [32], } the average arrival rate $\lambda_k$ is set within the range of $\lambda_k \in [0.0 005, 0.71]$ and in proportion to the access frequency of video k. Further, the requests from different regions can be seen as independent. The preferences of requests in different regions are reflected in slightly different ranks of popularity. In our experiment, we generate the arrival trace for each region independently.

    The maximum number of new arrival requests $f_i^k (n)$ is bounded by $M=2$, so the feasible value of $f_i^k (n)$ is 0, 1, 2. Due to the request conservation constraints $\sum_j{a_{ij}^k (n)}=f_i^k (n)$, the dispatching action $a_{ij}^k (n)$ takes values in the set of {0, 1, 2}. The maximum number of requests for video k being served by datacenter j in TS n, denoted as T is 5. The departing rates are all set as the same $\mu=0.75$. The expectation of departing requests are smaller than the upper bound of the serving queue, which indicates that the queue may not be emptied for hot videos during the period of peak demand. The latency $\zeta_{ij}$ between an arbitrary pair of user and data center is randomly generated between 10 ms and 100 ms. The revenue function is the quadratic function as in (7) with parameter $\beta=1\times 10^{-8}$. That is, serving a request in one time slot with 10 ms latency will translate to times 10-6 revenue for the video provider.

    We emulate a cost mechanism as follows. The price of renting a VM per hour is set as $\{0.07, 0.05, 0.1\}$ in the three datacenters on the basis of the Amazon EC2’s pricing models [26]. According to the real setting of Amazon S3 [27], we set the charges for storage on the datacenters per byte per hour in the range $[\$ 3.82 \times 10^{-11}, \$ 6.75 \times 10^{-11}]$. The cost of migrating video file mainly comprises the bandwidth charge for uploading from the origin server. Then we set $u_j^k \in [\$ 0.85 \times 10^{-4}, \$ 1.55 \times 10^{-4}]$.

    From the aspect of practical system, the duration of a time slot is set as ten seconds after some trials. We set all the entries as 0 at the initialization phase. After all the configurations, we run the experiments for 60 minutes.

  • We find that it takes much time and efforts to estimate the performance potentials and calculate the optimal actions for all the videos one by one. Many videos of similar popularities present similar request patterns. Further, the average arrival rates of many less popular videos are far less than 1, i.e., $0<\lambda_k \ll 1$, which indicates that their number of arrived requests take values of 0 at most time slots. The two observations inspire us to regard the similar videos as a whole so as to reduce the dimensionality. The following Fig. 2 shows the k-means clustering [33] results based on the arrival number of each video over a 60 minutes period. Thus, we reduce 1 000 videos to 12 clusters, which significantly alleviates the computational complexity. Then, we could execute the same dispatching and replicating actions for the videos in the same cluster.

    Figure 2.  Clustering based on the popularity

  • Since our reward function which is defined as the revenues loss minus the costs always takes negative value, considering the convenience for intuitive expression, we add a positive constant H to the reward function to make it positive. H is set as 0.10 and the preference parameter $\alpha$ is set as 5.

    We first present the convergence of the policy iteration. We select the regenerative period L as 100 after some trials, which can guarantee the estimate error of the performance potentials below the desired bound and avoid generating over long sample paths. As Fig. 3 illustrated, the iteration procedure will stop within about 55 iterations. The rapid convergence of our policy iteration algorithm makes it feasible for VSP to relearn a new optimal policy when the current one is not able to cope with new situations.

    Figure 3.  Policy iteration procedure

    We next present the performance improvement of our proposed algorithm by comparing it with the two algorithms: a simple Snatch scheduling algorithm and another existing algorithm for resource allocation and request routing based on the Nash bargaining solution [5], abbreviated as the NBS algorithm.

    The Snatch scheduling algorithm is a greedy strategy which only considers maximizing the reward (8) of the current single time slot. It processes all requests dispatching and decides video replication without the consideration of future systems dynamics and average reward maximization over a long run. The NBS algorithm predicts the user demand using the time series method [19] and takes the prediction value as the input to solve the bandwidth allocation and request dispatching strategy based on Nash bargaining solution. Niu et al.[19] propose a greedy video placement algorithm based on the solved bandwidth allocation. To compare the average performance, we run their greedy placement algorithm periodically and add the video migration cost to the total cost.

    Fig. 4 illustrates the reward comparison among our MDP algorithm, the Snatch scheduling algorithm and the NBS algorithm at each time slot. From Fig. 4, it can be observed that the reward obtained by our MDP algorithm is higher than that by the snatch scheduling algorithm and NBS algorithm at all times. The results indicate that our optimal algorithm achieves performance improvement over a long run which is embodied in two aspects: 1) Our algorithm adjusts the resource allocation and video placement with far sight to largely reduce migration cost. 2) Our algorithm can wisely dispatch requests to nearby datacenters where corresponding videos are placed to lower the average latency. It is easy to analyze and conclude the reason: The Snatch scheduling algorithm only focuses on the current state and the myopia will incur a large content migration cost. Although the NBS algorithm also aims to achieve a good trade-off between latency and cost, the greedy placement strategy may not guarantee the optimality of the total rewards over a long run.

    Figure 4.  Performance comparison among the algorithms

  • In the reward function (8), the preset parameter $\alpha$ represents the optimization preference between system costs and revenues. The VSP's revenues are determined by the users' QoE, i.e., the latency perceived by users. We can change the value of $\alpha$ to adjust the preference of the optimization object. Fig. 5 illustrates the impact of $\alpha$ in our algorithm. The average latency per request becomes smaller with the increase of $\alpha$. This is because the system performance is more sensitive to the latency, the optimal strategy would rather increase the cost so as to lower the latency. Further, the results reveal that our optimal policy with $\alpha=10$ can guarantee the average latency around 30 ms and we can improve the users' QoE by increasing the value of $\alpha$. Conversely, the decrease of $\alpha$ often indicates VSPs are more concerned about the high price of resources. In this situation, the optimal strategy would dispatch the requests to the cheaper datacenters at the expense of the increase of latency.

    Figure 5.  Average latency with different $\alpha$

  • In this paper, we study the optimal deployment of video streaming services onto multiple geographically distributed datacenters. We formulate the joint optimization problem as a MDP model and obtain the optimal video placement and request dispatching strategy by applying sensitivity analysis and sample-path-based policy iteration approaches. Relying on the optimal dispatching and replacement actions, most users can receive service from proximity to guarantee users' quality of experience and meanwhile VSP can largely cut down the charges for renting cloud resources. We theoretically validate the optimality of our algorithm and the experimental evaluation also verifies the effectiveness of our algorithm.

Reference (33)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return