Abstract

This paper proposes a DRL-based cache content update policy in the cache-enabled network to improve the cache hit ratio and reduce the average latency. In contrast to the existing policies, a more practical cache scenario is considered in this work, in which the content requests vary by both time and location. Considering the constraint of the limited cache capacity, the dynamic content update problem is modeled as a Markov decision process (MDP). Besides that, the deep Q-learning network (DQN) algorithm is utilised to solve the MDP problem. Specifically, the neural network is optimised to approximate the value where the training data are chosen from the experience replay memory. The DQN agent derives the optimal policy for the cache decision. Compared with the existing policies, the simulation results show that our proposed policy is 56%–64% improved in terms of the cache hit ratio and 56%–59% decreased in terms of the average latency.

1. Introduction

The recent rapid evolution of mobile communication techniques and the proliferation of smart mobile devices have caused an exponential growth in mobile network traffic [1] [2]. According to Cisco [3], global mobile network traffic will reach 77 exabytes each month by 2022. As such, it will lead to data traffic congestion of the backhaul [4]. To mitigate this, a cache-enabled technique has emerged that is regarded as an effective method that can alleviate data traffic congestion [5]. In a cache-enabled network, a portion of the popular content is cached at the edge of the network at base stations (BSs) or user terminals (UTs), where users can directly access and download the cached content from the edge rather than from the core network via backhaul links. Consequently, data traffic congestion of the backhaul can be reduced and content retrieval from the edge can be faster than from the remote core network [6, 7].

However, because of the limited cache capacity, it is necessary to update cache content to ensure that cache-enabled networks always store the most popular content [8]. The most two common content update policies are the least frequently used (LFU) policy and the least recently used (LRU) policy [9]. LRU frequently stores the content with the latest access time, and LFU frequently stores content with the largest cumulative request times. Besides, as described in [10], a heterogeneous cache structure is proposed, in which the most popular contents are stored at small BSs and the less popular contents are stored at macro BSs. The combination of small BSs and macro BSs can maximise the network capacity and satisfy the content transmission demand. In [11], an optimal cooperative cache policy that can increase the cache hit ratio was presented. The cache hit ratio is utilised to describe how frequently content is requested by mobile users. In [9], an adaptive cache policy was proposed that can reduce user access latencies. In [12], an edge cache policy was proposed to reduce the average content delivery latency. However, conventional methods lack adaptive ability in dynamic cache scenarios. The reason is that they assume that the content popularity distribution is known or can be accurately predicted, which is difficult to achieve in dynamic caching scenarios. In this case, due to an inaccurate distribution of content popularity, the conventional methods have poor cache performances, since their performances are highly dependent on the accurate distribution of content popularity.

Motivated by the deep reinforcement learning (DRL) approach in solving the dynamic problem [13], DRL has been applied into cache policies to improve the cache performance of dynamic cache scenarios. In [14], a DRL approach was proposed to reduce the transmission cost by jointly considering proactive cache and content recommendations. In [15], a cache content update policy based on DRL was proposed to improve energy efficiency. In [16], a DRL model was utilised to minimise transmission latencies. For specific, reinforcement learning (RL) is applied to obtain the optimal cache policy. In [17], a DRL-based policy was proposed to minimise system power consumption. In [18], a deep Q-learning network (DQN) algorithm, one branch of DRL, is applied to do the network slicing decision and allocates the spectrum resources for the content delivery. In [19], a DQN-based mobile edge computing network is proposed, in which several computation tasks are offloaded from the user terminals to the computational access points. Although DQN has attracted significant attention in the cache-enabled network, there are very little works done in applying DQN into the cache content update phase. Moreover, most of the previously mentioned DRL-based cache policies assume the content requests as a time-varying variable. They did not adopt more practical scenarios in which the content requests are varied in both time and location, also known as spatiotemporally varying scenarios.

Inspired by the aforementioned literature, in this paper, a DQN-based content update policy at BSs is proposed to increase the cache hit ratio and reduce average latency, as well as considering spatiotemporally varying scenarios in which content requests vary by both time and location. The reasons to apply DQN are as follows: (1) DQN has a faster convergence speed than the conventional DRL policies, e.g., advanced actor-critic (A2C) and deep deterministic policy gradient (DDPG) [14]. (2) DQN can adapt to the varying scenarios, as long as the dynamic problem is correctly modeled and the DQN agent is allowed to continuously learn experience from the environment [18]. The main contributions are summarised as follows:(i)The dynamic cache content update problem is formulated as a Markov decision process (MDP) problem, which is solved by a DQN algorithm. Specifically, the neural network is utilised to approximate the value and the DQN agent is used to decide whether or not to cache the requested content(ii)Our proposed policy is compared with LRU, LFU and DRL [20] policies and the simulation results demonstrate that our proposed policy has the best cache performance in terms of the cache hit ratios and average latencies

The rest of this paper is organised as follows. The system model and problem formulation are introduced in Section 2. The detailed elements of the MDP framework and the principles of the DQN-based cache content update policy are discussed in Section 3. The simulation results are shown in Section 4, and the conclusion is provided in Section 5.

2. System Model and Problem Formulation

In this section, the system model and the problem of how to maximise the cache hit ratio and minimise average latency are introduced.

2.1. System Model

As shown in Figure 1, the cache-enabled system includes one core network, cache-enabled BSs, and mobile users. Each BS can store contents at most. The total content library contains kinds of contents and each content has the same size . The core network is assumed that has enough capacity to store the entire contents. Each BS covers a circular cellular region with a fixed radius, and all of the mobile users in its cellular region can connect with the serving BS (the BS where users connect). Mobile users can directly retrieve their requested content from the serving BS if the content is cached locally (the requested content is already cached at the serving BS); otherwise, the requested content must be retrieved from the core network. The th BS is regarded as a DQN agent and receives the spatiotemporal content requests , where is the current content request at the th BS. From the received content requests, the DQN agent can decide when and where (which BS) to cache the content or not. If cached, the DQN agent further decides which cached content is replaced by the currently requested content; otherwise, the cached contents remain the same. The action space of the th BS is defined as and uses one hot code. means that the cached content remains the same, and means that the th cached content is replaced by the currently requested content, where . In summary, at each time slot , each BS receives numerous content requests including the user preference content and location information, and each DQN agent executes one action from the corresponding action space to maximise the cache hit ratio and minimise the average latency.

2.2. Problem Formulation

The problem in this study consists of two subproblems: maximising the cache hit ratio and minimising average latency.

2.2.1. Maximising the cache hit ratio

The cache hit ratio is utilised to describe the probability of the requested content at the local cache. The system cache hit ratio is formulated for requests as follows:where is a function to test whether the requested content is cached locally. The definition of is as follows:

Maximising the cache hit ratio is expressed as follows:

2.2.2. Minimising the average latency

The latency is an indicator that evaluates the cache content update policy’s performance. The latency is the time when content is transmitted from one location to another. The latency consists of the transmission latency, propagation latency , processing latency , and queue latency . From [20], the expression of the latency is given as:

Normally in the content update process, the destination of the content packet is determinate, and the content packet is assumed that does not need to wait for transmission. Hence, the processing and queue latencies can be neglected during the content update process [20, 21], and the expression of the latency can be optimised as follows:where is the content size, is the content transmission rate, is the maximal coverage radius of the serving BS or core network, is the distance between the user and the serving BS or between the serving BS and the core network, and is the maximal propagation latency between the user and the serving BS or between the serving BS and the core network. To meet the requirement of the fifth-generation (5G) communication [22], the indicator is expressed as follows:where is the maximal propagation latency between the user and the serving BS, and is the maximal propagation latency between the serving BS and the core network.

In more detail, if the requested content is cached locally, the content can be directly retrieved from the serving BS. Thus, for a hit content request, we consider the maximal propagation latency between the user and the serving BS , the distance between the user and the serving BS , and the maximal coverage radius of the serving BS . The definition of the hit content latency is as follows:

If the requested content is missed at the serving BS, the serving BS needs to first retrieve the requested content from the core network and then deliver the requested content to the corresponding user. Hence, for a missed content request, we consider the maximal propagation latency between the user and the serving BS , the maximal propagation latency between the serving BS and the core network , the distance between the user and the serving BS , the distance between the serving BS and the core network , the maximal coverage radius of the serving BS , and the maximal coverage radius of the core network . The definition of the latency of missed content is as follows:

The system latency is the sum of the latency of all of the hit content requests and all of the missed content requests. The average latency is the system latency divided by the number of content requests . The and are defined as follows:

The problem on how to minimise the average latency can be formulated as follows:

3. A Deep Q-Learning Network-Based Cache Content Update Policy

The related elements of the deep Q-learning network will be introduced in Section 3.1. The principle of the DQN algorithm and the workflow of our proposed cache policy will be provided in Section 3.2.

3.1. The Description of the Related Elements of the Deep Q-Learning Network

The principle of the DQN can be regarded as a Markov decision process (MDP) [23, 24]. To apply the DQN to the cache content update problem, the related notations under the DQN framework are described.

3.1.1. State Space

In time slot , the instant state consists of the currently cached content, the currently requested content and its corresponding user, the user’s next location, and the current time. In time slot , the current instant state is defined aswhere is the cached content at the th DQN agent, is the currently requested content, is the unique name of the mobile user currently requesting the content, is the next location of the th user, , and .

The state space is the set of all of the instant states over a time period. It is defined as

3.1.2. Action Space

In each time slot , the th DQN agent decides whether or not to cache the currently requested content. If yes, the DQN agent decides which cached content is replaced by the currently requested content; otherwise, the cached content remains the same. The action space of the th DQN is defined aswhere uses one hot code, which means only one action can be executed in a time slot. In this study, means the cached content remains the same and means the th cached content is replaced by the currently requested content, where and is the maximal capacity of the th BS.

3.1.3. Reward and Value Functions

The reward is the instant cache hit ratio in the time slot . Specifically, reward when the currently requested content is hit in the next state ; otherwise, . The policy is a map that shows the probability of the execution of action under the current state , and . The MDP evaluates and optimises the policy based on the value function, which is defined as the expected value of cumulative discounted rewards received over the entire process following the policy [25]. There are two definitions of value functions: one is the state value function and the other is the state-action value function. The state value function is the expected value of a discounted cumulative reward in the current state when the agent follows the policy. The state value function is defined as follows:

The state-action value function is the expected value of the discounted cumulative reward from the current state and action is based on the policy used to choose one action. The definition of the state-action value function iswhere is a discount factor that affects the future reward from the current state . The target of the MDP is finding the optimal and that can obtain the maximal value function.

3.2. The Cache Content Update Based on the Deep Q-Learning Network
3.2.1. Principle of the DQN Framework

DQN is an effective hybrid framework of neural networks and Q-learning. In this framework, the neural network is applied to predict the values rather than recording the values in a table. However, the DQN will not be efficient when considering only the combination of Q-learning and the neural network. The following two characteristics improve the DQN framework’s efficiency.(i)The DQN has two neural networks with the same structures operating in different parameters, the evaluation network and the target network. The parameters of the evaluation and target networks are defined as and , respectively. The evaluation network uses the latest parameter to predict the current state-action values , where is updated in each iteration. The target network uses the parameter to predict the next state-action value , where is updated over a period time. The target network can solve the correlation of the value with the target value, which makes the DQN easier to converge(ii)DQN has an experience replay memory with a limited capacity. The current state , action , reward , and next state are stored in format (, , , and ) into the memory as experiences. Once the capacity is full, new received experiences will replace earlier experiences. During the training stage, the training data are randomly selected from the experience replay memory. The random selection disorganises the experience correlation, which solves the neural network’s overfitting issue

The neural network enables [26]. According to [5], the evaluation of is derived from Q-learning as:where is the learning rate , and is the discount factor .

The neural network can be trained via the minimisation of the loss function. The loss function is defined as:where is the target network’s value and is the evaluation’s value.

The detailed optimisation of the evaluation network and target network is shown in Figure 2. In each training step, the evaluation network receives a backpropagated loss function based on a batch of experiences randomly selected from the experience replay memory. The parameter of the evaluation network is then updated by the minimisation of the loss function via the stochastic gradient descent (SGD) function. After several steps, the parameter of the target network is updated by assigning the latest parameter to . After a training period, the two neural works are stably trained.

3.2.2. The Workflow of the Cache Content Update Policy Based on DQN

In each decision epoch, the th DQN agent receives a content request. If the content is cached locally, the serving BS delivers the requested content to the corresponding user. If the content is missed at the serving BS, the serving BS retrieves the requested content from the core network and then delivers the content to the corresponding user. Subsequently, the requested content is cached at the serving BS when the cache capacity is not full. If the cache capacity is full, the optimised evaluation network outputs the value of all of the actions, and the DQN agent selects an action with the maximal value. After the execution of the action , the new instant reward is calculated into the target network’s value and a new loss function is obtained based on Eq. (17). The parameters and are then updated based on the minimisation of the new loss function. After a training period, the best policy that can maximise the cache hit ratio and minimise the average latency is derived. The DQN-based cache content update policy is shown in Algorithm 1.

The DQN-based cache content update algorithm.
Input: The feature of the state
Initialise the parameter and and instant reward =0
for step =1, Y do
for, do
  Receive a content request
  if the content request is cached locally, then
  BS directly delivers the requested content to the user end epoch
  elif
  The cache capacity is not full, then
  BS retrieves the requested content from the core network and delivers the requested content to the user
  The requested content is cached locally end epoch
  elif
  The cache capacity is full, then
  observe the current state
  randomly generate a value
  if < , then
   randomly select an action from the action spaces
   else
   = argmax
   end if
   execute , receive the reward , next state
   store (, , , ) into the experience replay memory
   randomly selects a mini-batch of the experiences
   update the parameter of the evaluation via the minimisation of the backpropagated loss
   update the parameter of the evaluation in several time slots
   end if
end for
end for

4. Results and Discussion

In this study, we consider a cache-enabled network with 4 BSs and 10 mobile users and ensure that each user is covered by a BS. For simplicity, the users are distributed along with the edge of the serving BS, and each BS has the maximal communication distance with the core network, and hence, the rate is 1. Besides, there is no overlap between any two BSs to avoid the handover between any two BSs. Furthermore, each content has the same size (2,000 bits), and the content transmission rate is 35 Mbit/s. The neural network has three layers, the input layer, hidden layer, and output layer. The hidden layer has 512 neurons, and the number of neurons at the input and output layers is and , respectively. The maximal cache capacity is described in each experiment. The learning rate is 0.9, the greedy parameter is 0.9, and the discount factor is 0.1. The content requests of the user are generated following the Zipf distribution law aswhere is the content rank, is the Zipf parameter, and is the total number of content requests. In each experiment, we assume that the total number of content requests is 7,200.

Figure 3 investigates the cache hit ratios of the LFU policy, LRU policy, DRL policy in [20], and our proposed policy. The Zipf parameters vary from 1.1 to 1.8, the users’ locations are fixed, and the cache can store 288 types of contents at most. As the Zipf parameter increases, the four policies’ cache hit ratios increase. This occurs because as the Zipf parameter increases, there is less content with larger probabilities of content requests. In other words, the popular content becomes more popular, the unpopular content becomes less popular, and the type of content decreases. Considering the same cache capacity, the cached content is more popular, and therefore, the cache hit ratio increases. Our proposed policy has the highest cache hit ratio regardless of the Zipf parameter. The simulation demonstrates that the effect of the popular content in the cache hit ratio increases as the Zipf parameter increases. Thus, our proposed policy is superior to the three other policies.

Figure 4 investigates the effect of the cache capacity on the cache hit ratio. Here, the Zipf parameter is 1.4, and the mobile users’ locations are fixed. The varied cache capacity is 36, 72, 108, 144, 180, 216, 252, and 288. As demonstrated, our proposed policy is superior to the three other policies since our proposed policy has the highest cache hit ratio. In addition, as the cache capacity increases, the cache hit ratios of the four policies continuously increase. When the capacity is 288, the cache hit ratios of the four policies are remarkably close. This occurs because the popular content dominates the cache hit ratio, and the cache capacity is high enough to store all of the popular contents.

The cache hit ratio under spatiotemporally varying scenarios is shown in Figure 5. In the experiment, the cache can store 216 types of contents at most, the Zipf parameters are randomly generated from 1.2 to 1.6 every 20,000 time slots, and the users are initially fixed and randomly change their locations among the four BSs after the 20,000 time slot. When the users’ locations and the Zipf parameters are fixed, the gaps between our proposed policy and the three other policies are gradually stable. This occurs because all of the policies are optimally trained. After time slot 20,000, the four policies immediately decrease. This occurs because the content popularity changes with the random movement of the users and random generation of the Zipf parameters. Later, our proposed policy’s curve slowly increases, while the three other policies’ curves continuously decrease. The gaps between our proposed policy’s curve and the other policies’ curves continuously increase. Our proposed policy eventually improves by at least 56% compared with the three other policies. The growth ratio is derived based on , in which and is the cache hit ratio of our proposed policy and any one of the other three policies, respectively. This significant improvement occurs because our proposed policy considers the effect of the users’ random distribution and the random generation of Zipf parameters. Therefore, our proposed policy quickly adapts to spatiotemporally varying content requests. Consequently, we conclude that our proposed policy is superior for managing spatiotemporally varying problems.

Figure 6 demonstrates the four policies’ average latencies under different Zipf parameters. Here, the Zipf parameters vary from 1.1 to 1.8, the mobile users’ locations are fixed, and the cache can store 288 types of contents at most. As demonstrated, our proposed policy always has the lowest cache hit ratio compared with the other three policies. Thus, our proposed policy has the best cache hit ratio. The higher the cache hit ratio is, the more contents can be retrieved locally. The local latency from the BS is much smaller than the remote latency from the core network. Therefore, our proposed policy performs better than the other three policies in terms of the average latency.

As shown in Figure 7, we investigate the effect of the cache capacity on the average latency. In this simulation, the Zipf parameter is 1.4, and the mobile users’ locations are fixed. The cache capacity is 36, 72, 108, 144, 180, 216, 252, and 288. The higher the cache capacity, the lower the average latency of each policy. This occurs because more contents can be cached locally as the cache capacity increases. In addition, the slope of each policy gradually decreases. This occurs because all of the policies aim to cache the most popular contents via their limited cache capacity. As the cache capacity further increases, more contents are cached, while the recently cached contents are less popular than the initially cached contents. Consequently, the average latency increases less when caching less popular contents. Furthermore, our proposed policy has the minimal average latency regardless of the cache capacity.

Figure 8 shows the average latency under spatiotemporally varying scenarios. In the experiment, the cache is assumed that can store 216 types of contents at most, the Zipf parameter is randomly generated from 1.2 to 1.6 every 20,000 time slots, and the users are initially fixed and randomly change their locations among the four BSs after the 20,000 time slot. In the first 20,000 time slots, each policy finally has a stable cache performance after a training period. Once the users randomly move among the four BSs, the LRU, LFU, and DRL policies’ curves immediately increase, and our proposed policy’s curve first slightly increases and then gradually decreases. More specifically, our proposed policy achieves a 56%-59% decrease compared to the three other policies. The reduction rate is derived based on , in which and is the latency of our proposed policy and any one of the other three policies, respectively. The decrease occurs because our proposed policy considers the effect of the dynamic changes in the user distribution and Zipf parameters on the latency, while the other three policies do not. The simulation demonstrates that our proposed policy can perform stably under spatiotemporally varying scenarios.

5. Conclusions

In this study, a DRL-based cache content update policy is proposed with the objective to maximise the cache hit ratio and minimise the average latency. Compared to the existing policies, a more practical cache scenario is considered, in which the content requests vary spatiotemporally. The dynamic content update problem is formulated as an MDP problem, and DQN is applied to solve this MDP problem. Specifically, the neural network is trained to approximate the value, in which the training data are chosen from the experience replay memory. The DQN agent derives the optimal policy from the neural network for the cache decision. Compared with the existing policies, e.g., the LFU, LRU, and DRL [20] policies, the simulation results show that our proposed DRL-based cache content update policy has the best cache performance in the considered spatiotemporally varying scenario and is 56%–64% improved in terms of the cache hit ratio and 56%–59% decreased in terms of the average latency.

Data Availability

Content requests were described in the simulation section.

Conflicts of Interest

Lincan Li, Chiew Foong Kwong, Qianyu Liu, and Jing Wang declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This study was supported by Ningbo Natural Science Programme (NBNSP), project code 2018A610095.