1 Introduction

Solving tasks which require long decision sequences or complex policies is an important challenge in reinforcement learning (RL). The option framework (Sutton et al. 1999) is a promising approach to simplify the complexity of such tasks. In the option framework, a reinforcement learning (RL) agent can choose between actions and macro-actions, which are carried out over multiple time steps (Parr and Russell 1998; Sutton et al. 1999).

Using these macro-actions, the agent has to make less decisions to solve a task. Furthermore, even if macro-actions are based on simple policies, the combination of multiple macro-actions can represent more complex solutions than the simple policies would allow for on their own. For example, if a given task requires a nonlinear policy, a combination of multiple linear sub-policies might still be able to solve this task. Such an automated decomposition of complex solutions can simplify the learning problem in many domains.

Based on the SMDP setting (Puterman 1994), the option framework (Sutton et al. 1998) incorporates such macro actions. An option consists of a sub-policy, an initiation set and a termination probability. After an option is initiated, actions are generated by the sub-policy until the option is terminated and a new option is activated. While the option framework has received considerable attention (Parr and Russell 1998; Sutton et al. 1999; Dietterich 2000), to date most algorithms either require the manual specification of the activation policies or sub-policies. Algorithms for autonomous option discovery typically depend on discrete state-action spaces (McGovern and Barto 2001b; Mann and Mannor 2014; Simsek et al. 2005) with exceptions such as the work of Konidaris and Barto (2009). Furthermore, many existing algorithms first explore a given MDP and learn suitable options afterwards (Menache et al. 2002; Simsek and Barto 2008). Hence, they are not aimed at leveraging the efficiency of options in the initial stages of learning but rather aim at transferring the options to new tasks. These approaches are powerful in scenarios where options can be transferred to similar tasks in the future. In contrast, the approach suggested in this paper aims at directly learning suitable options for the problem at hand while also being applicable in continuous state-action spaces.

In continuous state-action spaces, policy search (PS) methods which optimize parametrized policies have been shown to learn efficiently in simulated and real world tasks (Ng et al. 1998; Kober and Peters 2010). Thus, the compatibility of the proposed option discovery framework with PS methods such as PoWER (Kober and Peters 2010) and REPS (Peters et al. 2010) is an important goal of this paper. In the discrete setting, the framework can equally be combined with a wide range of methods such as as Q-Learning (Christopher 1992) and LSPI (Lagoudakis and Parr 2003). Furthermore, many complex tasks can be solved through combinations of simple behavior patterns. The proposed framework can combine multiple simple sub-policies to achieve complex behavior if a single sub-policy would be insufficient to solve the task. These simpler sub-policies are easier to learn which can improve the overall learning speed.

To arrive at a general framework which works in discrete and continuous settings and requires minimal prior knowledge, we propose a probabilistic formulation of the option framework where all components are represented by distributions. To learn these options from data, we propose a probabilistic inference algorithm based on the expectation maximization algorithm (Baum 1972) to determine the options’ components. The resulting algorithm offers three key benefits

  1. 1.

    It infers a data-driven segmentation of the state-space to learn the initialization and termination probability for each option.

  2. 2.

    It is applicable to discrete as well as continuous state-action spaces.

  3. 3.

    It outperforms monolithic algorithms on discrete tasks and can solve complex continuous tasks by combining simpler sub-policies.

Together, these contributions allow for learning of options from data with minimal prior knowledge.

1.1 Problem statement

We consider the option framework within the MDP setting with states \( {\varvec{s}} \in \mathcal {S}\), actions \( {\varvec{a}} \in \mathcal {A}\), option indices \( o \in \mathcal {O}\), and termination events \(b \in \mathcal {B}\). Every option consists of a sub-policy \(\pi ( {\varvec{a}} | {\varvec{s}} , o )\) and a termination policy \({ \pi (b| {\varvec{s}} ,{ \bar{ o } }) } \), where b encodes a binary termination event and \({ \bar{ o } }\) is the index of the previously active option. Furthermore, one global activation policy \({\pi ( o | {\varvec{s}} ) }\) governs the activation of options following a termination event. This activation policy \({\pi ( o | {\varvec{s}} ) }\) replaces the initiation set used in classical algorithms and defines a probability of initiating an option in a given state. A new option can only be activated if the previously active option \({ \bar{ o } }\) is terminated. Thus, the agent uses the same option for multiple time steps if it is not terminated. The option transition model is given as

$$\begin{aligned} p( o | {\varvec{s}} ,b, { \bar{ o } })=&{\left\{ \begin{array}{ll} {\pi ( o | {\varvec{s}} ) }&{} \quad \text {if } b=1, \\ \delta _{ o ={ \bar{ o } }} &{} \quad \text {if } b=0, \\ \end{array}\right. } \end{aligned}$$
(1)

and ensures that option \({ \bar{ o } }\) remains active until a termination event occurs. After each termination, a new option will be sampled according to \({\pi ( o | {\varvec{s}} ) }\).

In Sect. 2, we show how to learn a hierarchical policy defined by the options given a set of demonstrated trajectories, i.e., how to solve the imitation learning problem. We formulate the option framework as a graphical model where the index of the executed option is treated as latent variable. Subsequently, we show how an EM algorithm can be used to infer the parameters of the option components. In Sect. 3 we extend the imitation learning solution to allow for reinforcement learning, i.e., for iteratively improving the hierarchical policy such that it maximizes a reward function.

2 Learning options from data

The goal of this paper is to determine sub-policies, termination policies and the activation policy from data with minimal prior knowledge. All option components are represented by parametrized distributions, governed by the parameter vector \({\varvec{\theta }}= \{ {\varvec{\theta }}_A, {\varvec{\theta }}_O, {\varvec{\theta }}_B \}\). The individual components are given as binary classifiers for the termination policies for each option \(\pi (b| {\varvec{s}} , o =i; {\varvec{\theta }}_B^i)\), one global multi-class classifier for the activation policies \({ \pi ( o | {\varvec{s}} ; {\varvec{\theta }}_O)} \) and the individual sub-policies \(\pi ( {\varvec{a}} | {\varvec{s}} , o =i;{\varvec{\theta }}_A^i)\). We use the notation \(\pi (\cdot )\) to denote option components, which we will ultimately aim to learn and \(p(\cdot )\) for all other distributions. Our goal is to estimate the set of parameters \({\varvec{\theta }}= \{{\varvec{\theta }}_A, {\varvec{\theta }}_O, {\varvec{\theta }}_B \}\), which explain one or more demonstrated trajectories \(\tau = { \{\tau _1, \dots , \tau _T \} }\) with \(\tau _t = \{ {\varvec{s}} _t, {\varvec{a}} _t \}\). Crucially, the observations \(\tau \) do not contain the option indices \( o \) nor the termination events b but only states \( {\varvec{s}} \) and actions \( {\varvec{a}} \). Both, the option index and the termination events are latent variables. While the proposed method learns all option components from data, it requires a manual selection of the total number of desired options. Future work could replace this requirement by, for example, sampling the number of options through a Dirichlet process.

For this Section, we ignore the optimality of the resulting trajectories and focus on the imitation learning problem, i.e., how to recover hierarchical policies from trajectory observations. We propose a probabilistic option framework, where all option components are represented as distributions. We can then recover distributions over the latent variables using probabilistic inference techniques.

Fig. 1
figure 1

The graphical model of the proposed framework. The termination policy \(\pi (b| {\varvec{s}} ,o)\) decides whether to keep executing the previously active option or to terminate its execution. After a termination event, the activation policy \({\pi ( o | {\varvec{s}} ) }\) samples a new option and the sub-policy \({\pi ( {\varvec{a}} | {\varvec{s}} , o ) }\) samples an action. Arrows from \( {\varvec{s}} _t, {\varvec{a}} _t\) to \( {\varvec{s}} _{t+1}\) are not shown to reduce clutter as we do not aim to model the transition probability distribution. Shaded nodes indicate observed variables, while clear nodes indicate the hidden variables. The model parameters \({{\varvec{\theta }}_B, {\varvec{\theta }}_O, {\varvec{\theta }}_A}\) are shown in rounded square boxes

2.1 The graphical model for options

To apply these probabilistic inference techniques, we need to specify how the variables in our model interact with each other. Figure 1 shows a graphical model of the proposed setting and the resulting hierarchical policy can be given as

$$\begin{aligned} \pi ( {\varvec{a}} | {\varvec{s}} , { \bar{ o } }; {\varvec{\theta }}) = \sum _{ o \in \mathcal {O}} \sum _{b\in \mathcal {B}} { \pi (b| {\varvec{s}} ,{ \bar{ o } }; {\varvec{\theta }}_B) } \pi ( o | {\varvec{s}} , b, { \bar{ o } }; {\varvec{\theta }}_O) \pi ( {\varvec{a}} | {\varvec{s}} , o ;{\varvec{\theta }}_A). \end{aligned}$$
(2)

The graphical model in Fig. 1 shows that the following operations occur in every time step. First, the termination policy \({ \pi (b| {\varvec{s}} ,{ \bar{ o } }; {\varvec{\theta }}_B) } \) samples a termination event b. According to Eq. (1) the previously active option \({ \bar{ o } }\) remains active if no termination occurs. Otherwise, the activation policy \({ \pi ( o | {\varvec{s}} ; {\varvec{\theta }}_O)} \) samples a new option index o based on the current state \( {\varvec{s}} \). Finally, the sub-policy \(\pi ( {\varvec{a}} | {\varvec{s}} , o ;{\varvec{\theta }}_A)\) samples an action \( {\varvec{a}} \) based on the state \( {\varvec{s}} \).

2.2 Expectation maximization for options

The graphical model for the option framework is a special case of a hidden Markov model (HMM). The Baum–Welch algorithm (Baum 1972) is an EM algorithm for estimating the parameters of a HMM. We will now state the Baum–Welch algorithm for our special case of the option model, where we consider the special case of a single trajectory for improved clarity. The extension to multiple trajectories, however, is straightforward.

In our graphical model, the latent variables \(\{ o_{1:T}, b_{1:T} \}\) are given by trajectories of the option index and the termination event. EM algorithms optimize a lower-bound of the marginal log-likelihood

$$\begin{aligned} \log p(\tau |\varvec{{\varvec{\theta }}}) \ \ge \sum _{o_{1:T}, b_{1:T}}p(o_{1:T}, b_{1:T} |\tau , {\varvec{\theta }}^\text {old} ) \log p(o_{1:T}, b_{1:T}, \tau | \varvec{{\varvec{\theta }}}) = Q({\varvec{\theta }}, {\varvec{\theta }}^\text {old} ). \end{aligned}$$
(3)

The lower-bound is maximized iteratively. In the expectation (E-) Step, we compute the posterior probabilities of the latent variables using our current estimate of the parameters \( {\varvec{\theta }}^\text {old} \). In the maximization (M-) Step, we maximize \(Q({\varvec{\theta }}, {\varvec{\theta }}^\text {old} )\) w.r.t \({\varvec{\theta }}\) to obtain a new estimate of the parameter vector. As the lower-bound is tighter after every E-Step, the EM algorithm is guaranteed to converge to a local maximum of the marginal log-likelihood.

As we are dealing with time series, we can leverage the structure of the trajectory distribution

$$\begin{aligned}&p(o_{1:T}, b_{2:T}, \tau | \varvec{{\varvec{\theta }}}) \ = \ \pi (o_1| {\varvec{s}} _1;{\varvec{\theta }}_O) \pi (a_1|o_1, {\varvec{s}} _1; {\varvec{\theta }}_A) \nonumber \\&\prod _{t = 1}^{T-1} \pi (b_{t + 1}| {\varvec{s}} _{t+1}, o_t;{\varvec{\theta }}_B) \pi (o_{t+1}| {\varvec{s}} _{t+1}, b_{t+1}, {\varvec{\theta }}_O) \pi ( {\varvec{a}} _{t+1}| {\varvec{s}} _{t+1}, o_{t+1}; {\varvec{\theta }}_A) p( {\varvec{s}} _{t+1}| {\varvec{a}} _t, {\varvec{s}} _t). \end{aligned}$$
(4)

After substituting Eq. (4) into Eq. (3) and rearranging terms, the lower bound decomposes in a sum over the time steps, i.e.,

$$\begin{aligned} Q({\varvec{\theta }}, {\varvec{\theta }}^\text {old} )&= \sum _{t=1}^{T-1} \sum _{o_{t}}\sum _{b_{t+1}} p( o _{t},{ b } _{t+1}|\tau ; {\varvec{\theta }}^\text {old} ) \log { \pi (b_{t+1}| {\varvec{s}} _{t+1}, o _{t}; {\varvec{\theta }}_B) } \nonumber \\&\quad + \sum _{t=1}^{T} \sum _{o_t} p( o _t| \tau ; {\varvec{\theta }}^\text {old} ) \log { \pi ( {\varvec{a}} _t| {\varvec{s}} _t, o _t; {\varvec{\theta }}_A)} , \nonumber \\&\quad +\sum _{t=1}^{T} \sum _{o_t} p( o _t, { b } _t = 1 | \tau ; {\varvec{\theta }}^\text {old} ) \log { \pi ( o _t| {\varvec{s}} _t; {\varvec{\theta }}_O)} . \end{aligned}$$
(5)

The posterior probabilities \(p( o _{t},{ b } _{t+1}|\tau ; {\varvec{\theta }}^\text {old} ), p( o _t| \tau ; {\varvec{\theta }}^\text {old} )\) and \(p( o _t, { b } _t = 1 | \tau ; {\varvec{\theta }}^\text {old} )\) can be recovered from two posterior belief distributions

$$\begin{aligned} \xi _t( o _{t:t+1},{ b } _{t:t+1}) = p(o_{t:t+1},b_{t:t+1}|\tau ; {\varvec{\theta }}^\text {old} ), \end{aligned}$$
(6)

and

$$\begin{aligned} \gamma _t( o _t, b_t)= p(o_{t}, b_t|\tau ; {\varvec{\theta }}^\text {old} ). \end{aligned}$$
(7)

Consequently, we need to infer only the posteriors \(\gamma _t\) and \(\xi _t\) in the E-Step, and not the probability \(p(o_{1:T}, b_{1:T}| \tau )\) of a whole trajectory.

2.2.1 Expectation step

During the expectation step, the responsibilities

$$\begin{aligned} \gamma _t( o _t,{ b } _t)&= z_t^{-1} \alpha _t( o _t,{ b } _t) \beta _t( o _t,{ b } _t), \end{aligned}$$
(8)

and

$$\begin{aligned} \xi _t( o _{t:t+1},{ b } _{t:t+1}) = z ^{-1}_t \alpha _t( o _t,{ b } _t) \beta _{t+1}( o _{t+1},{ b } _{t+1})&\nonumber \\ p( {\varvec{a}} _{t+1} , o _{t+1}, b_{t+1}| {\varvec{s}} _{t+1}, o _t ; {\varvec{\theta }}^\text {old} )&, \end{aligned}$$
(9)

can be determined using forward messages \(\alpha _t( o _t,{ b } _t)\) and backward messages \(\beta _t( o _t,{ b } _t)\), where \(z_t\) is the normalization constant of \(\gamma _t( o _t,{ b } _t)\). Given the option model, the forward messages

$$\begin{aligned} \alpha _{t}( o _t,{ b } _t)=p( {\varvec{a}} _{1:t}, o_t,b_{t}| {\varvec{s}} _{1:t}; {\varvec{\theta }}^\text {old} ) \end{aligned}$$

can be computed recursively by

$$\begin{aligned} \alpha _{t}( o _t,{ b } _t) =&\sum _{ o _{t-1}}\sum _{b_{t-1}}\alpha _{t-1}( o _{t-1},{ b } _{t-1}) p( {\varvec{a}} _t, b_{t}, o _t| {\varvec{s}} _{t}, o _{t-1}; {\varvec{\theta }}^\text {old} ), \end{aligned}$$
(10)

with \(\alpha _{1}( o _1,{ b } _1)=p( {\varvec{a}} _1, b_{1}, o _{1}| {\varvec{s}} _{1}; {\varvec{\theta }}^\text {old} )\). Based on these forward messages, the backward messages

$$\begin{aligned} \beta _{t}( o _t, { b } _t)=p( {\varvec{a}} _{t+1:T}, | {\varvec{s}} _{t+1:T}, b_{t}, o_{t}; {\varvec{\theta }}^\text {old} ) \end{aligned}$$
(11)

are computed recursively by

$$\begin{aligned} \beta _{t-1}( o _{t-1},{ b } _{t-1})=&\sum _{ o _t}\sum _{b_{t}}\beta _{t}( o _t, { b } _t) p( {\varvec{a}} _t, b_{t}, o _t| {\varvec{s}} _{t}, o _{t-1}, b_{t-1}; {\varvec{\theta }}^\text {old} ), \end{aligned}$$
(12)

with \(\beta _{T}( o _T,{ b } _T)=\varvec{1}\).

2.2.2 Maximization step

Given the distributions over latent variables and the observed state-action samples, the parameters \({\varvec{\theta }}\) can be determined by maximizing Eq. (5). Since \(Q({\varvec{\theta }}, {\varvec{\theta }}^\text {old} )\) is decoupled, independent optimization can be performed for the sub-policies, termination policies and the activation policy.

Termination policies

For optimizing the termination models, we have to consider the first term of the lower bound which is given by

$$\begin{aligned} Q_B^i({\varvec{\theta }}_B, {\varvec{\theta }}^\text {old} ) = \sum _t \sum _{b_{t+1}} p( o _t=i,{ b } _{t+1} | \tau ; {\varvec{\theta }}^\text {old} ) \log { \pi (b_{t+1}| {\varvec{s}} _{t+1}, o _{t}; {\varvec{\theta }}_B) } \end{aligned}$$
(13)

for each time step t and a single option \(o_t = i\). This term can be rewritten as

$$\begin{aligned} Q_B^i({\varvec{\theta }}_B, {\varvec{\theta }}^\text {old} ) =&\sum _t w_{B,i}(t) \Big ( t_{B,i}(t) \log p(b_{t+1} = 1|o_{t} = i;{\varvec{\theta }}_B) \nonumber + \\&\big (1 - t_{B,i}(t) \big ) \log \big ( 1 - p(b_{t+1} = 1|o_{t} = i;{\varvec{\theta }}_B) \big ) \Big ), \end{aligned}$$
(14)

where \(w_{B,i}(t)\) is a weight and \(t_{B,i}(t)\) defines the target probability of the termination event. Equation (14) resembles a weighted cost function for logistic regression (Bishop 2006) where the weights are given by

$$\begin{aligned} w_{B,i}(t) = p(o_{t}=i|\tau ; {\varvec{\theta }}^\text {old} ) = \sum _{b_{t}} \gamma (o_t = i, b_{t}). \end{aligned}$$

The target probability of the termination event is given by

$$\begin{aligned} t_{B,i}(t) = p(b_{t+1} = 1|o_{t} = i,\tau ; {\varvec{\theta }}^\text {old} ) = \frac{ \sum _{o_{t+1},b_t} \xi _t( o _t=i, o _{t+1},{ b } _{t}, { b } _{t+1}=1) }{p(o_{t}=i|\tau ; {\varvec{\theta }}^\text {old} )}. \end{aligned}$$

Using these target probabilities and weights, standard techniques can be used to fit, for example, a sigmoidal classifier for each termination policy (Bishop 2006).

Activation policy The case of the activation policy \({ \pi ( o | {\varvec{s}} ; {\varvec{\theta }}_O)} \) follows the argument of the termination policies. Similarly to the termination policies, we consider the relevant term of the lower bound

$$\begin{aligned} Q_O({\varvec{\theta }}_O, {\varvec{\theta }}^\text {old} ) = \sum _{t=1}^{T} \sum _{o_t} p( o _t, { b } _t = 1 | \tau ; {\varvec{\theta }}^\text {old} ) \log { \pi ( o _t| {\varvec{s}} _t; {\varvec{\theta }}_O)} , \end{aligned}$$
(15)

to extract weights

$$\begin{aligned} w_O(t) = \ p(b_{t}=1|\tau ; {\varvec{\theta }}^\text {old} ) = \sum _ o \gamma _t( o _t,b_t) , \end{aligned}$$

and target probabilities

$$\begin{aligned} t_O(t) = p(o_{t}|\tau ,b_{t}=1; {\varvec{\theta }}^\text {old} ) = \frac{\gamma _t( o _t,b_t) }{ \sum _ o \gamma _t( o _t,b_t)}. \end{aligned}$$

While there exists an individual termination policy for each option, in our implementation, only a single global activation policy governs the initialization of all options. Thus, only one multi-class classifier has to be learned. Here, having one global classifier is a design decision and other alternatives are feasible. Given the weighted target probabilities of the activation policy, standard methods can be used to fit a multi-class classifier (Bishop 2006).

Sub-policies

Finally, the sub-policies have to be fit. The relevant terms of the lower bound are given by

$$\begin{aligned} Q_A({\varvec{\theta }}_A, {\varvec{\theta }}^\text {old} ) =\sum _{t=1}^{T} \sum _{o_t} p( o _t| \tau ; {\varvec{\theta }}^\text {old} ) \log { \pi ( {\varvec{a}} _t| {\varvec{s}} _t, o _t; {\varvec{\theta }}_A)} , \end{aligned}$$
(16)

such that the weights are given by

$$\begin{aligned} w_{A,i}(t) = \ p(o_{t}=i|\tau ; {\varvec{\theta }}^\text {old} ) = \sum _{b_t} \gamma _t( o _{t}=i,b_t) . \end{aligned}$$

Here, it is worth noticing that the weights for the sub-policies are identical to the weights of the termination policies. However, the target values are different. For the sub-policies the target values are given directly by the observed actions \( {\varvec{a}} _t\), given the observed states \( {\varvec{s}} _t\). Given the weighted state-action pairs, stochastic policies, such as linear Gaussian policies, can be fit to the data.

Feature representations of the state The above derivations are given in their most general form, where each option component depends directly on the state \( {\varvec{s}} \). In practice, it may often be beneficial to train the individual components on a feature transformation \(\varvec{\phi } ( {\varvec{s}} )\) of the state. Such features might, for example, be polynomial expansion of the state variable or a kernelized representation. When using feature representations, different representations can be chosen for the individual option components.

3 Probabilistic reinforcement learning for option discovery

The results in the previous section allow us to recover a hierarchical policy from state-action observations, i.e., to perform imitation learning with hierarchical policies. In this section, we extend these results to allow for reinforcement learning. Using the hierarchical policy, the agent aims to maximize the expected reward

$$\begin{aligned} J(\pi ) = \mathbb {E}_\pi \sum _{t=0}^\infty \gamma ^t r( {\varvec{s}} _t, {\varvec{a}} _t), \end{aligned}$$
(17)

where \(\gamma \) is the discount factor. We will show how probabilistic inference based methods can be combined with EM algorithm of Sect. 2 to learn a reward maximizing hierarchical policy.

3.1 Probabilistic reinforcement learning algorithms

There exist several algorithms which use probabilistic inference techniques for computing the policy update in reinforcement learning (Dayan and Hinton 1993; Theodorou et al. 2010; Kober and Peters 2010; Peters et al. 2010). More formally, they either re-weight state-action trajectories or state-action pairs according to the estimated quality of the state-action pair and, subsequently, use a weighted maximum likelihood estimate to obtain the parameters of a new policy \(\pi ^*\).

A common approach is to use an exponential transformation of the advantage function \(A( {\varvec{s}} , {\varvec{a}} ) = Q( {\varvec{s}} , {\varvec{a}} ) - V( {\varvec{s}} )\), where \(Q( {\varvec{s}} _i, {\varvec{a}} _i)\) is the Q-function and \(V( {\varvec{s}} _i)\) is the value function, to reweight the state-action distribution (Peters et al. 2010; Daniel et al. 2012). The resulting desired state-action distribution \(p( {\varvec{s}} , {\varvec{a}} )\) is then given by

$$\begin{aligned} p( {\varvec{s}} , {\varvec{a}} ) \propto q( {\varvec{s}} , {\varvec{a}} ) \exp \left( \frac{ Q( {\varvec{s}} , {\varvec{a}} ) - V( {\varvec{s}} )}{\eta }\right) , \end{aligned}$$

where \(q( {\varvec{s}} , {\varvec{a}} )\) is the sampling distribution which is typically obtained by sampling from the old policy \(\tilde{\pi }( {\varvec{a}} | {\varvec{s}} )\). The parameter \(\eta \) is a temperature parameter that is either optimized by the algorithm (Peters et al. 2010; Daniel et al. 2012) or manually set (Theodorou et al. 2010; Kober and Peters 2010). A new parametrized policy \(\pi ^*\) can then be obtained by minimizing the expected Kulback-Leibler divergence between the re-weighted policy update \(p( {\varvec{a}} | {\varvec{s}} )\) and the new parametric policy \(\pi ^*\) (van Hoof et al. 2015), i.e.,

(18)

where

$$\begin{aligned} w( {\varvec{s}} _i, {\varvec{a}} _i) = \exp \left( \frac{ Q( {\varvec{s}} _i, {\varvec{a}} _i) - V( {\varvec{s}} _i)}{\eta } \right) \end{aligned}$$

defines a weighting for each state action pair. It can now be easily seen that Eq. 18 is equivalent to a weighted maximum likelihood estimate for the policy \(\pi ^*\). Specifically, we could use standard maximum likelihood techniques to fit, for example, a linear Gaussian policy to the observed state-action samples \(\{ {\varvec{s}} _i, {\varvec{a}} _i \}\), reweighted by the the weights \(w( {\varvec{s}} _i, {\varvec{a}} _i)\), which would minimize the Kullback-Leibler divergence in Eq. (18).

We will employ different RL algorithms in the continuous and discrete setting to compute these weightings. For the continuous experiments, we will use the HiREPs (Daniel et al. 2012) algorithm. While other algorithms are equally feasible (Theodorou et al. 2010; Kober and Peters 2010; Peters et al. 2010), HiREPS is able to actively separate sub-policies if options are available by ensuring that different sub-policies generate distinct behaviors in similar states. Ensuring such versatility of the sub-policies is achieved by constraining the entropy of responsibilities for each option, i.e.,

$$\begin{aligned} H\big (p( {\varvec{a}} | {\varvec{s}} , o )\big ) > \kappa , \end{aligned}$$
(19)

where \(\kappa \) is manually set. Furthermore, HiREPS limits the KL divergence of state-action distributions induced by policy updates such that

$$\begin{aligned} \text {KL}\big (p( {\varvec{s}} , {\varvec{a}} ) || q( {\varvec{s}} , {\varvec{a}} ) \big ) < \epsilon , \end{aligned}$$
(20)

where \(\epsilon \) is equally set manually. Values for \(\kappa \) and \(\epsilon \) are usually set close to 1.

For discrete environments, we can equally employ standard reinforcement learning techniques to obtain the Q-function \(Q( {\varvec{s}} , {\varvec{a}} )\) and value function \(V( {\varvec{s}} )\). In our experiments, we employed standard Q-learning (Christopher 1992) and LSPI (Lagoudakis and Parr 2003) to obtain those quantities. In the discrete case, the temperature parameter \(\eta \) was set to 1.

3.2 Combining EM and probabilistic reinforcement learning

As we have seen, the only difference between imitation learning and probabilistic reinforcement learning algorithms is the use of a weighted maximum likelihood (ML) estimate instead of a standard ML estimate. We can now combine the expectation maximization algorithm for discovering parametrized options with probabilistic reinforcement learning algorithms by weighting each time step in the maximization step of the EM algorithm.

Since the reinforcement learning weights \(w_t = w( {\varvec{s}} _t, {\varvec{a}} _t)\) are independent of all latent variables, the derivations in Sect. 2 remain largely unaffected. The integration of the reinforcement learning weights affects only the maximization step. In the M-Step, all sample weights for the individual option components obtained by the EM algorithm are multiplied by the RL weights. Thus, we obtain the following combined weights of each sample

$$\begin{aligned} \tilde{w}_{B,j}(t)&\ = \ w_t \ p(o_{t}=j|\tau ; {\varvec{\theta }}^\text {old} ), \\ \tilde{w}_O(t)&\ = \ w_t \ p(b_{t}=1|\tau ; {\varvec{\theta }}^\text {old} ) , \\ \tilde{w}_{A,j}(t)&\ = \ w_t \ p(o_{t}=j|\tau ; {\varvec{\theta }}^\text {old} ). \end{aligned}$$

Since the RL weightings only depend on the observed variables, we fortunately do not have to devise a new RL algorithm but can rely on a wide range of existing methods to provide the RL weights \(w_t\). Here, the weights \(w_t\) could alternatively be obtained through other means, in non-RL settings. However, since we are interested in solving a reinforcement learning problem, we refer to them as RL weights. Thus, the proposed framework acts as an interface between existing RL algorithms and the policy updates. While traditional methods use the RL weights \(w_t\) to perform, for example, a maximum likelihood update of a monolithic policy \(\pi ( {\varvec{a}} | {\varvec{s}} )\), the proposed method estimates all elements of the option framework by fitting the hierarchical policy to the weighted state-action samples.

Table 1 Learning options from experience. Termination events, options and actions are sampled from from the current policies. Subsequently, the distribution over latent variables is computed and weights \(f_\text {RL}\) are proposed by the RL algorithm The next policies are determined according to the update equations in the method Sect.

The information flow of the proposed algorithm is shown in Table 1.

4 Related work

Options as temporally extended macro-actions were introduced by Sutton et al. (1999). While previous research leveraged the power of temporal abstraction (Kaelbling 1993; Parr and Russell 1998; Sutton et al. 1999), such efforts did not improve the sub-policies themselves. Improving the sub-policies based on the observed state-action-reward sequences is known as intra-option learning. Intra-option learning is a consequence of having Markov options and allows for updating all options that are consistent with an experience. While it is a desired property of option learning methods, it is not realized by all existing methods. Sutton et al. (1998) showed that making use of intra-option learning can drastically improve the overall learning speed. Yet, the algorithms presented by Sutton et al. (1998) relied on hand coded options and were presented in the discrete setting.

Options are also used in many hierarchical RL approaches, where they either extend the action space or are directly extended to sub-tasks, where the overall problem is broken up into potentially simpler sub-problems. Dietterich (2000) proposed the MAXQ framework which uses several layers of such sub-tasks. However, the structure of these sub-tasks needs to be either specified by the user Dietterich (2000), or they rely on the availability of a successful trajectory (Mehta et al. 2008). Barto et al. (2004) rely on artificial curiosity to define the reward signal of individual sub-tasks, where the agent aims to maximize its knowledge of the environment to solve new tasks quicker. This approach relies on salient events which effectively define the sub-tasks.

Stolle and Precup (2002) first learn a flat solution to the task at hand and, subsequently, use state visitation statistics to build the option’s initiation and termination sets. Mann and Mannor (2014) apply the options framework to value iteration and show that it can speed up convergence.

Option discovery approaches often aim to identify so called bottleneck states, i.e., states the agent has to pass pass through on its way from start to goal. McGovern and Barto proposed to formulate this intuition as a multiple-instance learning problem and solve it using a diverse density method (McGovern and Barto 2001a). Other approaches aim to find such bottleneck states using graph theoretic algorithms. The Q-Cut (Menache et al. 2002) and L-Cut (Simsek and Barto 2008) build transition graphs of the MDP and solve a min cut problem to find bottleneck states. Silver and Ciosek (2012) assume a known MDP model to propose an option model composition framework, which can be used for planning while discovering new options. Niekum and Barto (2011) present a method to cluster subgoals discovered by existing subgoal discovery methods to find skills that generalize across tasks. In the presented paper, we do not assume knowledge of the underlying MDP and, further, present a framework which is also suitable for the continuous setting.

In continuous state-action settings, several sub-task based approaches have been proposed. Ghavamzadeh and Mahadevan (2003) proposed the use of a policy gradient method to learn subtasks while the selection of the sub-tasks is realized through Q-Learning. Morimoto and Doya (2001) proposed to learn how to reach specific joint configurations of a robot as sub-tasks, such that these options can later be combined to learn more complicated tasks. In both approaches, the sub-tasks have to be pre-specified by the user. Wingate et al. (2011) use policy priors to encode desired effects like temporal correlation. Levy and Shimkin (2012) propose to extend the state space by the option index, which allows for the use of policy gradient methods. In our proposed method, this option index is a latent variable which is inferred from data. The use of a latent variable allows our methods to update all options with all relevant data points. Konidaris and Barto (2009) use the option framework to learn chaining of skills. This approach requires that the agent can reach the goal state before constructing the series of options leading to the goal state.

A concept similar to the options framework has been widely adapted in the field of robot learning. There, temporal abstraction is achieved through the use of movement primitives (Paraschos et al. 2013; Da Silva et al. 2012). Instead of learning policies over state-torque mappings to control robots, the agent learns parameters of a trajectory generator (Kober and Peters 2010; Kajita et al. 2003). Based on a Beta-Process Autoregressive HMM proposed by Fox et al. (2009), Niekum et al. (2012) proposed a method to segment demonstrated trajectories into a sequence of primitives, addressing the imitation learning problem. Rosman and Konidaris (2015) extend the work of Fox et al. (2009) to allow skill discovery in the inverse reinforcement learning setting. There, the task is to recover reward functions which lead to a skill based solution of a task. Compared to the proposed method, methods based on or similar to the Beta-Process Autoregressive HMM allow to extract skills or segments without a priori knowledge of the total number of skills. However, such methods are computationally expensive and have not been shown to work in the loop together with reinforcement learning methods.

Sequencing multiple primitives can be viewed as an approximation to the options framework for robot learning and allows more challenging tasks to be solved (Stulp and Schaal 2012; Daniel et al. 2013). Robot learning approaches often benefit from substantial task knowledge in the form of task demonstrations. Furthermore, a key benefit of these methods is the ability to offer a simple parametrization of complex behaviors, which also inspired the proposed approach. However, existing robot learning methods usually do not allow for intra-option learning from the complete trajectory data (Daniel et al. 2012) and the individual primitives are generally of fixed duration (Kober and Peters 2010).

Fig. 2
figure 2

a The hand coded policy consists of three bands, where the center band realizes a stabilizing controller and the outer bands accelerate the pendulum to enable a swing-up. b One trajectory sample generated with the hand coded policy

5 Evaluation

The evaluation of the proposed framework is separated in two parts. We first evaluated the imitation learning capabilities and, subsequently, proceeded to evaluate different reinforcement learning tasks as well as comparing the proposed methods to other option learning frameworks.

5.1 Imitation learning

We started our evaluations with an imitation learning task. The evaluation of the imitation learning capabilities allowed us to ensure that the foundation of the proposed framework performs as expected.

For the imitation learning task, we evaluated the underpowered pendulum swing-up task. In this task, the agent exerted torques on the rotational joint of a pendulum and had to swing-up the pendulum into the upright position. However, the torques were limited such that the agent was not strong enough to perform a swing-up directly. Instead, the agent had to first perform a pre-swing to build up sufficient kinetic energy. The pendulum had a length of 0.5 m, a mass of 10 kg and a friction coefficient of 0.2 Ns/m. The pendulum was internally simulated with a frequency of 10 kHz. The agent controlled the pendulum at a frequency of 33 Hz. The agent could exert at most 30 Nm of torque and each episode had 100 time steps. This continuous task had two state variables, the angle and the velocity of the pendulum, where we used a periodic representation of the angle.

Fig. 3
figure 3

a The policy learned through imitation learning. Red circles show the observations and blue diamonds show trajectories generated using this learned policy. The plot shows that the learned policy uses only two of three available options. Option 2 realizes the stabilizing policy and Option 3 realizes the accelerating policy. b A qualitative plot showing the development of the log likelihood of the observed data under the learned policy for ten trials. The results show that convergence is usually reached after about five EM iterations

To generate observations, we provided a hand-coded policy which is shown in Fig. 2a. This policy was designed to generate noisy actions within \(\pm 300\) Nm. However, the torques on the system were capped to the torque limit of 30 Nm. The much larger range of desired torques was chosen to simplify the programming of the controller. This policy could successfully perform a pendulum swing-up if the pendulum was initially hanging down with zero velocity. An example trajectory generated by this hand coded policy is shown in Fig. 2b.

Based on five observed trajectories, we used the proposed framework to learn a policy with three options. However, the resulting policy, shown in Fig. 3a, learned to reproduce an effective policy using only two of the three available policies. Generally, we would not necessarily expect that the options recovered by the proposed algorithm exactly match the options of the demonstrator. The algorithm only aims at reproducing the observed behavior but is free to choose the internal structure of the hierarchical policy. Equally, in the RL case, we would not expect the proposed algorithm to learn options that are ‘human-like’. Specifically, we would not expect that a robotic agent would use the same solution decomposition as a human operator.

Figure 3a shows that the trajectories generated with this imitated policy closely resemble the observed trajectories. Figure 4 shows the inferred termination policies of the different options. The results show that options will only be terminated once they are outside of their region of ‘expertise’. Option one has a high termination probability in most regions, however, Fig. 3a shows that the final policy is actually not using this option. Figure 3b shows the development of the log likelihood of the observed data under the imitated policy. The results show that convergence typically has been reached after about five iterations of the EM algorithm. In the imitation learning case, the learned solution is only valid if the system is initiated in a state which is similar to those states that were previously demonstrated. Outside of this region, the imitation learning solution will not be able to successfully solve the task and a reinforcement learning solution is required. In the proposed method, the activation policy will learn to initiate sub-policies according to their responsibility of state-space region. If, for example, the sub-policies are modeled as Gaussians and have infinite support, they have non-zero responsibility for all states and could, theoretically, be activated. If, however, a different class of probability distributions is more task-appropriate and has limited support, it would not be activated outside of its support region.

In this task, the activation policy was a soft-max distribution based on a squared expansion of the state features. The termination policies were represented by sigmoidal functions based on the same features as the activation policy. The sub-policies, however, were represented as linear Gaussian policies based directly on the state features.

Fig. 4
figure 4

Visualization of the inferred termination policies for options one (left) through three (right). As shown in Fig. 3a, the first option is not used in the final policy and the agent learned to almost always terminate the first option. The third option learned a symmetric termination policy which terminates once the pendulum has sufficient energy to be ‘pulled up’ to the upright position. Interestingly, the second option learned a non-symmetric termination policy, possibly due to the relatively small number (five) of noisy demonstrations

5.2 Reinforcement learning

We present results on three discrete state-action tasks as well as one continuous state-action task. In the discrete state action task settings, we compare to two existing option learning algorithms, namely the Q-Cut algorithm (Menache et al. 2002) as well as the L-Cut algorithm (Simsek et al. 2005). Both algorithms aim to find bottleneck states by solving a max flow/ min cut problem on the transition graph. After identifying such bottleneck states, options are generated which lead to these states. The main difference between these two algorithms is that Q-Cut aims to solve a global graph partitioning problem, while L-Cut aims to solve a local problem, based on transitions observed in one episode. The main difference to the proposed algorithm is that we do not aim to identify bottleneck states, but instead aim to automatically learn a decomposition of the problem that is suitable to the class of available option components.

Fig. 5
figure 5

b, d, f Setups of the discrete world task. The agent has to cross fields of varying sizes while avoiding the obstacles on its way to the goal. Outlined triangles show trajectories learned using the proposed approach in combination with LSPI. The colors differentiate between active options. Since the transition dynamics were stochastic, the transitions did not always follow the selected actions. a, c, e Learning performances of the different algorithms on the discrete world tasks. The results show that the L-Cut and Q-Cut generally had similar performance, where Q-Cut outperformed L-Cut in the first two worlds. However, in the relatively open obstacle world, L-Cut outperformed Q-Cut since Q-Cut was not able to find suitable bottleneck states. In all worlds the proposed approach outperformed both L-Cut and Q-Cut when using Q-Learning. Changing the RL algorithm to LSPI further increased the learning performance (Color figure online)

For all evaluations, we tested each setting ten times and report mean and standard deviation of the results.

5.2.1 Discrete tasks

The discrete environments were given by three different gridworlds as shown in Figs. 5b, d and f. The first world shown in Fig. 5b represents the two rooms scenario (McGovern and Barto 2001a), where the agent has to find a doorway to travel between two rooms. In the second world shown in Fig. 5b, the agent has to traverse two elongated corridors before entering a big room in which the target is located. Finally, in the third world show in Fig. 5f no traditional bottleneck states appear, but the agent has to navigate around two obstacles. Furthermore, in this world optimal paths can lead around either side of the first obstacle.

In all experiments, the agent started in the lower level corner of the respective grid and had to traverse the grid while avoiding two obstacles to reach the goal at the opposite end. The actions available to the agent were going up, left, right and down. Transitions were successful with a probability of 0.8. Unsuccessful transitions had a uniform probability of ending up in any of the neighboring cells. The transition to each accessible field but the goal field generated a reward signal of \(-1 \). After reaching the goal, the agent received a reward of \(+1\) for every remaining step of the episode, where each episode had a length of 500 time steps. If the agent tried to walk into an obstacle or to leave the field, it remained in the current position.

For the discrete tasks we used a tabular feature encoding for the flat policy. For the hierarchical policy, we used the tabular features for the activation and termination policies. The sub-policies were state-independent multinomial distributions over the four actions.

In the discrete setting, we present results of two different RL learning methods, Q-Learning (Christopher 1992) and LSPI (Lagoudakis and Parr 2003), in combination with the proposed framework and converted the resulting Q functions into RL weights as described in Sect. 3. However, especially when using Q-Learning, the resulting policy updates can be very different and may de-stabilize the learning process. To stabilize the learning process, we use a learning rate \(\alpha \) for our policy updates such that

$$\begin{aligned} \tilde{\pi }( {\varvec{a}} | {\varvec{s}} , { \bar{ o } }) = \alpha \pi ( {\varvec{a}} | {\varvec{s}} , { \bar{ o } }) + (1-\alpha )\hat{\pi }( {\varvec{a}} | {\varvec{s}} , { \bar{ o } }), \end{aligned}$$

where \(\pi ( {\varvec{a}} | {\varvec{s}} , { \bar{ o } })\) is the resulting policy of the EM update and \(\hat{\pi }( {\varvec{a}} | {\varvec{s}} , { \bar{ o } })\) is the previous policy. Alpha was set to 0.1 in the experiments. Alternatively, the learning rate of Q-Learning itself could be decreased. However the proposed scheme yields better performance.

Comparison to existing methods In the comparative results to related work we follow the therein established method of reporting ‘steps to goal’ as qualitative measure. In the remaining evaluations of our algorithm we report the average return, which may in some cases be more informative since our reward functions also punish ‘falling off’ the board. The results in Fig. 5 show that in all experiments the proposed framework learned solutions faster than both the Q-Cut as well as the L-Cut methods. Comparing the use of Q-Learning and LSPI in the proposed framework, the results show that LSPI leads to convergence considerably faster than Q-Learning. Since the structure of the individual action policies learned by the proposed approach was given simply as a distribution over the four possible actions, the converged sub-policies usually always select only one action. This simplicity of the sub-policies is a key factor to accelerate the overall learning speed. While we do not present results of comparisons to primitive-based methods such as, for example, using Q-Learning directly, both of the methods that we did compare to have shown to outperform Q-Learning. Thus, we compared to such primitive-based methods indirectly. In our experience, both Q-Cut and L-Cut outperform Q-Learning when not using experience replay. However, in our internal evaluations on the tasks presented in this paper, Q-Learning with experience replay resulted in performance levels similar to Q-Cut and L-Cut, but worse than the proposed method.

Influence of available options After comparing to existing methods, we further evaluated the properties of the proposed framework. All remaining evaluations were performed using LSPI in the obstacle world. In our experience, these results were representative of both using Q-Learning as well as performing them in different tasks. Figure 6a shows the influence of available options. In theory this task can be solved optimally with only two options, where one will always go right and one will always go up. However, the results show that making more options available to the algorithm improved both asymptotic performance as well as speed of convergence. Adding more than 20 options did not further increase the performance.

Influence of EM iterations The proposed EM style of computing policy updates can be costly and, generally, requires several iterations as also seen in the imitation learning case in Fig. 3b. However, in the RL setting, subsequent policies are expected to be similar and, thus, a small number of EM iterations should be sufficient. The results in Fig. 6b show that this intuition holds true in our evaluations. In our experience, performing even more EM iterations did not change the result. However, when using very few rollouts per iteration, the effect of performing multiple EM iterations can be more pronounced. However, even when a small number of EM iterations is sufficient, introducing these additional computations results in a noticeable computational overhead in continuous task. Here, especially the number of options is an important factor influencing the runtime of the EM algorithm. In our experience, the addition of the EM algorithm would lead to approximately twice the overall computational requirements. However, this factor is of course highly dependent on the RL method used. In the discrete tasks that we evaluated, the computational overhead due to the EM iterations was negligible.

Fig. 6
figure 6

a Evaluation of the proposed algorithm on a gridworld task. With an increased number of options the performance of the algorithm also increases. With eight options, optimal asymptotic performance is reached, however adding even more options further improves the learning speed. b The number of EM iterations in the RL case does not have a large effect on the performance. More EM iterations improve the performance only marginally

Influence of termination events Finally, we evaluated the influence of the probabilistic terminations. In the proposed framework, the sub-policies have to be initialized and, thus, a prior for the termination policies has to be set. Figure 7a shows the effect of changing this prior. The results show that the proposed framework is robust to wide range of these initializations.

We also evaluated the effect of disabling the probabilistic termination sub-policies. In this case, the algorithm could still learn multiple options but no termination policies. Thus, each option could not be active for more than one time step but terminated after every step. The results in Fig. 7b show that learning without terminations slowed down the convergence speed. In our experience, this effect was strongly linked to the stochasticity of the transitions. The higher this stochasticity was, the stronger the benefit of the termination policies became.

5.2.2 Continuous task

After evaluating our algorithm on several discrete tasks, we returned to the continuous pendulum-swingup task described in the imitation learning section of the results. As before, the task is to swing-up an underpowered pendulum. To solve this task the agent has to learn to first perform a pre-swing to build up kinetic energy and then use the momentum to fulfill the task. Furthermore, in the presented task two solutions are possible, performing the swing-up clock-wise or counter-clock-wise. To provide a reinforcement learning signal for the continuous task, we employed the Hierarchical Relative Entropy Policy Search (HiREPS) algorithm (Daniel et al. 2012), where we used Fourier basis features as described by Konidaris et al. (2011) to represent the value function. For the value function a feature expansion of the fifth order was used. For the activation policy as well as the termination policies, a squared feature expansion was used. The individual sub-policies worked directly on the state observations, using only an additional constant offset. Thus, the sub-policies were linear controllers. In the pendulum swing-up task, a single linear controller is insufficient and, thus, the proposed approach had to combine multiple sub-policies to achieve the necessary non-linear behavior required for a swing-up.

Fig. 7
figure 7

a The effect of the termination prior (initialization of termination policies) using eight options. Changing the initial value of the termination probabilities within a reasonable range around 0.5 has only a small effect on the learning process. Very low values will lead to decreased performance. b The effect of completely disabling terminations. Without terminations, the performance of the algorithm decreases

Fig. 8
figure 8

a Evaluation of the proposed algorithm on the underpowered pendulum swingup task. A flat linear policy is insufficient to solve the task, only the combination of multiple linear policies is sufficiently expressive. Using two or three options, the task can be solved but stabilization depends on bang-bang control. With four or more options, the policy is sufficiently expressive to incorporate high torque signals for the swing-up and finely tuned sub-policies for stabilization. b Visualization of the hierarchical policy composed of four options. Options 2 and 3 learned a bang-bang control scheme and options 1 and 4 learned to stabilize the pendulum around the upright position (Best viewed in color)

Fig. 9
figure 9

a Visualization of swingup trajectory color coded by the currently active option. The pendulum is initially hanging down with a small rotation. First option 2 performs a pre-swing and, subsequently, option 3 accelerates the pendulum until options 1 and 4 start the stabilization process. Around timestep 15, option 4 is active for a short time, but the kinetic energy at that point is insufficient for a direct swingup (Best viewed in color)

The results in Fig. 8a show that while a single linear policy was insufficient to solve this task, it could be solved using two options. Adding more options further improved the resulting hierarchical policy. The visualization of the resulting policy in Fig. 8b shows that with more options, the algorithm learned a control scheme where options two and three were used to swing up the pendulum, and options one and four incorporated a linear stabilization scheme around the upright position of the pendulum. Figure 9a shows a trajectory generated by the resulting policy. Starting from the bottom, the pendulum was first accelerated by options two and three. The plot shows that in-between options two and three, option four was active for a few time steps. However, the kinetic energy at that point was insufficient to fully swing up the pendulum. After the pendulum almost reached the upright position around time step 40, the stabilizing options took over control. Since all option components are stochastic distributions, some option switches still occur even after the pendulum is stabilized. Since the effect of switching into a different option in the stable position for a single time step could easily be balanced by activating the stabilizing option in the next time step, the agent did not have a strong incentive to learn to avoid such behavior. Letting the algorithm run for more iterations might further improve this behavior.

5.2.3 Limitations

While the experiments show that the presented method worked well in the scenarios that were evaluated, we also want to make explicit the assumptions made in this paper. Primarily, the proposed method expects that the number of required options is known a priori. In our experience, this requirement is rather benign in practice, as the algorithm can be initialized with an excessive amount without deterioration in quality of the solution. However, adding more options does increase the computational requirements. Thus, approaches for automatically generating a task-appropriate number of options is an important aspect of future work. Furthermore, we introduced a damping factor \(\alpha =0.1\) on the policy update for the discrete setting, which we found to be especially important when using Q-Learning as the underlying RL method. In our experience, the recommended value of \(\alpha \) depends on the RL method used as well as the task under consideration. Methods such as LSPI will generally work well with larger values of \(\alpha \).

6 Conclusion and future work

In this paper, we presented a method to estimate the components of the option framework from data. The results show that the proposed method is able to learn options in the discrete and continuous setting. In the discrete setting, the algorithm performs better than two related option-discovery algorithms which are based on exploiting bottle-neck states. Instead of relying on bottle-neck states, the proposed algorithm achieves its performance by combining options with simpler sub-policies.

In the continuous setting, the results show that the algorithm is able to solve a non-linear task using a combination of options with only linear sub-policies. In this setting, a single linear policy is insufficient for solving the task. Furthermore, the framework allows for parametrized policies and, thus, state-of-the-art policy search methods developed for flat policies can be used to learn hierarchical policies.

The presented approach infers the option’s structure, such as the activation policy and termination policies, from data. However, the number of options still has to be set a-priori by the practitioner. While the results show that setting a relatively large number of options typically yields good performance, learning the number of options is an important aspect of future work. Finally, while the presented framework estimates the most likely termination policies, finding a way of enforcing fewer terminations might further improve learning performance.