Path following for Autonomous Ground Vehicle Using DDPG Algorithm: A Reinforcement Learning Approach

Cao, Yu; Ni, Kan; Jiang, Xiongwen; Kuroiwa, Taiga; Zhang, Haohao; Kawaguchi, Takahiro; Hashimoto, Seiji; Jiang, Wei

doi:10.3390/app13116847

Open AccessArticle

Path following for Autonomous Ground Vehicle Using DDPG Algorithm: A Reinforcement Learning Approach

¹

Program of Intelligence and Control, Cluster of Electronics and Mechanical Engineering, School of Science and Technology, Gunma University, 1-5-1 Tenjin-cho, Kiryu 376-8515, Japan

²

Ryomo Systems Co., Ltd., Ota 373-0853, Japan

³

Department of Electronic Engineering, Yangzhou University, Yangzhou 225012, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(11), 6847; https://doi.org/10.3390/app13116847

Submission received: 17 April 2023 / Revised: 27 May 2023 / Accepted: 3 June 2023 / Published: 5 June 2023

Download

Browse Figures

Versions Notes

Abstract

:

The potential of autonomous driving technology to revolutionize the transportation industry has attracted significant attention. Path following, a fundamental task in autonomous driving, involves accurately and safely guiding a vehicle along a specified path. Conventional path-following methods often rely on rule-based or parameter-tuning aspects, which may not be adaptable to complex and dynamic scenarios. Reinforcement learning (RL) has emerged as a promising approach that can learn effective control policies from experience without prior knowledge of system dynamics. This paper investigates the effectiveness of the Deep Deterministic Policy Gradient (DDPG) algorithm for steering control in ground vehicle path following. The algorithm quickly converges and the trained agent achieves stable and fast path following, outperforming three baseline methods. Additionally, the agent achieves smooth control without excessive actions. These results validate the proposed approach’s effectiveness, which could contribute to the development of autonomous driving technology.

Keywords:

autonomous driving; path following; RL; DDPG; ground vehicle

1. Introduction

Autonomous driving is increasingly being valued due to its potential in reshaping and enhancing safety in mobility, as well as in efficiency [1]. In urban environments, self-driving vehicles are less prone to traffic accidents so that the traffic network can be more fluid. In agricultural areas, reducing the required labor is a decent compensation for the decline in agricultural labor force and it makes it possible to complete agricultural tasks during certain hours that are not suitable for humans to work. Despite various technologies such as localization, perception and planning are required in realizing the full potential of autonomous driving; ensuring that the vehicle can stabilize at a set point or dynamically track reference signals and trajectories is a fundamental technology requirement [1,2,3].

Path following has been widely studied as an alternative to trajectory tracking in various types of vehicles due to its looseness with respect to time constraints [2]. In trajectory-tracking problems, the reference trajectory defines when and where the vehicle is supposed to be in the state space, while the objective of path following is to keep the vehicle as close as possible and moving along a predefined geometry path with no preassigned time information [2,4,5]. Pure pursuit is one of the earliest proposed strategies for following a path [6,7]. The straightforward nature of this strategy has made it a popular choice in many applications where real-time control is essential. Assuming that the reference path has no curvature and the vehicle is moving at a constant speed, the pure pursuit controller is based on fitting a circle through the vehicle’s current configuration, which in the case of a car is the rear wheel position, to a point on the path ahead of the vehicle by a so-called look-ahead distance L [1,8]. However, the controller does not take the case where the distance from the current configuration to the reference path is greater than L into consideration. Among the feedback control-based approaches, an approach where the reference path is parameterized as a continuous function and the rear wheel position is used as the regulated variable to minimize the cross-track error between the rear wheel and the path, while also ensuring the stability of the vehicle heading, has been proposed in [9]. Notwithstanding the local exponential convergence ensured by the control law, knowledge of the reference path’s curvature is required, thus necessitating that the parameterized function be twice continuously differentiable. Another feedback-based control law that minimizes the cross-track error of the front wheel position relative to a reference path that is discretized from the base trajectory to smooth waypoints has been proposed in [10]. This approach, which utilizes a nonlinear feedback function of the cross-track error to ensure local exponential convergence to the reference path, provides effective control for lower speeds, but necessitates some variations to be used for reverse driving. The vehicle robot using this controller for steering, namely Stanley, won the 2005 DARPA Grand Challenge. The aforementioned approaches [6,9,10] are suitable as baselines for reference and comparison, since they achieve satisfactory performance with a small number of adjustable parameters, moderate model accuracy and uncertainty requirements.

In recent years, RL has achieved significant accomplishments in many fields, such as gaming and robotics, which have brought it great attention and widespread recognition [11,12]. RL is a machine learning approach that enables an agent to learn optimal decisions through interactions with the environment. Since the use of deep neural networks to approximate the action-value function [13], which is also known as the Q-function, RL has been capable of dealing with high-dimensional state spaces. The Deterministic Policy Gradient (DPG) algorithm [14] replaces the method of representing policy using probability distribution with a deterministic function, making control in continuous action spaces more practical. The DDPG algorithm [15], which combines the two approaches, has thus found widespread applications in fields such as robotics control. Over the last few years, numerous researchers have attempted to apply deep reinforcement learning (DRL) methods to solve the problem of path following. Rubí et al. [5] proposed three sequentially improved methods based on DDPG, which realized the tracking of the path and adaptive velocity control of a quadrotor. It is difficult to fine-tune an agent trained in a simulation environment through experimentation; thus, a correction constant is employed for stable performance of the agent’s output in their experiment. Cheng et al. [16] accomplished path following and collision avoidance for a nonholonomic wheeled mobile robot in simulation, but the trained agent exerted excessive control effort, resulting in high jerks in robot velocities. More recently, Zheng et al. [17] propose a 3D path-following control method for powered parafoils, utilizing a combination of linear active disturbance rejection control and DDPG, to effectively control the parafoils’ flight trajectory and counter wind disturbances. Ma et al. [18] present a path-following control scheme based on Soft Actor–Critic for an underwater vehicle, demonstrating successful path tracking.

Our work focuses on exploring the application of RL to the path-following problem for ground car-like vehicles that move at medium speeds, with the primary objective of minimizing cross-track error. We use the DDPG algorithm, which combines the advantages of Deep Q-Network (DQN) and DPG, to solely address the steering control of the vehicle, separating lateral and longitudinal control to simplify the problem and reduce the action dimension. We also use a simple reward function to achieve smooth steering and avoid jitters during vehicle operation. This approach has been demonstrated to be easy to train and effective, with better performance on the trained path compared to three baseline methods and comparable performance on untrained paths. The trained agent’s control strategy has been shown to respond quickly to lateral offset from the target path with an acceptable overshoot. Given the advantages of this method, it has great potential for practical application.

The remainder of this paper is structured as follows. In Section 2, the path-following problem is described using a general approach, allowing for broad applicability. The kinematics model of the car-like vehicle, which relates to the reference path, is then introduced in detail. Section 3 provides an overview of the prerequisite knowledge of reinforcement learning, which forms the foundation of the actor–critic algorithm known as DDPG. This is followed by a description of the design and algorithm of the path-following controller based on DDPG, including implementation details. In Section 4, the training process and corresponding agent performances on three different paths are presented. Each case is compared and numerically analyzed against three baseline methods, with additional discussion on corresponding steering actions. The conclusion is presented in Section 5.

2. Problem Formulation and Modeling

In this section, we present the problem of path-following for ground car-like vehicles in a planar environment. The problem is described in a generalized manner, making it applicable to a variety of vehicle models. The ground vehicles are modeled as simplified two-wheeled vehicles, where it is assumed that the left and right wheels at the front and rear of the vehicle are consolidated at the center position of the axle. Additionally, the model takes into account the inertia effects.

2.1. Path Following

For a controlled nonlinear system of the form below,

\dot{x} (t) = f (x (t), u (t)), x (t_{0}) = x_{0}

(1)

where

x \in X \subseteq R^{n}

and

u \in U \subseteq R^{m}

define state and input constraints, the objective of path following is to design a controller, such that the system follows a parametrized reference path [2]. The reference path can be given by Equation (2), where only the movement on the plane is considered.

P = {p \in R^{2} | p = p_{T} (ω), \forall ω \in R^{+}}

(2)

For any given

ω

, a local tangential path reference frame

{T}

centered at

p_{T} (ω)

can be defined, which is indicated by the subscript T. The relative angle

σ

between the fixed ground frame

{O}

and the local reference frame

{T}

can be calculated with Equation (3).

σ_{T} (s) = atan2 (y_{T}^{^{'}} (ω), x_{T}^{^{'}} (ω))

(3)

where the function atan2 is the four-quadrant version of arctan that returns the angle between the positive x-axis and a point

{[x_{T}, y_{T}]}^{T}

in the Cartesian plane and is positive counterclockwise. It is also evidently seen that the parametrized reference path must be continuously differentiable.

Given the vehicle’s posture

{[x (t), y (t), ϕ (t)]}^{T}

at time t, the path-following error is given by Equation (4), which is a cross product between two vectors. The control objective is to guarantee that the path-following error converges such that

lim_{t \to \infty} e_{p} (t) = 0

.

e_{p} (t) = d_{y} {\hat{t}}_{x} - d_{x} {\hat{t}}_{y}

(4)

where

d = (d_{x}, d_{y})

is the tracking error vector and

\hat{t} = ({\hat{t}}_{x}, {\hat{t}}_{y})

is the unit tangent vector to the reference path at

ω (t)

as defined in Equations (5) and (6), respectively.

d = (x (t), y (t)) - (x_{T} (ω (t)), y_{T} (ω (t)))

(5)

\hat{t} = \frac{(x_{T}^{^{'}} (ω (t)), y_{T}^{^{'}} (ω (t)))}{∥(x_{T}^{^{'}} (ω (t)), y_{T}^{^{'}} (ω (t)))∥}

(6)

The relative orientation

ψ (t)

between the vehicle and the path at time t is given by Equation (7). It indicates whether the vehicle is moving towards the direction of the path or away from it. In practice, we employ the normalization technique to restrict the value of

ψ (t)

within the range of

- π

to

π

. Normalizing the angle difference is a common technique. Although the trigonometric operations remain the same, constraining the range within this specific interval helps reduce the observation space.

\begin{matrix} ψ (t) & = ϕ (t) - σ_{T} (ω (t)) \\ ψ (t) & = atan2 (sin (ψ (t)), cos (ψ (t))) \end{matrix}

(7)

To find the point along the reference path from which to calculate the cross-track error, the point that is nearest to the vehicle is selected [9,19,20]. This gives rise to an optimization problem of finding the parameter

ω

that minimizes the distance between the vehicle position and the reference path. We prefer the squared Euclidean distance that is based on the equivalence it holds with the original optimization problem and its computational convenience. The optimization problem can be expressed as follows:

ω (t) = \underset{ω}{arg min} {∥(x (t), y (t)) - (x_{T} (ω), y_{T} (ω))∥}^{2}

(8)

One natural approach for updating the path variable

ω

is to iteratively compute the value that yields the nearest distance between the vehicle and the reference path using Newton’s method [19]. The feature that Newton’s method only guarantees a local optimum helps prevent sudden jumps in the path variable and promotes stability in the optimization process by using the previous path variable value as the initial guess. One can refer to [21] for details.

2.2. Kinematics Model for Car-like Vehicles

Car-like vehicles are a class of vehicles that are capable of independently controlling their forward speed and steering, with the most commonly used model being the equivalent kinematics bicycle model [1,22,23]. The bicycle model as illustrated in Figure 1 consists of two wheels connected by a rigid link and is restricted to movement in a plane, where the front wheel is allowed to rotate about the axis vertical to the plane and the rear wheel fixed to the body provides forward momentum. More generally, vehicles with these constraints on maneuverability are referred to as nonholonomic vehicles. The parameters and notations of the vehicle model used in Figure 1 are presented in Table 1 and Table 2, respectively.

The motion in the lateral and yaw directions of the vehicle’s center of gravity (C.G.) is given by the following equation [22,24,25].

\begin{matrix} m V (\dot{β} + λ) & = F_{f} + F_{r} \\ I_{z} \dot{λ} & = l_{f} F_{f} - l_{r} F_{r} \end{matrix}

(9)

where cornering forces generated in the front and rear tires, denoted as

F_{f}

and

F_{r}

, respectively, are expressed as follows.

\begin{matrix} F_{f} & = - K_{f} (β + \frac{l_{f} λ}{V} - δ) \\ F_{r} & = - K_{r} (β + \frac{l_{r} λ}{V}) \end{matrix}

(10)

The motion equation for the vehicle model is derived by substituting Equation (10) into Equation (9), resulting in the following expression.

\begin{matrix} m V \dot{β} + (K_{f} + K_{r}) β + {m V + \frac{l_{f} K_{f} - l_{r} K_{r}}{V}} λ & = K_{f} δ \\ (l_{f} K_{f} - l_{r} K_{r}) β + I_{z} \dot{λ} + \frac{l_{f}^{2} K_{f} + l_{r}^{2} K_{r}}{V} λ & = l_{f} K_{f} δ \end{matrix}

(11)

where

δ

is an additional freedom degree of the front tire that rotates to steer [26]. Due to vehicle mechanics, the tire angle

δ

is usually limited to a range, where

δ \in [δ_{m i n}, δ_{m a x}]

.

Let

x = {[x, y, ϕ]}^{T} \in R^{3}

denote the posture of the vehicle’s C.G. in the fixed ground frame

{O}

. Here, the heading angle

ϕ

corresponds to the orientation of the vehicle, measured as the angle between the x-axis of frame

{O}

and the direction in which the vehicle is moving. Positive values of the heading angle indicate a counterclockwise rotation. The kinematics model that describes the differential constraints of the vehicle can be given by Equation (12).

\begin{matrix} \dot{x} & = V cos (ϕ + β) \\ \dot{y} & = V sin (ϕ + β) \\ \dot{ϕ} & = λ \end{matrix}

(12)

The path-following problem based on this model is depicted in Figure 2. In the figure,

e_{p}

denotes the cross-track error, which represents the deviation between the tangent at the nearest point on the path and the C.G. of the vehicle. When the vehicle is on the left side of the path,

e_{p}

is greater than zero (

e_{p} > 0

) and conversely, when the vehicle is on the right side of the path,

e_{p}

is less than zero (

e_{p} < 0

).

ψ

signifies the relative orientation between the C.G. of the vehicle and the path. When

ψ

ranges from

- \frac{π}{2}

to

\frac{π}{2}

, the vehicle’s travel direction aligns with the path. When

| ψ |

exceeds

\frac{π}{2}

, it indicates that the vehicle is traveling in the opposite direction to the path. Additionally, a value of

\frac{π}{2}

represents the vehicle moving perpendicular to the path.

In this study, we address the problem of path following for a car-like vehicle, as depicted in the structure illustrated in Figure 3. This structure involves the separation of longitudinal and latitudinal controls. Our focus is on the steering control, which is determined by a DDPG agent [15], with the aim of keeping the vehicle as close as possible to the reference path

p_{T} (s)

while maintaining a certain velocity

V^{*}

. The three baseline methods [6,9,10] also rely on this separated control structure, where the path-following algorithm is responsible for providing the commanded steering angle value. Specifically, in this study, the controllers are assumed to be employing identity transformations.

3. Path-Following Control Strategy with Deep Deterministic Policy Gradient

In this section, we employ a DRL approach to investigate path following for car-like vehicles in a customized simulation environment. This approach is based on the kinematic bicycle model and reference path definition introduced in the preceding section. Prior knowledge about DDPG is first discussed, followed by its implementation in the context of path following.

3.1. Preliminaries of Reinforcement Learning

RL involves an agent interacting with the environment to learn the optimal actions that maximize cumulative reward. Here, we consider a standard RL architecture using a Markov Decision Process (MDP), where based on a given state

s_{t} \in S

at time t the agent takes an action

a_{t} \in A

, receives a corresponding reward

r_{t}

and transitions to the next state

s_{t + 1}

with a probability

p (s_{t + 1} | s_{t}, a_{t})

[27,28,29]. The MDP models the decision-making process of the agent as it interacts with the environment. Actions taken by the agent are specified by a policy, in general, which is stochastic and denoted by

π (a | s)

. The policy

π : S \to P (A)

maps the states to a probability distribution over actions, specifying the likelihood of taking each action given a state.

The return, which it is the objective of the agent to maximize, is the total discounted reward from time-step t onwards, as defined in Equation (13).

R_{t} = r_{t} + γ r_{t + 1} + γ^{2} r_{t + 2} + . . . = \sum_{i = t}^{\infty} γ^{i - t} r (s_{i}, a_{i})

(13)

where

γ \in (0, 1)

is a discount rate that is used to prevent divergence of the return to infinity. Let

ρ^{π}

denote the discounted state distribution for a policy

π

; the action-value function, also commonly referred to as Q-function is defined to be the expected total discounted reward,

Q^{π} (s_{t}, a_{t}) = E_{s_{t} \sim ρ^{π}, a_{t} \sim π} [R_{t} | s_{t}, a_{t}]

. The aim of the agent is to acquire a policy that maximizes the cumulative discounted reward from the initial state, which is expressed as an expectation in Equation (14).

\begin{matrix} J (π) & = \sum_{S} ρ^{π} (s) \sum_{A} π (s, a) r (s, a) \\ = E_{s_{t} \sim ρ^{π}, a_{t} \sim π} [r (s, a)] \end{matrix}

(14)

The Bellman equation is a fundamental recursive relationship that is widely employed in the field of RL [14,30]. Specifically, the Q-function under a stochastic policy is expressed as Equation (15), whereas the same function under a deterministic policy

μ : S \leftarrow P (A)

can be rewritten as Equation (16) [14,15].

Q^{π} (s_{t}, a_{t}) = E_{s_{t} \sim ρ^{π}, a_{t} \sim π} [r_{t} + γ E_{a \sim π} [Q^{π} (s_{t + 1}, a_{t + 1})]

(15)

Q^{μ} (s_{t}, a_{t}) = E_{s_{t + 1} \sim ρ^{π}} [r_{t} + γ Q^{μ} (s_{t + 1}, μ (s_{t + 1}))]

(16)

With a deterministic policy, the expectation is solely dependent on the environment, allowing for the learning of

Q^{μ}

off-policy by utilizing transitions generated from a distinct stochastic policy

α

. The actor–critic architecture is a widely utilized off-policy framework [31,32], consisting of two eponymous components. Considering function approximators parameterized by

θ \in R^{n}

, which refers to a vector of n parameters, an actor updates the parameters

θ^{μ}

of the actor function

μ

using policy gradients, while a critic updates the parameters

θ^{Q}

and estimates the unknown true action-value function

Q^{μ}

by using a policy evaluation algorithm such as temporal-difference (TD) learning. By minimizing the mean squared loss given by Equation (17), we can optimize the parameters of the critic.

L (θ^{Q}) = E_{s_{t} \sim ρ^{α}, a_{t} \sim α} [{(y_{t} - Q (s_{t}, a_{t} | θ^{Q}))}^{2}]

(17)

where

y_{t}

is the TD target that the Q-function is updating towards.

y_{t} = r (s_{t}, a_{t}) + γ Q (s_{t + 1}, μ (s_{t + 1}) | θ^{Q})

(18)

By applying the chain rule to the actor parameters

θ^{μ}

, the way to update the actor using the expected return from the starting J is shown in Equation (19).

\begin{matrix} \nabla_{θ^{μ}} J & \approx E_{s_{t} \sim ρ^{α}} [\nabla_{θ^{μ}} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t} | θ^{μ})}] \\ = E_{s_{t} \sim ρ^{α}} [\nabla_{a} Q (s, a | θ^{Q}) |_{s = s_{t}, a = μ (s_{t})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |_{s = s_{t}}] \end{matrix}

(19)

3.2. Deep Deterministic Policy Gradient for Path-Following

The path-following controller based on the DDPG algorithm makes use of a pair of neural networks, one for learning the policy (actor) and the other for learning the value function (critic) and introduces target networks to reduce bias in estimation through delayed updates. Additionally, it employs an experience replay buffer to store and replay samples, which reduces sample correlation, increases sample efficiency and enhances the learning capability of the algorithm. As used in the original paper [15], an Ornstein–Uhlenbeck (OU) process [33], which generates temporally correlated noise, is introduced to explore the action space. An overall architecture is depicted in Figure 4.

3.2.1. Observation Space and Action Space

As the MDP under consideration is partially observable, we use the term “observation” instead of “state” to denote the information the agent relies on when taking actions. The agent performs some actions in the environment and observes the resulting changes in the environment’s state. This interaction between action and observation is commonly known as a time step. The observation s, which serves as the input to the path-following controller, is selected in an intuitive and low-dimensional manner:

s = {e_{p}, ψ, δ}

(20)

where

e_{p}

is the cross-track error in Equation (4),

ψ \in [- π, π]

is the orientation error between the path and the vehicle, and

δ

is the steering angle of the front tire, as graphically presented in Figure 2. We calculate the cross-track error and heading error based on the C.G. of the vehicle, as described in Section 2.1. The steering angle

δ

reflects the level of effort exerted by the vehicle. More details are presented in Table 3. It is worth noting that, for the purpose of constraining the sparsity of the reward function, the absolute value of the cross-track error is limited to below 2.0 m. This constraint helps reduce the size of the observation space. Additionally, any deviation exceeding 2.0 m is considered an unacceptable error and the vehicle should be forcefully brought to a stop.

We choose to use the rate of steering angle

\dot{δ}

over the steering angle

δ

itself as the action,

a_{t} = μ (s_{t} | θ^{μ})

, to avoid undesired fast angle changes. Adopting an incremental control input to the vehicle makes it easier to achieve smooth vehicle motions. Furthermore, constraining the steering angle rate within a specified range, i.e.,

\dot{δ} \in [{\dot{δ}}_{m i n}, {\dot{δ}}_{m a x}]

, provides a better stability. The steering angle command

δ_{t}^{*}

at each time step during training is computed using Equation (21), which incorporates the sampled noise signal

n_{t}

from an OU process

N

. However, the noise is excluded when it is not the training phase. The clipping operation is used to bound values within their ranges, typically to prevent saturation. The constraints on the action and OU process parameters are illustrated in Table 4 and Table 5.

\begin{matrix} n_{t} & = n_{t - 1} + ν_{n} (μ_{n} - n_{t - 1}) Δ t + ζ_{n} d W_{t} \\ δ_{t}^{*} & = δ_{t} + clip (μ (s_{t} | θ^{μ}) + n_{t}, {\dot{δ}}_{m i n}, {\dot{δ}}_{m a x}) Δ t \\ δ_{t}^{*} & = clip (δ_{t}^{*}, δ_{m i n}, δ_{m a x}) \end{matrix}

(21)

where

ν_{n}

determines the speed of mean reversion, the drift term

μ_{n}

affects the asymptotic mean and the standard Wiener process scaled by the volatility

ζ_{n}

is denoted by

d W_{t}

.

3.2.2. Rewards

Given that the objective of an agent is to maximize long-term returns, the design of a reward function is crucial for the satisfactory performance of the agent. In the context of path following, a natural and intuitive approach is to reward the agent for minimizing the cross-track error with respect to the desired path. In [5], the agent is rewarded when the vehicle stays on the path and is penalized by an absolute value function when it deviates from the path. In [21], a Gaussian reward function centered at a cross-track error of 0, with a reasonable standard deviation, is employed. However, we believe that the exponential reward function, with its shape, incentivizes more effective minimization of cross-track error. To ensure smooth steering control, we apply penalties to excessive steering, without considering the consistency between the heading and the path. To mitigate the sparsity of the exponential reward and reduce insignificant experiences, the reward function, as shown in Equation (22), is kept concise. Moreover, if the vehicle deviates from the path beyond a certain distance

e_{m a x}

or moves in the opposite direction of the path, the current training episode is truncated, indicating a failure in completion. In such cases, a negative reward is given to penalize the failure. On the other hand, if the vehicle successfully completes the path or reaches the maximum time steps allowed for a single episode, a positive reward is given. The scaling parameters associated with the rewards are provided in Table 6.

r (t) = e^{- c_{e} | e_{p} (t) |} - c_{δ} | δ (t) | - c_{\dot{δ}} | \dot{δ} (t) | + c_{H} H (t) - c_{F} F (t)

(22)

where

c_{e} > 0

determines the sparsity of the reward and the degree of convergence in training; a high

c_{e}

may result in overly sparse rewards.

c_{δ} > 0

and

c_{\dot{δ}} > 0

scale the penalty term on the steering angle and its rate to penalize excessive steering. H and F are logical values that indicate the successful completion of an episode and failure, respectively.

3.2.3. Environment

The environment in which the agent is expected to perform is an a priori known path. The vehicle kinematics should also be considered as part of the training environment, as it is beyond our control. Moreover, it is crucial for the agent to be trained on a wide variety of challenges to enable it to handle generalized situations instead of overfitting to specific paths. Therefore, we propose an algorithm outlined in Algorithm 1 for generating stochastic reference paths.

Algorithm 1 Stochastic Path Generator

Generated waypoint counter

n \leftarrow 1

, starting waypoint

p_{1} \leftarrow [0, 0]

, Number of path waypoints

N_{w} \in N

, Range of length between waypoints

[L_{m i n}, L_{m a x}] \in R^{+}

while

n \leq N_{w}

do

Sample

L_{w}

from

U (L_{m i n}, L_{m a x})

Sample

θ_{w}

from

U (0, 2 π)

New waypoint

p_{n + 1} \leftarrow p_{n} + L_{w} {[cos (θ_{w}), sin (θ_{w})]}^{T}

n \leftarrow n + 1

end while

Create parameterized path

p_{T} (ω) = {[x_{T} (ω), y_{T} (ω)]}^{T}

using Cubic Spline Interpolator

In this work,

N_{w} = U (2, 6)

,

L_{m i n} = 25

and

L_{m a x} = 50

. Some paths randomly generated from this algorithm are shown in Figure 5.

3.2.4. Implementation Details

The actor and critic neural networks both consist of two hidden layers, as illustrated in Figure 6. Each layer includes rectified linear units (ReLU) activation, with 400 neurons in the first hidden layer and 300 neurons in the second. Notably, in the critic’s network, the state vector connects to the first hidden layer, whereas the action is concatenated before the second hidden layer, following the structure of the original algorithm. This design allows the action to bypass the first layer, which improves the stability and performance of the networks [15]. The final layer of the actor is a tanh layer used to bound the action.

We initialize the weights following the method described in [34], with the exception that we used uniform distributions

[- 3 \times 10^{- 3}, 3 \times 10^{- 3}]

and

[- 3 \times 10^{- 4}, 3 \times 10^{- 4}]

to initialize the final layers of the actor and critic networks, respectively. This is done to prevent output saturation during the early stages of training. To optimize neural networks, we employ Adam [35] with a minibatch size of 64.

Algorithm 2 outlines the training process based on DDPG, following the path-following strategy described above. The initial posture of the vehicle is sampled from a uniform distribution, with a position range of

[- 1, 1]

meters and a heading angle range of

[- 0.2618, 0.2618]

radians. We also introduce a warm-up technique to collect completely random experiences. During the initial training period, the agent uniformly samples random actions from the action space. After warming up, the actor network first generates an output, i.e., action, based on the observed state and adds sampled noise, transitions to the next state, computes the corresponding reward and stores it in the experience replay buffer in the form of a tuple until the number of experiences reaches the size of the set minibatch, at which point the networks are optimized and updated. The loss required for updating the parameters of the critic and actor networks can be calculated using Equations (17) and (19). After optimizing the networks, the parameters of the target networks are updated using a soft update strategy, as denoted by Equation (23). Specifically, a fraction of the updated network parameters are blended with the target network parameters, which helps to stabilize the learning process and avoid oscillations.

θ^{t a r g e t} = τ θ^{u p d a t e d} + (1 - τ) θ^{t a r g e t}

(23)

where parameter

τ

indicates how fast the update is carried on and the update is performed at each step after training the online networks.

During training, a random path is generated for each episode and the agent’s action is subject to exploration noise. The agent selected is determined based on its performance in the evaluation. The difference between training and evaluation is that during evaluation, the actions taken by the agent are solely based on the current learned policy without the addition of exploration noise. For each evaluation, the agent’s performance is assessed across 10 randomly generated paths. Since evaluation occurs at regular intervals, we select the agent that achieves the most high rewards among the evaluations. Table 7 presents the relevant parameters and their values used in the training. We also conduct the training 10 times using 10 different seeds to ensure reputability. For instance, the uniform sampling of the initial position of the vehicle and the random generation of paths rely on a random number generator that is controlled by a seed.

Algorithm 2 Training Process of Path-Following Control Strategy for Car-Like Vehicles

Randomly Initialize critic network

Q (s, a | θ^{Q})

and actor

μ (s | θ^{μ})

with weights

θ^{Q}

and

θ^{μ}

Initialize target network

Q^{'}

and

μ^{'}

with weights

θ^{Q^{'}} \leftarrow θ^{Q}

and

θ^{μ^{'}} \leftarrow θ^{μ}

Initialize replay buffer R

for

t = 1, T

do

Initialize a stochastic path

P

and random initial posture

{[x_{1}, y_{1}, θ_{1}]}^{T}

of vehicle

Initialize an OU process

N

for action exploration

Observe initial state

s_{1}

while True do

if

t < T_{s t a r t}

then

Select action randomly from action space

else

Select action based on the policy

a_{t} = clip (μ (s_{t} | θ^{μ}) + N_{t}, {\dot{δ}}_{m i n}, {\dot{δ}}_{m a x})

end if

Calculate steering command

δ_{t}^{*} = clip (δ t + a_{t} Δ t, δ_{m i n}, δ_{m a x})

Execute steering control, calculate reward

r_{t}

and transitions to new state

s_{t + 1}

Store transition

(s_{t}, a_{t}, r_{t}, s_{t + 1})

in R

if

abs (e_{p} (t)) > e_{m a x}

or

abs (ψ (t)) \geq \frac{π}{2}

then

Logical value

F = 1

, Update reward

r_{t}

break

else if

∥(x (t), y (t)) - (x_{N_{w}}, y_{N_{w}})∥ < d

then

Logical value

H = 1

, Update reward

r_{t}

break

end if

if Number of transitions ≥ Minibatch Size then

Sample a random minibatch of N transitions

(s_{i}, a_{i}, r_{i}, s_{i + 1})

from R

Set

y_{i} = r_{i} + γ Q^{'} (s_{i + 1}, μ^{'} (s_{i + 1} | θ^{μ^{'}}) | θ^{Q^{'}})

Update critic by minimizing the loss:

L = \frac{1}{N} \sum_{i} {(y_{i} - Q (s_{i}, a_{i} | θ^{Q}))}^{2}

Update the actor policy using the sampled policy gradient:

\nabla_{θ^{μ}} J \approx \frac{1}{N} \sum_{i} \nabla_{a} Q (s, a | θ^{Q}) {|_{s = s_{i}, a = μ (s_{i})} \nabla_{θ^{μ}} μ (s | θ^{μ}) |}_{s = s_{i}}

Update the target networks with soft update strategy:

\begin{matrix} θ^{Q^{'}} & \leftarrow τ θ^{Q} + (1 - τ) θ^{Q^{'}} \\ θ^{μ^{'}} & \leftarrow τ θ^{μ} + (1 - τ) θ^{μ^{'}} \end{matrix}

end if

end while

end for

3.3. Tools and Libraries

Our solution was implemented using the PyTorch library [36], which provides a comprehensive deep learning framework for constructing and training neural networks. In our implementation, we utilized PyTorch’s tensor operations and high-level modules, such as the torch.nn module, to construct our actor–critic networks. We also employed PyTorch’s optimization algorithms, such as the Adam optimizer, to update the weights of the networks during training. The agent was trained in a custom path-following environment built using Python, which allowed us to simulate a range of scenarios and evaluate the agent’s performance under different conditions.

4. Results

In this section, we discuss the training process of path following and present the test results on three parameterized paths after training. The evaluation criteria focus on the effectiveness of path convergence and whether the agent adopts minimal and smooth steering as much as possible.

4.1. Training Process

The learning curve for the path-following problem is illustrated in Figure 7, where the solid line represents the average of the 10 trials. The shaded region indicates half a standard deviation of the average evaluation. In the initial stage, the agent learns quickly and receives high rewards. However, as the warm-up phase ends and the agent starts taking actions based on the current policy, there is a period of decline with relatively unstable results across the 10 trials. Nevertheless, after progressing halfway through the learning process, all 10 agents begin to learn a stable policy that yields high rewards consistently.

4.2. Path Convergence with Smooth Steering

4.2.1. Figure-Eight Curve

The figure-eight curve, also known as the Lemniscate of Gerono, possesses a curvature that is exceptionally smooth and richly dispersed. This curve can be parameterized via Equation set (24).

\{\begin{matrix} x = a sin (ω) \\ y = a sin (ω) cos (ω) \end{matrix}

(24)

where a is a constant that determines the size and shape of the curve and is set as 50.

The trajectories of the four methods, including our proposed approach, in tracking the curve are shown in Figure 8. All four methods are able to maintain a small cross-track error while following the path, with the main difference being observed at locations with high curvature and sharp turns. From the right subplot, it can be clearly observed that our proposed approach achieves faster reduction of cross-track error and maintains closer proximity to the desired path, even in the overall task. It is noteworthy that such a path did not appear in our training environment and based on our experience, our random path generation algorithm would have difficulty producing paths similar to this. This highlights the generality of the RL-based method. The overall root squared cross-track error for the path-following task is summarized in Table 8. The results of the DDPG-based approach show the average and standard deviation of the performance of 10 trained agents, indicating a high level of consistency in the performance across the 10 training runs.

The corresponding steering angles

δ

are shown in Figure 9. As expected, our proposed approach achieves smooth steering actions for the agent while reducing the cross-track error in the prior task, thereby avoiding any jitters. In addition, the agent trained using our approach tends to employ smaller steering angles.

4.2.2. Lane Change

The lane change maneuver, which is a very common vehicle maneuver, is selected to verify and compare the tracking performance of the algorithms. The path can be parameterized as a sigmoid function with Equation (25).

\{\begin{matrix} x = ω \\ y = \frac{b - a}{1 + e^{- k (ω - c)}} \end{matrix}

(25)

where the parameters are defined as follows:

a = 0, b = 40, c = 40

and

k = 0.2

, representing the starting point, end point, center of the lane change and the steepness of the sigmoid function, respectively.

The trajectories for each method are shown in Figure 10 and the root squared cross-track error for the overall task is summarized in Table 9. In this scenario, the performances of the DDPG-based controller and the Feedback-based controller stand out, with DDPG slightly outperforming the latter. However, both controllers tend to employ relatively larger steering angles compared to other approaches as shwon in Figure 11. The performance of the 10 agents varies, with some agents performing well and others performing poorly, in comparison to the previous scenario.

4.2.3. Return to Lane

The convergence performance of the controller can be assessed by evaluating its ability to execute the return-to-lane path. The return-to-lane path refers to the vehicle’s process of returning to a straight line from an offset posture, which frequently happens during normal vehicle operation.

Compared to the other methods, our proposal demonstrates superior overall performance in terms of fast convergence and avoidance of overshoot. While achieving rapid path convergence, it also maintains smooth and minimal steering as illustrated in Figure 12. The numerical comparison is summarized in Table 10, where delay time is defined as the time required for the cross-track error to reach

50 %

of its steady-state value, settling time is defined as the time required for the cross-track error to enter the

\pm 5 %

range of its steady-state value and overshoot is defined as the percentage difference between the peak value of the cross-track error and its target value, relative to the target value.

5. Conclusions

In this paper, we explored an off-policy algorithm, namely DDPG, based on an actor–critic architecture to address the path-following problem for ground vehicles. Our approach not only minimizes the cross-track error between the vehicle and the path, but also prevents excessive steering that can cause severe oscillations. To train the agent, We used a challenging and varied environment where each episode generates a random path. In testing, we evaluated the performance of the trained agents in terms of fast path convergence and smooth steering by selecting three representative paths.

Conventional methods rely on rules and parameter tuning. As a comparison, three baseline methods mentioned in this paper require parameter adjustment for each path to achieve good path-following performance. In contrast, the trained agent has broader applicability and outperforms the baseline methods. Similar to the baseline methods, our agent focuses only on steering control to achieve the goal of path following, reducing the action dimension but also losing some exploration space. An agent that combines both speed and steering control may find better solutions, which is our future research direction. Furthermore, control strategies that take into account both the path-following and tyre management in contact with various terrains [37,38], or more importantly include the pollution due to particles of worn rubber [39,40], is a practical aspect we aim to investigate in our future work.

In conclusion, our approach effectively achieves smooth path following by only interacting with the environment to reward long-term returns. Our approach has demonstrated satisfactory performance and could contribute to the development of autonomous driving technology.

Author Contributions

Conceptualization, Y.C., T.K. (Takahiro Kawaguchi) and S.H.; methodology, Y.C., K.N., S.H. and W.J.; software, Y.C., X.J. and H.Z.; validation, K.N., X.J. and T.K. (Taiga Kuroiwa); formal analysis, Y.C., K.N., X.J. and H.Z.; investigation, Y.C., T.K. (Taiga Kuroiwa) and W.J.; resources, T.K. (Takahiro Kawaguchi) and S.H.; data curation, Y.C. and S.H.; writing—original draft preparation, Y.C.; writing—review and editing, Y.C.; visualization, Y.C.; supervision, T.K. (Takahiro Kawaguchi), S.H. and W.J.; project administration, S.H.; funding acquisition, T.K. (Takahiro Kawaguchi). and S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are available from the author, Seiji Hashimoto, upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RL	Reinforcement Learning
DDPG	Deep Deterministic Policy Gradient
DPG	Deterministic Policy Gradient
DRL	Deep Reinforcement Learning
PPO	Proximal Policy Optimization
DQN	Deep Q-Network
MDP	Markov Decision Process
TD	Temporal-Difference
OU	Ornstein–Uhlenbeck
ReLU	Rectified Linear Units

References

Paden, B.; Čáp, M.; Yong, S.Z.; Yershov, D.; Frazzoli, E. A Survey of Motion Planning and Control Techniques for Self-Driving Urban Vehicles. IEEE Trans. Intell. Veh. 2016, 1, 33–55. [Google Scholar] [CrossRef] [Green Version]
Faulwasser, T.; Kern, B.; Findeisen, R. Model predictive path-following for constrained nonlinear systems. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) Held Jointly with 2009 28th Chinese Control Conference, Shanghai, China, 15–18 December 2009; pp. 8642–8647. [Google Scholar] [CrossRef] [Green Version]
Yurtsever, E.; Lambert, J.; Carballo, A.; Takeda, K. A Survey of Autonomous Driving: Common Practices and Emerging Technologies. IEEE Access 2020, 8, 58443–58469. [Google Scholar] [CrossRef]
Aguiar, A.P.; Hespanha, J.P.; Kokotović, P.V. Performance limitations in reference tracking and path following for nonlinear systems. Automatica 2008, 44, 598–610. [Google Scholar] [CrossRef]
Rubí, B.; Morcego, B.; Pérez, R. Deep reinforcement learning for quadrotor path following with adaptive velocity. Auton. Robot. 2021, 45, 119–134. [Google Scholar] [CrossRef]
Coulter, R.C. Implementation of the Pure Pursuit Path Tracking Algorithm; Technical Report; Carnegie-Mellon UNIV Pittsburgh PA Robotics INST: Pittsburgh, PA, USA, 1992. [Google Scholar]
Amidi, O.; Thorpe, C.E. Integrated mobile robot control. In Proceedings of the Mobile Robots V. SPIE, Boston, MA, USA, 1 March 1991; Volume 1388, pp. 504–523. [Google Scholar] [CrossRef] [Green Version]
Amer, N.H.; Zamzuri, H.; Hudha, K.; Kadir, Z.A. Modelling and control strategies in path tracking control for autonomous ground vehicles: A review of state of the art and challenges. J. Intell. Robot. Syst. 2017, 86, 225–254. [Google Scholar] [CrossRef]
Samson, C. Path Following And Time-Varying Feedback Stabilization of a Wheeled Mobile Robot. Second Int. Conf. Autom. Robot. Comput. Vis. 1992, 3, 1. [Google Scholar]
Thrun, S.; Montemerlo, M.; Dahlkamp, H.; Stavens, D.; Aron, A.; Diebel, J.; Fong, P.; Gale, J.; Halpenny, M.; Hoffmann, G.; et al. Stanley: The robot that won the DARPA Grand Challenge. J. Field Robot. 2006, 23, 661–692. [Google Scholar] [CrossRef]
Zhao, W.; Queralta, J.P.; Westerlund, T. Sim-to-real transfer in deep reinforcement learning for robotics: A survey. In Proceedings of the 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, 1–4 December 2020; pp. 737–744. [Google Scholar] [CrossRef]
Li, Y. Deep reinforcement learning: An overview. arXiv 2017, arXiv:1701.07274. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Silver, D.; Lever, G.; Heess, N.; Degris, T.; Wierstra, D.; Riedmiller, M. Deterministic Policy Gradient algorithms. In Proceedings of the International Conference on Machine Learning, Beijing, China, 21–26 June 2014; pp. 387–395. [Google Scholar]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar]
Cheng, X.; Zhang, S.; Cheng, S.; Xia, Q.; Zhang, J. Path-Following and Obstacle Avoidance Control of Nonholonomic Wheeled Mobile Robot Based on Deep Reinforcement Learning. Appl. Sci. 2022, 12, 6874. [Google Scholar] [CrossRef]
Zheng, Y.; Tao, J.; Sun, Q.; Zeng, X.; Sun, H.; Sun, M.; Chen, Z. DDPG-based active disturbance rejection 3D path-following control for powered parafoil under wind disturbances. Nonlinear Dyn. 2023, 111, 11205–11221. [Google Scholar] [CrossRef]
Ma, R.; Wang, Y.; Wang, S.; Cheng, L.; Wang, R.; Tan, M. Sample-Observed Soft Actor–Critic Learning for Path Following of a Biomimetic Underwater Vehicle. IEEE Trans. Autom. Sci. Eng. 2023, 1–10. [Google Scholar] [CrossRef]
Martinsen, A.B.; Lekkas, A.M. Curved Path Following with Deep Reinforcement Learning: Results from Three Vessel Models. In Proceedings of the OCEANS 2018 MTS/IEEE Charleston, Charleston, SC, USA, 22–25 October 2018; pp. 1–8. [Google Scholar] [CrossRef]
Rounsaville, J.D.; Dvorak, J.S.; Stombaugh, T.S. Methods for calculating relative cross-track error for ASABE/ISO Standard 12188-2 from discrete measurements. Trans. ASABE 2016, 59, 1609–1616. [Google Scholar] [CrossRef] [Green Version]
Martinsen, A.B. End-to-End Training for Path Following and Control of Marine Vehicles. Master’s Thesis, Norwegian University of Science and Technology, Trondheim, Norway, 2018. [Google Scholar]
Yamamoto, K.I.; Nishimura, H. Control system design of electric power steering for a full vehicle model with active stabilizer. J. Syst. Des. Dyn. 2011, 5, 789–804. [Google Scholar] [CrossRef] [Green Version]
De Luca, A.; Oriolo, G.; Samson, C. Feedback control of a nonholonomic car-like robot. Robot. Motion Plan. Control 2005, 229, 171–253. [Google Scholar]
Tateyama, Y.; Yamada, H.; Noyori, J.; Mori, Y.; Yamamoto, K.; Ogi, T.; Nishimura, H.; Kitamura, N.; Yashiro, H. Observation of drivers’ behavior at narrow roads using immersive car driving simulator. In Proceedings of the 9th ACM SIGGRAPH Conference on Virtual-Reality Continuum and Its Applications in Industry, Seoul, Republic of Korea, 12–13 December 2010; pp. 391–396. [Google Scholar] [CrossRef]
Fujimura, Y.; Hashimoto, S.; Banjerdpongchai, D. Design of model predictive control with nonlinear disturbance observer for electric power steering system. In Proceedings of the 2019 SICE International Symposium on Control Systems (SICE ISCS), Kumamoto, Japan, 7–9 March 2019; pp. 49–56. [Google Scholar] [CrossRef]
Corke, P.I.; Khatib, O. Robotics, Vision and Control: Fundamental Algorithms in MATLAB; Springer: Berlin/Heidelberg, Germany, 2011; Volume 73. [Google Scholar]
Woo, J.; Yu, C.; Kim, N. Deep reinforcement learning-based controller for path following of an unmanned surface vehicle. Ocean Eng. 2019, 183, 155–166. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef] [Green Version]
François-Lavet, V.; Henderson, P.; Islam, R.; Bellemare, M.G.; Pineau, J. An introduction to deep reinforcement learning. Found. Trends® Mach. Learn. 2018, 11, 219–354. [Google Scholar] [CrossRef] [Green Version]
Han, D.; Mulyana, B.; Stankovic, V.; Cheng, S. A Survey on Deep Reinforcement Learning Algorithms for Robotic Manipulation. Sensors 2023, 23, 3762. [Google Scholar] [CrossRef]
Bhatnagar, S.; Ghavamzadeh, M.; Lee, M.; Sutton, R.S. Incremental natural actor–critic algorithms. Adv. Neural Inf. Process. Syst. 2007, 20, 105–112. [Google Scholar]
Degris, T.; White, M.; Sutton, R.S. Off-policy actor–critic. arXiv 2012, arXiv:1205.4839. [Google Scholar]
Uhlenbeck, G.E.; Ornstein, L.S. On the theory of the Brownian motion. Phys. Rev. 1930, 36, 823. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar] [CrossRef] [Green Version]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
PyTorch. 2016. Available online: https://pytorch.org/ (accessed on 12 April 2023).
Sakhnevych, A.; Arricale, V.M.; Bruschetta, M.; Censi, A.; Mion, E.; Picotti, E.; Frazzoli, E. Investigation on the model-based control performance in vehicle safety critical scenarios with varying tyre limits. Sensors 2021, 21, 5372. [Google Scholar] [CrossRef] [PubMed]
Santini, S.; Albarella, N.; Arricale, V.M.; Brancati, R.; Sakhnevych, A. On-board road friction estimation technique for autonomous driving vehicle-following maneuvers. Appl. Sci. 2021, 11, 2197. [Google Scholar] [CrossRef]
Obereigner, G.; Shorten, R.; del Re, L. Low tyre particle control. In Proceedings of the 2020 24th International Conference on System Theory, Control and Computing (ICSTCC), Sinaia, Romania, 8–10 October 2020; pp. 757–762. [Google Scholar] [CrossRef]
Tonegawa, Y.; Sasaki, S. Development of tire-wear particle emission measurements for passenger vehicles. Emiss. Control Sci. Technol. 2021, 7, 56–62. [Google Scholar] [CrossRef]

Figure 1. Bicycle model of a ground car-like vehicle. This model is a two-wheeled vehicle model assuming that the left and right wheels of the front and rear of the vehicle are concentrated at the center position of the axle.

Figure 2. The relationship between the posture of the vehicle’s C.G. in the fixed ground frame

{O}

and the local path reference frame

{T}

.

Figure 2. The relationship between the posture of the vehicle’s C.G. in the fixed ground frame

{O}

and the local path reference frame

{T}

.

Figure 3. Structure of separated longitudinal and latitudinal controls. We focus on the path-following algorithm, which is solely used to determine the steering angle command

δ^{*}

.

Figure 3. Structure of separated longitudinal and latitudinal controls. We focus on the path-following algorithm, which is solely used to determine the steering angle command

δ^{*}

.

Figure 4. DDPG-based controller for path following. The optimization process of the two network pairs is illustrated, with the parameters used to optimize the critic and actor networks distinguished by the colors blue and red, respectively. The exploration noise is only added in the training phase. The apostrophe implies that the variable pertains to onward steps.

Figure 5. Randomly generated paths, including both straight paths and complex curved paths.

Figure 6. The two networks share the same structure and input, except that in the critic network, the action is concatenated before the second hidden layer.

Figure 7. Processes of 10 training runs performed with different random seeds under the same training conditions. The solid line represents the average of the 10 runs, while the shaded area corresponds to the confidence interval represented by the standard deviation.

Figure 8. The trajectories of the positions for calculating the cross-track error, where the initial positions are the origin. The subplot on the right side shows the cross-track errors observed in the left subplot.

Figure 9. The changes in steering angles over time.

Figure 10. The trajectories of the positions for calculating the cross-track error, where the initial positions of the rear wheels are the origin. The subplot on the right zooms in on the part where the steering occurs.

Figure 11. The changes in steering angles over time.

Figure 12. The trajectories of the positions for calculating the cross-track error, where the initial postures are all

[0.0, 0.5, 0.0]

. The right subplot show the changes of steering angles over time.

Figure 12. The trajectories of the positions for calculating the cross-track error, where the initial postures are all

[0.0, 0.5, 0.0]

. The right subplot show the changes of steering angles over time.

Table 1. Vehicle parameters for simulation.

Symbol	Description	Value
m	Vehicle mass	1188 kg
V	Vehicle velocity	28 km/h
$I_{z}$	Vehicle yawing inertia	2243.1 kgm $^{2}$
$l_{f}$	Front Axle-C.G. distance	1.1281 m
$l_{r}$	Rear Axle-C.G. distance	1.4719 m
$K_{f}$	Front cornering power	76,744 N/rad
$K_{r}$	Rear cornering power	119,320 N/rad

Table 2. Vehicle variables for simulation.

Symbol	Description	Unit
$δ$	Front tire angle	rad
$β$	Side slip angle	rad
$λ$	Yaw rate	rad/s

Table 3. Observation space.

Symbol	Description	Min	Max
$e_{p}$	Cross-track error	$- 2.0$ m	$2.0$ m
$ψ$	Orientation error	$- π$ rad	$π$ rad
$δ$	Front tire angle	$- 0.5236$ rad	$0.5236$ rad

Table 4. Action space.

Symbol	Description	Min	Max
$\dot{δ}$	Front tire angle rate	$- 1.5708$ rad/s	$1.5708$ rad/s

Table 5. Ornstein–Uhlenbeck process parameters.

Symbol	Value
$μ_{n}$	0.0
$ζ_{n}$	0.15708
$ν_{n}$	0.15

Table 6. Reward function parameters.

Symbol	Value
$c_{e}$	2.0
$c_{δ}$	0.1
$c_{\dot{δ}}$	0.5
$c_{H}$	10.0
$c_{F}$	10.0
$e_{m a x}$	2.0

Table 7. Parameters of the DDPG agent.

Symbol	Description	Value
$Δ t$	Time period	0.05 s
$α_{μ}$	Learning rate of actor network	0.0001
$α_{Q}$	Learning rate of critic network	0.001
$τ$	Target soft update rate	0.001
$γ$	Discount factor	0.99
T	Maximum time steps	1,000,000
-	Startup time steps	25,000
-	Evaluation per time steps	5000
-	Experience replay buffer size	1,000,000
-	Minibatch size	64

Table 8. Comparison of the root mean squared cross-track error. The unit is meters.

Method	RMSE
Pure Pursuit	0.2015
Feedback	0.1913
Stanley	0.2348
Proposed DDPG	0.1151 ± 0.0075

Table 9. Comparison of the root mean squared cross-track error. The unit is meters.

Method	RMSE
Pure Pursuit	0.0855
Feedback	0.0568
Stanley	0.1062
Proposed DDPG	0.0469 ± 0.0131

Table 10. Comparison of convergence performance.

Method	Delay Time (s)	Settling Time (s)	Overshoot (%)
Pure Pursuit	0.60	1.00	11.11
Feedback	0.50	0.80	10.91
Stanley	0.50	1.80	-
Proposed DDPG	0.45	0.80	5.51

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cao, Y.; Ni, K.; Jiang, X.; Kuroiwa, T.; Zhang, H.; Kawaguchi, T.; Hashimoto, S.; Jiang, W. Path following for Autonomous Ground Vehicle Using DDPG Algorithm: A Reinforcement Learning Approach. Appl. Sci. 2023, 13, 6847. https://doi.org/10.3390/app13116847

AMA Style

Cao Y, Ni K, Jiang X, Kuroiwa T, Zhang H, Kawaguchi T, Hashimoto S, Jiang W. Path following for Autonomous Ground Vehicle Using DDPG Algorithm: A Reinforcement Learning Approach. Applied Sciences. 2023; 13(11):6847. https://doi.org/10.3390/app13116847

Chicago/Turabian Style

Cao, Yu, Kan Ni, Xiongwen Jiang, Taiga Kuroiwa, Haohao Zhang, Takahiro Kawaguchi, Seiji Hashimoto, and Wei Jiang. 2023. "Path following for Autonomous Ground Vehicle Using DDPG Algorithm: A Reinforcement Learning Approach" Applied Sciences 13, no. 11: 6847. https://doi.org/10.3390/app13116847

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Path following for Autonomous Ground Vehicle Using DDPG Algorithm: A Reinforcement Learning Approach

Abstract

1. Introduction

2. Problem Formulation and Modeling

2.1. Path Following

2.2. Kinematics Model for Car-like Vehicles

3. Path-Following Control Strategy with Deep Deterministic Policy Gradient

3.1. Preliminaries of Reinforcement Learning

3.2. Deep Deterministic Policy Gradient for Path-Following

3.2.1. Observation Space and Action Space

3.2.2. Rewards

3.2.3. Environment

3.2.4. Implementation Details

3.3. Tools and Libraries

4. Results

4.1. Training Process

4.2. Path Convergence with Smooth Steering

4.2.1. Figure-Eight Curve

4.2.2. Lane Change

4.2.3. Return to Lane

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI