Micromagnetic simulations using Graphics Processing Units

L Lopez-Diaz; D Aurelio; L Torres; E Martinez; M A Hernandez-Lopez; J Gomez; O Alejos; M Carpentieri; G Finocchio; G Consolo

doi:10.1088/0022-3727/45/32/323001

1. Introduction

Micromagnetic simulations have become a ubiquitous research tool in nanomagnetism due to its predictive power and the insight they provide in the understanding of magnetization dynamics and the interpretation of experimental results. Until the last two decades, micromagnetic simulations were carried out only by highly specialized groups, but nowadays there are several packages available, both open source [1–4] and commercial [5–9], allowing for non-experts to run their own simulations. The origin of micromagnetism can be traced back to the 1960s, when Brown established the theoretical basis of this formalism [10] and Labonte published his work on the internal structure of domain walls in thin films [11]. Since the very beginning, it was apparent that, in order to get physically meaningful results, many problems demanded simulating a large number of degrees of freedom, so that the search for different ways to speed-up simulations has been incessant over the years. Most of the efforts so far have been devoted to developing methods for calculating the self-magnetostatic field more efficiently [12–22] since this is, by far, the most time-consuming part of micromagnetic simulations. Some progress has also been achieved in the implementation of sophisticated time integration algorithms allowing for larger time steps maintaining numerical stability [23–27].

On the other hand, over the last three decades there has been significant progress in the miniaturization techniques that has allowed to fabricate and measure the magnetic properties of structures with lateral dimensions increasingly smaller down to the nanometre scale. This enormous progress in miniaturization has run in parallel with dramatic improvements in computer speed and capacity and also with the already mentioned development of more efficient computational methods. As a result, it has become progressively more feasible over the years to carry out simulations with the dimensions of real devices using a standard personal computer (PC) at reasonable time cost. In the authors' view, this is a key point behind the popularity of micromagnetic simulations.

However, some emerging research areas in nanomagnetism often require micromagnetic simulations with over a million spins for time windows in the order of hundreds of nanoseconds, which can be extremely memory demanding and time consuming. Examples of these areas are nanocontact spin-transfer oscillators (STOs) [28–30], spin-wave logic [31, 32], collective excitations in dot [33–35] and antidot arrays [36–39], magnonics [40, 41], spin caloritronics [42] etc. Therefore, the search for acceleration methods is of great relevance nowadays. Due to the speed saturation of single-core computing systems, these methods should rely on parallelization. Several parallel implementations of micromagnetic solvers on central processing units (CPUs) already exist [21, 43, 44]. However, the performance increase achievable with this approach is limited by the small number of available cores in current shared memory computers, whereas clusters with a large number of them are expensive, consume much power and are available only at specialized facilities.

On the other hand, there is a growing interest in running numerical calculations using graphics processing unit (GPU) architectures instead of CPUs. Although originally intended for purely graphical purposes, GPUs have turned out to be suitable for high-performance scientific computing because they offer massive parallelization at low cost [45–47]. For example, the NVIDIA Tesla 2070 GPU used for this work has 448 cores and delivers up to 1.03 × 10¹² floating-point operations per second (flops) [48], about two orders of magnitude more than a typical CPU. However, in order to take advantage of this huge numerical power, programs need to be written specifically for GPU hardware using the programming languages provided by the GPU manufacturer and many hardware specific technicalities need to be addressed. Additionally, algorithms need to be expressed in highly parallel manner, which is not always a straightforward task. Over the last years some groups have already developed micromagnetic solvers for GPU hardware, both finite-difference (FD) [49] and finite-element (FE) [50–52] and very significant speed-ups compared with CPU solvers have been reported. In this work we use another micromagnetic solver developed to run on GPU systems, named GPMagnet and commercialized by GoParallel [9], that renders similar performance increases. The review is intended to serve for two main purposes. On the one hand, to present the basic aspects that have to be considered when translating a serial micromagnetic code to a GPU-based parallel one, with the aim that anybody who intends to do so will get a general picture of what has to be carried out, the steps to be taken and the technical difficulties to be faced. On the other hand, to show how an efficient GPU-based micromagnetic solver allows us to tackle some interesting problems of different natures that could hardly be addressed with a standard CPU-based serial code.

This review paper is organized as follows. In section 2 the theoretical background of micromagnetism is presented and some technical aspects related to the spatial discretization and the numerical time integration of the dynamic equation are also briefly discussed. Section 3 is devoted to describing the GPU architecture and presenting some relevant information on how to adapt a serial code to run in a GPU architecture, including explanations of how it is carried out in GPMagnet. In section 4 the solution to some simple problems obtained with GPMagnet are compared with solutions obtained with serial solvers in order to test the accuracy and the efficiency of GPMagnet. Section 5 is dedicated to showing how some advanced problems that require simulating a large number of spins can be solved with GPMagnet at reasonable time cost. The main conclusions of this work are highlighted in section 6.

2. Computational micromagnetics

2.1. Basic model

Micromagnetics [10, 53, 54] is a continuum theory of ferro- and ferrimagnetic materials that allows for the computation of magnetization distributions of arbitrary shape. The theory is based on the assumptions that (i) the modulus of the magnetization is constant (M = M_s m with m · m = 1, M_s being the saturation magnetization) and (ii) all vector quantities (the magnetization, the exchange and the self-magnetostatic fields, especially) vary slowly at the atomic scale. The time evolution of the magnetization distribution m(r, t) is given by the so-called Landau–Lifsthitz–Gilbert (LLG) equation [10]

$\begin{equation} \frac{\rmd {\bit m}}{\rmd t}=-\gamma_{0} \, \left( {\bit m}\times{\bit H}_{\rm eff} \right)+\alpha \, \left[ {\bit m}\times \frac{\rmd {\bit m}}{\rmd t}\right], \end{equation} \tag{ 1 }$

where H_eff(r, t) is the effective field, γ₀ = 2.21 × 10⁵ (A/m)⁻¹ s⁻¹ is the gyromagnetic ratio and α is the (dimensionless) Gilbert damping parameter. After some algebraic manipulation, (1) can be written in the form

$\begin{equation} (1+\alpha^2)\,\frac{\rmd {\bit m}}{\rmd t}=-\gamma_{0} \, \left[ {\bit m}\times{\bit H}_{\rm eff} + \alpha \, {\bit m}\times \left( {\bit m}\times{\bit H}_{\rm eff} \right) \right], \end{equation} \tag{ 2 }$

which is more appropriate for numerical integration, since the time derivative of the magnetization appears only in the left part of the equation. According to Brown's theory [10], the effective field acting on the magnetization is the functional derivative of the energy density with respect to the magnetization

$\begin{equation} {\bit H}_{\rm eff}=-\frac{1}{\gamma_0 \, M_{\rm s}} \, \frac{\delta \epsilon}{\delta {\bit m}}. \end{equation} \tag{ 3 }$

Contributions arising from exchange, anisotropy, self-magnetostatic and applied field interactions are taken into account in the energy of the system E = ∫_V(m) d³r, which can be written as [10]

$\begin{eqnarray} \fl E&=&\int_{V} \left[ A\,(\nabla{\bit m})^2+\epsilon_{\rm an}- \frac{1}{2}\,\mu_0\,M_{\rm s}\,{\bit m}\cdot{\bit H}_{\rm d}\right.\nonumber\\ \fl&& \left. -\, \mu_0\,M_{\rm s}\,{\bit m}\cdot{\bit H}_{\rm app} \right]\,\rmd^3r, \end{eqnarray} \tag{ 4 }$

where A is the exchange constant, _an(m) is the anisotropy energy density functional, H_d is the self-magnetostatic field and H_app is the applied field.

The most common anisotropy types are uniaxial and cubic. In the first case, the energy density is given by

$\begin{equation} \epsilon_{\rm an,u}({\bit m})=-K_{\rm u} \, ({\bit m}\cdot{\bit u}_k)^2, \end{equation} \tag{ 5 }$

where K_u is the uniaxial anisotropy constant and u_k is the unit vector along the anisotropy axis. In the case of cubic anisotropy the corresponding expression is

$\begin{equation} \epsilon_{\rm an,c}({\bit m})=K_1\,\left( \alpha^2\beta^2 + \beta^2\gamma^2+\alpha^2\gamma^2 \right)+K_2\,\left( \alpha^2\beta^2\gamma^2\right), \end{equation} \tag{ 6 }$

where K₁ and K₂ are the cubic anisotropy constants and α, β and γ are the direction cosines of the magnetization with respect to the cubic easy axes. Surface anisotropy can also be taken into account by means of

$\begin{equation} \epsilon_{\rm an,s}({\bit m})=K_{\rm s}\,\left[1-({\bit m}\cdot{\bit n})^2 \right], \end{equation} \tag{ 7 }$

where K_s is the surface anisotropy constant and n is the unit vector normal to the surface. Moreover, exchange bias at a ferromagnet–antiferromagnet (FM–AFM) interface [55, 56] and interlayer indirect exchange (RKKY) coupling between two magnetic layers separated by a metal spacer [57, 58] can be also included in the formalism by means of the following surface anisotropy term

$\begin{equation} \epsilon_{\rm an,e}({\bit m})=J_1 \,\left(1-{\bit m}\cdot{\bit m}' \right) + J_2 \,\left[1-\left({\bit m}\cdot{\bit m}'\right)^2 \right], \end{equation} \tag{ 8 }$

where J₁ and J₂ are the bilinear and biquadratic surface-exchange coefficients, respectively, and m' defines the magnetization orientation of the reference coupled layer.

On the other hand, the self-magnetostatic field, also referred to as demagnetizing field, H_d, is given by

$\begin{eqnarray} \fl {\bit H}_{\rm d}({\bit r})&=&\frac{1}{4\,\pi} \, \left[ \int_{V'} \frac{({\bit r}-{\bit r}')\,\lambda_{V}({\bit r}')}{|{\bit r}-{\bit r}'|^3} \,\rmd^3r' \right.\nonumber\\ \fl&& \left. +\, \int_{S'} \frac{({\bit r}-{\bit r}')\,\sigma_{S}({\bit r}')}{|{\bit r}-{\bit r}'|^3} \,\rmd^2r' \right], \end{eqnarray} \tag{ 9 }$

where λ_V = −M_s ∇ · m and σ_s = M_s m · n are the magnetic volume and surface charges, respectively. Alternatively, H_d can be obtained as

$\begin{equation} {\bit H}_{\rm d}({\bit r})=-{\boldsymbol{\nabla}}\phi({\bit r}), \end{equation} \tag{ 10 }$

where φ(r) is the magnetic scalar potential, given by

$\begin{equation} \phi({\bit r})=\frac{1}{4\,\pi} \, \left[ \int_{V'} \frac{\lambda_{V}({\bit r}')}{|{\bit r}-{\bit r}'|} \,\rmd^3r' + \int_{S'} \frac{\sigma_{S}({\bit r}')}{|{\bit r}-{\bit r}'|} \,\rmd^2r' \right]. \end{equation} \tag{ 11 }$

Regarding the applied field H_app, it is worth pointing out that it includes any field directly applied from external sources but also the magnetic Oersted field resulting from electric currents inside the device, which is given by Biot–Savart's law

$\begin{equation} {\bit H}_{\rm Oe}({\bit r})=\frac{1}{4\,\pi}\,\int_{V'} {\bit J}({\bit r}')\times\frac{{\bit r}-{\bit r}'}{|{\bit r}-{\bit r}'|^3}\,\rmd^3r'. \end{equation} \tag{ 12 }$

where J(r') is the current density.

By taking the functional derivative in (3) using (4) one obtains the following expression for the effective field:

$\begin{equation} {\bit H}_{\rm eff}=-\frac{2 \, A}{\mu_0 \, M_{\rm s}} \, \nabla^2 {\bit m} - \frac{1}{M_{\rm s}} \, \frac{\delta \epsilon_{\rm an}}{\delta {\bit m}} + {\bit H}_{\rm d} + {\bit H}_{\rm app}. \end{equation} \tag{ 13 }$

Since the exchange energy involves the square of the magnetization gradient, its variation gives rise not only to the exchange contribution to the effective field in (13) but also to boundary conditions (BCs) at the surface. The BCs arising from the discontinuity of exchange interactions at the surface are referred to as 'free' BC and they can be written as [10, 59, 60]

$\begin{eqnarray} \fl \frac{\partial {\bit m}}{\partial {\bit n}} &=& \frac{K_{\rm s}}{A}\,({\bit m}\cdot{\bit n})\left[{\bit n}-({\bit m}\cdot{\bit n}){\bit m}\right] -\left[\frac{J_1}{2\,A}+\frac{J_2}{A}({\bit m}\cdot{\bit m}')\right]\nonumber\\\fl&&\times\,\left[({\bit m}\cdot{\bit m}'){\bit m}-{\bit m}' \right]. \end{eqnarray} \tag{ 14 }$

We note that, in the absence of surface anisotropy (7) and interlayer exchange (8), the BCs are reduced to

$\begin{equation} \frac{\partial {\bit m}}{\partial {\bit n}}=0. \end{equation} \tag{ 15 }$

Altogether, what has been described so far can be considered as the basic zero-temperature micromagnetic model. Starting from some initial magnetization distribution m₀(r), the formalism is based on integrating (2) in time, where H_eff(r) is given by (13), preserving the BCs (14). If one is not interested in the dynamic evolution of the system under study but just on the magnetization distribution at equilibrium then the equilibrium equation

$\begin{equation} {\bit m}\times{\bit H}_{\rm eff}=0 \end{equation} \tag{ 16 }$

is to be solved.

2.2. Spin-transfer torque

This basic model described in section 2.1 can be extended to include the spin-transfer torque (STT) effect [61, 62] by adding a term Γ_ST to the dynamic equation (1). When a spin-polarized current flows perpendicular to the plane (CPP) the torque is given by [66]

$\begin{equation} \boldsymbol{{\Gamma}}_{{\rm ST\mbox{-}CPP}}=\frac{|g|\,\mu_{\rm B}\,J\,\mathcal{P}}{2\,M_{\rm s}\,d\,|e|}\,{\bit m}\times({\bit m}\times{\bit p}), \end{equation} \tag{ 17 }$

where μ_B is the Bohr's magneton, e is the electron's charge, the g-factor is ≈ −2.0 for transition metals, J is the modulus of the current density, d is the layer thickness, $\mathcal{P}$ is the polarization function and p is the unit vector along the current polarization direction. The polarization function $\mathcal{P}({\bit m}\cdot{\bit p})$ contains all the information stemming from the stack geometry and materials properties and it depends on the relative orientation between the magnetization of the free and fixed layers [61, 63–66]. When the LLG equation (1) augmented with the spin torque (19) is written in explicit form it reads

$\begin{eqnarray} &&\fl (1+\alpha^2)\,\frac{\rmd {\bit m}}{\rmd t}= -\gamma_{0} \, \left[ {\bit m}\times{\bit H}_{\rm eff} + \alpha \, {\bit m}\times \left( {\bit m}\times{\bit H}_{\rm eff} \right) \right] \nonumber\\ &&-\, \frac{|g|\,\mu_{\rm B}\,J\,\mathcal{P}}{2\,M_{\rm s}\,d\,|e|}\, \left[ {\bit m}\times({\bit m}\times{\bit p}) - \alpha \,({\bit m}\times{\bit p}) \right]. \end{eqnarray} \tag{ 18 }$

On the other hand, if the current flows in-plane (CIP), both adiabatic a non-adiabatic terms are included in the torque [62, 67–69]

$\begin{eqnarray} &&\fl \boldsymbol{{\Gamma}}_{{\rm ST\mbox{-}CIP}}=\frac{|g|\,\mu_{\rm B}\,\mathcal{P}}{2\,M_{\rm s}\,|e|} \, \left[-\left( {\bit J}\cdot{\boldsymbol{\nabla}}\right) {\bit m} + \xi\,{\bit m}\times \left({\bit J}\cdot{\boldsymbol{\nabla}}\right){\bit m} \right], \end{eqnarray} \tag{ 19 }$

where now $\mathcal{P}$ is the spin polarization factor of the current and ξ is the dimensionless constant that measures the degree of nonadiabaticity between the spin of conduction electrons and the local magnetization [67–69]. When these two terms are added to (1) and the full dynamic equation is written in explicit form one obtains

$\begin{eqnarray} &&\fl (1+\alpha^2)\frac{\rmd {\bit m}}{\rmd t}= -\gamma_{0} [ {\bit m}\times{\bit H}_{\rm eff} + \alpha {\bit m}\times ( {\bit m}\times{\bit H}_{\rm eff} ) ] \nonumber\\ &&\fl -\, \frac{|g| \mu_{\rm B}\mathcal{P}}{2M_{\rm s}|e|}[ (1+\xi\alpha) ({\bit J}\cdot{\boldsymbol{\nabla}}){\bit m} - (\xi-\alpha) {\bit m}\times({\bit J}\cdot{\boldsymbol{\nabla}}){\bit m} ].\nonumber\\ \end{eqnarray} \tag{ 20 }$

2.3. Discretization and numerical integration

As described in (2.1) and (2.2), micromagnetics requires, in any of the cases considered, solving a non-linear integro-differential (equations (2), (16), (18) and (20)) that involves long range (self-magnetostatic), short range (exchange) and local (anisotropy, external field, spin torque) interactions. Therefore, analytical solutions do not exist except for very idealized cases [53, 54] and all problems of practical interest have to be solved numerically. In order to do that, the continuous magnetization is sampled at a finite number of points in a mesh, so that the field (equations (2), (16), (18) and (20)) is converted into a finite set of simultaneous coupled equations, one for each mesh point. Two different discretization approaches are used in micromagnetics. On the one hand, the FD approach [70], which uses a uniform rectangular mesh [1, 4–6, 9]. On the other hand, the FE approach [71], which is based on interpolating the magnetization using linear basis functions on a non-uniform, typically tetrahedral, mesh [2, 3, 7, 8, 72]. GPMagnet [9], the software used in this work, belongs to the first group. Consequently, only FD discretization and the technical aspects related to it will be discussed here. A comprehensive review of FE-based computational micromagnetics can be found elsewhere [73].

In general, the FD approach is easier to implement than the FE one. In particular, meshing is straightforward and the only thing one has to care is to choose the cell size smaller than the length scale over which the magnetization changes significantly. There are two characteristic length scales in micromagnetics, i.e. the so-called exchange length

$\begin{equation} l_{\rm ex}=\sqrt{\frac{2\,A}{\mu_0 \, M_{\rm s}^2}}, \end{equation} \tag{ 21 }$

on one hand and

$\begin{equation} l_{w}=\sqrt{\frac{A}{K}} \end{equation} \tag{ 22 }$

on the other, where K is the anisotropy constant of the material under consideration. Roughly speaking, l_ex is the relevant scale when magnetostatic dipolar interaction is dominant over anisotropy, as is the case in soft magnetic materials, whereas in materials with large anisotropy, l_w is the relevant one. For the most frequently used ferromagnetic materials, the values of these parameters range between 4 and 8 nm, which fixes an upper bound for the cell size in the simulations. The main disadvantage of the FD approach is that sampling curved boundaries with a rectangular mesh results in a staircase type approximation to the geometry, which sometimes introduces spurious effects. Some ad hoc techniques have been developed in order to minimize these effects [24, 74, 75].

Once the meshing is decided, discretized expressions for the effective field (13) and the STT (17) and (19) have to be obtained. Local terms, such as the anisotropy and the applied field components of H_eff (13) as well as the ST-CPP (17), are trivially discretized. For the spatial derivatives of the magnetization appearing in the exchange contribution to H_eff (13), the BC (14) and the ST-CIP (19), central expressions based on a second-order Taylor expansion have been used in this work. For an arbitrary scalar function f, the expressions for the first and second derivative with respect to a spatial coordinate x are

$\begin{eqnarray} &&f'_i=\frac{f_{i+1}-f_{i-1}}{2\,h} \end{eqnarray} \tag{ 23a }$

$\begin{eqnarray} &&f''_i=\frac{f_{i+1}-2\,f_i+f_{i-1}}{h^2} \end{eqnarray} \tag{ 23b }$

where h is the spacing and i the cell counter along the x direction, respectively. Higher order expressions can also be easily implemented [59, 60, 76, 77]. Finally, the self-magnetostatic field is calculated assuming that m is uniform inside the cell. In this case, as shown in [78], the averaged self-magnetostatic field at cell i can be written as

$\begin{equation} {\bit H}_{d,i}=-M_{\rm s}\,\sum_{j} {\bit N}({\bit r}_i-{\bit r}_j) \cdot {\bit m}_j, \end{equation} \tag{ 24 }$

where the sum is over all cells, r_i (r_j) defines the position of cell i(j) and N(r_i − r_j) is a 3 × 3 symmetric tensor, usually called demagnetizing tensor, given by

$\begin{equation} {\bit N}({\bit r}_i-{\bit r}_j)=\frac{1}{4\,\pi}\oint_{S_i}\oint_{S_j}\frac{\rmd{\bit S}_i \, \rmd{\bit S}'_j}{|{\bit r}-{\bit r}'|}, \end{equation} \tag{ 25 }$

S_i (S_j) being the surface of cell i (j) and S_i (S_j) the corresponding surface vector S_i = S_i n (S_j = S_j n'). Evaluating H_d at one cell requires a summation of contributions from all cells (24) and therefore, evaluation of H_d for the whole system requires O(N²) operations, N being the number of cells. The other terms require only O(N) operations, since they correspond to local or, at the most, short range interactions. As a result of this, simulations with a large number of cells become prohibitive if H_d is calculated by direct evaluation of (24). However, since N only depends on the relative position between cells, r_i − r_j, (24) can be recognized as a discrete convolution of N with m, and therefore it can be computed more efficiently using fast Fourier transform (FFT) techniques [79, 80], since the convolution is converted into a simple product in the Fourier space. The procedure is based on taking the FFT of N and m, multiply them pointwise in the Fourier space and take the inverse FFT to obtain H_d. Since FFT can be computed in O (N log N) operations, the evaluation of H_d can be carried out in O (N log N), as opposed to O (N²) resulting from direct evaluation of (24). However, a convolution carried by such FFT method assumes that the magnetization data in the computational region repeats periodically in an infinite space. To avoid the artefacts derived from this, the magnetization data in the physical region need to be zero-padded [70] to double its length along each Cartesian direction, leading to an eightfold increase in the number of cells.

Once discretized expressions for the different torques have been obtained, the partial differential equation governing magnetization dynamics (equations (2), (18) and (20)) is converted into a set of coupled ordinary differential equations and the magnetization is advanced in time by integrating them simultaneously using a time integration scheme [70]. A second-order predictor–corrector algorithm [70] with fixed time step has been used in this work, but several other algorithms have also been implemented by some of the co-authors of this work and others [23–26, 49]. On the other hand, if the equilibrium equation (16) is to be solved, after discretization one ends up with a set of non-linear coupled algebraic equations that are solved simultaneously using iterative relaxations methods, such as under-relaxed Jacobi, steepest descent or conjugate gradient methods [70, 81].

2.4. Thermal fluctuations

Thermal fluctuations are usually included in a system by adding random noise to the deterministic dynamic equation, which is then converted into an stochastic one, usually referred to as the Langevin equation [82]. In 1963 Brown applied this procedure to single domain particles and showed what the statistical properties of the random field had to be in order to reproduce correctly the equilibrium thermodynamics [83]. It is not straightforward to apply this methodology to the continuous theory of micromagnetism and, to our knowledge, it has not been carried out. However, once the theory is discretized (section 2.3), the problem is formally equivalent to an ensemble of single domain (interacting) particles, each cell corresponding to a particle. Therefore, the same procedure followed by Brown can be applied to include thermal fluctuations in micromagnetic simulations. Consequently, a random thermal field H^th is added to the effective field (13) acting on the magnetization of each cell. The Cartesian components of H^th are independent Gaussian distributed random terms with the following statistical properties:

$\begin{equation} \left\langle H_{\alpha,i}^{\rm th}(t) \right\rangle = 0 \end{equation} \tag{ 26a }$

$\begin{equation} \left\langle H_{\alpha,i}^{\rm th}(t) H_{\beta,j}^{\rm th}(t')\right\rangle = 2 \, D \, \delta_{ij} \, \delta_{\alpha\beta} \, \delta(t-t'), \end{equation} \tag{ 26b }$

where 〈...〉 means statistical average, i, j are the cell counters and α, β = x, y, z refer to the Cartesian components of the field. The first Kronecker delta, δ_ij, means that the fluctuating terms of different cells are independent from each other, the second one, δ_αβ, means that the three Cartesian terms are independent from each other and, finally, δ(t − t') indicates that the noise is uncorrelated in time (white noise). The coefficient D is determined by imposing that the stationary solution of the Fokker–Planck equation [82, 85], which is constructed from the Langevin equation and governs the dynamics of the probability distribution of the magnetization, satisfies Maxwell–Boltzmann statistics when thermodynamic equilibrium is reached, leading to [86, 87]

$\begin{equation} D=\frac{\alpha\,k_{\rm B}\,T}{(1+\alpha^2)\,\gamma_0\,\mu_0\,M_{\rm s}\,V} \end{equation} \tag{ 27 }$

where k_B is the Boltzmann constant, T is the temperature and V is the volume of the cell. Therefore, the fluctuating thermal field to be added to the effective field at each cell is given by

$\begin{equation} {\bit H}_{i}^{\rm th}(t) = \bf{\eta}_i(t) \, \sqrt{\frac{2\,\alpha\,k_{\rm B}\,T}{(1+\alpha^2)\,\gamma_0\mu_0\,M_{\rm s}\,V\,\Delta t}} \end{equation} \tag{ 28 }$

where η_i(t) is a stochastic vector whose components are zero-mean standard normal-distributed random numbers and Δt is the time step of the simulation.

Because the noise enters the dynamic equation in a multiplicative way, the Langevin equation must by supplemented by an interpretation rule of the stochastic calculus to properly define it [85]. In this work the Langevin equation is integrated numerically using the stochastic Heun scheme [86], which converges in quadratic mean to the solution of the Langevin equation when interpreted in the Stratonovich sense [86].

3. GPU implementation

In this section some general aspects of the GPU NVIDIA CUDA hardware and programming model are presented in 3.1 and 3.2, respectively. The information provided here is taken from the CUDA C Programming Guide [88] and CUDA C Best Practices Guide [89]. Later on, in section 3.3, we briefly describe the main features of GPMagnet, the software used to obtain the results presented in sections 4 and 5.

3.1. GPU hardware

As mentioned in section 1, GPUs were originally designed to perform the highly parallel computations required for graphics rendering, but over the last years, they have proven to be powerful computing platforms in many fields outside graphics applications, such as general signal processing, physics simulations, finance or computational biology [45–47]. The main reason behind such an evolution is that the GPU is specialized for compute-intensive, highly parallel computation and therefore, compared with a CPU, more transistors are devoted to data processing rather than data caching and flow control, as schematically represented in figure 1 [88]. In contrast, GPUs typically have hundreds of floating-point execution units and large context switch information storage space, resulting in a small area remaining for cache.

**Figure 1.** Schematic representation of a CPU and a GPU. The GPU devotes more transistors to data processing (ALU stands for Arithmetic Logic Unit) than the CPU. Colour online.
Download figure:
Standard image

State-of-the-art CPUs, such as Intel Core 2 Quad, can run only 16 threads concurrently (32 if the CPUs supports Hyper Threading), whereas the NVIDIA Tesla 2070 GPU used for this work has 448 cores. Moreover, execution of concurrent threads in a CPU is generally time consuming and complicated to perform. The operating system must swap threads on and off of host execution channels to provide multithreading capability. Context switches (when two threads are swapped) are therefore slow and expensive. In contrast, threads on GPUs are extremely lightweight. In a typical system, thousands of threads are queued up for work (in warps of 32 threads each). If the device must wait on one warp of threads, it simply begins executing work on another. Because separate registers are allocated to all active threads, no swapping of registers or state need occur between device threads. Resources stay allocated to each thread until execution is completed [89].

3.2. CUDA programming model

CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing architecture, which manages computation on the GPU in a way that provides a simple interface for the programmer [88]. The CUDA architecture can be programmed in industry-standard C/C++ with a few extensions. When programmed through CUDA, the GPU is viewed as a compute device capable of executing a very large number of threads in parallel. It operates as a coprocessor to the main CPU, or host. Therefore, data-parallel, compute-intensive portions of applications running on the host are off-loaded onto the device. More precisely, a portion of an application that is executed many times, but independently on different data, can be isolated into a function that is executed on the device (GPU) as many different threads. To that effect, such a function is compiled to the instruction set of the device and the resulting program, called a kernel, is downloaded to the device [88].

When a kernel is called, the function is executed N times in parallel by N CUDA threads, differently from a regular C function that is executed only once. A kernel is defined using the __global__ declaration specifier. Code to be executed on both the host and device can be contained in a single source file with a .cu' extension. The code is compiled using nvcc, the CUDA C-compiler. When the resultant binary file is run, the C runtime for CUDA coordinates the execution of host and device components.

As shown in figure 2, the threads executed in a kernel call are organized into grids of thread blocks, the dimensions of which are specified using a new ${\tt <<<...>>>}$ execution configuration syntax. Each thread is given a unique ID that is accessible within the kernel through the threadIdx variable, which is a three-component vector. Therefore, threads can be identified using a one-dimensional, two-dimensional or three-dimensional thread index, forming one-dimensional, two-dimensional or three-dimensional thread block, respectively. There is a limit to the number of threads per block, since all threads of a block are executed in the same processor core and must share the limited memory resources. This limit depends on the GPU architecture [88].

Blocks are organized into a grid of thread blocks. This grid may also be one-dimensional, two-dimensional or three-dimensional. The number of thread blocks in a grid is usually indicated by the size of the data being processed or the number of processors in the system. Each block within the grid can be identified by a one-dimensional, two-dimensional or three-dimensional index accessible within the kernel through the blockIdx variable. The dimensions of the thread block are accessible within the kernel through the blockDim variable [88].

The CUDA memory hierarchy is depicted in figure 3. Unlike in the host, where random access memory (RAM) is generally equally accessible to all code (within the limitations enforced by the operating system), on the device RAM is divided virtually and physically into different types, each of which has a special purpose. These memory spaces include global, local, shared, texture and registers [88]. Of these different memory spaces, global and texture memory are the most abundant. Global, local and texture memory have the greatest access latency, followed by constant memory, registers and shared memory. Because it is on-chip, shared memory is much faster than local and global memory, so it is often used as a scratch pad to store intermediate results for computations, for buffering reads and writes to achieve optimal memory-access patterns and for providing inter-thread communication within a block. Local memory is so named because its scope is local to the thread, not because of its physical location. In fact, local memory is off-chip and access to it is as expensive as to global memory [89].

Memory optimization is the most important area for increasing performance. The maximum off-chip memory bandwidth (175 Gbytes/s for a Tesla C2070 GPU [48]) inside the device is still much higher than the maximum bandwidth between host memory and device memory (typically around 2 GBytes/s). Consequently for best overall application performance, it is obviously crucial to minimize data transfer between the host and the device, even if that means running kernels on the device that do not demonstrate any speed-up compared with running them on the host [89]. These data transfer depends on the nature of the code to parallelize or, more precisely, on the granularity of the problem. In parallel computing, granularity is a qualitative measure of the ratio of computation to communication. For a coarse (fine) granularity problem, relatively large (small) amount of computational work is performed between communication events.

3.3. GPMagnet

As already pointed out in section 1, GPMagnet [9] is a commercial software package for running micromagnetic simulations on GPUs. It is based on FD discretization and time integration of the Landau–Lifshitz–Gilbert equation. The theoretical basis and the technical issues related to discretization and numeric integration were presented in section 2. GPMagnet is a general purpose tool that allows one to simulate a wide variety of systems and phenomena. Samples of arbitrary shape and multilayered systems with different material parameters can be simulated. The effective field acting on the magnetization includes exchange, anisotropy, self-magnestostatic and external field contributions, as discussed in section 2.1. External fields can vary in space and time. Both CIP and CPP spin torques can be taken into account as discussed in section 2.2 and the current density may also be spatially and time dependent. Finally, thermal fluctuations are included following the formalism described in section 2.4.

On the other hand, a friendly and powerful Graphical Unit Interface (GUI) has been developed on top of the low level micromagnetic solver using Qt [90]. This GUI includes a configuration wizard to guide users in defining the different aspects of the problem to be solved, such as geometry, material parameters and interactions to be considered. Nevertheless, these specifications can also be provided by means of input ascii files passed to the solver via the command line, which allows for running series of batched simulations using scripts. GPMagnet also provides powerful real-time and post processing tools to monitor and analyse the time evolution of the magnetization and other relevant variables. They have been developed using Qwt [91] and Qwtplot3D [92] and they include 2D curve plots, 2D and 3D surface plots and vectors plots. The snapshots of the magnetization shown in section 5 were obtained with these tools.

The core of the micromagnetic solver, on the other hand, designed to run on NVIDIA GPUs, is written in C/C++ using the API of CUDA [88]. A FD dynamic micromagnetic solver like ours presents a coarse granularity. Therefore, we chose an implementation in which the host provides the input data to the device memory at the beginning of the simulation and all the computation is performed in the device by means of adequate kernels. During the simulation, data are transferred to the host only when they are saved to disk or represented with the real-time graphic tools mentioned above. All data are stored in the global memory of the device which, as discussed in section 3.2, is the most abundant and provides good bandwidth.

GPMagnet relies on single precision at the present stage. However, the coefficients of the demagnetizing tensor (25) are calculated using double precision arithmetic at the beginning of the simulation and latter on truncated to single precision. The reason for this is that the numerical behaviour of the analytical formulae used to calculate these coefficients is poor, with significant and increasing loss of accuracy as the distance between the prisms grows [93], particularly when single precision arithmetics is used.

The first and obvious task when parallelizing the micromagnetic code is to convert all the loops over the computational cells into CUDA kernels, which was carried out in a straight-forward way. On the other hand, GPMagnet uses some specialized CUDA libraries for different purposes. In particular, the FFT transforms involved in the calculation of the self-magnetostatic field are performed using the CUFFT library [94]. At the moment, full 3D real-to-complex direct and inverse transforms are carried out, although a more efficient approach based on one-dimensional transforms along the different directions [49] will be implemented shortly. CUDPP [95], the CUDA Data Parallel Primitives Library, is also used when averaging or sorting maximum and minimum values of data sets. Finally, the CURAND Library [96] is used to generate the pseudo-random number sequences needed to model thermal fluctuations (see section 2.4).

4. Code validation and performance tests

In this section some simple problems of different nature with well-known solutions are addressed with the aim of testing the accuracy and efficiency of GPMagnet by comparison with CPU-based solvers, such as OOMMF or the serial CPU code from which GPMagnet was developed. CPU simulations were performed in a Intel Core 2 Quad Q9300 processor with a clock speed of 2.5 GHz and 6 MB of L2 cache, whereas for GPMagnet simulations a NVIDIA Tesla 2070 GPU card with 1.15 GHz clock speed and 6 GB of RAM was used.

4.1. Standard problem #2

In the µMAG [97] standard problem #2 a rectangular particle of dimensions 5d × d × 0.1 d is considered. Since only magnetostatic and exchange interactions are included, the quasi-static hysteresis properties, when expressed in reduced units as m = M/M_s versus h = H/M_s, only depend on the ratio between the size of the particle d and the exchange length l_ex (21). The desired output for comparison is the coercivity and remanence values as a function of d/l_ex when the field is applied along the [1, 1, 1] direction.

Hysteresis loops were simulated for 20 values of d/l_ex using both GPMagnet and OOMMF. For both of them a conjugate gradient solver was used and the same discretization Δ_x = Δ_y = 0.7 l_ex, Δ_z = 0.1 d was chosen. The total time for the 20 simulations was 480' 7'' for OOMMF and 57' 5'' for GPMagnet. The simulated hysteresis loops are in good agreement with the ones obtained by other groups and the different reversal mechanisms found are described elsewhere [81, 97, 98]. As an example, the descending branch of the loop for d/l_ex = 16.87 is presented in figure 4. An almost perfect agreement between the two solutions is obtained except for small differences around the two irreversible jumps, which is commented below.

**Figure 4.** Standard problem #2: descending branch of the hysteresis loop for d/l_ex = 16.87 computed using OOMMF (black solid squares) and GPMagnet (red hollow circles). Colour online.
Download figure:
Standard image

The computed remanent magnetization values along the x and y axes are shown in figure 5 and the coercive fields are presented in figure 6. As can be observed, very good agreement is found for the remanence, whereas the agreement is not as good for the coercivity, where the discrepancies become more apparent as the size of the particles increases. These discrepancies are to be expected, since the coercive field is in the proximity of an irreversible jump, which is a critical instability point. As already pointed out [53], relaxation methods, such as conjugate gradient, are designed to find equilibrium states, and when an instability point is approaching, the moment at which the algorithm decides to jump from one branch of the solution to another is extremely sensitive to many factors and, therefore, unpredictable. For an accurate determination of a critical point, a systematic study would be required involving the extrapolation to the continuum of a suitably defined local susceptibility [99–102], which is beyond the scope of this review.

4.2. Standard problem #4

Standard problem #4 focuses on dynamic aspects of micromagnetic simulations. A rectangular elongated ferromagnetic sample of length L = 500 nm, width w = 125 nm, thickness t = 3 nm and typical Permalloy material parameters (A = 1.3 × 10⁻¹¹ J m⁻¹, M_s = 8.0 × 10⁵ A m⁻¹, K = 0 J/m³) is considered. Starting from the s-state obtained after saturating along the [1, 1, 1] direction and then slowly removing the field, the time evolution of the system is to be simulated by integrating (2) with α = 0.02 once a uniform in-plane field is applied instantaneously. Two different cases, field 1 (25 mT directed 170° counterclockwise from the positive x-axis) and field 2 (36 mT directed 190° degrees counterclockwise from the positive x-axis), are considered.

The reported solutions to standard problem #4, one of which comes from some co-authors of this work, yield essentially the same physical processes [97]. In particular, reversal in field 1 proceeds by propagation of end domains towards the sample centre, whereas reversal in field 2 involves rotation of the end domains in one direction while the centre of the particle rotates in the opposite direction, resulting in collapsing 360° walls and more complex dynamics. The submitted solutions exhibit good agreement in the dynamic evolution of the averaged magnetization except from some minor discrepancies in field 2 appearing after 0.4 ns [97].

Although the solutions reported by the different groups were all obtained using cell sizes smaller than the exchange length (l_ex = 5.69 nm), our purpose here was to check the robustness of the solution as the discretization is made even smaller. The temporal evolution of the average magnetization along the y-axis (short axis) is shown in figures 7 and 8 for fields 1 and 2, respectively. For each field, four different discretizations in the xy plane, Δ_x = Δ_y = 5.0, 2.5 1.25 and 1.0 nm, were considered, and a fixed time step of Δt = 50 fs was used in all cases. As can be observed, for field 1 the solution is virtually independent of the discretization, whereas for field 2 some discrepancies appear roughly after 0.35 ns, but the solutions clearly converge as the cell size decreases. We note that simulations with mesh size Δ_x = Δ_y = 1 nm for the problem under consideration are hardly feasible with a standard CPU, whereas with a GPU solver they are easily tackled.

**Figure 7.** Standard problem #4: time evolution of the average magnetization along the y-axis in field 1 computed for four different discretizations. Colour online.
Download figure:
Standard image

**Figure 8.** Standard problem #4: time evolution of the average magnetization along the y-axis in field 2 computed for four different discretizations. Colour online.
Download figure:
Standard image

4.3. Performance test

A comparison in speed between GPMagnet and a serial CPU-based micromagnetic solver was performed. The serial code is the one from which GPMagnet was developed and, therefore, it has the same features as GPMagnet, highlighted in section 2. Specifications of the hardware used in both cases were given at the beginning of section 4. The comparison is made in terms of the time required to take one time step using a second-order predictor–corrector method, which involves calculating the effective field twice. A two-dimensional computational region of N × N cells is considered and a systematic study varying N is carried out. The results are presented in figure 9. Noticeably, GPMagnet performance curve (circled symbols, labelled GPMagnet-1) does not increase monotonically with the number of cells but it presents ups and downs. This feature was also found in [49] but here the jumps are even more pronounced. The explanation lies in the performance of the FFT routine used in the evaluation of the self-magnetostatic field, by far the most time-consuming part of micromagnetic simulations. The optimal dimensions for the FFT in terms of efficiency and accuracy are given by 2^k · 3^l · 5^m · 7ⁿ with k, l,m, n = 0, 1, 2, 3, .... We note that for a computational region of N × N × 1 cells the dimensions of the FFT are 2 N × 2 N × 2, since, as pointed out in section 2.3, the physical region needs to be zero-padded to double its length along each Cartesian direction. In the commercial version of GPMagnet the dimensions of the zero-padding are augmented, if necessary, to meet the optimal values for the FFT. The performance curve after this adjustment is shown in figure 9 (triangular symbols, labelled GPMagnet-2). As can be observed, the points for which a bad performance was found in the case of minimal zero-padding are shifted down significantly despite of the fact that the size of the FFT is increased.

As seen in figure 9, GPMagnet with the zero-padding adjustment is roughly two orders of magnitude faster than the CPU solver (squared symbols) in simulations with a large number of cells. Similar speed-up factors have been found in other GPU implementations [49–51].

4.4. Thermal equilibrium statistics

The Langevin formalism used to include thermal fluctuations in micromagnetism presented in section 2.4 is not fully satisfactory. From a theoretical point of view, the question of whether the correlation properties of the noise, derived by Brown for a single moment [83], are applicable to a system with strongly interacting magnetic moments remains open. In particular, it is questioned whether the noise should remain δ-correlated in space and time and, if not, how both space and time correlations and also the strength of the noise could be determined [84, 106]. On the other hand, it has been pointed out that the model produces egregious errors at high temperature, in particular it fails to reproduce the Curie temperature T_C of a magnetic sample [103, 108]. Some attempts have been proposed to correct for this shortcoming, based on either renormalization [103] or atomic scale modelling [104, 105] of some system parameters, but these approaches lack generality. Finally, it is also well known that when thermal fluctuations are included the results strongly depend on discretization even for cell sizes smaller than l_ex (21) and l_w (22) [107, 108]. All these problems are not independent from each other and a fully rigorous model that solves all of them is still far and, certainly, beyond the scope of this work. Our goal here is to check whether the model described in section 2.4, despite of its limitations, is consistent in the sense that it reproduces Maxwell–Boltzmann statistics in thermodynamic equilibrium, which is the condition imposed to determine the strength of the thermal field (27).

In order to investigate this, a uniformly magnetized sample under the action of an external field B is considered. In the absence of anisotropy, the energy of the system is simply given by

$\begin{equation} E = -{\bit M}\cdot{\bit B}\,V=-M_{\rm s}\,B\,V\,m_z \end{equation} \tag{ 29 }$

where V is the volume of the sample and B is assumed to be along the z axis (B = B u_z) for convenience. According to Maxwell–Boltzmann statistics, the probability distribution of the energy in thermodynamic equilibrium should verify $P(E)\propto \exp(\frac{-E}{k_{\rm B}\,T})$ . Since the energy is proportional to m_z (29), the probability distribution of m_z should also follow an exponential law. Namely, after normalizing one obtains

$\begin{equation} P_{\rm eq}(m_z) = \frac{a\,e^{a\,m_z}}{2\,\sinh(a)} \end{equation} \tag{ 30 }$

where a is given by

$\begin{equation} a = \frac{M_{\rm s}\,B\,V}{k_{\rm B}\,T}. \end{equation} \tag{ 31 }$

A cubic particle of 4 × 4 × 4 nm³ and material parameters A = 9 × 10⁻¹¹ J m⁻¹, M_s = 4.5 × 10⁵ A m⁻¹ and α = 0.01 is considered. We note that a large value of the exchange constant A has been chosen so that the size of the particle is well below the exchange length of this fictitious material. This way, we guarantee that deviations from uniform magnetization are small so that a meaningful comparison between the results of the simulations and the theoretical description can be made. The statistical distribution of m_z at T = 300 K with an external field B = 200 mT is obtained by integrating the stochastic LLG equation during a time window of 2000 ns. Three discretizations, Δ = 4 nm, 2 nm and 1 nm, are considered, being Δt = 231.2 fs, 5.77 fs and 0.58 fs the time step used in each case, respectively. The results are represented in figure 10, where the computed probability distribution for each discretization is compared with the theoretical prediction for a macrospin particle (30). As can be observed, a virtually perfect agreement is found for the 4 nm cell size, which is not surprising, since in this case there is only one computational cell and, therefore, it corresponds to a uniformly magnetized particle. For the other two discretizations good agreement is also found, although the probability around m_z = +1 is significantly smaller in the simulations than in theory. Again, this result should not be surprising, since a finite exchange energy introduces small deviations from uniformity, leading to a slightly smaller effective magnetization and, consequently, to a leakage in the probability distribution from m_z ≈ 1 towards neighbouring regions with smaller m_z.

**Figure 10.** Equilibrium probability distribution of m_z for a 4 × 4 × 4 nm³ cubic particle at T = 300 K under an applied field B = 200 mT. The distribution is computed micromagnetically using three different discretizations and each one of them is compared with the theoretical prediction (30). Colour online.
Download figure:
Standard image

On the other hand, an analytical expression for the expected value of m_z can be easily obtained

$\begin{equation} \left<m_z\right> = \int_{-1}^{1}P_{\rm eq}(m_z)\,m_z\,\rmd m_z=\coth(a)-\frac{1}{a} \end{equation} \tag{ 32 }$

where a is given by (31). Therefore, we can also study the dependence of 〈m_z〉 with the applied field, which in (32) is hidden in the parameter a (31). Systematic simulations, similar to those from which the results in figure 10 were obtained, are carried out for different values of the applied field, and the averaged value of m_z is computed in each case. The results are plotted in figure 11. Excellent agreement between theory (32) and simulations is found for all three discretizations.

**Figure 11.** Expected value of m_z for a 4 × 4 × 4 nm³ cubic particle for different values of the applied field. The results for three different discretizations are shown together with the theoretical prediction (32). Colour online.
Download figure:
Standard image

From this study one can conclude that the formalism described in section 2.4 predicts the correct probability distribution for a system in thermodynamic equilibrium with a thermal bath. As already proposed by others [109, 110], this sort of problems could be used to formulate a new standard problem to test the implementation of thermal fluctuations in numerical micromagnetic codes.

5. Sample problems

The problems dealt with in the previous section were of academic nature in the sense that they were posed to validate micromagnetic codes and test their performances. In this section, however, we present two examples of more realistic and interesting problems from a physical point of view that require simulating a large number of cells and, nevertheless, can be solved with GPMagnet at a reasonable time cost, whereas they would be unapproachable with a typical serial CPU-based solver.

5.1. Hysteresis loop of an antidot array

The magnetic properties of thin films with ordered antidot arrays can be tailored by controlling their geometrical parameters, such as antidot size and separation [111, 112]. Micromagnetic simulations are a useful tool in understanding equilibrium states [113] and reversal mechanisms [114] in these systems. However, modelling them in a meaningful way often involves simulating a large area of dimensions much larger than the array period, so that edge effects are not dominating and collective reversal processes involving many antidots are allowed.

Our goal is to model an antidot array system fabricated by sputtering Ni₈₀Fe₂₀ (Py) (5 nm thickness) onto a hexagonal AAM (anodic alumina membrane) template with 35 nm pore diameter and lattice parameter D = 105 nm. More details about the fabrication process can be found elsewhere [115]. A field emission scanning electron microscopy (FE-SEM) image covering a region of 6 × 5 µm² of the antidot array is shown in figure 12(a). As can be observed, the hexagonal ordering of the template is preserved. In order to simulate this system as realistically as possible, a computational region of 6 × 5 µm² is considered and a binary mask obtained from the FE-SEM is used to define the geometry. Typical values of Py are used for the exchange constant (A = 1.3 × 10⁻¹¹ J m⁻¹) and saturation magnetization (M_s = 8.6 × 10⁵ A m⁻¹). In both cases the computational region is discretized in cubic cells of 5 nm in side. A major hysteresis loop is simulated by sweeping the field between +100 and −100 mT in steps of 1 mT. For each step the LLG equation (2) is integrated with α = 1.0 and a fixed time step Δt = 200 fs until equilibrium is reached. The field is applied along the horizontal direction. The complete hysteresis loop, shown in figure 12(b) (solid line), was computed in roughly 27 h and 54 min. Snapshots of the magnetization configuration at B = −36 mT and B = −42 mT are shown in figures 12(c) and (d), respectively. The magnetization component along the applied field direction is plotted in a colour scale. As can be observed, a complex reversal mechanism is found yielding a smooth hysteresis loop. For comparison, a second simulation with the same features was performed on a computational region five times smaller (1.2 × 1 µm²) obtained from the central part of figure 12(a). The computed hysteresis loop is also shown in figure 12(b) (dashed line). In this case, the loop comprises some apparent abrupt jumps, indicating that larger computational regions need to be considered in order to mimic the hysteresis properties of thin fims with periodic antidot arrays.

**Figure 12.** (a) FE-SEM image of the antidot array (courtesy of del Real). (b) Computed hysteresis loop along the horizontal direction. The loop marked with a solid line was obtained using a 6 × 5 µm² computational region, whereas for the dashed one a 1.2 × 1µm² computational region was used. Snapshots of the magnetization at (c) B = −36 mT and (d) B = −42 mT. The magnetization component along the applied field direction is plotted in colour scale. Colour online.
Download figure:
Standard image

5.2. Nanocontact STO

It has been proved both theoretically [61–63] and experimentally [116–118] that the STT exerted by a spin-polarized current on a ferromagnetic film can sustain stable microwave oscillation in the latter, which could be used to design a new type of nanoscale oscillator, the so-called STO. In this work we will focus on STOs based on the nanocontact geometry where, as opposed to the nanopillar geometry, the current flows only through a confined region, whereas the lateral dimensions of the active magnetic layer are much larger, typically tens of micrometres. An schematic representation of this device is shown in figure 13. Some advantages with respect to pillar devices are less current-induced instabilities and noise, no influence of surface roughness and, most noticeably, the possibility to synchronize an array of such oscillators by coupling them via spin-wave emission [119]. Micromagnetics can be a very useful tool in understanding the nature of self-sustained oscillations in these systems [28, 29, 120, 121]. However, modelling the real size of nanocontact devices is unfeasible in most cases and one can only simulate at reasonable time cost an area of a few μm² around the nanocontact, which might give rise to numerical spurious effects. The most evident of them is spin wave reflections at the computational boundaries that propagate back towards the nanocontact and eventually distort the oscillations. Some ad hoc techniques have been developed in order to minimize this effect. They are based on either artificially increasing the damping parameter α as moving away from the nanocontact [28, 122] or introducing surface roughness at the boundaries leading to destructive interference of the scattered waves [30]. Although these techniques have proved to be useful to some extent, their performance is not satisfactory in some cases and the necessity of simulating computational areas as large as possible remains.

**Figure 13.** Schematic representation of a nanocontact STO (left). In the free layer the current flows only through the region below the top metallic nanocontact whereas the lateral dimensions are typically much larger (not plotted on scale). Current-induced oscillations lead to spin waves that propagate outwards (right). Colour online.
Download figure:
Standard image

We consider a nanocontact STO in which the geometry and parameter values have been chosen to model the experimental device used in [117] as closely as possible. In particular, the device under investigation is a Co₉₀Fe₁₀(20 nm)/Cu(5 nm)/Ni₈₀Fe₂₀(5 nm) trilayer with a circular nanocontact of 40 nm in diameter. The response of the system is simulated when a dc current I = 8.0 mA flows through the device with a field of 900 mT applied perpendicular to the plane. Parameter values M_s = 5 × 10⁵ A m⁻¹, A = 1.3 × 10⁻¹¹ J m⁻¹, α = 0.01 were chosen for the Py free layer, whereas the fixed Co₉₀Fe₁₀ layer is assumed to be uniformly magnetized along the applied field direction. The time evolution of the magnetization in the free layer is computed during a time window of 200 ns for different dimensions of the (square) computational region ranging from L = 1 µm to L = 5 µm in side. Figure 14 shows the computed response in the frequency domain for each case. The spectra were computed from the out-of-plane component of the magnetization averaged over the region immediately below the nanocontact. The spectrum obtained with L = 2 µm with optimized absorbing boundary conditions (ABC) [122] is also shown (black). As can be observed, the spectra differ from each other significantly and these discrepancies can only be attributed to the effect of reflected spin waves, since only the region below the nanocontact is taken into account in obtaining the spectra. It was recently shown [122] that ABC based on increasing α away from the nanocontact effectively remove spin-wave reflections in a very similar problem to the one discussed here and, therefore, it can be safely assumed that in the curve obtained with ABC, peaked at 22.50 GHz, this spurious effect is negligible. From figure 14 it can be concluded that even for L = 4 µm the effect of spin-wave reflections is significant and that lateral dimensions of 5 µm have to be simulated for an accurate estimation of the response if no ABC are implemented. The 5 µm simulation was carried out with GPMagnet in 98 h using 4 × 4 × 5 nm³ cells and a fixed time step of 200 fs.

**Figure 14.** Frequency response of a STO with a nanocontact of 40 nm in diameter for an applied current I = 8.0 mA and a perpendicular applied field of 900 mT. The free layer is 5 nm thick, whereas the fixed one is assumed to be uniformly magnetized along the applied field direction. Spectra for different sizes of the computational region were computed from the time evolution of the out-of-plane component of the magnetization over a time window of 200 ns, leading to a 5 MHz resolution. The black curve corresponds to the spectrum obtained for L = 2 µm with optimized ABC. No ABC is implemented in the rest of the cases. Colour online.
Download figure:
Standard image

5.3. STO with two dynamic layers

In this section we investigate current-induced oscillations in a Py(5 nm)/Cu(40 nm)/Py(60 nm) spin valve with elliptical cross sectional area of 160 nm × 75 nm, schematically represented in figure 15. This geometry mimics the device investigated in [123], where the first experimental confirmation that a magnetic vortex could be excited into persistent microwave oscillations by a spin-polarized dc current was provided.

**Figure 15.** Schematic representation of the spin valve (left). Snapshots of the magnetization configuration in the top and bottom layers when a dc current of 1.0 × 10¹² A m⁻² flows perpendicularly to the device under a 160 mT perpendicular field. Colour online.
Download figure:
Standard image

This system was investigated by means of micromagnetic simulations in a previous publication [124] by some of the co-authors of this work. It was shown that the combined spin torque and magnetic coupling between the two magnetic layers plays a crucial role in the explanation of the hysteresis properties of the device and in obtaining persistent vortex oscillations. Good agreement with the measured frequency versus current behaviour was found. The computed linewidths could not be compared with the experimental data because simulation times of 50 ns were considered, limiting the resolution to 20 MHz, while the experimental linewidths are well below 1 MHz [123]. In this work a systematic analysis of the linewidth versus current in this system is carried out.

As in [124], the following parameter values were used for both Py layers: M_s = 6.5 × 10⁵ A m⁻¹, A = 1.3 × 10⁻¹¹ J m⁻¹, α = 0.015. The STT interaction between both layers was treated using the model described in [124] with P = 0.38 and χ = 1.5, whereas thermal fluctuations were not taken into account. The computational region was discretized in 5 × 5 × 5 nm³ cubic cells. The time evolution of the system was simulated when an dc current passes through the device from top to bottom under the action of a perpendicular applied field of 160 mT. A fixed time step of 500 fs was used. The three-dimensional orbits of the averaged magnetization in both magnetic layers are shown in figure 16 for an applied current of 1.0 × 10¹² A m⁻². As discussed in [124], the vortex moves in the thicker Py layer in an quasielliptical trajectory whereas the thinner Py layer magnetization is in highly nonuniform configuration that changes dynamically as the vortex precesses.

**Figure 16.** Three-dimensional orbits of the averaged magnetization in the thin and thick Py layers for an applied current density of 1.0 × 10¹² A m⁻². Colour online.
Download figure:
Standard image

Simulations for different values of the applied current density ranging from 1.0 × 10¹² to 2.0 × 10¹² A m⁻² were carried out for a time window of 10⁻⁵ s, leading to a 100 kHz resolution in the frequency spectra. Each one of the simulations took roughly 60 h. Figure 17(a) shows the detail of the computed spectrum around the main peak for J = 1.3 × 10¹² A m⁻². A lorenztian fit (red curve) to the results yields a linewidth of 43 kHz. In figure 17(b) the dependence of the linewidth with the applied current is shown. As can be observed, up to J = 1.6 × 10¹² A m⁻² the linewidths are well below 1 MHz, but for larger currents a strong increase in the linewidth is obtained. A detailed analysis of these results is beyond the scope of this work.

**Figure 17.** (a) Detail of the spectrum around the main peak for an applied current density J = 1.3 × 10¹² A m⁻². A linewidth of 43 kHz is obtained from the lorenztian fit (red curve). (b) Computed linewidth as a function of the applied current. Colour online.
Download figure:
Standard image

6. Conclusions

GPU computing is becoming widely used in high-performance computing. In this review, the general features of CUDA, NVIDIA's parallel computing architecture were briefly presented and some general aspects related to adapting a serial micromagnetic code to run in a GPU architecture were discussed.

GPMagnet, a commercial software package for general-purpose micromagnetic simulations, was presented as an example of GPU implementation of a micromagnetic solver. The accuracy of GPMagnet was tested by solving standard problems with well known solutions. In terms of efficiency, it was shown that GPMagnet is roughly two orders of magnitude faster than a CPU-based code of similar features. This opens new opportunities in micromagnetic modelling, allowing for the possibility of performing simulations with a large number of spins at an affordable time cost.

Moreover, the efficiency of GPMagnet is expected to increase substantially in the short term. On one hand, some platform-independent improvements in the code will be implemented shortly, such as a higher order integration scheme with adaptive time step or a more efficient implementation of the FFT. On the other, the possibility of performing simulations on different GPUs using multi-GPU distributed memory clusters, which will also be implemented in the near future, will increase the computational power significantly.

In conclusion, the emergence of GPU computing makes it possible to carry out systematic, large scale and high-resolution micromagnetic studies, which up until now required the computational power of supercomputers, at an affordable cost. Therefore, it is the authors' belief that future micromagnetic solvers will mostly rely on GPU architectures.

Acknowledgments

This work was partially supported by the Spanish Government under project MAT2011-28532-CO3-01.

Micromagnetic simulations using Graphics Processing Units

Article metrics

Submit

Permissions

Author affiliations

Dates

Abstract

1. Introduction

2. Computational micromagnetics

2.1. Basic model

2.2. Spin-transfer torque

2.3. Discretization and numerical integration

2.4. Thermal fluctuations

3. GPU implementation

3.1. GPU hardware

3.2. CUDA programming model

3.3. GPMagnet

4. Code validation and performance tests

4.1. Standard problem #2

4.2. Standard problem #4

4.3. Performance test

4.4. Thermal equilibrium statistics

5. Sample problems

5.1. Hysteresis loop of an antidot array

5.2. Nanocontact STO

5.3. STO with two dynamic layers

6. Conclusions

Acknowledgments

Micromagnetic simulations using Graphics Processing Units

Article metrics

Submit

Permissions

Share this article

Author affiliations

Dates

Abstract

1. Introduction

2. Computational micromagnetics

2.1. Basic model

2.2. Spin-transfer torque

2.3. Discretization and numerical integration

2.4. Thermal fluctuations

3. GPU implementation

3.1. GPU hardware

3.2. CUDA programming model

3.3. GPMagnet

4. Code validation and performance tests

4.1. Standard problem #2

4.2. Standard problem #4

4.3. Performance test

4.4. Thermal equilibrium statistics

5. Sample problems

5.1. Hysteresis loop of an antidot array

5.2. Nanocontact STO

5.3. STO with two dynamic layers

6. Conclusions

Acknowledgments