1 Introduction

Flavour-changing neutral-current (FCNC) processes are highly suppressed in the Standard Model (SM), and their study is relevant to indirect searches for physics beyond the SM. The branching fractions of the decays \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) are of particular interest because of the additional helicity suppression and since they are accurately predicted in the SM: \(\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) = (3.65 \pm 0.23) \times 10^{-9}\) and \(\mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-) = (1.06 \pm 0.09) \times 10^{-10}\) [1]. Significant deviations from these values can arise in models involving non-SM heavy particles, such as those predicted in the Minimal Supersymmetric Standard Model [26], in extensions such as Minimal Flavour Violation [7, 8], Two-Higgs-Doublet Models [6], and others [9, 10]. The CMS and LHCb collaborations have reported the observation of \(B^0_s \rightarrow \mu ^+\mu ^-\)  [11, 12] and evidence of \(B^0 \rightarrow \mu ^+\mu ^-\), with combined values: \(\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) = \left( 2.8^{+0.7}_{-0.6}\right) \times 10^{-9}\) and \(\mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-) = \left( 3.9^{+1.6}_{-1.4}\right) \times 10^{-10}\) [13].

This paper reports the result of a search for \(B^0_s \rightarrow \mu ^+\mu ^-\) and \(B^0 \rightarrow \mu ^+\mu ^-\) decays performed using pp collision data corresponding to an integrated luminosity of 25 fb\(^{-1}\), collected at 7 and 8 TeV in the full LHC Run 1 data-taking period using the ATLAS detector. This analysis supersedes the previous result [14] based on 2011 data and exploits improved analysis techniques in addition to the larger dataset.

2 Outline

The \(B^0_s \rightarrow \mu ^+\mu ^-\) and \(B^0 \rightarrow \mu ^+\mu ^-\) branching fractions are measured relative to the normalisation decay \(B^+\rightarrow J/\psi (\rightarrow \mu ^+\mu ^-)K^+\) that is abundant and has a known branching fraction \(\mathcal{B}(B^+ \rightarrow J/\psi \,K^+) \times \mathcal{B}(J/\psi \rightarrow \mu ^{+} \mu ^{-})\). In the simplest form, the \(B^0 \rightarrow \mu ^+\mu ^-\) (\(B^0_s \rightarrow \mu ^+\mu ^-\)) branching fraction can be extracted as:

$$\begin{aligned}&\mathcal{B}(B^0_{(s)}\! \rightarrow \!\mu ^+ \mu ^-) \\&\quad = \frac{N_{d(s)}}{\varepsilon _{\mu ^+\mu ^-}}\left[ \mathcal{B}(B^+ \rightarrow J/\psi \,K^+) \times \mathcal{B}(J/\psi \rightarrow \mu ^{+} \mu ^{-})\right] \\&\qquad \times \,\frac{\varepsilon _{J/\psi K^+}}{N_{J/\psi K^+}} \times \frac{f_{u}}{f_{d(s)}}, \end{aligned}$$

where \(N_d\) (\(N_s\)) is the \(B^0 \rightarrow \mu ^+\mu ^-\) (\(B^0_s \rightarrow \mu ^+\mu ^-\)) signal yield, \(N_{J/\psi K^+}\) is the \(B^+ \rightarrow J/\psi \,K^+\) normalisation yield, \(\varepsilon _{\mu ^+\mu ^-}\) and \(\varepsilon _{J/\psi K^+}\) are the corresponding values of acceptance times efficiency, and \(f_u/f_d\) (\(f_u/f_s\)) is the ratio of the hadronisation probabilities of a b-quark into \(B^+\) and \(B^0\) (\(B_s^0\)).

For this study, a modified formula is used to normalise independently samples of events collected in different data-taking periods and with different trigger selections:

$$\begin{aligned}&\mathcal{B}(B^0_{(s)}\! \rightarrow \!\mu ^+ \mu ^-) \nonumber \\&\quad = N_{d(s)}\left[ \mathcal{B}(B^+ \rightarrow J/\psi \,K^+) \times \mathcal{B}(J/\psi \rightarrow \mu ^{+} \mu ^{-})\right] \nonumber \\&\qquad \times \, \frac{f_{u}}{f_{d(s)}} \times \frac{1}{\mathcal {D}_{\mathrm {norm}}}, \end{aligned}$$
(1)

with

$$\begin{aligned} \mathcal {D}_{\mathrm {norm}} = \sum _k N^k_{J/\psi K^+} \alpha _k\left( \frac{\varepsilon _{\mu ^+\mu ^-}}{\varepsilon _{J/\psi \, K^+}} \right) _k. \end{aligned}$$
(2)

The denominator \(\mathcal {D}_{\mathrm {norm}}\) consists of a sum whose index k runs over the data-taking periods and the trigger selections. In the sum, the \(\alpha _k\) parameter takes into account the different trigger prescale factors and integrated luminosities in the signal and normalisation channels, and the ratio of the efficiencies corrects for reconstruction differences in each data sample k. Signal and reference channel events are selected with similar dimuon triggers.

The notation used throughout the paper refers to both the stated and charge-conjugated process, unless otherwise specified. The analysis is performed without tagging of the flavour \(B^0{}\!_{(s)}\) or \(\overline{B^0}\!_{(s)}\) at production. The yield measurement in the normalisation channel is obtained by summing \(J/\psi K^+\) and \(J/\psi K^-\) contributions.

The analysis is performed integrating over the decay time distribution of the event candidates. The relation between the measured branching fraction and the corresponding value at production is established assuming the decay time distribution predicted in the SM, where the decay occurs predominantly through the heavy eigenstate \(B_{s/d,\mathrm {H}}\) of the \(B^0{}\!_{(s)}\)-\(\overline{B^0}\!_{(s)}\) system. Models for new physics [15, 16] can predict modification to the decay time distribution of \(B^0_s \rightarrow \mu ^+\mu ^-\) and a comparison with the experimental result may require a correction to the ratio of the time-integrated efficiencies.

The ATLAS inner tracking system and muon spectrometer are used to reconstruct and select the event candidates. Details of the detector, trigger, data sets, and preliminary selection criteria are discussed in Sects. 3 and 4. A blind analysis was performed in which data in the dimuon invariant mass region from 5166 to 5526 MeV were removed until the procedures for event selection and the details of signal yield extraction were completely defined. Section 5 introduces the three main categories of background (continuum background due to muons from uncorrelated hadron decays, background from partially reconstructed decays, and peaking background from \(B^0{}\!_{(s)}\) two-body hadronic decays, where both particles are misidentified as muon pairs). Section 6 describes the strategy used to reduce the probability of hadron misidentification. The final sample of candidates is selected using a multivariate classifier, designed to enhance the signal relative to the continuum background, as discussed in Sect. 7. Checks on the distributions of the variables used in the multivariate classifier are summarised in Sect. 8. They are based on the comparison of data and simulation for dimuon events, for \(B^+ \rightarrow J/\psi \,K^+\) candidates and for events selected as \(B^0_s \rightarrow J/\psi \, \phi \rightarrow \mu ^+\mu ^- K^+K^-\), which provide an additional validation of the procedures used in the analysis. Section 9 details the fit procedure to extract the yield of \(B^+ \rightarrow J/\psi \,K^+\) events. As an ancillary measurement to the \(B^+ \rightarrow J/\psi \,K^+\) yield determination, a measurement of the ratio \(\mathcal {B}(B^+ \rightarrow J/\psi \,\pi ^+ )/\mathcal {B}(B^+ \rightarrow J/\psi \,K^+)\) is performed, as presented in Sect. 9.1. The ratio of efficiencies in the signal and the normalisation channels is presented in Sect. 10. Section 11 describes the extraction of the signal yield, obtained with an unbinned maximum-likelihood fit performed on the dimuon invariant mass distribution, with the events classified according to three intervals in the classifier used for the final selection. The results on the branching fractions \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\)) and \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)) are reported in Sect. 12.

3 ATLAS detector, data and simulation samples

The ATLAS detectorFootnote 1 consists of three main components: an inner detector (ID) tracking system immersed in a 2 T axial magnetic field, surrounded by electromagnetic and hadronic calorimeters and by the muon spectrometer (MS). A full description can be found in Ref. [17].

This analysis is based on the Run 1 data sample recorded in 2011 and 2012 by the ATLAS detector from pp collisions at the LHC at \(\sqrt{s}=7\) and 8 TeV, respectively. Data used in the analysis were recorded during stable LHC beam periods. Data quality requirements were imposed, notably on the performance of the MS and ID systems. The total integrated luminosity of good quality data used in this analysis is 4.9 fb\(^{-1}\) for the 2011 sample and 20 fb\(^{-1}\) for 2012. The average number of reconstructed primary vertices (PV) per event, related to multiple proton–proton interactions, is 6.2 and 11.4 in the two years respectively.

Samples of simulated Monte-Carlo (MC) events are used for training and validation of the multivariate analyses, for the determination of the efficiency ratios, and for guiding the signal extraction fits. Exclusive MC samples were produced for the signal channels \(B^0_s \rightarrow \mu ^+\mu ^- \) and \(B^0 \rightarrow \mu ^+\mu ^- \), the normalisation channel \(B^+ \rightarrow J/\psi \,K^+\) (\(J/\psi \rightarrow \mu ^+\mu ^-\)), the \(B^+ \rightarrow J/\psi \, \pi ^+\) channel, and the control channel \(B^0_s \rightarrow J/\psi \, \phi \) (\(\phi \rightarrow K^+K^- \)). In addition, background studies employ MC samples of inclusive semileptonic decays \(B\rightarrow \mu X\), samples of \(B_s^0 \rightarrow K^- \mu ^+ \nu \), \(B^0 \rightarrow \pi ^-\mu ^+ \nu \), \(\Lambda _b \rightarrow p \mu ^- \overline{\nu }\), \(B^0{}\!_{(s)} \rightarrow hh^\prime \) decays with \(h^{(\prime )}\) being a charged pion or kaon, and inclusive decays \(B \rightarrow J/\psi X\).

Most of dimuon candidates in the data sample originate from the uncorrelated decays of hadrons produced in the hadronisation of a b and a \(\bar{b}\) quarks. To describe this background, defined as continuum, a large MC sample was generated by selecting specific topologies that dominate it. The strategy is to consider both the primary decays from b quarks and the secondary decays from c quarks. Independent samples of events with forced semileptonic decays or decays including muons pairs from \(J/\psi \) were generated in all combinations. The total number of events in each sample is chosen to reproduce the composition of oppositely charged muon pairs representative of our data.

The MC samples were generated with Pythia 6 [18] for studies related to data collected in 2011, and with Pythia 8 [19] and EvtGen [20] for the 2012 sample and the development of multivariate classifiers. The ATLAS detector and its response are simulated using Geant4 [21, 22]. Additional pp interactions in the same and nearby bunch crossings (pile-up) are included in the simulation. All simulated samples are reweighted to have the same distribution of the number of PVs per bunch crossing found in data.

Using the iterative reweighting method described in Ref. [14], the simulated samples of the exclusive decays considered are adjusted with two-dimensional data-driven weights (DDW) to correct for the differences between simulation and data observed in the \(p_{\text {T}} ^B\) and and \(|\eta ^B|\) distributions. DDW obtained from \(B^+ \rightarrow J/\psi \,K^+\) decays are used to correct the simulation samples in the signal and normalisation channels. DDW obtained from the \(B^0_s \rightarrow J/\psi \, \phi \) control channel are found to agree with those from \(B^+ \rightarrow J/\psi \,K^+\) showing the consistency of the corrections.

Similarly to the exclusive decays, the large continuum background MC sample is reweighted via DDW obtained from its comparison with the data in the sidebands of the signal region.

4 Data selection

For data collected during the LHC Run 1, the ATLAS detector used a three-level trigger system, consisting of a hardware-based Level-1 trigger, software-based Level-2 and Event Filter triggers.

A dimuon trigger [23, 24] is used to select events. The 2011 data sample contains events seeded by a Level-1 dimuon trigger that required a transverse momentum \(p_{\text {T}} > 4\) GeV for both muon candidates. Due to the increased pile-up in 2012 data, this dimuon trigger was prescaled at the beginning of every fill. The effect of prescaling is mitigated by including in the analysis events selected by two additional Level-1 triggers scarcely affected by prescaling, where tighter selections were applied: \(p_{\text {T}} > 6\) GeV or \(|\eta | < 1.05\) for one of the muons. A full track reconstruction of the muon candidates was performed at the software trigger levels, where an additional loose selection was applied to the dimuon invariant mass \(m_{\mu \mu }\) and the events were assigned to the \(J/\psi \) stream (\(2.5< m_{\mu \mu } < 4.3\) GeV) or to the B stream (\(4.0< m_{\mu \mu } < 8.5\) GeV).

Events from the 2012 dataset are divided into three mutually exclusive trigger categories:

  • \(T_1\)  “Higher threshold” trigger with \(p_{\text {T}}\) \(>6\) GeV for one muon and \(>4\) GeV for the other one;

  • \(T_2\)  “Barrel” trigger with \(p_{\text {T}}\) \(>4\) GeV for both muon candidates and at least one of them with \(|\eta | < 1.05\) (and \(T_1\) requirement not satisfied);

  • \(T_3\)  Basic dimuon trigger with \(p_{\text {T}}\) \(>4\) GeV for both muon candidates (and \(T_1\), \(T_2\) requirements not satisfied).

Events belonging to a given category are all associated with the same pattern of Level-1 prescaling. The event sample in the \(T_2\) (\(T_3\)) category has an equivalent integrated luminosity equal to \(97.7\,\%\) (\(81.3\,\%\)) of the luminosity of the \(T_1\) category. The impact of the trigger Level-1 prescale on the total sample of collected events is minor, since the majority of the events belong to the \(T_1\) category.

The events in the reference channels \(B^+ \rightarrow J/\psi \,K^+\) and \(B_s^0 \rightarrow J/\psi \phi \) collected in 2012 belong to a prescaled sample of events, which was processed together with the signal events. The effective prescaling factor is equal to 7.3, and does not affect the sensitivity of this analysis, given the large number of available events in the normalisation channel. This factor is included in the \(\alpha _k\) parameters in Eq. (1).

A fourth category is defined for events from the 2011 dataset. They were collected with a trigger requirement \(p_{\text {T}}\) \(>4\) GeV for both muon candidates, and prescaling was not applied to this sample.

After off-line reconstruction, a preliminary selection is performed on candidates for \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^- \), \(B^+ \rightarrow J/\psi \,K^+ \rightarrow \mu ^{+} \mu ^{-} K^+\) and \(B_s^0\) \( \rightarrow J/\psi \phi \rightarrow \mu ^{+} \mu ^{-} K^+ K^-\). In the ID system, muons are required to have at least one hit in the pixel detector, five hits in the semiconductor tracker (two hits per each double-sided layer), and six hits in the transition-radiation tracker, if \(0.1< |\eta |< 1.9\). They are also required to be reconstructed in the MS, and to have \(|\eta | < 2.5\) and \(p_{\text {T}} >4\) GeV. Kaon candidates have to satisfy similar requirements in the ID, except that at least nine instead of six hits are required in the transition-radiation tracker and a looser requirement of \(p_{\text {T}} >1\) GeV is imposed.

B meson properties are computed based on a decay vertex fitted to two, three or four tracks, depending on the decay process to be reconstructed. The \(\chi ^2\) per degree of freedom in the vertex fit is required to be less than six for the B vertex, and less than ten for the \(J/\psi \rightarrow \mu \mu \) vertex. The conditions \(2915<m(\mu \mu )<3275\) MeV and \(1005< m(KK) < 1035\) MeV are required on ID track combinations for the \(J/\psi \rightarrow \mu \mu \) and the \(\phi \rightarrow K K\) vertices, respectively. In the \(B^+ \rightarrow J/\psi \,K^+\) and \(B^0_s \rightarrow J/\psi \, \phi \) fits the reconstructed \(J/\psi \) mass is constrained to the world average value [25].

Reconstructed B candidates are required to satisfy \(p_{\text {T}} ^B>8.0\) GeV and \(|\eta ^B|<2.5\). The dimuon invariant mass for \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^- \) candidates is calculated using the combined ID and MS information, in order to improve the mass resolution in the end-caps with respect to using ID information only [26].

The invariant mass range considered for the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^- \) decay is 4766–5966 MeV in which the 5166–5526 MeV range is defined as the signal region while the low-mass and high-mass regions (4766–5166 and 5526–5966 MeV) are the signal mass sidebands. For the reference channels, the mass range considered is 4930–5630 (5050–5650) MeV for \(B^+ \rightarrow J/\psi \,K^+\) (\(B^0_s \rightarrow J/\psi \, \phi \)) in which the 5180–5380 (5297–5437) MeV range is the peak region and the two low and high mass ranges are the mass sidebands used for background subtraction.

The coordinates of the PVs are obtained from charged tracks not used in the decay vertices, and are transversely constrained to the luminous region of the colliding beams. The matching of a B candidate to a PV is made by propagating the candidate to the point of closest approach to the collision axis, and choosing the PV with the smallest separation along z. Simulation shows that this method achieves a correct matching probability of better than \(99\,\%\).

To reduce of the large background in the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) channel before the final selection based on multivariate classifiers, a loose collinearity requirement is applied between the momentum of the B candidate (\(\overrightarrow{p}^B\)) and the spatial separation between the PV and the decay vertex (\(\overrightarrow{\Delta x}\)). The absolute value of the difference in azimuthal angle \(\alpha _{2\mathrm {D}}\) is required to be smaller than 1.0 rad. Using the difference in rapidity \(\Delta \eta \), the combination \(\Delta R = \sqrt{\alpha _{2\mathrm {D}}{}^{2} + \Delta \eta ^{2}}\) is required to be smaller than 1.5. These requirements reduce the background by a factor of 0.4, with a signal efficiency of \(95\,\%\).

After the preliminary selection, approximately \(2.6\times 10^6\) (\(2.3 \times 10^6\)) candidates are found in the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^- \) (\(B^+ \rightarrow J/\psi \,K^+ \)) signal regions.

5 Background composition

The background to the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) signal originates from three main sources:

Continuum :

Background, the dominant combinatorial component, made from muons coming from uncorrelated hadron decays and characterised by a small dependence on the dimuon invariant mass;

Partially reconstructed :

\(B\rightarrow \mu \mu X\) decays, characterised by non-reconstructed final-state particles (X) and thus accumulating in the low dimuon invariant mass sideband;

Peaking :

Background, due to \(B^0{}\!_{(s)}\rightarrow h\,h^\prime \) decays, with both hadrons misidentified as muons.

The continuum background consists mainly of muons independently produced in the fragmentation and decay trees of a b and a \(\overline{b}\) quark (opposite-side muons). It is studied in the signal mass sidebands, and it is found to be correctly described by the inclusive MC sample of semileptonic decays of b and c hadrons.

Section 8 contains data–MC comparisons for the continuum background. As discussed in Sect. 7, a multivariate classifier trained on MC samples is used to reduce this component.

The partially reconstructed decays consist of several topologies: (a) same-side (SS) combinatorial background from decay cascades (\(b \rightarrow c \mu \nu \rightarrow s(d) \mu \mu \nu \nu \)); (b) same-vertex (SV) background from B decays containing a muon pair (e.g. \(B^0 \rightarrow K^{*0}\mu \mu \), \(B \rightarrow J/\psi X \rightarrow \mu \mu \mu X^\prime \)); (c) \(B_c\) decays (e.g. \(B_c \rightarrow J/\psi \mu \nu \rightarrow \mu \mu \mu \nu \)); (d) semileptonic b-hadron decays where a final-state hadron is misidentified as a muon.

Inclusive MC samples of SS events, SV events, and \(B_c \rightarrow J/\psi \mu \nu \) decays were generated. All subsamples have a dimuon invariant mass distribution accumulating below the mass range considered in this analysis. The high-mass tail extends to the signal region and becomes a significant fraction of the background only after applying a selection against the continuum background.

Fig. 1
figure 1

a Dimuon invariant mass distribution for the partially reconstructed background, from simulation, before the final selection against continuum is applied but after all other requirements. The different components are shown as stacked histograms, normalised according to world-averaged measured branching fractions. The SM expectation for the \(B^0_s \rightarrow \mu ^+\mu ^-\) signal is also shown for comparison (non-stacked). Continuum background is not included here. b Invariant mass distribution of the peaking background components \(B^0{}\!_{(s)} \rightarrow hh^\prime \), after the complete signal selection is applied. In both plots the distributions are normalised to the expected yield for the integrated luminosity of 25 fb\(^{-1}\).

The semileptonic decays with final-state hadrons misidentified as muons consist mainly of three-body charmless decays \(B^0 \rightarrow \pi \mu \nu \), \(B^0_s \rightarrow K\mu \nu \) and \(\Lambda _b \rightarrow p\mu \nu \) in which the tail of the invariant mass distribution extends to the signal region. Due to branching fractions of the order of \(10^{-6}\), this background is not large, and is further reduced by the dedicated muon identification requirements, discussed in Sect. 6. The MC invariant mass distributions of these partially reconstructed decay topologies are shown in Fig. 1a after applying the preliminary selection criteria described in Sect. 4.

Finally, the peaking background is due to \(B^0{}\!_{(s)}\) decays containing two hadrons misidentified as muons, which populate the signal region as shown in Fig. 1b.

6 Hadron misidentification

In the preliminary selection, muon candidates are formed from the combination of tracks reconstructed independently in the ID and MS [27]. The performance of the muon reconstruction in ATLAS is presented in Ref. [26]. Additional studies were performed for this analysis to minimise and evaluate the amount of background related to hadrons erroneously identified as muons.

Detailed simulation studies were performed for the channels \(B^0{}\!_{(s)} \rightarrow hh^\prime \) and \(\Lambda _b \rightarrow ph\), with \(h^{(\prime )} = \pi ^\pm , K^\pm \). A full Geant4-based simulation [21] in all systems of the ATLAS detector is used for this purpose. The vast majority of background events from particle misidentification are due to decays in flight of kaons and pions, in which the muon receives most of the energy of the meson. Hence, despite the notation of fake muons, this background is generally related to true muons measured in the MS, but not produced promptly in the decay of a B meson. The contribution from hadronic punch-through into the MS is expected from simulation to amount only to 3 % (8 %) of the total number of fake candidates from kaons (pions).

The simulation shows that after the preliminary selection the probability for a kaon (pion) to be misidentified as a muon is 0.4 % (0.2 %). This fraction is found to be largely independent of the transverse momentum and rapidity of the track, as well as other variables related to the underlying event or pile-up. The misidentification rate for protons is found to be negligible (\({<}0.01~\%\)).

The muon candidate is further required to match the trigger requirements, resulting in a reduction in the number of retained tracks by a factor 0.58, and to pass an additional multivariate selection, implemented as a boosted decision tree (BDT) [28]. This selection, referred to as fake-BDT, is based on variables described in Table 1 and it is built and trained on the MC samples. The BDT training is done using a multivariate analysis tool (TMVA) [28]. The fake-BDT selection is tuned for a 95 % efficiency for muons in the signal sample, and achieves an average reduction of the hadron misidentification by a factor 0.37, determined with independent MC samples. The resulting final value of the misidentification probability is equal to 0.09 % for kaons and 0.04 % for pions.

Table 1 Description of the eight variables used in the discrimination between signal muons and those from hadron decays in flight and punch-throughs

The background due to \(B^0{}\!_{(s)} \rightarrow hh^\prime \), with double misidentification \(h h^\prime \rightarrow \mu \mu \), has a distribution in the reconstructed invariant mass peaking at 5250 MeV, close to the \(B_s^0\) mass and is effectively indistinguishable from the \(B^0\) signal (see Fig. 1b). Beyond the muon and fake-BDT selection, these events have the same acceptance and selection efficiency as the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) signal. Therefore, the expected number of peaking-background events can be estimated from the number of observed \(B^+ \rightarrow J/\psi \,K^+\) events, in a way analogous to what is done for the signal, using Eq. (1). World average [25] values for the branching fractions of \(B^0\) and \(B_s^0\) into \(K\pi \), KK and \(\pi \pi \) are used, together with the hadron misidentification probabilities obtained from simulation. The resulting total expected number of peaking-background events, after the final selection (including a multivariate cut against \(\mu ^+\mu ^-\) continuum background, the continuum-BDT discussed in Sect. 7), is equal to 0.7, with a 10 % uncertainty from the normalisation procedure.

The simulation of hadron misidentification was validated and calibrated with studies performed on data. The fractions of fake muons after the preliminary selection were evaluated on samples of \(\phi \rightarrow K^+K^-\) and \(B^+ \rightarrow J/\psi \,K^+\) events, and found to be consistent with the simulation within a factor \(1.2\pm 0.2\). This factor and its square \(1.4 \pm 0.5\) are used as scale correction and systematic uncertainty in the single and double misidentification probability, respectively. Hence, the expected number of peaking background events is equal to \(1.0\pm 0.4\).

A further test of the peaking background was performed on the final sample of \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) candidates. Inverting the selection applied with the fake-BDT, the number of events containing real muons is largely reduced, while the number of peaking-background events is approximately three times larger than in the sample obtained with the nominal selection. A fit to the background-enhanced sample gives a peaking background yield of \(0.5 \pm 3.0\) events, in good agreement with the expectation.

The efficiency of the fake-BDT selection when applied to muons from \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) decays was tested on the sample of \(B^+ \rightarrow J/\psi \,K^+\) candidates selected in data. The value from simulation was found to be accurate to better than \(1~\%\).

Besides the peaking background, the selection with the fake-BDT also reduces the semileptonic contributions with a single misidentified hadron. The expected number of events from \(B^0 \rightarrow \pi \mu \nu \) and \(B_s^0 \rightarrow K\mu \nu \) in the final sample is \(107 \pm 27\). The \(\Lambda _b \rightarrow p\mu \nu \) contribution is negligible due to the smaller production cross section and the fake rejection for protons at the level of \(10^{-5}\).

7 Continuum background reduction

A multivariate analysis, implemented as a BDT, is employed to enhance the signal relative to the continuum background. This classifier, referred to as the continuum-BDT, is based on the 15 variables described in Table 2. The discriminating variables can be classified into three groups: (a) B meson variables, related to the reconstruction of the decay vertex and to the collinearity between \(\overrightarrow{p}^B\) and the separation between production and decay vertices \(\overrightarrow{\Delta{x}}\); (b) variables describing the muons forming the B meson candidate; and (c) variables related to the rest of the event. The selection of the variables aims to optimise the discrimination power of the classifier, while minimising the dependence on the invariant mass of the muon pair.

Most of the discriminating variables are part of the set used in the previous analysis based on data collected in 2011 [14], while others were modified or added, exploiting the statistical power of the large samples of MC events used for training and validating the classifier. To minimise the dependence of the classifier on the effects of the pile-up, requirements of compatibility with the same vertex matched to the dimuon candidate are placed on the additional tracks considered for the variables \(I_{0.7}\), DOCA\(_{\mathrm {xtrk}}\) and \(N^{\mathrm {close}}_{\mathrm {xtrk}}\).

The correlation between the discriminating variables was studied in the MC samples for signal and continuum background discussed in Sect. 3, and on data from the sidebands of the \(\mu ^+\mu ^-\) invariant mass distribution. Different degrees of correlation are present, with significant linear correlation among the variables \(\chi ^{2}_{{\mathrm {PV,DV}}\,xy}\), \(L_{xy}\), \(|d_{0}|^{\mathrm {max}}\)-sig., \(|d_{0}|^{\mathrm {min}}\)-sig. and \(\chi ^2_{\mu ,\mathrm {xPV}}\). Conversely, the variables IP\(_B^{3\mathrm {D}}\), DOCA\(_{\mu \mu }\) and \(I_{0.7}\) have negligible correlation with any of the others used in the classifier.

Table 2 Description of the 15 variables used in the discrimination between signal and continuum background. When the BDT classifier is applied to \(B^+ \rightarrow J/\psi \,K^+\) and \(B^0_s \rightarrow J/\psi \, \phi \) candidates, the variables related to the decay products of the B mesons refer only to the muons from the decay of the \(J/\psi \)

The MC sample for signal and the large MC sample of semileptonic decays of hadrons containing b or c quarks are used for training and testing the classifier. As discussed in Sect. 3, signal and background samples are reweighted according to the distributions of \(p_{\text {T}}\) and \(|\eta |\) of the dimuon and of the number of reconstructed PVs observed in data. To reproduce accurately the 2012 data distributions, MC events belonging to different trigger streams are reweighted according to the relative equivalent luminosity and to two different versions of the Level-2 muon reconstruction algorithm used during the data taking. The BDT training is done using TMVA [28].

Figure 2 shows the distribution of the BDT output variable for signal and background, separately for continuum background and partially reconstructed events. Also shown is the BDT distribution for dimuon candidates from data, from the sidebands of the invariant mass distribution. In both the signal and background MC samples, the absolute value of the linear correlation coefficient between the BDT output and the dimuon invariant mass is smaller than 1 %. The final selection requires a continuum-BDT output value larger than 0.24, corresponding to a signal relative efficiency of \(54~\%\) (see Sect. 11), and to a reduction of the continuum background by a factor of about \(10^{-3}\).

Fig. 2
figure 2

Continuum-BDT distribution for the signal and background events: signal \(B^0{}\!_{(s)}\), partially reconstructed B events (SS+SV), \(B_c\) decays and continuum. The solid histograms are obtained from simulation, while the points represent data collected in the sidebands. All distributions are normalised to unity. The distributions are shown after the preliminary selection, and before applying any reweighting to the variables used in the classifier

8 Data–simulation comparisons

The distributions of the discriminating variables are used to compare the MC sample of semileptonic decays with data in the dimuon sidebands. Figure 3 shows the distributions for two discriminating variables. Agreement with the sideband data is fair and the discrepancies observed do not compromise the use of this MC background sample for the purpose of training the continuum-BDT. The continuum MC simulation is not used for computation of efficiencies or normalisation purposes.

The distributions of the discriminating variables are also used for the comparison of \(B^+ \rightarrow J/\psi \,K^+\) and \(B^0_s \rightarrow J/\psi \, \phi \) events between simulation and data. To perform such comparison, for each variable the contribution of the background is subtracted from the signal. For this purpose, a maximum-likelihood fit is performed to the invariant mass distribution, separately in the four trigger and data categories. For \(B^+\), the signal is described by two overlying Gaussian distributions, an error function for the partially reconstructed decays and an exponential function for the continuum background. The fit model is simpler than the one used for the extraction of the \(B^+\) signal used for normalisation after the final selection, described in Sect. 9, but it is sufficient for the purpose discussed here. For \(B^0_s \rightarrow J/\psi \, \phi \), a Gaussian distribution is used for the signal and a third-order Chebychev polynomial for the background. For each discriminating variable, the background distribution observed in the sidebands is interpolated to the signal region, normalised according to the result of the likelihood fit, and subtracted from the distribution observed in the signal region.

Fig. 3
figure 3

Data and continuum MC distributions of \(|\alpha _{2\mathrm {D}}|\) (a) and \(\chi ^2{}_{\mu ,\mathrm {xPV}}\) (b) variables (see Table 2). The dots correspond to the 2012 sideband data, while the continuous-line histogram corresponds to the continuum MC distribution, normalised to the number of data events. The filled-area histogram shows the signal MC distribution for comparison. Discrepancies between MC events and sideband data like the one observed for \(\chi ^2_{\mu ,\mathrm {xPV}}\) do not compromise significantly the optimisation of the continuum-BDT classifier

Figure 4 shows examples of the distributions of the discriminating variables obtained from data and simulation. In general, the overall shapes of distributions are in good agreement between data and MC events. Observed differences are accounted for as systematic effects with the procedure described in Sect. 10. The discrepancy shown for the isolation variable \(I_{0.7}\) in the \(B^+ \rightarrow J/\psi \,K^+\) channel is the most significant one among all variables and both reference channels.

Fig. 4
figure 4

Data and MC distributions in \(B^+ \rightarrow J/\psi \,K^+\) events for the discriminating variables: \(|\alpha _{2\mathrm {D}}|\) (a), \(\chi ^{2}{}_{{\mathrm PV,DV}\,xy}\) (b) and \(I_{0.7}\) (c). The variable \(I_{0.7}\) is also shown for \(B^0_s \rightarrow J/\psi \, \phi \) events (d). The black dots correspond to the sideband-subtracted data, while the red histogram corresponds to the MC distribution, normalised to the number of data events. Differences in shape between MC events and data are accounted for as systematic effects. The discrepancy shown for \(I_{0.7}\) in the \(B^+ \rightarrow J/\psi \,K^+\) channel is the most significant among all variables and both reference channels

9 Yield extraction for the normalisation channel \(B^+ \rightarrow J/\psi K^+\)

The \(B^+\) yield for the normalisation channel is extracted with an unbinned extended maximum-likelihood fit to the \(J/\psi K^+\) invariant mass distribution. The functional forms used to model both the signal and the backgrounds are obtained from studies of MC samples. All the yields are extracted from the fit to data, while the shape parameters are determined from a simultaneous fit to data and MC samples. Free parameters are introduced for the mass scale and mass resolution to accommodate data–MC differences.

Fig. 5
figure 5

\(J/\psi K^+\) invariant mass distribution for all \(B^+\) candidates in the \(T_1\) trigger category in 2012 data in linear (a) and logarithmic (b) scale. The result of the fit is overlaid. The various components of the spectrum are described in the text. The insets at the bottom of the plots show the bin-by-bin pulls for the fits, where the pull is defined as the difference between the data point and the value obtained from the fit function, divided by the error from the fit

The fit includes four components: \(B^+ \rightarrow J/\psi \,K^+\) events, Cabibbo-suppressed \(B^+ \rightarrow J/\psi \, \pi ^+\) events on the right tail of the main peak, partially reconstructed B decays (PRD) where one or more of the final-state particles are missing, and the continuum background composed mostly of \(b\bar{b}\rightarrow J/\psi X\) events. The shape of the \(B^+ \rightarrow J/\psi \,K^+\) distribution is parameterised using a Johnson \(S_U\) function [29, 30] and a Gaussian function for the \(T_1\), \(T_2\) and 2011 categories, while a single Johnson \(S_U\) function is used for the \(T_3\) category. The final \(B^+ \rightarrow J/\psi \,K^+\) yield includes the contribution from radiative decays. The \(B^+ \rightarrow J/\psi \, \pi ^+\) events are modelled by the sum of a Johnson \(S_U\) and a Gaussian function, where all parameters are determined from the simulated data. The PRD are described with combinations of Fermi–Dirac and exponential functions, slightly different between the different categories in the low-mass region. Their shape parameters are determined from simulation. Finally, the continuum background is modelled with an exponential function with the shape parameter extracted from the fit. As an example, the fit for the \(T_1\) category is shown in Fig. 5. The results of the fits in all data categories are shown in Table 3.

Some of the systematic effects are included automatically in the fit: the effect of limited MC sample size, for example, is included in the uncertainties through a simultaneous fit to data and MC samples. Scaling factors determined in the fit to data account for the differences in mass scale and resolution between data and simulation. Additional systematic uncertainties are evaluated by varying the default fit model described above: they take into account the kinematic differences between data and the MC samples used in the fit, differences in efficiency between \(B^+\) and \(B^-\) decays, uncertainties in the relative fractions and shapes of PRD, and in the shape of the continuum background. In each case, the difference with respect to the default fit is recorded, symmetrised and used as an estimate of the systematic uncertainty. The main contributions to the systematic uncertainty come from the shape of the continuum background, the relative fractions of PRD and the signal charge asymmetry. The total statistical and systematic uncertainty in the \(B^+\) normalisation yield amounts to \(0.8~\%\).

9.1 \(B^+ \rightarrow J/\psi \, \pi ^+\) / \(B^+ \rightarrow J/\psi \,K^+\) branching fraction ratio measurement

For further validation of the fit to the \(B^+ \rightarrow J/\psi \,K^+\) yield, the fit described in Sect. 9 is used to extract the yields for \(B^+ \rightarrow J/\psi \,K^+\) and \(B^+ \rightarrow J/\psi \, \pi ^+\) decays and obtain the ratio \(\rho _{\pi /K}\) of the corresponding branching fractions. The measurement is performed separately in the four categories, and combined into an uncertainty-weighted mean \( {\rho _{\pi /K}}\). Table 3 shows the fitted yields.

Most systematic effects cancel in the measurement of this ratio. Residual systematic uncertainties in the ratio of the branching fractions come from the uncertainties in the \(K^{-}/K^{+}\), \(\pi ^{-}/\pi ^{+}\) and \(K^{+}/\pi ^{+}\) relative efficiencies. For each systematic effect the ratio is re-evaluated, therefore accounting for correlated effects. The largest systematic uncertainty in the measured ratio comes from the continuum background model parameterisation (\(23~\%\)), followed by the effect of the uncertainties in the PRD fraction estimates (\(15~\%\)). All other systematic sources have uncertainties at the level of \(10~\%\) or less. The final result for the ratio of branching fractions is:

$$\begin{aligned} {\rho _{\pi /K}} = \frac{\mathcal{B}(B^{+} \rightarrow J/\psi \pi ^{+} )}{\mathcal{B}(B^{+} \rightarrow J/\psi K^{+}) } = 0.035 \pm 0.003 \pm 0.012 \,, \end{aligned}$$

where the first uncertainty is statistical and the second is systematic. The result is in agreement with the most accurate available results from LHCb (\(0.0383 \pm 0.0011 \pm 0.0007\) [31]) and BABAR (\(0.0537 \pm 0.0045 \pm 0.0011\) [32]).

10 Evaluation of the \(B^+ \rightarrow J/\psi \,K^+\) to \(B^0{}\!_{(s)}\rightarrow \mu ^{+} \mu ^{-} \) efficiency ratio

The ratio of efficiencies for \(B^+ \rightarrow J/\psi \,K^+\) and \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) enters the \(\mathcal {D}_{\mathrm {norm}}\) term defined in Eq. (2). Both channels are measured in the fiducial volume of the B meson defined as \(p_{\text {T}} ^B>8.0\) GeV and \(\left| \eta _B\right| <2.5\).

Table 3 Results of the fits to the events reconstructed as \(B^+ \rightarrow J/\psi \,K^+ \) in each trigger and data category. Uncertainties are statistical and systematic, respectively

The total efficiencies within the fiducial volume include acceptance and trigger, reconstruction and selection efficiencies. The acceptance is defined by the selection placed on the particles in the final state: \(p_{\text {T}} ^{\mu }>4.0\) GeV and \(|\eta _{\mu }|<2.5\) for muons, \(p_{\text {T}} ^{K}>1.0\) GeV and \(\left| \eta _{K}\right| <2.5\) for kaons. In addition to the reweighting of the distributions of \(p_{\text {T}} ^B\), \(|\eta ^B|\) and the number of reconstructed PVs observed in data, the MC samples are reweighted according to the equivalent integrated luminosity associated with each trigger category and the Level-2 muon trigger algorithms used in 2012.

The trigger efficiencies are taken from a data-driven study based on the comparison of single-muon and dimuon triggers for events containing muon pairs from the decays of \(J/\psi \) and \(\Upsilon \) resonances [33]. Reconstruction and selection efficiencies are obtained from simulation. The signal selection requires the output of the continuum-BDT to be larger than 0.24.

Table 4 Values of the efficiency ratios \(R_{\varepsilon }^k\) for the 2012 trigger categories and the 2011 sample, and their relative contributions to \(\mathcal {D}_{\mathrm {norm}}\) (Eq. (2)). The first uncertainty is statistical and the second systematic. The systematic component includes the uncertainties from the MC reweighting and from data–MC discrepancies, as described in the text. The correction due to the \(B_s^0 \) effective lifetime value discussed in the text is not applied to the numbers shown

All efficiency terms are computed separately for the three trigger selections used in 2012 and for the 2011 sample. Table 4 provides the values of the efficiency ratios \(R_{\varepsilon }^k\), for each of the categories (\(k=1-4\)), together with the statistical and systematic uncertainties described below.

The efficiency ratios shown in Table 4 are computed using the mean lifetime of \(B_s^0 \) [25, 34] in the MC generator. The same efficiency ratios apply to the \(B^0_s \rightarrow \mu ^+\mu ^-\) and \(B^0 \rightarrow \mu ^+\mu ^-\) decays, within the MC statistical uncertainty of \({\pm }0.5~\%\).

The statistical uncertainties in the efficiency ratios come from the finite number of events available for the simulated samples. The systematic uncertainty affecting \(R_{\varepsilon }^k\) comes from four sources. A first contribution is due to the uncertainties in the DDW. This term is assessed from pseudo-MC studies, performed by varying the corrections within their statistical uncertainties. The RMS value of the distribution of \(R_{\varepsilon }^k\) obtained from pseudo-MC samples is taken as the systematic uncertainty. The uncertainties range from ±1 to ±6 % depending on the category considered.

A second contribution is related to the trigger efficiencies. The effects of the statistical uncertainties in the data-driven efficiencies is evaluated with pseudo-MC studies, obtaining values in the range of ±1.5 to ±7 % in the different categories. An additional ±1.5 % uncertainty is added in quadrature for systematic effects. This term includes uncertainties in the Level-2 muon trigger algorithm, which are evaluated through data-driven studies performed using \(J/\psi \, K^+\) and \(\mu ^+\mu ^-\) candidates, and cancel to a large extent in the ratio of normalisation and signal channels.

A third source of systematic uncertainty arises from the differences between data and simulation observed in the modelling of the discriminating variables used in the continuum-BDT classifier (Table 2). For each of the 15 variables, the MC samples for \(B^0_s \rightarrow \mu ^+\mu ^-\) and \(B^+ \rightarrow J/\psi \,K^+\) are reweighted according to the distribution of the variable observed in \(B^+ \rightarrow J/\psi \,K^+\) events from the data sample, after background subtraction. The isolation variable \(I_{0.7}\) is computed using charged-particle tracks only, and differences between \(B^+\) and \(B_s^0\) are expected and were observed in previous studies [14]. Hence for this variable the reweighting procedure for the \(B^0_s \rightarrow \mu ^+\mu ^-\) MC sample is based on \(B^0_s \rightarrow J/\psi \, \phi \) data. For all discriminating variables but \(I_{0.7}\), the value of the efficiency ratio is modified by less than 2 % by the reweighting procedure. For these variables, each variation is taken as an independent contribution to the systematic uncertainty in the efficiency ratio. For \(I_{0.7}\) the reweighing procedure changes the efficiency ratio by \(-5.3\) % for the 2012 data sample. A smaller effect is found for the 2011 sample, obtained with a different MC generator. Because of the significant mis-modelling, the 2012 MC samples obtained after reweighting on the distribution of \(I_{0.7}\) are taken as a reference, thus correcting the central value of the efficiency ratio. The uncertainty in the correction is \({\pm }3.2\) % and is added to the sum in quadrature of the uncertainties assigned to the other discriminating variables. The total uncertainty in the modelling of the discriminating variables is the dominant contribution to the systematic uncertainties shown in Table 4.

A fourth source of systematic uncertainty arises from differences between the \(B^0_s \rightarrow \mu ^+\mu ^-\) and the \(B^+ \rightarrow J/\psi \,K^+\) channels related to the reconstruction efficiency of the kaon track and of the \(B^+\) decay vertex [35]. These uncertainties are mainly related to inaccuracy in the modelling of passive material in the ID system and have been validated by studies performed on data. The corresponding systematic uncertainty is \({\pm }3.6\) %.

The efficiency ratios enter in Eq. (1) with the \(\mathcal {D}_{\mathrm {norm}}\) term defined in Eq. (2). For each category k, the efficiency ratio is multiplied by the number of observed \(B^+\) candidates and the trigger prescaling factor. The relative contributions of the \(T_1\), \(T_2\), \(T_3\) and 2011 categories are shown in Table 4. The uncertainties in \(R_{\varepsilon }^k\) are weighted accordingly and combined. For the trigger categories of the 2012 data sample, the correlations among the uncertainties due to DDW, trigger efficiency and mis-modelling of the discriminating variables are taken into account. Table 5 shows the different contributions and the total uncertainty on \(\mathcal {D}_{\mathrm {norm}}\), equal to \({\pm }5.9\) %.

Table 5 Summary of the systematic uncertainties in the \(\mathcal {D}_{\mathrm {norm}}\) term of Eq. (2)

A correction to the efficiency ratio for \(B^0_s \rightarrow \mu ^+\mu ^-\) is needed because of the width difference \(\Delta \Gamma _s\) between the \(B_s^0\) eigenstates. According to the SM, the decay \(B^0_s \rightarrow \mu ^+\mu ^-\) proceeds predominantly through the heavy state \(B_{s,\mathrm {H}}\) [1, 15], which has width \(\Gamma _{s,\mathrm {H}}= \Gamma _s - \Delta \Gamma _s/2\), i.e. \((6.2\pm 0.5)\) % smaller than the average \(\Gamma _s\) [34]. The variation in the value of the \(B^0_s \rightarrow \mu ^+\mu ^-\) mean lifetime was tested with simulation, and found to change the \(B_s^0\) efficiency by \(+4\) %, and consequently the \(B_s^0\) to \(B^+\) efficiency ratio. This correction is applied to the central value of \(\mathcal {D}_{\mathrm {norm}}\) used in Sect. 12 for the determination of \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\)).Footnote 2 Due to the small value of \(\Delta \Gamma _d\), no correction needs to be applied to the \(B^0 \rightarrow \mu ^+\mu ^-\) decay.

10.1 Comparison of normalisation yields with other measurements

The systematic acceptance and efficiencies uncertainties are minimised by using \(B^+ \rightarrow J/\psi \,K^+\) as the normalisation channel and evaluating only efficiency ratios. However, event counts and absolute efficiency values for the reference channels can be used to extract the production cross sections for the purposes of comparisons with other measurements.

The yield of \(B^+\) can be compared to the one obtained by ATLAS with 2.7 fb\(^{-1}\) of data collected at \(\sqrt{s}= 7\) TeV [35], and based on the same decay channel. In the comparison, the data collected at \(\sqrt{s}= 8\) TeV for the present analysis were restricted to the phase space \(p_{\text {T}} ^B>9.0\) GeV and \(\left| \eta _B\right| <2.25\) used for the previous result. Trigger and preliminary selections are very similar, but the selections against continuum background and fake muons are used only in the present analysis. The difference in the collision energy is taken into account by comparing the measured production cross section to the prediction based on the fixed-order next-to-leading-log (FONLL) approximation [36]. The theoretical uncertainty in the extrapolation from 7 to 8 TeV is expected to be small compared to experimental uncertainties. The ratio of the observed to the predicted cross section was measured in Ref. [35] as \(1.24\,\pm \,0.04\,\pm \,0.09\), where the statistical and systematic uncertainties include only the experimental ones. The corresponding value from the present analysis is \(1.17\pm 0.02\pm 0.14\), with the uncertainty dominated by the systematic uncertainty in the efficiency of the continuum-BDT selection. The result is in agreement with the previous measurement. Correlated systematic uncertainties between the two analyses amount to \(\pm 0.05\).

The measurements of \(B^0_s \rightarrow J/\psi \, \phi \) and \(B^+ \rightarrow J/\psi \,K^+\) yield, together with the corresponding acceptance and efficiency values, can be used to extract the production ratio \(B_s^0/B^+ \), for \(10 \lesssim p_{\text {T}} ^B \lesssim 20\) GeV and \(|\eta |<2.5\), in pp collisions at \(\sqrt{s}= 8\) TeV. Using world averages values [25] for the branching fractions to the final states, the resulting mean ratio of the hadronisation fractions \(f_s/f_u\) is equal to \(0.236 \pm 0.014 \pm 0.018 \pm 0.021\), where the first uncertainty is statistical, the second is the systematic uncertainty in the efficiency ratio and the third is the uncertainty in the branching fractions. The ratio is uniform across the kinematic range observed, and it varies by only \(-2~\%\) if the \(B_s^0 \) and \(B^+ \) signals are extracted without applying the continuum-BDT selection. The normalisation procedure might not be free of bias, since the value of \(\mathcal{B}(B^0_s \rightarrow J/\psi \, \phi )\) includes assumptions about \(f_s\), and updating the assumptions may change it by about \(5~\%\). The result nevertheless provides a satisfactory consistency check with the available measurements [37, 38]. The most direct comparison is with the recent value \(f_s/f_d = 0.240 \pm 0.020\) [38], obtained by ATLAS from the analysis of 2.7 fb\(^{-1}\) of data collected at \(\sqrt{s}= 7\) TeV, and performed over the same \(p_{\text {T}} ^B\) and \(\eta ^B\) ranges used in this analysis. The uncertainty in that measurement is dominated by the prediction of the ratio of branching fractions \(\mathcal{B}(B^0_s \rightarrow J/\psi \, \phi )\,/\,\mathcal{B}(B^0 \rightarrow J/\psi K^{*0})\). The ratio of the efficiency-corrected event yields observed at \(\sqrt{s}= 8\) TeV in the present analysis can be compared to the corresponding value from Ref. [38] after rescaling by the ratio of branching fractions \(\mathcal{B}(B^+ \rightarrow J/\psi \,K^+)\, /\, [\mathcal{B}(B^0 \rightarrow J/\psi K^{*0})\!\times \! \mathcal{B}(K^{*0}\rightarrow K^+ \pi ^-)]\), which is known with better accuracy than \(\mathcal{B}(B^0_s \rightarrow J/\psi \, \phi )\). In this way, some systematic uncertainties are removed, and the ratio of the two results is \(0.96\pm 0.12\). The largest contribution to the systematic uncertainty is from the efficiency of the continuum-BDT selection used in the present analysis.

In conclusion, the observed event rates for the normalisation channels are in agreement with previous measurements within uncertainties of about \(12~\%\).

11 Extraction of the signal yield

Dimuon candidates passing the preliminary selection and the multivariate selections against hadron misidentification and continuum background are classified according to three intervals in the continuum-BDT output: 0.240–0.346, 0.346–0.446 and 0.446–1. Each interval corresponds to an equal efficiency of \(18~\%\) for signal events, and they are ordered according to increasing signal-to-background ratio. In each continuum-BDT interval, events from the four trigger and data categories are merged.

An unbinned extended maximum-likelihood fit is performed on the dimuon invariant mass distribution simultaneously across the three continuum-BDT intervals. The result of the fit is the total yield of \(B^0_s \rightarrow \mu ^+\mu ^-\) and \(B^0 \rightarrow \mu ^+\mu ^-\) events in the three BDT intervals. The parameters describing the background are allowed to vary freely and are determined by the fit. The fit model for signal and background is described in Sect. 11.1. The systematic uncertainties related to the BDT intervals, to the signal and to the background model are discussed in Sects. 11.1 and 11.2, and are included in the likelihood with Gaussian multiplicative factors with width equal to the systematic uncertainty.

11.1 Signal and background model

The model for describing signal and background is based on simulations and on data collected in the mass sidebands of the search region.

The invariant mass distribution of the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) signal is described by a superposition of two Gaussian distributions, both centred at the \(B^0\) or \(B_s^0\) mass. The parameters are extracted from simulation, and they are taken to be uncorrelated with the BDT output. Systematic uncertainties in the mass scale and resolutions are considered separately. Figure 6 shows the invariant mass distributions for \(B^0\) and \(B_s^0\), obtained from MC events and normalised to the SM expectations.

Fig. 6
figure 6

Dimuon invariant mass distribution for the \(B_s^0\) and \(B^0\) signals from simulation. The double Gaussian fits are overlaid. The two distributions are normalised to the SM prediction for the expected yield with an integrated luminosity of 25 fb\(^{-1}\)

The efficiency of the three intervals in the continuum-BDT output for \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) events is calibrated with studies performed on the reference channels. The distribution of the BDT output is compared between MC and background-subtracted data. The differences observed in the ratio of data over simulation are described with a linear dependence on the BDT output. The slopes are equal within \({\pm }12\) % between \(B^+ \rightarrow J/\psi \,K^+\) and \(B^0_s \rightarrow J/\psi \, \phi \)   and the mean value is used to reweight the BDT-output distribution in the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) MC sample. The corresponding absolute variations in the efficiencies are equal to \(+1.8\) and \(-1.8\) % respectively in the first and third BDT intervals. The values of the lower edge of the second and third BDT intervals are corrected in simulation to obtain equal efficiencies of 18.0 % in each interval.

The systematic uncertainties in the efficiency of the BDT intervals are obtained with a procedure similar to the one used for the event selection (Sect. 10). For each discriminating variable, the MC sample is reweighted according to the difference between simulation and data observed in the reference channels. The variation in the efficiency of each BDT interval is taken as the contribution to the systematic uncertainty due to mis-modelling of that variable. In each BDT interval, the sum in quadrature of the variations of all discriminating variables is found to be similar in the \(B^+ \rightarrow J/\psi \,K^+\) and \(B^0_s \rightarrow J/\psi \, \phi \) channels, and the average of the two is taken as the total systematic uncertainty in the efficiency. Absolute values of \({\pm }2.6\), \({\pm }1.0\) and \({\pm }2.3\) % are found respectively in the first, second and third interval. Gaussian terms are included in the likelihood in order to describe these uncertainties, taking care of constraining the sum of the efficiencies of the three intervals, since that uncertainty is already included in the selection efficiency.

Figure 7 shows the distribution of the continuum-BDT output from data and simulation for the reference channels, after reweighting the MC sample. The MC distribution for \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) events is also shown, illustrating the correction based on the BDT output and the systematic uncertainty discussed above. The reweighting on the \(I_{0.7}\) variables, discussed in Sect. 10 for the evaluation of the efficiency of the final event selection (BDT output \(> 0.24\)), is not applied to the events shown in Fig. 7, and in the evaluation of the relative efficiency of the intervals used for the extraction of the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) signal. Reweighting the BDT output is preferred over reweighting \(I_{0.7}\), because of correlations present between the discriminating variables after the final selection is applied.

Finally, for the \(B^0_s \rightarrow \mu ^+\mu ^-\) signal, the lifetime difference between \(B_{s,\mathrm {H}}\) and \(B_s^0\) requires further absolute corrections to the efficiency of the BDT intervals of \(+0.3\) and \(+1.8\) % respectively in the second and third interval.

Fig. 7
figure 7

Data and MC distributions of \(B^+ \rightarrow J/\psi \,K^+\) (a) and \(B^0_s \rightarrow J/\psi \, \phi \) (b) for the continuum-BDT output and the MC distributions for \(B^0_s \rightarrow \mu ^+\mu ^-\) (c). The MC samples are normalised to the number of data events. A linear correction has been applied to the MC distributions, equal for all channels, and a systematic uncertainty is assigned to the distribution of the \(B^0{}\!_{(s)} \rightarrow \mu ^+\mu ^-\) MC sample, as discussed in the text and illustrated by the dashed line and the envelope shown in c. The vertical dashed lines in c correspond to the boundaries of the continuum-BDT intervals used for the signal extraction

The background is composed of the types of events described in Sect. 5: (a) the continuum background; (b) the background from partially reconstructed SS and SV events, which is present mainly in the low-mass sideband; (c) the peaking background.

The dependence of the continuum background on the dimuon invariant mass is described with a first-order polynomial. In the simulation, the slope of the distribution is similar in the three continuum-BDT intervals. The correlation between continuum-BDT and dimuon invariant mass is small, and similar between simulation and sideband data within large statistical uncertainties. Hence the slope of the mass dependence is described by independent parameters in the three intervals, subject to loose Gaussian constraints of uniformity within \({\pm }40~\%\) between the first and second interval, and \({\pm }80~\%\) between the first and the third. Such variations of slope are larger than those observed in simulation, and consistent with those determined from data. Deviations from these assumptions are discussed below in Sect. 11.2. The normalisation of the continuum background is also extracted independently in each BDT interval.

The SS+SV background has a dimuon invariant mass distribution peaking below the low-mass sideband region. The mass dependence is derived from data in the low-mass sideband region, and described with an exponential function with equal shape in the three continuum-BDT intervals. The value of the shape parameter is extracted from the fit to data. The normalisation values are extracted independently in each interval.

The invariant mass distribution of the peaking background is very similar to the \(B^0\) signal, as shown in Fig. 1b. In the fit, this contribution is included with fixed mass shape and with a normalisation of \(1.0\;\pm \;0.4\) events, as discussed in Sect. 6. This contribution is equally distributed among the three intervals of continuum-BDT.

The fitting procedure is tested with pseudo-MC experiments, as discussed below. The use of three intervals in the continuum-BDT output is found to optimise the performance of the likelihood fit, with all BDT intervals contributing to the determination of the background, while the second and in particular the third interval provide sensitivity to the signal yield.

11.2 Systematic uncertainties in the fit

Studies based on pseudo-MC experiments are used to assess the sensitivity of the fit to the input assumptions. Variations in the description of signal and background components are used in the generation of the pseudo-MC samples. The corresponding deviations in the average numbers \(N_s\), \(N_d\) of \(B_s^0\) and \(B^0\) events returned by the fit, run in the nominal configuration, are taken as systematic uncertainties. The amplitude of the variations in the generation of the pseudo-MC samples is determined in some cases by known characteristics of the ATLAS detector (reconstructed momentum scale and momentum resolution), in others using MC evaluation (background due to semileptonic three-body \(B^0{}\!_{(s)}\) decays and to \(B_c \rightarrow J/ \psi \mu \)), and in others from uncertainties determined from data in the sidebands and from simulation (shapes of the background components and their variation across the continuum-BDT intervals).

Fig. 8
figure 8

Dimuon invariant mass distributions in the unblinded data, in the three intervals of continuum-BDT output. Superimposed is the result of the maximum-likelihood fit, obtained imposing the boundary of non-negative signal contributions. The total fit is shown as a black continuous line, the filled area corresponds to the observed signal component, the blue dashed line to the SS+SV background, and the green dashed line to the continuum background

The pseudo-MC experiments were generated with the normalisation of the continuum and SS+SV components obtained from the fit to the data in the sideband of the invariant mass distribution, and the peaking background from the expectation discussed in Sect. 6. The signal was generated with different configurations, corresponding to the SM prediction, to smaller values of \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\)) and to smaller/larger values of \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)).

For all variations in the assumptions and all configurations of the signal amplitudes, the distributions of the differences between results and generated values, divided by the fit errors (pull distributions), are found to be correctly described by Gaussian functions with widths approximately equal to one and values of the mean smaller than 0.2 for \(B^0_s \rightarrow \mu ^+\mu ^-\) and smaller than 0.4 for \(B^0 \rightarrow \mu ^+\mu ^-\). The distributions obtained from pseudo-MC samples generated according to the nominal fit model are used to evaluate fit biases. For \(B^0_s \rightarrow \mu ^+\mu ^-\) the fit bias is negligible. For \(B^0 \rightarrow \mu ^+\mu ^-\) the bias on the yield is smaller than 25 % of the fit error, and it is included as an additional systematic uncertainty.

The shifts in \(N_s\) or \(N_d\) are combined by considering separately the sums in quadrature of the positive and negative shifts and taking the larger as the symmetric systematic uncertainty. For \(B_s^0\), the total systematic uncertainty is found to increase with the assumed size of the signal, with a dependence \(\sigma _\mathrm {syst} (N_s) = \sqrt{2^2 + (0.06\times N_s)^2}\). The total systematic uncertainty for \(B^0\) is approximately \(\sigma _\mathrm {syst} (N_d) = 3\). Most of the shifts observed have opposite sign for \(N_s\) and \(N_d\), resulting in a combined correlation coefficient in the systematic uncertainties of \(\rho _\mathrm {syst}=-0.7\).

The fit to the yield of \(B_s^0\) and \(B^0\) events is modified by including in the likelihood two smearing parameters for \(N_s\) and \(N_d\) that are constrained by a combined Gaussian distribution parameterised by the values of \(\sigma _\mathrm {syst} (N_s)\), \(\sigma _\mathrm {syst} (N_d)\) and \(\rho _\mathrm {syst}\).

11.3 Results of the signal yield extraction

Including both the 2012 and 2011 data-taking periods, the numbers of background events contained in the signal region (5166–5526 MeV) are computed from the interpolation of the data observed in the sidebands. The values \(509\pm 28\), \(32\pm 6\) and \(4.8\pm 1.9\) events are obtained respectively in the three intervals of continuum-BDT. For comparison, the total expected number of signal events according to the SM prediction is 41 and 5 respectively for \(N_s\) and \(N_d\), equally distributed among the three intervals.Footnote 3

Once the signal region is unblinded, a total of 1951 events in the full mass range of 4766–5966 MeV are used for the likelihood fit to signal and background. Without applying any boundary on the values of the fitted parameters, the values determined by the fit are \(N_s = 16 \pm 12\) and \(N_d = - 11 \pm 9\), where the uncertainties correspond to likelihood variations of \(- 2\,\Delta \ln (L) = 1\). The likelihood includes the systematic uncertainties discussed above, but statistical uncertainties largely dominate. The primary result of this analysis is obtained by applying the natural boundary of non-negative yields, for which the fit returns the values \(N_s = 11\) and \(N_d = 0.\) The uncertainties in the result of the fit are discussed in Sect. 12, where the measured values of the branching fractions are presented.

Figure 8 shows the dimuon invariant mass distributions in the three intervals of continuum-BDT, together with the projections of the likelihood fit.

For comparison, the value \(N_d\) can be constrained according to the SM expectation for the ratio \(\mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-)/\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-)\) [1] multiplied by the ratio of the hadronisation probabilities \(f_d/f_s\) [38], rather than being extracted independently from the fit. In this case the value of \(N_s\) changes by \(-0.8\), while \(N_d = N_\mathrm {s}/8.3 \approx 1.2\).

12 Branching fraction extraction

The branching fractions for the decays \(B^0_s \rightarrow \mu ^+\mu ^-\) and \(B^0 \rightarrow \mu ^+\mu ^-\) are extracted from data using a profile-likelihood fit. The likelihood is obtained from the one used for \(N_s\) and \(N_d\) replacing the fit parameters with the corresponding branching fractions divided by normalisation terms in Eq. (1), and including Gaussian multiplicative factors for the normalisation uncertainties.

The normalisation terms include external inputs for the \(B^+ \) branching fraction and the relative hadronisation probability. The first is obtained from world averages [25] as the product of \(\mathcal{B}(B^+ \rightarrow J/\psi \,K^+)=(1.027 \pm 0.031) \times 10^{-3}\) and \(\mathcal{B}(J/\psi \rightarrow \mu ^{+} \mu ^{-})=(5.961 \pm 0.033)~\%\). The second is equal to one for \(B^0 \), while for \(B_s^0 \) it is taken from the ATLAS measurement \(f_s/f_d = 0.240 \pm 0.020\) [38], assuming \(f_u/f_d=1\) [34].

The efficiency- and luminosity-weighted number of events for the normalisation channel enters in Eq. (1) with the denominator \(\mathcal {D}_{\mathrm {norm}}\) (Eq. (2)). The values \(\mathcal {D}_{\mathrm {norm}} =(2.88 \pm 0.17) \times 10^{6}\) for \(B_s^0 \) and \((2.77 \pm 0.16) \times 10^{6}\) for \(B^0 \) are obtained using Tables 3 and 4 for each category, together with the combined uncertainty from Table 5, and including the \(+4\) % correction to the \(B^0_s \rightarrow \mu ^+\mu ^-\) efficiency due to the lifetime difference between \(B_{s,\mathrm {H}}\) and \(B_s^0 \).

The combination of \(B^+ \) branching fraction, hadronisation probabilities and \(\mathcal {D}_{\mathrm {norm}}\), i.e. the single-event sensitivity, is equal to \((8.9\pm 1.0)\times 10^{-11}\) for \(B^0_s \rightarrow \mu ^+\mu ^-\) and \((2.21\pm 0.15)\times 10^{-11}\) for \(B^0 \rightarrow \mu ^+\mu ^-\).

The values of the branching fractions that maximise the profile-likelihood within the constraint of non-negative values are \(\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) = 0.9 \times 10^{-9}\) and \(\mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-) = 0\). That constraint is applied for all results discussed in this section if not otherwise stated.

A Neyman construction [39] is used to determine the 68.3 % confidence interval for \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\)) with pseudo-MC experiments, obtaining:

$$\begin{aligned} \mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) = \left( 0.9^{+1.1}_{-0.8} \right) \times 10^{-9}. \end{aligned}$$

The uncertainties include both the statistical and systematic contributions. The two components are separated by repeating the likelihood fit after setting all systematic uncertainties to zero. The statistical uncertainty is dominant, with the systematic uncertainty equal to \(\pm \,0.3\times 10^{-9}\).

The observed significance of the \(B^0_s \rightarrow \mu ^+\mu ^-\) signal is determined from pseudo-MC experiments, with a hypothesis test based on the likelihood ratio \(-\ln [L(\mathrm {no\!-\!signal})/L(\mathrm {max})]\) [40], and is equal to 1.4 standard deviations. For this test, \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)) is left free to be determined in the fit. The corresponding expected significance is 3.1 standard deviations for the SM predictions \(\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) = (3.65 \pm 0.23) \times 10^{-9}\) and \(\mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-) = (1.06 \pm 0.09) \times 10^{-10}\) [1].

Pseudo-MC experiments are also used to evaluate the compatibility of the observation with the SM prediction. A hypothesis test based on \(-\ln [L(\mathrm {SM})/L(\mathrm {max})]\) is performed for the simultaneous fit to \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\)) and \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)). The result is \(p=0.048\pm 0.002\), corresponding to 2.0 standard deviations.

Figure 9 shows the contours in the plane of \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\)) and \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)) drawn for values of \(-2\, \Delta \ln (L)\) equal to 2.3, 6.2 and 11.8, relative to the maximum of the likelihood, allowing negative values of the branching fractions. The maximum within the physical boundary is shown with error bars indicating the 68.3 % interval for the value of \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\)). Also shown are the corresponding contours obtained in the combination of the results of the CMS and LHCb experiments[13], and the prediction based on the SM.

Fig. 9
figure 9

Contours in the plane \(\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-), \mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-)\) for intervals of \(-2\, \Delta \ln (L)\) equal to 2.3, 6.2 and 11.8 relative to the absolute maximum of the likelihood, without imposing the constraint of non-negative branching fractions. Also shown are the corresponding contours for the combined result of the CMS and LHCb experiments, the SM prediction, and the maximum of the likelihood within the boundary of non-negative branching fractions, with the error bars covering the 68.3 % confidence range for \(\mathcal B\)(\(B^0_s \rightarrow \mu ^+\mu ^-\))

Using the CL\(_\mathrm {s}\) method [41] implemented with pseudo-MC experiments, an upper limit is placed on the \(B^0_s \rightarrow \mu ^+\mu ^-\) branching fraction at the 95 % confidence level:

$$\begin{aligned} \mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) < 3.0\times 10^{-9}\, (95~\% \, \mathrm {CL}). \end{aligned}$$

The limit is obtained under the hypothesis of background only, with \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)) left free to be determined in the fit. The expected limit is \(1.8^{+0.7}_{-0.4} \times 10^{-9}\).

An upper limit based on the CL\(_\mathrm {s}\) method is also set on \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)). The expected limit obtained from pseudo-MC samples generated according to the observed amplitudes of backgrounds and \(B_s^0 \) signal is \(\left( 5.7^{+2.1}_{-1.5} \right) \times 10^{-10}\) at a confidence level of 95 %. The observed limit is:

$$\begin{aligned} \mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-) < 4.2\times 10^{-10}\, (95~\% \, \mathrm {CL}). \end{aligned}$$

The observed upper limit is above the SM prediction and also covers the central value of the combination of the measurements by CMS and LHCb [13]. The expected significance for \(\mathcal B\)(\(B^0 \rightarrow \mu ^+\mu ^-\)) according to the SM prediction is equal to 0.2 standard deviations.

13 Conclusions

A study of the rare decays of \(B_s^0\) and \(B^0\) mesons into oppositely charged muon pairs is presented, based on 25 fb\(^{-1}\) of 7 and 8 TeV proton–proton collision data collected by the ATLAS experiment in Run 1 of LHC.

For \(B^0\) an upper limit \(\mathcal{B}(B^0 \rightarrow \mu ^+\mu ^-) < 4.2 \times 10^{-10}\) is placed at the 95 % confidence level, based on the CL\(_\mathrm {s}\) method. The limit is compatible with the predictions based on the SM and with the combined result of the CMS and LHCb experiments.

For \(B_s^0\) the result is \(\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) = \left( 0.9^{+1.1}_{-0.8} \right) \times 10^{-9}\), where the uncertainty includes both the statistical and systematic components. An upper limit \(\mathcal{B}(B^0_s \rightarrow \mu ^+\mu ^-) < 3.0 \times 10^{-9}\) at 95 % CL is placed, lower than the SM prediction, and in better agreement with the measurement of CMS and LHCb.

A p value of 4.8 % is found for the compatibility of the results with the SM prediction.