1 Introduction

Jets are collimated cascades of particles produced at particle accelerators. Quarks and gluons originating from hadron collisions, such as the proton-proton collisions at the CERN Large Hadron Collider (LHC), generate a cascade of other particles (mainly other quarks or gluons) that then arrange themselves into hadrons. The stable and unstable hadrons’ decay products are observed by large particle detectors, reconstructed by algorithms that combine the information from different detector components, and then clustered into jets, using physics-motivated sequential recombination algorithms such as those described in Refs. [1,2,3]. Jet identification, or tagging, algorithms are designed to identify the nature of the particle that initiated a given cascade, inferring it from the collective features of the particles generated in the cascade.

Fig. 1
figure 1

Pictorial representations of the different jet categories considered in this paper. Left: jets originating from quarks or gluons produce one cluster of particles, approximately cone-shaped, developing along the flight direction of the quark or gluon that started the cascade. Center: when produced with large momentum, a heavy boson decaying to quarks would result in a single jet, made of 2 particle clusters (usually referred to as prongs). Right: a high-momentum \(\mathrm {t} \rightarrow \mathrm {W} \mathrm {b} \rightarrow \mathrm {q} \overline{\mathrm {q}} ^{\prime }\mathrm {b} \) decay chain results in a jet composed of three prongs

Traditionally, jet tagging was meant to distinguish three classes of jets: light flavor quarks \(\mathrm {q} =\mathrm {u},\mathrm {d},\mathrm {s},\mathrm {c} \), gluons \(\mathrm {g} \), or bottom quarks (\(\mathrm {b} \)). At the LHC, due to the large collision energy, new jet topologies emerge. When heavy particles, e.g. \(\mathrm {W} \), \(\mathrm {Z} \), or Higgs (\(\mathrm {H} \)) bosons or the top quark, are produced with large momentum and decay to all-quark final states, the resulting jets are contained in a small solid angle. A single jet emerges from the overlap of two (for bosons) or three (for the top quark) jets, as illustrated in Fig. 1. These jets are characterized by a large invariant mass (computed from the sum of the four-momenta of their constituents) and they differ from ordinary quark and gluon jets, due to their peculiar momentum flow around the jet axis.

Several techniques have been proposed to identify these jets by using physics-motivated quantities, collectively referred to as “jet substructure” variables. A review of the different techniques can be found in Ref. [4]. As discussed in the review, approaches based on deep learning (DL) have been extensively investigated (see also Sect. 2), processing sets of physics-motivated quantities with dense layers or raw data representations (e.g. jet images or particle feature lists) with more complex architectures (e.g. convolutional or recurrent networks).

In this work, we compare the typical performance of some of these approaches to what is achievable with a novel jet identification algorithm based on an interaction network (JEDI-net). Interaction networks [5] (INs) were designed to decompose complex systems into distinct objects and relations, and reason about their interactions and dynamics. One of the first uses of INs was to predict the evolution of physical systems under the influence of internal and external forces, for example, to emulate the effect of gravitational interactions in n-body systems. The n-body system is represented as a set of objects subject to one-on-one interactions. The n bodies are embedded in a graph and these one-on-one interaction functions, expressed as trainable neural networks, are used to predict the post-interaction status of the n-body system. We study whether this type of network generalizes to a novel context in high energy physics. In particular, we represent a jet as a set of particles, each of which is represented by its momentum and embedded as a vertex in a fully-connected graph. We use neural networks to learn a representation of each one-on-one particle interactionFootnote 1 in the jet, which we then use to define jet-related high-level features (HLFs). Based on these features, a classifier associates each jet to one of the five categories shown in Fig. 1.

For comparison, we consider other classifiers based on different architectures: a dense neural network (DNN) [6] receiving a set of jet-substructure quantities, a convolutional neural network (CNN) [7,8,9] receiving an image representation of the transverse momentum (\(p_{\mathrm {T}} \)) flow in the jet,Footnote 2 and a recurrent neural network (RNN) with gated recurrent units [10] (GRUs), which process a list of particle features. These models can achieve state-of-the-art performance although they require additional ingredients: the DNN model requires processing the constituent particles to pre-compute HLFs, the GRU model assumes an ordering criterion for the input particle feature list, and the CNN model requires representing the jet as a rectangular, regular, pixelated image. Any of these aspects can be handled in a reasonable way (e.g. one can use a jet clustering metric to order the particles), sometimes sacrificing some detector performance (e.g., with coarser image pixels than realistic tracking angular resolution, in the case of many models based on CNN). It is then worth exploring alternative solutions that could reach state-of-the-art performance without making these assumptions. In particular, it is interesting to consider architectures that directly takes as input jet constituents and are invariant for their permutation. This motivated the study of jet taggers based on recursive [11], graph networks [12, 13], and energy flow networks [14]. In this context, we aim to investigate the potential of INs.

This paper is structured as follows: we provide a list of related works in Sect. 2. In Sect. 3, we describe the utilized data set. The structure of the JEDI-net model is discussed in Sect. 4 together with the alternative architectures considered for comparison. Results are shown in Sect. 5. Sections 6 and 7 discuss what the JEDI-net learns when processing the graph and quantify the amount of resources needed by the tagger, respectively. We conclude with a discussion and outlook for this work in Sect. 8. “Appendix A” describes the design and optimization of the alternative models.

2 Related work

Jet tagging is one of the most popular LHC-related tasks to which DL solutions have been applied. Several classification algorithms have been studied in the context of jet tagging at the LHC [15,16,17,18,19,20,21,22] using DNNs, CNNs, or physics-inspired architectures. Recurrent and recursive layers have been used to construct jet classifiers starting from a list of reconstructed particle momenta [11,12,13]. Recently, these different approaches, applied to the specific case of top quark jet identification, have been compared in Ref. [23]. While many of these studies focus on data analysis, work is underway to apply these algorithms in the early stages of LHC real-time event processing, i.e. the trigger system. For example, Ref. [24] focuses on converting these models into firmware for field programmable gate arrays (FPGAs) optimized for low latency (less than 1 \(\upmu \)s). If successful, such a program could allow for a more resource-efficient and effective event selection for future LHC runs.

Graph neural networks have also been considered as jet tagging algorithms [25, 26] as a way to circumvent the sparsity of image-based representations of jets. These approaches demonstrate remarkable categorization performance. Motivated by the early results of Ref. [25], graph networks have been also applied to other high energy physics tasks, such as event topology classification [27, 28], particle tracking in a collider detector [29], pileup subtraction at the LHC [30], and particle reconstruction in irregular calorimeters [31].

3 Data set description

This study is based on a data set consisting of simulated jets with an energy of \(p_{\mathrm {T}} \approx 1\) TeV, originating from light quarks \(\mathrm {q} \), gluons \(\mathrm {g} \), \(\mathrm {W} \) and \(\mathrm {Z} \) bosons, and top quarks produced in \(\sqrt{s} = 13\,\text {TeV} \) proton-proton collisions. The data set was created using the configuration and parametric description of an LHC detector described in Refs. [24, 32], and is available on the Zenodo platform [33,34,35,36].

Fig. 2
figure 2

Distributions of the 16 high-level features used in this study, described in Ref. [24]

Fig. 3
figure 3

Average \(100\times 100\) images for the five jet classes considered in this study: \(\mathrm {q} \) (top left), \(\mathrm {g} \) (top center), \(\mathrm {W} \) (top right), \(\mathrm {Z} \) (bottom left), and top jets (bottom right). The temperature map represents the amount of \(p_{\mathrm {T}} \) collected in each cell of the image, measured in GeV and computed from the scalar sum of the \(p_{\mathrm {T}} \) of the particles pointing to each cell

Fig. 4
figure 4

Example of \(100\times 100\) images for the five jet classes considered in this study: \(\mathrm {q} \) (top-left), \(\mathrm {g} \) (top-right), \(\mathrm {W} \) (center-left), \(\mathrm {Z} \) (center-right), and top jets (bottom). The temperature map represents the amount of \(p_{\mathrm {T}} \) collected in each cell of the image, measured in GeV and computed from the scalar sum of the \(p_{\mathrm {T}} \) of the particles pointing to each cell

Jets are clustered from individual reconstructed particles, using the anti-\(k_{\mathrm {T}} \) algorithm [3, 37] with jet-size parameter \(R=0.8\). Three different jet representations are considered:

  • A list of 16 HLFs, described in Ref. [24], given as input to a DNN. The 16 distributions are shown in Fig. 2 for the five jet classes.

  • An image representation of the jet, derived by considering a square with pseudorapidity and azimut distances \(\varDelta \eta =\varDelta \phi =2R\), centered along the jet axis. The image is binned into \(100\times 100\) pixels. Such a pixel size is comparable to the cell of a typical LHC electromagnetic calorimeter, but much coarser than the typical angular resolution of a tracking device for the \(p_{\mathrm {T}} \) values relevant to this task. Each pixel is filled with the scalar sum of the \(p_{\mathrm {T}} \) of the particles in that region. These images are obtained by considering the 150 highest-\(p_{\mathrm {T}} \) constituents for each jet. This jet representation is used to train a CNN classifier. The average jet images for the five jet classes are shown in Fig. 3. For comparison, a randomly chosen set of images is shown in Fig. 4.

  • A constituent list for up to 150 particles, in which each particle is represented by 16 features, computed from the particle four-momenta: the three Cartesian coordinates of the momentum (\(p_x\), \(p_y\), and \(p_z\)), the absolute energy E, \(p_{\mathrm {T}} \), the pseudorapidity \(\eta \), the azimuthal angle \(\phi \), the distance \(\varDelta R = \sqrt{\varDelta \eta ^2 + \varDelta \phi ^2}\) from the jet center, the relative energy \(E^{\mathrm {rel}} = E^\mathrm {particle}/E^\mathrm {jet}\) and relative transverse momentum \(p_{\mathrm {T}} ^{\mathrm {rel}} = p_{\mathrm {T}} ^\mathrm {particle}/p_{\mathrm {T}} ^\mathrm {jet}\) defined as the ratio of the particle quantity and the jet quantity, the relative coordinates \(\eta ^{\mathrm {rel}}=\eta ^\mathrm {particle} - \eta ^\mathrm {jet}\) and \(\phi ^{\mathrm {rel}}=\phi ^\mathrm {particle} - \phi ^\mathrm {jet}\) defined with respect to the jet axis, \(\cos {\theta }\) and \(\cos {\theta ^{\mathrm {rel}}}\) where \(\theta ^\mathrm {rel} = \theta ^\mathrm {particle} - \theta ^\mathrm {jet}\) is defined with respect to the jet axis, and the relative \(\eta \) and \(\phi \) coordinates of the particle after applying a proper Lorentz transformation (rotation) as described in Ref. [38]. Whenever less than 150 particles are reconstructed, the list is filled with zeros. The distributions of these features considering the 150 highest-\(p_{\mathrm {T}} \) particles in the jet are shown in Fig. 5 for the five jet categories. This jet representation is used for a RNN with a GRU layer and for JEDI-net.

Fig. 5
figure 5

Distributions of kinematic features described in the text for the 150 highest-\(p_{\mathrm {T}} \) particles in each jet

4 JEDI-net

In this work, we apply an IN [5] architecture to learn a representation of a given input graph (the set of constituents in a jet) and use it to accomplish a classification task (tagging the jet). One can see the IN architecture as a processing algorithm to learn a new representation of the initial input. This is done replacing a set of input features, describing each individual vertex of the graph, with a set of engineered features, specific of each vertex but whose values depend on the connection between the vertices in the graph.

The starting point consists of building a graph for each input jet. The \(N_O\) particles in the jet are represented by the vertices of the graph, fully interconnected through directional edges, for a total of \(N_E = N_O \times (N_O-1)\) edges. An example is shown in Fig. 6 for the case of a three-vertex graph. The vertices and edges are labeled for practical reasons, but the network architecture ensures that the labeling convention plays no role in creating the new representation.

Once the graph is built, a receiving matrix (\(R_R\)) and a sending matrix (\(R_S\)) are defined. Both matrices have dimensions \(N_O \times N_E\). The element \((R_R)_{ij}\) is set to 1 when the \(i^{\mathrm {th}}\) vertex receives the \(j^{\mathrm {th}}\) edge and is 0 otherwise. Similarly, the element \((R_S)_{ij}\) is set to 1 when the \(i^{\mathrm {th}}\) vertex sends the \(j^{\mathrm {th}}\) edge and is 0 otherwise. In the case of the graph of Fig. 6, the two matrices take the form:

$$\begin{aligned} R_S&= \begin{pmatrix} ~ &{} E_1 &{} E_2 &{} E_3 &{} E_4 &{} E_5 &{} E_6\\ O_1 &{} 0 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0\\ O_2 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1\\ O_3 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0 &{} 0 \end{pmatrix}\end{aligned}$$
(1)
$$\begin{aligned} R_R&= \begin{pmatrix} ~ &{} E_1 &{} E_2 &{} E_3 &{} E_4 &{} E_5 &{} E_6\\ O_1 &{} 1 &{} 1 &{} 0 &{} 0 &{} 0 &{} 0\\ O_2 &{} 0 &{} 0 &{} 1 &{} 1 &{} 0 &{} 0\\ O_3 &{} 0 &{} 0 &{} 0 &{} 0 &{} 1 &{} 1 \end{pmatrix}. \end{aligned}$$
(2)

The input particle features are represented by an input matrix I. Each column of the matrix corresponds to one of the graph vertices, while the rows correspond to the P features used to represent each vertex. In our case, the vertices are the particles inside the jet, each represented by its array of features (i.e., the 16 features shown in Fig. 5). Therefore, the I matrix has dimensions \(P \times N_O\).

The I matrix is processed by the IN in a series of steps, represented in Fig. 7. The I matrix is multiplied by the \(R_R\) and \(R_S\) matrices and the two resulting matrices are then concatenated to form the B matrix, having dimension \(2P \times N_E\):

$$\begin{aligned} B&= \begin{pmatrix}I \times R_R \\ I \times R_S\end{pmatrix}. \end{aligned}$$
(3)

Each column of the B matrix represents an edge, i.e. a particle-to-particle interaction. The 2P elements of each column are the features of the sending and receiving vertices for that edge. Using this information, a \(D_E\)-dimensional hidden representation of the interaction edge is created through a trainable function \(f_R: \mathbb {R}^{2P} \mapsto \mathbb {R}^{D_E}\). This gives a matrix E with dimensions \(D_E \times N_E\). The cumulative effects of the interactions received by a given vertex are gathered by summing the \(D_E\) hidden features over the edges arriving to it. This is done by computing \(\overline{E} = E R_R^\top \) with dimensions \(D_E \times N_O\), which is then appended to the initial input matrix I:

$$\begin{aligned} C&= \begin{pmatrix}I\\ \overline{E}\end{pmatrix}. \end{aligned}$$
(4)

At this stage, each column of the C matrix represents a constituent in the jet, expressed as a \((P+D_E)\)-dimensional feature vector, containing the P input features and the \(D_E\) hidden features representing the combined effect of the interactions with all the connected particles. A trainable function \(f_O: \mathbb {R}^{P + D_E} \mapsto \mathbb {R}^{D_O}\) is used to build a post-interaction representation of each jet constituent. The function \(f_O\) is applied to each column of C to build the post-interaction matrix O with dimensions \(D_O \times N_O\).

A final classifier \(\phi _C\) takes as input the elements of the O matrix and returns the probability for that jet to belong to each of the five categories. This is done in two ways: (i) in one case, we define the quantities \(\overline{O}_i = \sum _j O_{ij}\), where j is the index of the vertex in the graph (the particle, in our case), and the \(i\in [0, D_E]\) index runs across the \(D_E\) outputs of the \(f_O\) function. The \(\overline{O}\) quantities are used as input to \(\phi _C: \mathbb {R}^{D_O} \mapsto \mathbb {R}^{N}\). This choice allows to preserve the independence of the architecture on the labeling convention adopted to build the I, \(R_R\), and \(R_S\) matrices, at the cost of losing some discriminating information in the summation. (ii) Alternatively, the \(\phi _C\) matrix is defined directly from the \(D_O \times N_O\) elements of the O matrix, flattened into a one-dimensional array. The full information from O is preserved, but \(\phi _C\) assumes an ordering of the \(N_O\) input objects. In our case, we rank the input particles in descending order by \(p_{\mathrm {T}} \).

The trainable functions \(f_O\), \(f_R\), and \(\phi _C\) consist of three DNNs. Each of them has two hidden layers, the first (second) having \(N_n^1\) (\(N_n^2 = \lfloor N_n^1/2 \rfloor \)) neurons. The model is implemented in PyTorch [39] and trained using an NVIDIA GTX1080 GPU. The training (validation) data set consists of 630,000 (240,000) examples, while 10,000 events are used for testing purposes.

The architecture of the three trainable functions is determined by minimizing the loss function through a Bayesian optimization, using the GpyOpt library [40], based on Gpy [41]. We consider the following hyperparameters:

  • The number of output neurons of the \(f_R\) network, \(D_E\) (between 4 and 14).

  • The number of output neurons of the \(f_O\) network, \(D_O\) (between 4 and 14).

  • The number of neurons \(N_n^1\) in the first hidden layer of the \(f_O\), \(f_R\), and \(\phi _C\) network (between 5 and 50).

  • The activation function for the hidden and output layers of the \(f_R\) network: ReLU [42], ELU [43], or SELU [44] functions.

  • The activation function for the hidden and output layers of the \(f_O\) network: ReLU, ELU, or SELU.

  • The activation function for the hidden layers of the \(\phi _C\) network: ReLU, ELU, or SELU.

  • The optimizer algorithm: Adam [45] or AdaDelta [46].

In addition, the output neurons of the \(\phi _C\) network are activated by a softmax function. A learning rate of \(10^{-4}\) is used. For a given network architecture, the network parameters are optimized by minimizing the categorical cross entropy. The Bayesian optimization is repeated four times. In each case, the input particles are ordered by descending \(p_{\mathrm {T}} \) value and the first 30, 50, 100, or 150 particles are considered. The parameter optimization is performed on the training data set, while the loss for the Bayesian optimization is estimated on the validation data set.

Fig. 6
figure 6

An example graph with three fully connected vertices and the corresponding six edges

Fig. 7
figure 7

A flowchart illustrating the interaction network scheme

Tables 1 and 2 summarize the result of the Bayesian optimization for the JEDI-net architecture with and without the sum over the columns of the O matrix, respectively. The best result of each case, highlighted in bold, is used as a reference for the rest of the paper.

Table 1 Optimal JEDI-net hyperparameter setting for different input data sets, when the summed \(\overline{O}_i\) quantities are given as input to the \(\phi _C\) network. The best result, obtained when considering up to 150 particles per jet, is highlighted in bold
Table 2 Optimal JEDI-net hyperparameter setting for different input data sets, when all the \(O_{ij}\) elements are given as input to the \(\phi _C\) network. The best result, obtained when considering up to 100 particles per jet, is highlighted in bold

For comparison, three alternative models are trained on the three different representations of the same data set described in Sect. 3: a DNN model taking as input a list of HLFs, a CNN model processing jet images, and a recurrent model applying GRUs on the same input list used for JEDI-net. The three benchmark models are optimized through a Bayesian optimization procedure, as done for the INs. Details of these optimizations and the resulting best models are discussed in “Appendix A”.

5 Results

Figure 8 shows the receiver operating characteristic (ROC) curves obtained for the optimized JEDI-net tagger in each of the five jet categories, compared to the corresponding curves for the DNN, CNN, and GRU alternative models. The curves are derived by fixing the network architectures to the optimal values based on Table 2 and “Appendix A” and performing a k-fold cross-validation training, with \(k=10\). The solid lines represent the average ROC curve, while the shaded bands quantify the \(\pm 1\) RMS dispersion. The area under the curve (AUC) values, reported in the figure, allow for a comparison of the performance of the different taggers.

The algorithm’s tagging performance is quantified computing the true positive rate (TPR) values for two given reference false positive rate (FPR) values (10% and 1%). The comparison of the TPR values gives an assessment of the tagging performance in a realistic use case, typical of an LHC analysis. Table 3 shows the corresponding FPR values for the optimized JEDI-net taggers, compared to the corresponding values for the benchmark models. The largest TPR value for each class is highlighted in bold. As shown in Fig. 8 and Table 3, the two JEDI-net models outperform the other architectures in almost all cases. The only notable exception is the tight working point of the top-jet tagger, for which the DNN model gives a TPR higher by about \(2\%\), while the CNN and GRU models give much worse performance.

The TPR values for the two JEDI-net models are within \(1\%\). The only exception is observed for the tight working points of the \(\mathrm {W} \) and \(\mathrm {Z} \) taggers, for which the model using the \(\overline{O}\) sums shows a drop in TPR of \(\sim 4\%\). In this respect, the model using summed \(\overline{O}\) features is preferable (despite this small TPR loss), given the reduced model complexity (see Sect. 7) and its independence on the labeling convention for the particles embedded in the graph and for the edges connecting them.

Fig. 8
figure 8

ROC curves for JEDI-net and the three alternative models, computed for gluons (top-left), light quarks (top-right), \(\mathrm {W} \) (center-left) and \(\mathrm {Z} \) (center-right) bosons, and top quarks (bottom). The solid lines represent the average ROC curves derived from 10 k-fold trainings of each model. The shaded bands around the average lines are represent one standard deviation, computed with the same 10 k-fold trainings

Table 3 True positive rates (TPR) for the optimized JEDI-net taggers and the three alternative models (DNN, CNN, and GRU), corresponding to a false positive rate (FPR) of 10% (top) and 1% (bottom). The largest TPR value for each case is highlighted in bold
Fig. 9
figure 9

Two-dimensional distributions between (top to bottom) \(\overline{O}_1\) and constituents multiplicty, \(\overline{O}_4\) and \(\tau _1^{(\beta =2)}\), \(\overline{O}_2\) and \(\tau _3^{(\beta =1)}\), \(\overline{O}_9\) and \(\tau _3^{(\beta =2)}\), for jets originating from (right to left) gluons, light flavor quarks, \(\mathrm {W} \) bosons, \(\mathrm {Z} \) bosons, and top quarks. For each distribution, the linear correlation coefficient \(\rho \) is reported

6 What did JEDI-net learn?

In order to characterize the information learned by JEDI-net, we consider the \(\overline{O}\) sums across the \(N_O\) vertices of the graph (see Sect. 4) and we study their correlations to physics motivated quantities, typically used when exploiting jet substructure in a search. We consider the HLF quantities used for the DNN model and the N-subjettiness variables \(\tau _N^{(\beta )}\) [47], computed with angular exponent \(\beta =1,2\).

Not all the \(\overline{O}\) sums exhibit an obvious correlation with the considered quantities, i.e., the network engineers high-level features that encode other information than what is used, for instance, in the DNN model.

Nevertheless, some interesting correlation pattern between the physics motivated quantities and the \(\overline{O}_i\) sums is observed. The most relevant examples are given in Fig. 9, where the 2D histograms and the corresponding linear correlation coefficient (\(\rho \)) are shown. The correlation between \(\overline{O}_1\) and the particle multiplicity in the jet is not completely unexpected. As long as the O quantities aggregated across the graph have the same order of magnitude, the corresponding sum \(\overline{O}\) would be proportional to jet-constituent multiplicity.

The strong correlation between the \(\overline{O}_4\) and \(\tau ^{(\beta =2)}_{1}\) (with \(\rho \) values between 0.69 and 0.97, depending on the jet class) is much less expected. The \(\tau _1^\beta \) quantities assume small values when the jet constituents can be arranged into a single sub-jet inside the jet. Aggregating information from the constituent momenta across the jet, the JEDI-net model based on the \(\overline{O}\) quantities learns to build a quantity very close to \(\tau ^{(\beta =2)}_1\). The last two rows of Fig. 9 show two intermediate cases: the correlation between \(\overline{O}_2\) and \(\tau ^{(\beta =1)}_3\) and between \(\overline{O}_9\) and \(\tau ^{(\beta =2)}_3\). The two \(\overline{O}\) sums considered are correlated to the corresponding substructure quantities, but with smaller (within 0.48 and 0.77) correlation coefficients.

7 Resource comparison

Table 4 shows a comparison of the computational resources needed by the different models discussed in this paper. The best-performing JEDI-net model has more than twice the number of trainable parameters than the DNN and GRU model, but approximately a factor of 6 less parameters than the CNN model. The JEDI-net model based on the summed \(\overline{O}\) features achieves comparable performance with about a factor of 4 less parameters, less than the DNN and GRU models. While being far from expensive in terms of number of parameters, the JEDI-net models are expensive in terms of the number of floating point operations (FLOP). The simple model based on \(\overline{O}\) sums, using as input a sequence of 150 particles, uses 458 MFLOP. The increase is mainly due to the scaling with the number of vertices in the graph. Many of these operations are the \(0\times \) and \(1\times \) products involving the elements of the \(R_R\) and \(R_S\) matrices. The cost of these operations could be reduced with an IN implementation optimized for inference, e.g., through an efficient sparse-matrix representation.

Table 4 Resource comparison across models. The quoted number of parameters refers only to the trainable parameters for each model. The inference time is measured by applying the model to batches of 1000 events 100 times: the 50% median quantile is quoted as central value and the 10%-90% semi-distance is quoted as the uncertainty. The GPU used is an NVIDIA GTX 1080 with 8 GB memory, mounted on a commercial desktop with an Intel Xeon CPU, operating at a frequency of 2.60GHz. The tests were executed in Python 3.7 with no other concurrent process running on the machine

In addition, we quote in Table 4 the average inference time on a GPU. The inference time is measured applying the model to 1000 events, as part of a Python application based on TensorFlow [48]. To this end, the JEDI-net models, implemented and trained in PyTorch, are exported to ONNX [49] and then loaded as TensorFlow graph. The quoted time includes loading the data, which occurs for the first inference and is different for different event representations, that is smaller for the JEDI-net models than for the CNN models. The GPU used is an NVIDIA GTX 1080 with 8 GB memory, mounted on a commercial desktop with an Intel Xeon CPU, operating at a frequency of 2.60 GHz. The tests were executed in Python 3.7, with no other concurrent process running on the machine. Given the larger number of operations, the GPU inference time for the two IN models is much larger than for the other models.

The current IN algorithm is costly to deploy in the online selection environment of a typical LHC experiment. A dedicated R&D effort is needed to reduce the resource consumption in a realistic environment in order to benefit from the improved accuracy that INs can achieve. For example, one could trade model accuracy for reduced resource needs by applying neural network pruning [50, 51], reducing the numerical precision [52, 53], and limiting the maximum number of particles in each jet representation.

8 Conclusions

This paper presents JEDI-net, a jet tagging algorithm based on interaction networks. Applied to a data set of jets from light-flavor quarks, gluons, vector bosons, and top quarks, this algorithm achieves better performance than models based on dense, convolutional, and recurrent neural networks, trained and optimized with the same procedure on the same data set. As other graph networks, JEDI-net offers several practical advantages that make it particularly suitable for deployment in the data-processing workflows of LHC experiments: it can directly process the list of jet constituent features (e.g. particle four-momenta), it does not assume specific properties of the underlying detector geometry, and it is insensitive to any ordering principle applied to the input jet constituents. For these reasons, the implementation of this and other graph networks is an interesting prospect for future runs of the LHC. On the other hand, the current implementation of this model demands large computational resources and a large inference time, which make the use of these models problematic for real-time selection and calls for a dedicated program to optimize the model deployment on typical L1 and HLT environments.

The quantities engineered by one of the trained IN models exhibit interesting correlation patterns with some of the jet substructure quantities proposed in literature, showing that the model is capable of learning some of the relevant physics in the problem. On the other hand, some of the engineered quantities do not exhibit striking correlation patterns, implying the possibility of a non trivial insight to be gained by studying these quantities.