RAPID: Early Classification of Explosive Transients Using Deep Learning

Daniel Muthukrishna; Gautham Narayan; Kaisey S. Mandel; Rahul Biswas; Renée Hložek

doi:10.1088/1538-3873/ab1609

1. Introduction

Observations of the transient universe have led to some of the most significant discoveries in astronomy and cosmology. From the use of Cepheids and Type Ia supernovae (SNe Ia) as standardizable candles for estimating cosmological distances, to the recent detection of a kilonova event as the electromagnetic counterpart of GW170817, the transient sky continues to provide exciting new astronomical discoveries.

In the past, transient science has had significant successes using visual classification by experienced astronomers to rank interesting new events and prioritize spectroscopic follow-up. Nevertheless, the visual classification process inevitably introduces latency into follow-up studies, and spectra for many objects are obtained several days to weeks after the initial detection. Existing and upcoming wide-field surveys and facilities will produce several million transient alerts per night, e.g., the Large Synoptic Survey Telescope (LSST; Ivezić et al. 2019), the Dark Energy Survey (DES; Dark Energy Survey Collaboration et al. 2016), the Zwicky Transient Facility (ZTF; Bellm 2014), the Catalina Real-Time Transient Survey (CRTS; Djorgovski et al. 2011), the Panoramic Survey Telescope and Rapid Response System (PanSTARRS; Chambers & Pan-STARRS Team 2017), the Asteroid Terrestrial-impact Last Alert System (ATLAS; Tonry et al. 2018), and the Planet Search Survey Telescope (PSST; Dunham et al. 2004). This unprecedented number means that it will be possible to obtain early time observations of a large sample of transients, which in turn will enable detailed studies of their progenitor systems and a deeper understanding of their explosion physics. However, with this deluge of data comes new challenges, and individual visual classification for spectroscopic follow-up is utterly unfeasible.

Developing methods to automate the classification of photometric data is of particular importance to the transient community. In the case of SNe Ia, cosmological analyses to measure the equation of state of the dark energy w and its evolution requires large samples with low contamination. The need for a high purity sample necessitates expensive spectroscopic observations to determine the type of each candidate, as classifying SNe Ia based on sparse light curves⁹ runs the risk of contamination with other transients, particularly Type Ibc supernovae. Even with human inspection, the differing cadence, observer frame passbands, photometric properties, and contextual information of each transient light curve constitute a complex mixture of sparse information, which can confound our visual sense, leading to potentially inconsistent classifications. This failing of visual classification, coupled with the large volumes of data, necessitates a streamlined automated classification process. This is our motivation for the development of our deep neural network (DNN) for Real-time Automated Photometric IDentification (RAPID), the focus of this work.

1.1. Previous Work on Automated Photometric Classification

In 2010, the Supernova Photometric Classification Challenge (SNPhotCC; Kessler et al. 2010a, 2010b), in preparation for the Dark Energy Survey (DES), spurred the development of several innovative classification techniques. The goal of the challenge was to determine which techniques could distinguish SNe Ia from several other classes of supernovae using light curves simulated with the properties of the DES. The techniques used for classification varied widely, from fitting light curves with a variety of templates (Sako et al. 2008), to much more complex methodologies that use semi-supervised learning approaches (Richards et al. 2012) or parametric fitting of light curves (Karpenka et al. 2013). A measure of the value of the SNPhotCC is that the data set is still used as the reference standard to benchmark contemporary supernova light-curve classification schemes, such as Bloom et al. (2012); Richards et al. (2012); Ishida & de Souza (2013); Charnock & Moss (2017); Lochner et al. (2016); Revsbech et al. (2018); Narayan et al. (2018); Pasquet et al. (2019).

Nearly all approaches to automated classification developed using the SNPhotCC data set have either used empirical template-fitting methods (Sako et al. 2008, 2011) or have extracted features from supernova light curves as inputs to machine-learning classification algorithms (Newling et al. 2011; Karpenka et al. 2013; Lochner et al. 2016; Narayan et al. 2018; Möller et al. 2016; Sooknunan et al. 2018). Lochner et al. (2016) used a feature-based approach, computing features using either parametric fits to the light curve, template fitting with SALT2 (Guy et al. 2007), or model-independent wavelet decomposition of the data. These features were independently fed into a range of machine-learning architectures including Naive Bayes, k-nearest neighbors, multilayer perceptrons, support vector machines, and boosted decision trees (see Lochner et al. 2016 for a brief description of these) and were used to classify just three broad supernova types. The work concluded that the non-parametric feature extraction approaches were most effective for all classifiers, and that boosted decision trees performed most effectively. Surprisingly, they further showed that the classifiers did not improve with the addition of redshift information. These previous approaches share two characteristics:

1.
they are largely tuned to discriminate between different classes of supernovae, and
2.
they require the full phase coverage of each light curve for classification.

Both characteristics arise from the SNPhotCC data set. As it was developed to test photometric classification for an experiment using SNe Ia as cosmological probes, the training set represented only a few types of non-SNe Ia that were likely contaminants, whereas the transient sky is a menagerie. Additionally, SNPhotCC presented astronomers with full light curves, rather than the streaming data that is generated by real-time transient searches, such as ZTF. While previous methods can be extended with a larger, more diverse training set, the second characteristic they share is a more severe limitation. Requiring complete phase coverage of each light curve for classification (e.g., Lochner et al. 2016) is a fundamental design choice when developing the architecture for automated photometric classification, and methods cannot trivially be re-engineered to work with sparse data.

1.2. Early Classification

While retrospective classification after the full light curve of an event has been observed is useful, it also limits the scientific questions that can be answered about these events, many of which exhibit interesting physics at early times. Detailed observations, including high-cadence photometry, time-resolved spectroscopy, and spectropolarimetry, shortly after the explosion provides insights into the progenitor systems that power the event and hence improves our understanding of the objects' physical mechanism. Therefore, ensuring a short latency between a transient detection and its follow-up is an important scientific challenge. Thus, a key goal of our work on RAPID has been to develop a classifier capable of identifying transient types within 2 days of detection. We refer to photometric typing with observations obtained in this narrow time range as early classification.

The discovery of the electromagnetic counterpart (Abbott et al. 2017a; Coulter et al. 2017; Arcavi et al. 2017; Soares-Santos et al. 2017) from the binary neutron star merger event, GW170817, has thrown the need for automated photometric classifiers capable of identifying exotic events from sparse data into sharp relief. As we enter the era of multimessenger astrophysics, it will become evermore important to decrease the latency between the detection and follow-up of transients. While the massive effort to optically follow-up GW170817 was heroic, it involved a disarray of resource coordination. With the large volumes of interesting and unique data expected by surveys such as LSST (∼10⁷ alerts per night), the need to streamline follow-up processes is crucial. The automated early classification scheme developed in this work alongside the new transient brokers¹⁰ such as ALeRCE,¹¹ LASAIR¹² (Smith et al. 2019), and ANTARES¹³ (Saha et al. 2014, 2016) are necessary to ensure organized and streamlined follow-up of the high density of exciting transients in upcoming surveys.

There have been a few notable efforts at early time photometric classification, particularly using additional contextual data. Sullivan et al. (2006) successfully discriminated between SNe Ia and core-collapse SNe in the Supernova Legacy Survey using a template-fitting technique on only two to three epochs of multiband photometry. Poznanski et al. (2007) similarly attempted to distinguish between SNe Ia and core-collapse SNe using only single-epoch photometry along with a photometric redshift estimate from the probable host galaxy. A few contemporary techniques such as Foley & Mandel (2013) and the sherlock package¹⁴ use only host galaxy and contextual information with limited accuracy to predict transient classes (e.g., the metallicity of the host galaxy is correlated with supernova type).

The most widely used scheme for classification is pSNid (Sako et al. 2008, 2011). It has been used by the Sloan Digital Sky Survey and DES (D'Andrea et al. 2018) to classify pre-maximum and full light curves into three supernova types (SNIa, SNII, SNIbc). For each class, it has a library of template light curves generated over a grid of parameters (redshift, dust extinction, time of maximum, and light-curve shape). To classify a transient, it performs an exhaustive search over the templates of all classes. It identifies the class of the template that best matches (with minimum χ²) the data and computes the Bayesian evidence by marginalizing the likelihood over the parameter space. The latest version employs computationally expensive nested sampling to compute the evidence. Therefore, the main computational burden (which increases with the number of templates used) is incurred every time it used to predict the class of each new transient. As new data arrives, this cost is multiplied as the procedure must be repeated each time the classifications are updated.

In contrast, RAPID covers a much broader variety of transient classes and learns a function that directly maps the observed photometric timeseries onto these transient class probabilities. The main computational cost is incurred only once, during the training of the DNN, while predictions obtained by running new photometric timeseries through the trained network are very fast. Because of this advantage, updating the class probabilities as new data arrives is trivial. Furthermore, we specifically designed our RNN architecture for temporal processes. In principle, it is able to save the information from previous nights so that the additional cost to update the classifications as new data are observed is only incremental. These aspects make RAPID particularly well suited to the large volume of transients that new surveys such as LSST will observe.

1.3. Deep Learning in Time-domain Astronomy

Improving on previous feature-based classification schemes, and developing methods that exhibit good performance even with sparse data, requires new machine-learning architectures. Advanced neural network architectures are non-feature-based approaches that have recently been shown to have several benefits such as low computational cost, and being robust against some of the biases that can afflict machine-learning techniques that require "expert designed" features (Aguirre et al. 2019; Charnock & Moss 2017; Moss 2018; Naul et al. 2018). The use of Artificial Neural Networks (ANN, McCulloch & Pitts 1943) and deep learning, in particular, has seen dramatic success in image classification, speech recognition, and computer vision, outperforming previous approaches in many benchmark challenges (Krizhevsky et al. 2012; Razavian et al. 2014; Szegedy et al. 2015).

In time-domain astronomy, deep learning has recently been used in a variety of classification problems including variable stars (Naul et al. 2018; Hinners et al. 2018), supernova spectra (Muthukrishna et al. 2019), photometric supernovae (Charnock & Moss 2017; Moss 2018; Möller & de Boissière 2019; Pasquet et al. 2019), and sequences of transient images (Carrasco-Davis et al. 2018). A particular class of ANNs known as Recurrent Neural Networks (RNNs) are particularly suited to learning sequential information (e.g., timeseries data, speech recognition, and natural language problems). While ANNs are often feed-forward (e.g., convolutional neural networks and multilayer perceptrons), where information passes through the layers once, RNNs allow for cycling of information through the layers. They are able to encode an internal representation of previous epochs in timeseries data, which along with real-time data, can be used for classification.

A variant of RNNs known as long short term memory networks (LSTMs; Hochreiter & Schmidhuber 1997) improve upon standard RNNs by being able to store long-term information, and have achieved state-of-the-art performance in several timeseries applications. In particular, they revolutionized speech recognition, outperforming traditional models (Fernández et al. 2007; Hannun et al. 2014; Li & Wu 2015) and have very recently been used in the trigger word detection algorithms popularized by Apple's Siri, Microsoft's Cortana, Google's voice assistant, and Amazon's Echo. Naul et al. (2018) and Hinners et al. (2018) have had excellent success in variable star classification. Charnock & Moss (2017) applied the technique to supernova classification. They used supernova data from the SNPhotCC and fed the multiband photometric full light curves into their LSTM architecture to achieve high SNIa vs non-SNIa binary classification accuracies. Moss (2018) recently followed this up on the same data with a novel approach applying a new phased-LSTM (Neil et al. 2016) architecture. These approaches have the advantage over previous supernova photometric classifiers of not requiring computationally expensive and user-defined (and hence, possibly biased) feature engineering processes.

While this manuscript was under review, Möller & de Boissière (2019) released a similar algorithm for photometric classification of a range of supernova types. It uses a recurrent neural network architecture based on BRNNs (Bayesian Recurrent Neural Networks) and does not require any feature extraction. At a similar time, Pasquet et al. (2019) released a package for full light curve photometric classification based on Convolutional Neural Networks, again able to use photometric light curves without requiring feature engineering processes. These approaches made use of data sets adapted from the SNPhotCC, using light curves based on the observing properties of the Dark Energy Survey and with a smaller variety of transient classes than the PLAsTiCC-based training set used in our work. Pasquet et al. (2019) use a framework that is very effective for full light curve classification, but is not well suited to early or partial light curve classification. Möller & de Boissière (2019), on the other hand, is one of the first approaches able to classify partial supernova light curves using a single bi-directional RNN layer, and achieve accuracies of up to 96.92 ± 0.26% for a binary SNIa versus non-SNIa classifier. While their approach is similar, the type of RNN, neural network architecture, data set, and focus on supernovae differs from the work presented in this paper. RAPID is focused on early and real-time light curve classification of a wide range of transient classes to identify interesting objects early, rather than on full light curve classification for creating pure photometric sample for SNIa cosmology. We are currently applying our work to the real-time alert stream from the ongoing ZTF survey through the ANTARES broker, and plan to develop the technique further for use on the LSST alert stream.

1.4. Overview

In this paper, we build upon the approach used in Charnock & Moss (2017). We develop RAPID using a deep neural network (DNN) architecture that employs a very recently improved RNN variation known as gated recurrent units (GRUs; Cho et al. 2014). This novel architecture allows us to provide real-time, rather than only full light curve, classifications.

Previous RNN approaches (including Charnock & Moss 2017; Moss 2018; Hinners et al. 2018; Naul et al. 2018; Möller & de Boissière 2019) all make use of bi-directional RNNs that can access input data from both past and future frames relative to the time at which the classification is desired. While this is effective for full light curve classification, it does not suit the real-time, time-varying classification that we focus on in this work. In real-time classification, we can only access input data previous to the classification time. Therefore, to respect causality, we make use of uni-directional RNNs that only take inputs from timesteps previous to at any given classification time. RAPID also enables multi-class and multi-passband classifications of transients as well as a new and independent measure of transient explosion dates. We further make use of a new light curve simulation software developed by the recent Photometric LSST Astronomical Timeseries Classification Challenge (PLAsTiCC; The PLAsTiCC team et al. 2018).

In Section 2 we discuss how we use the PLAsTiCC models with the SNANA software suite (Kessler et al. 2009) to simulate photometric light curves based on the observing characteristics of the ZTF survey, and describe the resulting data set along with our cuts, processing, and modeling methods. In Section 3, we frame the problem we aim to solve and detail the deep learning architecture used for RAPID. In Section 4, we evaluate our classifier's performance with a range of metrics, and in Section 5 we apply the classifier to observed data from the live ZTF data stream. An illustration of the different sections of this paper and their connections is shown in Figure 1. Finally, in Section 6, we compare RAPID to a feature-based classification technique we implemented using an advanced Random Forest classifier improved from Narayan et al. (2018) and based on Lochner et al. (2016). We present conclusions in Section 6.

**Figure 1.** Schematic illustrating the preprocessing, training set preparation, and classification processes described throughout this paper.
Download figure:
Standard image High-resolution image

2. Data

2.1. Simulations

One of the key challenges with developing classifiers for upcoming transient surveys is the lack of labeled samples that are appropriate for training. Moreover, even once a survey commences, it can take a significant amount of time to accumulate a well-labeled sample that is large enough to develop robust learning algorithms. To meet this difficulty for LSST, the PLAsTiCC collaboration has developed the infrastructure to simulate light curves of astrophysical sources with realistic sampling and noise properties. This effort was one component of an open-access challenge to develop algorithms that classify astronomical transients. By adapting supernova analysis tools such as SNANA (Kessler et al. 2009) to process several models of astrophysical phenomena from leading experts, a range of new transient behavior included in the PLAsTiCC data set. The challenge has recently been released to the public on Kaggle¹⁵ (The PLAsTiCC team et al. 2018) along with the metric framework to evaluate submissions to the challenge (Malz et al. 2018). The PLAsTiCC models are the most comprehensive enumeration of the transient and variable sky available at present.

We use the PLAsTiCC transient class models and the simulation code developed in The PLAsTiCC team (2018) to create a simulated data set that is representative of the cadence and observing properties of the ongoing public "Mid Scale Innovations Program" (MSIP) survey at the ZTF (Bellm 2014). This allows us to compare the validity of the simulations with the live ZTF data stream, and apply our classifier to it as illustrated in Section 5.

2.1.1. Zwicky Transient Facility

ZTF is the first of the new generation of optical synoptic survey telescopes and builds upon the infrastructure of the Palomar Transient Factory (PTF; Rau et al. 2009). It employs a 47 square degree field-of-view camera to scan more than 3750 square degrees an hour to a depth of 20.5–21 mag (Graham & Zwicky Transient Facility (ZTF) Project Team 2018). It is a precursor to the LSST and will be the first survey to produce one million alerts a night and to have a trillion row data archive. To prepare for this unprecedented data volume, we build an automated classifier trained on a large simulated ZTF-like data set that contains a labeled sample of transients.

We obtained logs of ZTF observing conditions (E. Bellm, private communication) and photometric properties (zero-points, FWHM, sky brightness, etc.), and one of us (R.B.) converted these into a library suitable for use with SNANA. SNANA simulates millions of light curves for each model, following a class-specific luminosity function prescription within the ZTF footprint. The sampling and noise properties of each observation on each light curve is set to reflect a random sequence from within the observing conditions library. The simulated light curves thus mimic the ZTF observing properties with a median cadence of 3 days in the g and r passbands. As ZTF had only been operating for four months when we constructed the observing conditions library, it is likely that our simulations are not fully representative of the survey. Nevertheless, this procedure is more realistic than simulating the observing conditions entirely, as we would have been forced to do if we had developed RAPID for LSST or WFIRST. We verified that the simulated light curves have similar properties to observed transient sources detected by ZTF that have been announced publicly. The data set consists of a labeled set of 48,029 simulated transients evenly distributed across a range of different classes briefly described below. An example of a simulated light curve from each class is shown in Figure 2.

**Figure 2.** Light curves of one example transient from each of the 12 transient classes is plotted with an offset. We have only plotted transients with a high signal to noise and with a low simulated host redshift (z < 0.2) to facilitate comparison of light-curve shape between the classes. The dark-colored square markers plots the r-band light curves of each transient, while the lighter-colored circle markers are the g-band light curves of each transient.
Download figure:
Standard image High-resolution image

2.1.2. Transient Classes

We briefly describe the transient class models from The PLAsTiCC team (2018) that are used throughout this paper.

Type Ia Supernovae. Type Ia supernovae are the thermonuclear explosion of a binary star system consisting of a carbon-oxygen white dwarf accreting matter from a companion star. In recent years, many subgroups in the SNIa class have been defined to account for their observed diversity. In this work we include three subtypes. SNIa-norm are the most commonly observed SNIa class. Type Ia-91bg Supernovae (SNIa-91bg) burn at slightly lower luminosities and have lower eject velocities. SNIax are similar, and are SN2002cx-like supernovae (defined in Silverman et al. 2012; Foley et al.2013).
Core-collapse Supernovae. Type Ibc (SNIbc) and Type II (SNII) supernovae are typically found in regions of star formation, and are the result of the core-collapse of massive stars. Their light-curve shape and spectra near maximum light look very similar to SNIa, but tend to have magnitudes about 1.0–1.5 times fainter than a typical SNIa.
point-Ia. These are a hypothetical supernova type which are expected to have light-curve shapes very similar to normal SNe Ia, but are just one-tenth as bright. They are the result of the early onset of detonation of helium transferring white dwarf binaries known as AM Canum Venacticorum systems (Shen et al. 2010). Helium that accretes onto carbon-oxygen white dwarfs undergoes unstable thermonuclear flashes when the orbital period is short: in the 2.5–3.5 minute range (Bildsten et al. 2007). This process is strong enough to result in the onset of a detonation.
TDE. Tidal Disruption Events occur when a star in the orbit of a massive black hole is pulled apart by the black hole's tidal forces. Some debris from the event is ejected at high speeds, while the remainder is swallowed by the black hole, resulting in a bright flare lasting up to a few years (Rees 1988).
Kilonovae. Kilonovae have been observed as the electromagnetic counterparts of gravitational waves. They are the mergers of either double neutron stars (NS-NS) or black hole neutron star (BH-NS) binaries, the former of which was recently discovered by LIGO (see the famous GW170817 Abbott et al. 2017a, 2017b). The neutron-rich ejecta from the merger undergoes rapid neutron capture (r-process) nucleosynthesis to produce the Universe's rare heavy elements. The radioactive decay of these unstable nuclei power a rapidly evolving transient kilonova event (Metzger 2017; Yu et al. 2018).
SLSN-I. Type I Super-luminous supernovae (SLSN) have ∼10 times the energy of SNe Ia and core-collapse SNe, and are thought to be caused by several different progenitor mechanisms including magnetars, the core-collapse of particularly massive stars, and interaction with circumstellar material. In analogy to supernovae, they are divided into hydrogen-poor (Type I) and hydrogen-rich (Type II). A subclass of SLSN-I appear to be powered by radioactive decay of Ni-56, and are termed SLSN-R. However, the majority of SLSN-I require some other energy source, as the nickel mass required to power their peaks is in conflict with their late-time decay. While SLSN-I and SLSN-II have different spectroscopic fingerprints, their light curves are qualitatively similar, and for classification purposes, it suffices to consider SLSN-I as a proxy for classification performance on all SLSN. See SN2005ap in Quimby et al. (2007) for an example and Gal-Yam (2018) for a comprehensive review.
PISN. Pair-instability Supernovae are thought to be runaway thermonuclear explosions of massive stars with oxygen cores initiated when the internal energy in the core is sufficiently high to initiate pair production. This pair production from γ rays in turn leads to a dramatic drop in pressure support, and partial collapse. The rapid contraction leads to accelerated oxygen ignition, followed by explosion. These explosive transients can only result from stars with masses ∼10 M_⊙ and above, and they naturally yield several solar masses of Ni-56. Ren et al. (2012) and Gal-Yam (2012) has suggested that observed members of the SLSN-R subclass are consistent with PISN models.
ILOT. Intermediate luminosity transients have a peak luminosity in the energy gap between novae and supernovae (e.g., NGC 300 OT2008-1, Berger et al. 2009). The physical mechanism of these objects is not well understood, but they have been modeled as either the eruption of red giants or as interacting binary systems (see Kashi & Soker 2017 and references therein).
CART. Calcium-rich gap transients (e.g., PTF11kmb; Lunnan et al. 2017) are a recently discovered transient class that have strong forbidden and permitted calcium lines in their spectra. The physical mechanism of these events is not well understood, but they are known to evolve much faster than average SNe Ia with rise times less than 15 days (compared with ∼18 days for SNIa), have velocities of approximately 6000 to 10000 km s⁻¹, and have absolute magnitudes in the range −15.5 to −16.5 (a factor of 10 to 30 times fainter than SNe Ia) (Sell et al. 2015; Kasliwal et al. 2012).

The above list of transients is not exhaustive, but is the largest collection of transient models assembled to date. The specific models used in the simulations derived from Kessler et al. (2019) are SNIa-norm: Guy et al. (2010), Kessler et al. (2013), Pierel et al. (2018); SNIbc: Kessler et al. (2010b), Pierel et al. (2018), Guillochon et al. (2018), Villar et al. (2017); SNII: Kessler et al. (2010b), Pierel et al. (2018), Guillochon et al. (2018), Villar et al. (2017); SNIa-91bg: (Galbany et al., 2019, in preparation); SNIa-x: Jha (2017); pointIa: Shen et al. (2010); Kilonovae: Kasen et al. (2017); SLSN: Guillochon et al. (2018), Nicholl et al. (2017), Kasen & Bildsten (2010); PISN: Guillochon et al. (2018), Villar et al. (2017), Kasen et al. (2011); ILOT: Guillochon et al. (2018), Villar et al. (2017); CART: Guillochon et al. (2018), Villar et al. (2017), Kasliwal et al. (2012); and TDE: Guillochon et al. (2018), Mockler et al. (2019), Rees (1988). Each simulated transient data set consists of a timeseries of flux and flux error measurements in the g and r ZTF bands, along with sky position, Milky Way dust reddening, a host galaxy redshift, and a photometric redshift. The models used in PLAsTiCC were extensively validated against real observations by several complementary techniques, as described by (G. Narayan et al. 2019, preparation). We split the total set of transients into two parts: 60% for the training set and 40% for the testing set. The training set is used to train the classifier to identify the correct transient class, while the testing set is used to test the performance of the classifier.

2.2. Trigger for Issuing Alerts

The primary method used for detecting transient events is to subtract real-time or archival data from a new image to detect a change in observed flux. This is known as difference imaging, and has been shown to be effective, even in fields that are crowded or associated with highly non-uniform unresolved surface brightness (Tomaney & Crotts 1996; Bond et al. 2001). Most transient surveys, including ZTF, use this method, and "trigger" a transient event when there is a detection in a difference image that exceeds a 5σ signal-to-noise (S/N) threshold. Throughout this work, we use trigger to identify this time of detection. We refer to early classification as classification made within 2 days of this trigger, and full classification as classifications made after 40 days since trigger.

2.3. Selection Criteria

To create a good and clean training sample, we made a number of cuts before processing the light curves. The selection criteria is described as follows.

z < 0.5 and $z\ne 0$ . First, we cut objects with host galaxy redshifts z = 0 or z > 0.5 such that all galactic objects and any higher redshift objects were removed as these candidates are too faint to be useful for the early time progenitor studies that motivated the development of this classifier in the first place. While our work relies on knowing the redshift of each transient, in this low redshift range, we should be able to obtain a redshift from the host galaxy from existing catalogs.
Sufficient data in the early light curve. Next, we ensured that the selected light curves each had at least three measurements before trigger, and at least two of these were in different passbands. Even if these measurements were themselves insufficient to cause a trigger, they help establish a baseline flux. This cut therefore removes objects that triggered immediately after the beginning of the observing season, as these are likely to be unacceptably windowed.
b > 15° and b < −15°. Any object in the galactic plane, with latitude −15° < b < 15°, was also cut from the data set because our analysis only considers extragalactic transients.
Selected only transient objects. Finally, while the PLAsTiCC simulations included a range of variable objects, including AGN, RR Lyrae, M-dwarf flares, and Eclipsing Binary events, we removed these from the simulated data set. This cut on the data set was made because these long-lived variable candidates will likely be identified to a very high completeness over the redshift range under consideration, and will not be misidentified as a class of astrophysical interest for early time studies.

2.4. Preprocessing

Arguably, one of the most important aspects in an effective learning algorithm is the quality of the training set. In this section we discuss efforts to ensure that the data is processed in a uniform and systematic way before we train our DNN.

The light curves are measured in flux units, as is expected for the ZTF difference imaging pipeline. The simulations have a significant fraction of the observations being 5–10 sigma outliers. These outliers are intended to replicate the difference image analysis artifacts, telescope CCD deficiencies, and cosmic rays seen in observational data. We perform "sigma clipping" to reject these outliers. We do this by rejecting photometric points with flux uncertainties that are more than 3σ from the mean uncertainty in each passband, and iteratively repeat this clipping 5 times. Next, we correct the light curves for interstellar extinction using the reddening function of Fitzpatrick (1999). We assume an extinction law, R_V = 3.1, and use the central wavelength of each ZTF filter to deredden each light curve listed as follows:¹⁶

$\begin{eqnarray*}&&g:\ 4767\,\mathring{\rm A} ,\ r:\ 6215\,\mathring{\rm A} .\end{eqnarray*}$

Following this, we account for cosmological time dilation using the host redshifts, z, and convert the observer frame time since trigger to a rest-frame time interval,

$\begin{eqnarray}&&t=({T}_{\mathrm{obs}}-{T}_{\mathrm{trigger}})/(1+z),\end{eqnarray} \tag{ 1 }$

where capital T refers to an observer frame time in MJD and lowercase t refers to a rest-frame time interval relative to trigger. We define trigger as the epoch at which the ZTF difference imaging detects a 5σ threshold change in flux.

We then calculate the luminosity distance, d_L(z), to each transient using the known host redshift and assuming a ΛCDM cosmology with Ω_M = 0.3, Ω_Λ = 0.7 and H₀ = 70. We correct the flux for this distance by multiplying each flux by ${d}_{L}^{2}$ and scaling by some normalizing factor, norm = 10¹⁸, to keep the flux values in a good range for floating-point machine precision. A measure for the distance-corrected flux, which is proportional to luminosity, is

$\begin{eqnarray}&&{L}_{\mathrm{data}}(t)=\left(F(t)-F{\left(t\right)}_{\mathrm{med}}\right)\cdot \displaystyle \frac{{d}_{L}^{2}(z)}{\mathrm{norm}},\end{eqnarray} \tag{ 2 }$

where F(t) is the raw flux value and F(t)_med is the median value of the raw flux points that were observed before the trigger. This median value is representative of the background flux. Even for objects observed by a single survey, with a common set of passbands on a common photometric system, comparing the fluxes of different sources in the same rest-frame wavelength range requires that the light-curve photometry be transformed into a common reference frame, accounting for the redshifting of the sources. However, this k-correction (Hogg et al. 2002) requires knowledge of the underlying spectral energy distribution (SED) of each source, and therefore its type—the goal of this work. Therefore, we have not k-corrected these data into the rest frame, hence L_data cannot be considered the true rest-frame luminosity in each passband.

2.4.1. Modeling the Early Light Curve

The ability to predict the class of an object as a function of time is one of the main advantages of RAPID over previous work. Critical to achieving this is determining the epoch at which transient behavior begins (usually the date of explosion) so that we can teach our DNN what a pre-explosion looks like. Basic geometry suggests that a single explosive event should evolve in flux proportional to the square of time (Arnett 1982). While future work might try to fit a power law, we are limited by the sparse and noisy data in the early light curve. Therefore, we model the pre-maximum part of each light curve in each passband, λ, with a simple t² fit as follows:

$\begin{eqnarray}&&{L}_{\mathrm{mod}}^{\lambda }(t;{t}_{0},{a}^{\lambda },{c}^{\lambda })=\left[{a}^{\lambda }{\left(t-{t}_{0}\right)}^{2}\right]\cdot H(t-{t}_{0})+{c}^{\lambda },\end{eqnarray} \tag{ 3 }$

where L_mod(t) is the modeled relative luminosity, t₀ is the estimate for time of the initial explosion or collision event, H(t−t₀) is the Heaviside function, and a and c are constants for the amplitude and intercept of the early light curve. The Heaviside function is used to model the t² relationship after the explosion, t₀, and fit a constant flux, c, before t₀. We define the pre-maximum part of the light curve as observations occurring up to the simulated peak luminosity of the light curve, t_peak. We emphasize that this early model is only required for the training set, and therefore, we are able to use the simulated peak time, which will not be available on observed data and is not used for the testing set.

We make the assumption that the light curves from each passband have the same explosion date, t₀, and fit the light curves from all passbands simultaneously. This is a 5 parameter model: two free parameters, slope (a^λ) and intercept (c^λ), for each of the two passbands, and a shared t₀ parameter. We aim to optimize the model's fit to the light curve by first defining the chi-squared for each transient as

$\begin{eqnarray}&&{\chi }^{2}({t}_{0},{\bf{a}},{\bf{c}})=\mathop{\displaystyle \sum }\limits_{\lambda }\underset{t=-\infty }{\overset{{t}_{\mathrm{peak}}}{\displaystyle \sum }}\displaystyle \frac{{\left[{L}_{\mathrm{data}}^{\lambda }(t)-{L}_{\mathrm{mod}}^{\lambda }(t;{t}_{0},{a}^{\lambda },{c}^{\lambda })\right]}^{2}}{{\sigma }^{\lambda }{\left(t\right)}^{2}},\end{eqnarray} \tag{ 4 }$

where λ is the index over passbands, σ(t) are the photometric errors in L_data, and the sum is taken over all observations at the position of the transient until the time of the peak of the light curve, t_peak.

We sampled the posterior probability $\propto \exp \left(-\tfrac{1}{2}{\chi }^{2}\right)$ using Markov Chain Monte Carlo (MCMC) with the affine-invariant ensemble sampler as implemented in the Python package, emcee (Foreman-Mackey et al. 2013). We set a flat uniform prior on t₀ to be in a reasonable range before trigger, −35 < t₀ < 0, and have a flat improper prior on the other parameters. We use 200 walkers and set the initial positions of each walker as a Gaussian random number with the following mean values: the median of ${L}_{\mathrm{data}}^{\lambda }$ for c^λ, the mean of the ${L}_{\mathrm{data}}^{\lambda }$ for a^λ, and −12 for t₀. We ran each walker for 700 steps, which after analyzing the convergence of a few MCMC chains, we deemed reasonable to not be too computationally expensive while still finding approximate best fits for a, c and t₀. The best fit early light curve for an example Type Ia supernova in the training set is illustrated in Figure 3.

**Figure 3.** Example preprocessed Type Ia Supernova light curve from the ZTF simulated data set (redshift = 0.174). The normalized fluxes of the r and g passbands are plotted with errors and the solid line is the best fit model of the pre-maximum part the light curve (up to t_peak) using Equations (3) and (4). The horizontal axis is plotted in the rest frame (redshift corrected), while the vertical axis is the relative de-reddened and distance-corrected flux (or relative luminosity). The vertical black solid line is the date that difference imaging records a trigger alert, and the vertical gray dashed line is the model's prediction of the explosion date with respect to trigger.
Download figure:
Standard image High-resolution image

We summarize the selection criteria and preprocessing stages applied to the testing and training sets as detailed in Sections 2.3–2.4 in Table 1.

Table 1. Summary of the Cuts and Preprocessing Steps Applied to the Training and Testing Sets

Selection Criteria	Applied To
0 < z ≤ 0.5	Train & Test
b < 15° or b > 15°	Train & Test
At least three points pre-trigger	Train & Test

Preprocessing	Applied to

Sigma-clipping fluxes	Train & Test
Undilate the light curves by (1 + z)	Train & Test
Correct light curves for distance.	Train & Test
Correct Milky Way extinction	Train & Test
Rescale light curves between 0 and 1	Train & Test
Model early light curve to obtain t₀	Train only^a
Keep only −70 < t < 80 days from trigger	Train & Test

Note. The Selection Criteria Help Match the Simulations to What We Expect from the Observed ZTF Data stream.

^aApplied to training set for designating pre-explosion. Applied to test set to evaluate performance.

Download table as: ASCII Typeset image

2.5. Training Set Preparation

Irregularly sampled timeseries data is a common problem in machine learning, and is particularly prevalent in astronomical surveys where the intranight cadence choices and seasonal constraints lead to naturally arising temporal gaps. Therefore, once the light curves have been processed and t₀ has been computed for each transient, we linearly interpolate between the unevenly sampled timeseries data. From this interpolation, we impute data points such that each light curve is sampled at 3-day intervals between −70 < t < 80 days since trigger (or as far as the observations exist), to give a vector of length n = 50, where we set the values outside the data range to zero. We ensure that each light curve in a given passband is sampled on the same 3-day grid. The final input image for each transient s is ${{\boldsymbol{I}}}^{s}$ , which is a matrix with each row composed of the imputed light-curve fluxes for each passband and two additional rows containing repeated values of the host galaxy redshift in one row and the MW dust reddening in the other row. Hence, the input image is an n × (p + 2) matrix, where p is the number of passbands. This input image, ${{\boldsymbol{I}}}^{s}$ , is illustrated as the Input Matrix in Figure 1.

One of the key differences in this work compared to previous light curve classification approaches is our ability to provide time-varying classifications. Key to computing this, is labeling the data at each epoch rather than providing a single label to an entire light curve. Using the value of t₀ computed in Section 2.4.1, we define two phases of each transient light curve: the pre-explosion phase (where t < t₀), and the transient phase (where t ≥ t₀). Therefore, the label for each light curve is a vector of length n identifying the transient class at each timestep. This n-length vector is subsequently one-hot encoded, such that each class is changed to a zero-filled vector with one element set to 1 to indicate the transient class (see Equation (7)). This transforms the n-length label vector into an n × (m + 1) vector, where m is the number of transient classes. This is illustrated as the Class Matrix in Figure 1.

3. Model

3.1. Framing the Problem

In this work, we train a deep neural network (DNN) to map the light-curve data of an individual transient s onto probabilities over classes {c = 1, ..., (m + 1)}. The DNN models a function that maps an input multi-passband light-curve matrix, ${{\boldsymbol{I}}}^{{st}}$ , for transient s up to a discrete time t, onto an output probability vector,

$\begin{eqnarray}&&{{\boldsymbol{y}}}^{{st}}={{\boldsymbol{f}}}_{t}({{\boldsymbol{I}}}^{{st}};{\boldsymbol{\theta }}),\end{eqnarray} \tag{ 5 }$

where ${\boldsymbol{\theta }}$ are the parameters (e.g., weights and biases of the neurons) of our DNN architecture. We define the input ${{\boldsymbol{I}}}^{{st}}$ as an n × (p + 2) matrix¹⁷ representing the light curve up to a timestep, t. The output ${{\boldsymbol{y}}}^{{st}}$ is a probability vector with length (m + 1), where each element ${y}_{c}^{{st}}$ is the model's predicted probability of each class c (at each timestep), such that ${y}_{c}^{{st}}\geqslant 0$ and ${\displaystyle \sum }_{c=1}^{m+1}{y}_{c}^{{st}}=1.$

First, to quantify the discrepancy between the model probabilities and the class labels we define a weighted categorical cross-entropy,

$\begin{eqnarray}&&{H}_{w}({{\boldsymbol{Y}}}^{{st}},{{\boldsymbol{y}}}^{{st}})=-\underset{c=1}{\overset{m+1}{\displaystyle \sum }}{w}_{c}\,{Y}_{c}^{{st}}\mathrm{log}({y}_{c}^{{st}}),\end{eqnarray} \tag{ 6 }$

where w_c is the weight of each class, ${{\boldsymbol{Y}}}^{{st}}$ is the label for the true transient class at each timestep and is a one-hot encoded vector of length (m + 1) such that

$\begin{eqnarray}{Y}_{c}^{{st}}=\left\{\begin{array}{cc}1 & \mathrm{if}\ {\text{}}{\rm{c}}\ \mathrm{is}\ \mathrm{the}\ \mathrm{true}\ \mathrm{class}\ \mathrm{of}\ \mathrm{transient}\ {\text{}}{\rm{s}}\ \mathrm{at}\ \mathrm{time}\ {\text{}}{\rm{t}}\\ 0 & \mathrm{otherwise}\end{array}\right.\end{eqnarray} \tag{ 7 }$

where the label, ${{\boldsymbol{Y}}}^{{st}}$ , has two phases, the pre-explosion phase with class c = 1 when t < t₀ and the transient phase with class c > 1 when t ≥ t₀.

If weights were equal for all classes, Equation (6) is proportional to the negative log-likelihood of the probabilities of a categorical distribution (or a generalized Bernoulli distribution). However, to counteract imbalances in the distribution of classes in the data set which may cause more abundant classes to dominate in the optimization, we define the weight for each class c as

$\begin{eqnarray}&&{w}_{c}=\displaystyle \frac{N\times n}{{N}_{c}},\end{eqnarray} \tag{ 8 }$

where N_c is the number of times a particular class appears in the N × n training set.

We define the global objective function as

$\begin{eqnarray}&&\mathrm{obj}({\boldsymbol{\theta }})=\underset{s=1}{\overset{N}{\displaystyle \sum }}\underset{t=0}{\overset{n}{\displaystyle \sum }}{H}_{w}({{\boldsymbol{Y}}}^{{st}},{{\boldsymbol{y}}}^{{st}}),\end{eqnarray} \tag{ 9 }$

where we sum the weighted categorical cross-entropy over all n timesteps and N transients in the training set. To train the DNN and determine optimal values of its parameters $\mathop{{\boldsymbol{\theta }}}\limits^{\unicode{x002C6}}$ , we minimize this objective function with the sophisticated and commonly used Adam gradient descent optimizer (Kingma & Ba 2015). The model ${{\boldsymbol{f}}}_{t}({{\boldsymbol{I}}}^{{st}};\mathop{{\boldsymbol{\theta }}}\limits^{\unicode{x002C6}})$ is represented by the complex DNN architecture illustrated in Figure 4 and is described in the following section.

**Figure 4.** Schematic of the deep recurrent neural network architecture used in `RAPID`. Each column in the diagram is one of the n timesteps of the processed light curve, while each row represents a different neural network layer. The gray text in each block states the shape of the output matrix of each layer in that block. The input image is composed of an n × (p + 2) matrix consisting of the light-curve fluxes, host redshift, and Milky Way reddening. Two uni-directional gated recurrent unit layers of size 100 are used for encoding and decoding the input sequences, respectively. It is in these RNN layers that information about previous timesteps is encoded. Batch normalization is applied between each layer to normalize the network parameters and hence, speed the training process. Finally, a fully-connected (dense) layer with a softmax regression activation function is applied to compute the probability of each class at each timestep. We wrap the final layer in `Keras`' time-distributed layer so that each timestep is treated independently, and only uses information from the current and previous timesteps. To counter overfitting during training, we employ the dropout optimization technique (Srivastava et al. 2014) to the neurons in each of the GRU and Batch Normalization layers, and set the dropout rate to 20%.
Download figure:
Standard image High-resolution image

3.2. Recurrent Neural Network Architecture

Recurrent Neural Networks (RNNs), such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks have been shown to achieve state-of-the-art performance in many benchmark timeseries and sequential data applications (Bahdanau et al. 2014; Sutskever et al. 2014; Che et al. 2018). Its success in these applications is due to its ability to retain an internal memory of previous data, and hence capture long-term temporal dependencies of variable-length observations in sequential data. We extend this architecture to our case with a time-varying multi-channel (multiple passbands) input and a time-varying multi-class output.

Recently, Naul et al. (2018); Charnock & Moss (2017); Moss (2018), and Hinners et al. (2018) have used two RNN layers for this framework on astronomical timeseries data. However, our work differs from these by making use of uni-directional GRUs instead of bi-directional RNNs. Bi-directional RNNs are able to pass information both forwards and backwards through the neural network representation of the light curve, and can hence preserve information on both the past and future at any timestep. However, this is only suitable for retrospective classification, because it requires that we wait for the entire light curve to complete before obtaining a classification. The real-time classification used in our work is a novel approach in time-domain astronomy, but necessitates the use of uni-directional RNNs. Hence, our two RNN layers read the light curve chronologically.

The deep neural network (DNN) is illustrated in Figure 4. We have developed the network with the high level PythonAPI, Keras (Chollet 2015), built on the recent highly efficient TensorFlowmachine-learning system (Abadi et al. 2016). We describe the architecture in detail here.

Input. As detailed in Section 2.5, the input is an n × (p + 2) matrix. However, as we are implementing a sequence classifier, we can consider the input at each timestep as being vector of length (p + 2).
First GRU Layer. Gated recurrent units are an improved version of a standard RNN and are a variation of the LSTM (see Chung et al. 2014 for a detailed comparison and explanation). We have selected GRUs instead of LSTMs in this work, as they provide appreciably shorter overall training time, without any significant difference in classification performance. Both are able to capture long-term dependencies in time-varying data with parameters that control the information that should be remembered at each step along the light curve. We use the first GRU layer to read the input sequence one timestep at a time and encode it into a higher-dimensional representation. We set up this GRU layer with 100 units such that the output is a vector of shape 1 × 100.
Second GRU Layer. The second GRU layer is conditioned on the input sequence. It takes the output of the previous GRU and generates an output sequence. Again, we use 100 units in the GRU to maintain the n × 100 output shape. We use uni-directional GRUs that enable only information from previous timesteps to be encoded and passed onto future timesteps.
Batch Normalization. We then apply Batch Normalization (first introduced in Ioffe & Szegedy 2015) to each GRU layer. This acts to improve and speed up the optimization while adding stability to the neural network and reducing overfitting. While training the DNN, the distribution of each layer's inputs changes as the parameters of the previous layers change. To allow the parameter changes during training to be more stable, batch normalization scales the input. It does this by subtracting the mean of the inputs and then dividing it by the standard deviation.
Dropout. We also implement dropout regularization between each layer of the neural network to reduce overfitting during training. This is an important step that effectively ignores randomly selected neurons during training such that their contribution to the network is temporarily removed. This process causes other neurons to more robustly handle the representation required to make predictions for the missing neurons, making the network less sensitive to the specific weights of any individual neuron. We set the dropout rate to 20% of the neurons present in the previous layer each time the Dropout block appears in the DNN in Figure 4.
Dense Layer. A dense (or fully-connected) layer is the simplest type of neural network layer. It connects all 100 neurons at each timestep in the previous layer, to the (m + 1) neurons in the output layer simply using Equation (10). As this is a classification task, the output is a vector consisting of all m transient classes and the Pre-explosion class. However, as we are interested in time-varying classifications, we wrap this Dense layer with a Time-Distributed layer, such that the dense layer is applied independently at each timestep, hence giving an output matrix of shape n × (m + 1).
Neurons. The output of each neuron in a neural network layer can be expressed as the weighted sum of the connections to it from the previous layer:
$\begin{eqnarray}&&{\mathop{y}\limits^{\unicode{x002C6}}}_{i}=f\left(\underset{j=1}{\overset{M}{\displaystyle \sum }}{W}_{{ij}}\,{x}_{j}+{b}_{i}\right),\end{eqnarray} \tag{ 10 }$
where x_j are the different inputs to each neuron from the previous layer, W_ij are the weights of the corresponding inputs, b_i is a bias that is added to shift the threshold of where inputs become significant, j is an integer running from 1 to the number of connected neurons in the previous layer (M), and i is an integer running from 1 to the number of neurons in the next layer. For the Dense layer, x is simply the (1 × 100) matrix from the output of the GRU and Batch Normalization, y is made up of the (m + 1) output classes, j runs from 1 to (m + 1) and i runs across the 100 input neurons from the GRU. The matrix of weights and biases in the Dense layer and throughout the GRU layers are some of the free parameters that are computed by TensorFlow during the training process.
Activation function. As with any neural network, each neuron applies an activation function f(·) to bring non-linearity to the network and hence help it to adapt to a variety of data. For feed-forward networks it is common to make use of Rectified Linear Units (ReLU, Nair & Hinton 2010) to activate neurons. However, the GRU architecture uses sigmoid activation functions as it outputs a value between 0 and 1 and can either let no flow or complete flow of information from previous timesteps.
Softmax regression. The final layer applies the softmax regression activation function, which generalizes the sigmoid logistic regression to the case where it can handle multiple classes. It applies this to the Dense layer output at each timestep, so that the output vector is normalized to a value between 0 and 1 where the sum of the values of all classes at each timestep sums to 1. This enables the output to be viewed as a relative probability of an input transient being a particular class at each timestep. The output probability vector,
$\begin{eqnarray}&&{\boldsymbol{y}}=\mathrm{softmax}(\mathop{{\boldsymbol{y}}}\limits^{\unicode{x002C6}}),\end{eqnarray} \tag{ 11 }$
is computed with a softmax activation function that is defined as
$\begin{eqnarray}&&\mathrm{softmax}{\left({\boldsymbol{x}}\right)}_{i}=\displaystyle \frac{{e}^{{x}_{i}}}{\mathop{\sum }\limits_{j}{e}^{{x}_{j}}}.\end{eqnarray} \tag{ 12 }$
We use the output softmax probabilities to rank the best matching transient classes for each transient light curve at each timestep.

We reiterate that the overall architecture is simply a function that maps an input n × (p + 2) light-curve matrix onto an n × (m + 1) softmax probability matrix indicating the probability of each transient class at each timestep. In order to optimize the parameters of this mapping function, we specify a weighted categorical cross-entropy loss-function that indicates how accurately a model with given parameters matches the true class for each input light curve (as defined in Equation (6)).

We minimize the objective function defined in Equation (9) using the commonly used, but sophisticated stochastic gradient descent optimizer called the Adam optimizer (Kingma & Ba 2015). As the class distribution is inevitably uneven, and the pre-explosion class is naturally over-represented as it appears in each light-curve label, we prevent bias towards over-represented classes by applying class-dependent weights while training as defined in Equation (8).

The several layers in the DNN create a model that has over one hundred thousand free parameters. As we feed in our training set in batches of 64 light curves at a time, the neural network updates and optimizes these parameters. While the size of the parameter space seems insurmountable, the Adam optimizer is able to compute individual adaptive learning rates for different parameters from estimates of the mean and variance of the gradients and has been shown to be extraordinarily effective at optimizing high-dimensional deep learning models.

With the often quoted "black box" nature of machine learning, it is always a worry that the machine-learning algorithms are learning traits that are specific to the training set but do not reflect the physical nature of the classes more generally. Ideally, we would like to ensure that the model we build both accurately captures regularities in the training data while simultaneously generalizing well to unseen data. Simplistic models may fail to capture important patterns in the data, while models that are too complex may overfit random noise and capture spurious patterns that do not generalize outside the training set. While we implement regularization layers (dropout) to try to prevent overfitting, we also monitor the performance of the classifier on the training and testing sets during training. In particular, we ensure that we do not run the classifier over so many iterations that the difference between the values of the objective function evaluated on the training set and the testing set become significant.

4. Results

In this section, we detail the performance of RAPID trained on simulated ZTF light curves. The data set consists of 48,029 transients split between 12 different classes, where each class has approximately 4000 transients. We trained our DNN on 60% of this set and tested its performance on the remaining 40%. The data was preprocessed using the methods outlined in Sections 2.3–2.5. Processing this set, and then training the DNN on it, was computationally expensive, taking several hours to train. Once the DNN is trained, however, it is able to classify several thousands of transients within a few seconds.

4.1. Hyper-parameter Optimization

One of the key criticisms of deep neural networks is that they have many hyper-parameters describing the architecture that need to be set before training. As training our DNN architecture takes several hours, optimizing the hyper-parameter space by testing the performance of a range of setup values is a very slow process that most similar work have not attempted. Despite this challenge, we performed a broad grid-search of three of our DNN hyper-parameters: number of neurons in each GRU layer, and the dropout fraction. After testing 12 different setup parameters, we found that there was only a 2% variation on the overall accuracy. The hyper-parameters that are shown in Figure 4 were the best performing set of parameters.

4.2. Accuracy

We go beyond previous attempts at photometric classification in two important ways. First, we aim to classify a much larger variety of sparse multi-passband transients, and second, and most significantly, we provide classifications as a function of time. An example of this is illustrated in Figure 5. At each epoch along the light curve, the trained DNN outputs a softmax probability of each transient class. As more photometric data is provided along the light curve, the DNN updates the class probabilities of the transient based on the state of the network at previous timesteps plus the new timeseries data. Within just a few days of the explosion, and well before the ZTF trigger, the DNN was able to correctly learn that the transient evolved from pre-explosion to a SNIa.

**Figure 5.** Example normal SNIa light curve from the testing set (redshift = 0.174) is shown in the top panel, and the softmax classification probabilities from `RAPID` are plotted as a function of time over the light curve in the bottom panel. The plot shows the rest-frame time since trigger. The vertical gray dashed line is the predicted explosion date from our t² model fit of the early light curve (see Section 2.4.1). Initially, the object is correctly predicted to be pre-explosion, before it is more confidently predicted as a SNIa-Normal at −20 days before the trigger. Hence, the neural network predicts the explosion date only four days after early light-curve model fit's prediction. The confidence in the predicted classification improves over the lifetime of the transient.(An animation of this figure is available.)
Download figure:
Standard image High-resolution image

To assess the performance of RAPID, we make use of several metrics. The most obvious metric is simply the accuracy, that is, the ratio of correctly classified transients in each class to the total number of transients in each class. At each epoch along every light curve in the testing set, we select the highest probability class and compare this to the true class. After aligning each light curve by its trigger, we obtained the prediction accuracy of each class as a function of time since trigger. This is plotted in Figure 6.

**Figure 6.** Classification accuracy of each transient class as a function of time since trigger in the rest frame. The values correspond to the diagonals of the confusion matrices at each epoch.
Download figure:
Standard image High-resolution image

The total classification accuracy of each class in the testing set increases quickly before trigger, but then begins to flatten out with only small increases after approximately 20 days post-trigger. For most classes, the transient behavior of the light curve is generally nearing completion at this stage, and hence we can expect that new photometric data adds little to improving the classification as the brightness tends towards the background flux level. The classification performance of the core-collapse supernovae, SNIbc and SNII, are particularly poor. To better understand this, it is useful to see where misclassifications occurred.

4.3. The Confusion Matrix

The confusion matrix is often a good way to visualize this. Typically, each entry in the matrix describes counts of the number of transients of the true class, c, that had the highest predicted probability in class, $\mathop{c}\limits^{\unicode{x002C6}}$ . For ease of interpretability, we make use of a specially normalized confusion matrix in this work. We normalize the confusion matrix such that the (c, $\mathop{c}\limits^{\unicode{x002C6}}$ ) entry is the fraction of transients of the true class c that are classified into the predicted class $\mathop{c}\limits^{\unicode{x002C6}}$ . With this normalization, each row in the matrix must sum to 1. Therefore each row is an estimate of the classifier's conditional distribution of (maximum probability) predicted labels given each true class label.

In Figure 7, we plot the normalized confusion matrices at an early (two days post-trigger) and late (40 days post-trigger) stage of the light curve. In the online material, we provide an animation of this confusion matrix evolving in time since trigger (instead of just the two epochs shown here).¹⁸

The overall classification performance is, as expected, slightly better at the late phase of the light curve. However, the performance only two days after trigger is particularly promising for our ability to identify transients at early times to gather a well-motivated follow-up candidate list. SNe Ia have the highest classification accuracy at early times with most misclassification occurring with other subtypes of Type Ia supernovae. At late times, the intermediate luminosity transients and TDEs are best identified. The core-collapse supernovae (SNIbc, SNII) appear to be most often confused with calcium-rich transients and other supernova types. CARTs are a newly discovered class of transients and their physical mechanism is not yet well understood. However, the reason for the confusion most likely stems from their fast rise times similar to many core-collapse SNe. This is illustrated in Figure 8.

**Figure 8.** Three of the simulated light curves from our sample—a correctly classified CART (top), a SNIbc incorrectly classified as a CART (middle), and a correct classified SNIbc (bottom). The dark-colored square markers are the r-band fluxes and the lighter-colored circle markers are the g-band fluxes. Our classifier is sensitive to light-curve shape, and the limited color information available with ZTF leads to a significant fraction of SNIbc objects being misclassified as CARTs. We expect classification performance to improve with LSST, which will provide *ugrizy* light curves.
Download figure:
Standard image High-resolution image

4.4. Receiver Operating Characteristic Curves

The confusion matrix is a good measure of the performance of a classifier, but it does not make use of the full suite of probability vectors we obtain for every transient, and instead only uses the highest scoring class. The receiver operating characteristic (ROC) curve, on the other hand, makes use of the classification probabilities. Instead of selecting just the highest probability class, we use a probability threshold p_thresh. For each class c, transient s at time t is considered to be classified as c if ${y}_{c}^{{st}}$ > p_thresh. We sweep the values of p_thresh between 0 and 1. The ROC curve plots the true positive rate (TPR) against the false-positive rate (FPR) for these different probability thresholds. In a multi-class framework, the TPR is a measure of recall or completeness; it is the ratio between the number of correctly classified objects in a particular class (TP) to the total number of objects in that class (TP + FN).

$\begin{eqnarray}&&\mathrm{TPR}=\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\end{eqnarray} \tag{ 13 }$

Conversely, the FPR is a measure of the false alarm rate; it is the ratio between the number of transients that have been misclassified as a particular class (FP) and the total number of objects in all other classes (FP + TN).

$\begin{eqnarray}&&\mathrm{FPR}=\displaystyle \frac{\mathrm{FP}}{\mathrm{FP}+\mathrm{TN}}\end{eqnarray} \tag{ 14 }$

A good classifier is one that maximizes the area under the ROC curve (AUC), with a perfect classifier having an AUC = 1, and a randomly guessing classifier having an AUC = 0.5. Typically, values above 0.9 are considered to be very good classifiers. In Figure 9, we plot the ROC curve at an early and late phase in the light curve. Here, the classification performance looks very good with several classes having AUC values above 0.99 and the overall micro-averaged values being 0.95 for the early stage and 0.98 in the late stage. The macro-averaged ROC is simply the average of all of the ROC curves computed independently. Differently, the micro-averaged ROC aggregates the TPR and FPR contributions of all classes, and is equivalent to the weighted average of all the ROCs considering the number of transients in each class. As the class distribution in the data set is not too unbalanced, these values are quite close.

**Figure 9.** Receiver operating characteristic (ROC) curves for the 12 transient classes at an early epoch at two days past trigger (left), and at a late epoch at 40 days past trigger (right). Each curve represents a different transient class with the area under the ROC curve (AUC) score in the brackets. The macro-average and micro-average curves which are an average and weighted-average representation of all classes, respectively (see Section 4) are also plotted. We compute the metric on the 40% of the data set used for *testing*. Please see an animation of the evolution of the ROC curve as a function of time since trigger at https://www.ast.cam.ac.uk/~djm241/rapid/roc.gif.
Download figure:
Standard image High-resolution image

In the online material we plot an animation of the ROC curve evolving in time since trigger, rather than the two phases plotted here. As a still-image measure of this, we plot the AUC of each class as a function of time since trigger in Figure 10. Within 5 to 20 days after trigger, the AUC flattens out for most classes. We see that the ILOTs, kilonovae, SLSNe, and PISNs are predicted with high accuracy well before trigger. These transients are fainter than most other classes, and hence do not trigger an alert until their light curves approach maximum brightness. This means, that by the time a trigger happens, the transient behavior of the light curve is mature, and the classifier has more information to be confident in its prediction. On the other hand, the two core-collapse supernovae, SNII and SNIbc, have comparatively low AUCs. We expect that additional color information will help to separate these from other supernova types. The overall performance is best illustrated by the micro-averaged AUC curve shown as the dotted blue curve. The AUC is initially low due to the misclassifications with Pre-explosion, but within just a few days after trigger, plateaus to a very respectable AUC of 0.98.

**Figure 10.** Area under the ROC curve (AUC) of each class as a function of time since trigger. Figure 9 illustrates the ROC curves at two epochs with the AUC of each listed in the legend; this plot shows how the AUC evolves with time for each class, and is a still-representation of the animation of the ROC curves shown in the online material. The overall performance of the classifier is best judged with the shape of the "micro-average" curve.
Download figure:
Standard image High-resolution image

4.5. Precision-recall

We compute the Precision-recall metric. This metric has been shown to be particularly good for classifiers trained on imbalanced data sets (Saito & Rehmsmeier 2015). The precision (also known as purity) is a measure of the number of correct predictions in each class compared to the total number of predictions of that class, and is defined as

$\begin{eqnarray}&&\mathrm{precision}=\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}.\end{eqnarray} \tag{ 15 }$

The recall (also known as completeness) is the same as the true positive rate. It is a measure of the number of correct predictions in each class compared to the total number of that class in the testing set, and is defined as

$\begin{eqnarray}&&\mathrm{recall}=\displaystyle \frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}.\end{eqnarray} \tag{ 16 }$

A good classifier will have both high precision and high recall, and hence the area under the precision-recall curve will be high. In making the precision-recall plot, instead of simply selecting the class with the highest probability for each object, we apply a probability threshold as plotted in Figure 11. By using a very high probability threshold (instead of just selecting the most probable class), we can obtain a much more pure subset of classifications. The PISN, SNIa-norm, SNIa-91bg, kilonovae, point-Ia, and ILOTs have very good precision and recall at the late epoch, and quite respectable at the early phase. The core-collapse SNe and CARTs are again shown to perform poorly. Overall, this plot highlights some flaws in the classifier that the previous metrics did not capture. In particular, the CART class is shown to perform much more poorly than in previous metrics, highlighting that it does not have a high precision and that there are many false positives for it. As there are fewer CARTs in the test set than other classes, this was not as obvious in the other metrics (see Saito & Rehmsmeier (2015) for an analysis of precision-recall versus ROC curves as classification metrics).

**Figure 11.** Precision-recall curves for the 12 transient classes at an early epoch at two days past trigger (left), and at a late epoch at 40 days past trigger (right). We compute the metric on the 40% of the data set used for testing. Please see an animation of the evolution of the Precision-Recall as a function of time since trigger at https://www.ast.cam.ac.uk/~djm241/rapid/pr.gif.
Download figure:
Standard image High-resolution image

4.6. Weighted Log Loss

In each of the previous metrics we have treated each class equally. However, it is often useful to weight the classifications of particularly classes more favorably than others. Malz et al. (2018) recently explored the sensitivity of a range of different metrics of classification probabilities under various weighting schemes. They concluded that a weighted log loss provided the most meaningful interpretation, defined as follows:

$\begin{eqnarray}&&{\mathrm{lnLoss}}^{t}=-\left(\displaystyle \frac{{\displaystyle \sum }_{c=1}^{(m+1)}{w}_{c}\cdot {\displaystyle \sum }_{s=1}^{{N}_{c}}\tfrac{{Y}_{c}^{{st}}}{{N}_{c}}\cdot \mathrm{ln}{y}_{c}^{{st}}}{{\displaystyle \sum }_{c=1}^{(m+1)}{w}_{c}}\right),\end{eqnarray} \tag{ 17 }$

where c is an index over the (m + 1) classes, and j is an index over all N members of each class, ${y}_{c}^{{st}}$ is the predicted probability that object s at time t is a member of class c, and ${{\boldsymbol{Y}}}^{{st}}$ is the truth label. The weight of each class w_c can be different, and N_c is the number of objects in each class.

This metric is currently being used in the PLAsTiCC Kaggle challenge to assess the classification performance of each entry. We apply a weight of 2 to the classes that PLAsTiCC deemed to be rare or interesting and 1 to the remaining classes:

Weight 1. SNIa, SNIbc, SNII, Ia-91bg, Ia-x, Pre-explosion
Weight 2. Kilonova, SLSN, PISN, ILOT, CART, TDE

We plot the weighted log loss as a function of time since trigger in Figure 12. The metric of the early (two days after trigger) and late (40 days after trigger) epochs are 1.09 and 0.64, respectively, where a perfect classifier receives a score of 0. While we have applied our classifier to ZTF simulations, we find that the raw scores are competitive with top scores in the PLAsTiCC Kaggle challenge. The sharp improvement in performance at trigger is primarily due to the prior placed on the pre-explosion phase of the light curve that forces it to be before trigger. Within approximately 20 days after trigger, the classification performance plateaus, as the transient phase of most light curves is ending.

5. Application to Observational Data

One of the primary challenges with developing classifiers for astronomical surveys is obtaining a labeled sample of well-observed transients across a wide range of classes. While it may be possible to obtain a labeled sample of common supernovae during the initial stages of a survey, the observation rates of less common transients (such as kilonovae and CARTs, for example) mean that a diverse and large training set of observed data is impossible to obtain. Therefore, a classifier that is trained on simulated data but can classify observational data streams is of significant importance to the astronomical community. To this end, a key goal of RAPID is to be able to classify observed data using an architecture trained on only simulated light curves. In this section, we provide a few examples of RAPID's performance on transients from the ongoing ZTF data stream. In future work, we hope to extend this analysis to test the classification performance on a much larger set of observed light curves.

In Figure 13 we have tested RAPID on three objects recently observed by ZTF to highlight its direct use on observational data: ZTF18abxftqm, ZTF19aadnmgf, and ZTF18acmzpbf (also known as AT2018hco,¹⁹ SN2019bly,²⁰ and SN2018itl,²¹ respectively). These have already been spectroscopically classified as a TDE (z = 0.09), SNIa (z = 0.08), and SNIa (z = 0.036), respectively (van Velzen et al. 2018; Fremling et al. 2019, 2018). In the bottom panel of Figure 13, we see that RAPID was able to correctly confirm the class of each transient well before maximum brightness, and within just a couple of epochs after trigger.

The two SNIa light curves were correctly identified after just one detection epoch, and the confidence in these classifications improved over the lifetime of the transient. While the SNIa-norm probability was lower for ZTF18acmzpbf, this is a good example of where the confidence in this transient being any subtype of SNIa was actually much higher. Given that the second most probable class was a SNIa-91bg, we can sum the two class probabilities to obtain a much higher probability of the transient being a SNIa.

While we have shown RAPID's effective performance on some observational data, future revisions of the software can be used to identify differences between the simulated training set and observations. This will help to improve the transient class models that were used to generate the light-curve simulations. Future iterations to improve the simulations will in turn lead to a classifier that is even more effective at classifying observational data.

Moreover, as it stands, RAPID can classify 12 different transient classes. However, if an unforeseen transient were passed into the classifier, the class probabilities would split between the classes that were most similar to the input. RAPID is a supervised learning algorithm, and is not designed for anomaly detection. However, cases where RAPID is not confident on a classification may warrant closer attention for the possibility of an unusual transient.

5.1. Balanced or Representative Data Sets

Machine-learning-based classifiers such as neural networks often fail to cope with imbalanced training sets as they are sensitive to the proportions of the different classes (Kotsiantis et al. 2006). As a consequence, these algorithms tend to favor the class with the largest proportion of observations. This is particularly problematic when trying to classify rare classes. The intrinsic rate of some majority classes, such as SNe Ia, is orders of magnitude higher than some rare classes, such as kilonovae. A neural network classifier trained on such a representative data set, and that aims to minimize the overall unweighted objective function (Equation (9)), will be incentivized to learn how to identify the most common classes rather than the rare ones. An example of the poorer performance of classifiers trained on representative transient data sets compared with balanced data sets is illustrated well by Figure 7 in Narayan et al. (2018). The shown t-distributed stochastic neighbor embedding (t-SNE; van der Maaten & Hinton 2008) plot is able to correctly cluster classes much more accurately when the data set is balanced.

Moreover, building a representative data set is a very difficult task and there is a nonrepresentativeness between spectroscopic and photometric samples. The detection rates of different transient classes in photometric surveys are often biased by brightness and the ease at which some classes can be classified over others. Spectroscopic follow-up strategies have been dominated by SNe Ia for cosmology, and have hence led to biased spectroscopic sets. Recently, however, Ishida et al. (2019) identified a framework for constructing a more representative training set. They make use of real-time active learning to improve the way labeled samples are obtained from spectroscopic surveys to ultimately optimize the scientific performance of photometric classifiers. Employing such a framework will allow for the construction of a training set that is more representative.

In our work, we simulate a balanced training set in an attempt to mitigate the effects of the bias present in existing data sets and to improve our classifier's accuracy on rare classes. While machine-learning classifiers tend to perform better on balanced data sets (Kotsiantis et al. 2006; Narayan et al. 2018), future work should verify this for photometric identification by comparing the performance of classifiers trained on balanced and representative training sets. However, until the issue of the nonrepresentativeness between spectroscopic and photometric samples is mitigated with approaches like Ishida et al. (2019), a representative data set remains difficult to build.

6. Feature-based Early Classification

To compare the performance of RAPID against traditional light-curve classification approaches which often use extracted light-curve features for classification, we developed a Random Forest-based classifier that computed statistical features of each light curve to use as input, rather than directly using the photometric information. This has been the most commonly used approach in light curve classification tasks to date (e.g., Lochner et al. 2016; Narayan et al. 2018; Möller et al. 2016; Newling et al. 2011; Karpenka et al. 2013; Revsbech et al. 2018). We extend upon the approach developed in Narayan et al. (2018) and based on Lochner et al. (2016) by using a wider variety of important features and by extending the problem for early light curve classification. Specifically, we only compute features using data up to 2 days after trigger so that the Random Forest classifier can be directly compared with the DNN early classifications.

Extracting features from light curves provides us with a uniform method of comparing between transients which are often unevenly sampled in time. We can train directly on a synthesized feature vector instead of the photometric light curves. For timeseries data, extracting moments is the most obvious way to start obtaining features. We compute several moments of the light curve, and a list of the distilled features used in classification are listed in Table 2. While we focus on early classification in this paper, we also list some full-light-curve features that we used in work not shown in this paper that some readers may find useful. As we have two different passbands, we compute the features for each passband and obtain twice the number of moment-based features listed in the table. We also make use of more context specific features, such as redshift and color information.

Table 2. Description of the Features Extracted from Each Passband of Each Light Curve in the Data Set. Some of These are Redefined from Table 2 of Narayan et al. (2018)

Early Light-curve Features Only
Early rise rate	Slope of early light curve (see Equation (18)).
a	Amplitude of quadratic fit to the early light curve (see Equation (3)).
c	Intercept of quadratic fit to the early light curve (see Equation (3)).
Color at n days	Logarithmic ratio of the flux in two passbands (see Equation (19)).
Early color slope	Slope of the color curve.

Early and Full Light-curve Features

Redshift	Photometric cosmological redshift.
Milky Way dust extinction	Interstellar extinction.
Historic color	Logarithmic ratio of the flux in two passbands before trigger (Equation (19)).
Variance	Statistical variance of the flux distribution.
Amplitude	Ratio of the 99th minus 1st and 50th minus 1st percentile of the flux distribution.
Standard deviation/mean	A measure of the average inverse signal-to-noise ratio.
Median absolute deviation	A robust estimator of the standard deviation of the distribution.
Autocorrelation integral	The integral of the correlation versus time difference (Mislis et al. 2016).
Von-Neumann ratio	A measure of the autocorrelation of the flux distribution.
Entropy	The Shannon entropy assuming a Gaussian CDF following Mislis et al. (2016).
Rise time	Time from trigger to peak flux.

Full Light-curve Features Only

Kurtosis	Characteristic "peakedness" of the magnitude distribution.
Shapiro-Wilk statistic	A measure of the flux distribution's normality.
Skewness	Characteristic asymmetry of the flux distribution.
Interquartile range	The difference between the 75th and 25th percentile of the flux distribution.
Stetson K	An uncertainty weighted estimate of the kurtosis following Stetson (1996).
Stetson J	An uncertainty weighted estimate of the Welch-Stetson Variability Index(Welch & Stetson 1993).
Stetson L	Product of the Stetson J and Stetson K moments (Kinemuchi et al. 2006; Stetson 1996).
HL ratio	The ratio of the amplitudes of points higher and lower than the mean.
Fraction of observations above trigger	Fraction of light-curve observations above the trigger.
Period	Top ranked periods from the Lomb-scargle periodogram fit of the light curves (Lomb 1976; Scargle 1982).
Period score	Period weights from the Lomb-scargle periodogram fit of the light curves (Lomb 1976; Scargle 1982).
Color amplitude	Ratio of the amplitudes in two passbands.
Color mean	Ratio of the mean fluxes in two passbands.

Download table as: ASCII Typeset image

We compute the early rise rate feature for each passband as the slope of the fitted early light-curve model defined in Section 2.4.1:

$\begin{eqnarray}&&{\mathrm{rate}}^{\lambda }=\displaystyle \frac{{L}_{\mathrm{mod}}^{\lambda }({t}_{\mathrm{peak}})-{L}_{\mathrm{mod}}^{\lambda }({t}_{0})}{({t}_{\mathrm{peak}}-{t}_{0})}.\end{eqnarray} \tag{ 18 }$

We use the rise rate, and the early light-curve model parameter fits $\mathop{{\boldsymbol{a}}}\limits^{\unicode{x002C6}}$ and $\mathop{{\boldsymbol{c}}}\limits^{\unicode{x002C6}}$ from Equation (3) as features in the early classifier. We then define the color as a function of time:

$\begin{eqnarray}&&\mathrm{colour}(t)=-2.5{\mathrm{log}}_{10}\left(\displaystyle \frac{{L}_{\mathrm{mod}}^{g}(t)}{{L}_{\mathrm{mod}}^{r}(t)}\right),\end{eqnarray} \tag{ 19 }$

where ${L}_{\mathrm{mod}}^{\lambda }(t)$ is the modeled relative luminosity (defined in Equation (3)) at a particular passband, λ.

We use the color curves computed from each transient to define several features. Using Equation (19), we compute the color of each object at a couple of well-spaced points on the early light curve (five days and nine days after t₀) and use them as features in our early classifier. We also compute the slope of the color curve and use that as an additional feature for the early classifier. For the full light-curve classifier, we compute the color amplitude as the difference in the light-curve amplitudes in two different passbands, and also compute the color mean as the ratio of the mean flux value of two different passbands.

We feed the feature set into a Random Forest classifier. Random Forests (Breiman 2001) are one of the most flexible and popular machine-learning architectures. They construct an ensemble of several fully grown and uncorrelated decision trees (Morgan & Sonquist 1963) to create a more robust classifier that limits overfitting. Each decision tree is made up of a series of hierarchical branches that check whether values in the feature vector are in a particular range until it ascertains each of the class labels in the form of leaves. The trees are trained recursively and independently, selecting which feature and boundary provide the highest information gain for classifications. A single tree is subject to high variance and can easily overfit the training set. By combining an ensemble of decision trees—providing each tree with a subset of data that is randomly replaced during training—a Random Forest is able to decrease the variance by averaging the results from each tree.

We have designed the Random Forest with 200 estimators (or trees) and have run it through twice. On the first run we feed the classifier the entire feature set. We then rank the features by importance in classification and select the top 30 features. We feed only these top 30 features into the second run of the classifier. As many of the features are obviously highly correlated with each other, this acts to reduce feature dilution, whereby we remove features that do not provide high selective power.

We compute features using light curves up to only the first 2 days after trigger. As the Random Forest is much quicker to classify than the DNN, we perform ten-fold cross-validation to obtain a more robust estimate of the classifier's performance. We then produce the confusion matrix in Figure 14 and the ROC curve in Figure 15. We can compare these metrics to the early epoch metrics at 2 days after trigger produced with the deep neural network in Figures 7 and 9. We find that the performance in the early part of the light curve is marginally worse than the DNN with a micro-averaged AUC of 0.92 compared to 0.95. Moreover, the ability of the DNN to provide time-varying classifications makes it much more suited to early classification than the Random Forest.

**Figure 14.** Confusion matrix of the early light-curve Random Forest classifier trained on 60% and tested on 40% of the data set described in Section 2. The classifier makes use of 200 estimators (trees) in the ensemble. The color bar and values indicate the percentage of each true label that were classified as the predicted label. Negative color bar values are used only to indicate misclassifications.
Download figure:
Standard image High-resolution image

**Figure 15.** Receiver operating characteristic of the feature-based Random Forest approach. The features are computed on photometric data up to two days past trigger, and are fed into a Random Forest classifier. Each curve represents a different transient class with the AUC score in brackets. The macro and micro-average curves which are an average and weighted-average representation of all classes are also plotted. The metric computed on 40% of the data set.
Download figure:
Standard image High-resolution image

In Figure 16, we rank the importance of the top 30 features in the Random Forest classifier. While redshift is clearly the most important feature in the data set, we have also built classifiers without using redshift as a feature and found that the performance was only marginally worse. This provides an insight into the classifier's robustness when applied to surveys where redshift is not available. The next best features is the historic color, suggesting that the type of host galaxy is important contextual information to discern transients. The early slope of the light curve also ranks highly, as it is able to distinguish between faster rising core-collapse supernovae and other slower rising transients such as magnetars (SLSNe).

7. Conclusions

Existing and future wide-field optical surveys will probe new regimes in the time-domain, and find new astrophysical classes, while enabling a deeper understanding of presently rare classes. In addition, correlating these sources with alerts from gravitational wave, high-energy particle, and neutrino observatories will enable new breakthroughs in multimessenger astrophysics. However, the alert-rate from these surveys far outstrips the follow-up capacity of the entire astronomical community combined. Realizing the promise of these wide-field surveys requires that we characterize sources from sparse early time data, in order to select the most interesting objects for more detailed analysis.

We have detailed the development of a new real-time photometric classifier, RAPID, which is well suited for the millions of alerts per night that ongoing and upcoming wide-field surveys such as ZTF and LSST will produce. The key advantages that distinguish our approach from others in the literature are:

1.
Our deep recurrent neural network with uni-directional gated recurrent units, allows us to classify transients using the available data as a function of time.
2.
Our architecture combined with a diverse training set allows us to identify 12 different transient classes within days of its explosion, despite low S/N data and limited color information.
3.
We do not require user-defined feature extraction before classification, and instead use the processed light curves as direct inputs.
4.
Our algorithm is designed from the outset with speed as a consideration, and it can classify the tens of thousands of events that will be discovered in each LSST image within a few seconds.

This critical component of RAPID that enables early classification is our ability to use measurements of the source before an alert is triggered: "precovery" photometry with insufficient significance to trigger an alert, but that nevertheless encodes information about the transient. While we designed RAPID primarily for early classification, the flexibility of our architecture means that it is also useful for photometric classification with any available phase coverage of the light curves. It is competitive with contemporary approaches such as Lochner et al. (2016); Charnock & Moss (2017); Narayan et al. (2018) when classifying the full light curve.

There is no satisfactory single metric that can completely summarize classifier performance, and we have presented detailed confusion matrices, ROC curves and measures of precision vs recall for all the classes represented in our training set. The micro-averaged AUC, the most common single metric used to measure classifier performance, evaluated across the 12 transient classes is 0.95 and 0.98 at 2 days and 40 days after an alert trigger, respectively. We further evaluated RAPID's performance on a few transients from the real-time ZTF data stream, and, as an example, have shown its ability to effectively identify a TDE and two SNe Ia well before maximum brightness. The results at early times are particularly significant as, in many cases, they can exceed the performance of trained astronomers attempting visual classification.

We also developed a second early classification approach that trained a Random Forest classifier on features extracted from the light curve. This allowed us to directly compare the feature-based Random Forest approach to RAPID's model-independent approach. We found that the classification performances are comparable, but the RNN has the advantage of obtaining time-varying classifications, making it ideal for transient alert brokers. To this end, we have recently begun integrating the RAPID software with the ANTARES alert-broker, and plan to apply our DNN to the real-time ZTF data stream in the near future.

In future work, we plan on applying this method on LSST simulations to help to inform how changes in observing strategy affect transient classifications at early and late phases. Overall, RAPID provides a novel and effective method of classifying transients and providing prioritized follow-up candidates for the new era of large scale transient surveys.

D.M. would like to thank the Cambridge Australia Poynton Scholarship and Cambridge Trust for studentship funding, as well as Armin Rest and the Director's Office at STScI for travel funding. G.N. is supported by the Lasker Fellowship at the Space Telescope Science Institute, and thanks the Kavli Institute for Cosmology, Cambridge for travel funding through a Kavli Visitor Grant. This work used the science cluster at STScI and the Midway cluster at the University of Chicago. We are indebted to IT staff at STScI, and the University of Chicago Research Computing Center for their assistance with our work. We thank Eric Bellm for providing a database of ZTF observations thank Rick Kessler for help with the SNANA simulation code used for making the ZTF simulated light curves, and finally thank Ryan Foley, David Jones and Zack Akil for useful conversations.

Software: AstroPy (Astropy Collaboration et al. 2013), Keras (Chollet 2015), NumPy (van der Walt et al. 2011), Scikit-learn (Pedregosa et al. 2011), SciPy (Jones et al. 2001), TensorFlow (Abadi et al. 2016).

RAPID: Early Classification of Explosive Transients Using Deep Learning

Article metrics

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Previous Work on Automated Photometric Classification

1.2. Early Classification

1.3. Deep Learning in Time-domain Astronomy

1.4. Overview

2. Data

2.1. Simulations

2.1.1. Zwicky Transient Facility

2.1.2. Transient Classes

2.2. Trigger for Issuing Alerts

2.3. Selection Criteria

2.4. Preprocessing

2.4.1. Modeling the Early Light Curve

2.5. Training Set Preparation

3. Model

3.1. Framing the Problem

3.2. Recurrent Neural Network Architecture

4. Results

4.1. Hyper-parameter Optimization

4.2. Accuracy

4.3. The Confusion Matrix

4.4. Receiver Operating Characteristic Curves

4.5. Precision-recall

4.6. Weighted Log Loss

5. Application to Observational Data

5.1. Balanced or Representative Data Sets

6. Feature-based Early Classification

7. Conclusions

Footnotes

RAPID: Early Classification of Explosive Transients Using Deep Learning

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

1.1. Previous Work on Automated Photometric Classification

1.2. Early Classification

1.3. Deep Learning in Time-domain Astronomy

1.4. Overview

2. Data

2.1. Simulations

2.1.1. Zwicky Transient Facility

2.1.2. Transient Classes

2.2. Trigger for Issuing Alerts

2.3. Selection Criteria

2.4. Preprocessing

2.4.1. Modeling the Early Light Curve

2.5. Training Set Preparation

3. Model

3.1. Framing the Problem

3.2. Recurrent Neural Network Architecture

4. Results

4.1. Hyper-parameter Optimization

4.2. Accuracy

4.3. The Confusion Matrix

4.4. Receiver Operating Characteristic Curves

4.5. Precision-recall

4.6. Weighted Log Loss

5. Application to Observational Data

5.1. Balanced or Representative Data Sets

6. Feature-based Early Classification

7. Conclusions

Footnotes