1 Introduction

As robots become more prevalent in homes, factories, and other safety-critical settings, detecting and correcting robot errors becomes increasingly important. A fast, reliable, and intuitive framework for supervising robots could help avoid errors that would otherwise lead to costly hardware damage or safety risks. If a robot could be taught to detect nonverbal cues such as distress signals and hand gestures as reliably as a collaborating human partner, then interactions with robots would become more efficient and supervision or collaboration would become more effective.

Fig. 1
figure 1

A user supervises and controls an autonomous robot, using brain signals to detect mistakes and using muscle signals to correct mistakes by selecting targets. Videos are available at http://people.csail.mit.edu/delpreto/auro2020

Using biosignals such as muscle or brain activity via electromyography (EMG) or electroencephalography (EEG), respectively, has become a promising technique for fast and natural human–robot interaction (HRI). EMG interfaces can control dynamical systems such as exoskeletons, while EEG data can reveal higher-level cognitive states using signals such as the Error-Related Potential (ErrP) indicating a perceived error. Yet reliable real-time detection is challenging since both biosignals are noisy, often difficult to interpret, and vary over time and across subjects. This often leads to the practice of per-subject training phases to tune classifiers, which precludes a “plug-and-play” system where new users can begin controlling the robot from their first interaction.

This paper explores the combination of EEG and EMG into a hybrid interface for robot control. This framework aims to leverage the benefits of each modality to enhance supervisory tasks, especially as classification accuracy continues to improve. The ErrP brain signal is generated unconsciously when a person perceives a mistake and does not need to be taught, facilitating fast passive error detection relying only on user attention. Hand gestures generate characteristic muscle signals that are easier to detect than brain signals, facilitating a vocabulary for reliably indicating desired behavior.

Combining EEG and EMG systems in this way can capitalize on the human’s cognitive ability to judge whether a robot made a mistake and physical ability to indicate correct actions if a mistake is made. As shown in Fig. 1, this paper implements a hybrid control framework in a supervisory scenario where a robot conducts a target selection task for a mock drilling operation. A human supervisor observes the autonomous robot and mentally evaluates whether it chose the correct target. If an ErrP or gesture is detected, the robot halts and requests assistance. The human then gestures left or right to scroll through possible targets. Once the correct target is selected, the robot resumes autonomous operation.

Two independent classification pipelines process EEG and EMG signals in real time. The EEG pipeline evaluates two neural networks on a buffer of EEG signals, determining whether an ErrP is detected at the moment when the robot begins moving towards a chosen target. The EMG pipeline classifies two channels of surface EMG signals from the forearm on a rolling basis, evaluating a neural network 80 times per second to detect left or right hand-gestures; this allows the user to exert control over the robot at any time and to select a desired target. Both the EEG and EMG classifiers are evaluated in a plug-and-play fashion, only training on data from previous users rather than requiring additional data collection phases for each new supervisor.

Paper contributions This paper works towards using biosignals for effective and reliable supervisory hybrid control in HRI tasks. In particular, its contributions are as follows:

  • A framework for combining unconscious error detection via EEG with active error correction via EMG for supervision of autonomous robots during target selection tasks;

  • A signal-processing and classification pipeline for continuously detecting left or right hand-gestures based on two surface EMG signals from the forearm, without requiring training on current user data;

  • A classification pipeline for detecting unconscious ErrP signals in time-locked EEG signals, evaluated without retraining on each new user;

  • Experimental results from 7 untrained subjects using the system to supervise a mock drilling operation where the robot chooses from 3 possible targets;

  • Offline EMG analysis exploring inter-subject muscle signal variations and the impact on classification performance, leading to a clustering-based algorithm for identifying key subsets of past training data that could facilitate plug-and-play classifiers for new subjects.

2 Related work

This paper builds on bodies of work investigating human–robot interaction and classification of human biosignals.

2.1 EEG-based methods for human–robot interaction

Many brain-computer interfaces have made progress towards using EEG for communication with both healthy and disabled individuals (Brumberg et al. 2010; Artemiadis and Kyriakopoulos 2011; Birbaumer et al. 1999; Blumberg et al. 2007; Higger et al. 2015). Using EEG for robotics also shows great promise, with explorations including augmentation of capabilities (Schröer et al. 2015; Penaloza and Nishio 2018), shared control for robotic assistants or grasping (Akinola et al. 2017; Ying et al. 2018), and remote control (Tonin et al. 2011; LaFleur et al. 2013). Yet there are significant challenges such as low signal-to-noise ratios, user training, subject-specific variations, and tuning of detection pipelines (Wolpaw et al. 2002; Vidaurre et al. 2011; McFarland et al. 1997; Lotte and Guan 2011). These issues often lead to added cognitive burden on the user, repeated prompts, or user-specific classification algorithms.

ErrPs are a promising communication mechanism since they occur naturally in response to a perceived error without requiring training or active thought modulation by the human operator (Falkenstein et al. 1991; Schalk et al. 2000; Iturrate et al. 2009, 2010, 2015). ErrPs have been used in brain-computer interfaces and HRI tasks for binary and multi-target classification, correcting classification errors, or controlling robots (Spüler et al. 2012; Wolpaw et al. 1998; Salazar-Gomez et al. 2017; Iturrate et al. 2015; Buttfield et al. 2006; Llera et al. 2011; Iturrate et al. 2010; Perrin et al. 2010; Ramli et al. 2015; Zhang et al. 2015; Spüler and Niethammer 2015). Yet these studies also highlight challenges related to real-time classification that precipitate carefully controlled application scenarios and sophisticated detection algorithms (Rivet et al. 2009; Barachant and Bonnet 2011; Barachant et al. 2013; Behncke et al. 2018). Combining ErrP detection with other input signals in a hybrid system could help address some of these challenges while leveraging unconscious error detection.

2.2 EMG-based methods for human–robot interaction

Surface EMG can measure muscle activity via electrodes placed on the skin. Models can then be developed that facilitate controllers based on these signals (Zajac 1989; Hogan 1984; Manal and Buchanan 2003; Ramos and Meggiolaro 2014; Cavallaro et al. 2005; Qashqai et al. 2015; Menon et al. 2016). For example, upper-limb exoskeletons can leverage parameterized muscle models (Ramos and Meggiolaro 2014), neuro-fuzzy systems (Kiguchi and Hayashi 2012), or impedance controllers (Gopura et al. 2009). Such studies have shown that EMG can yield effective human–robot interfaces, but also demonstrate associated challenges including noise, variance between users, and complex muscle dynamics. Some approaches to addressing such challenges include redundant switching models (López et al. 2009) or leveraging the human within the control loop during physical interaction (Lenzi et al. 2011; DelPreto and Rus 2019; Peternel et al. 2017).

Prominent applications of EMG include assistive robots (Gopura et al. 2013) such as exoskeletons (Gopura et al. 2009; Lenzi et al. 2011; Ramos and Meggiolaro 2014; Yin et al. 2012; Kiguchi and Hayashi 2012; Ao et al. 2017; DiCicco et al. 2004; Mulas et al. 2005) and prostheses (Chu et al. 2007; Shenoy et al. 2008). In addition to direct collaboration, muscle signals can be used for remote control or supervision via continuous trajectory control (Artemiadis and Kyriakopoulos 2010; López et al. 2009; Artemiadis and Kyriakopoulos 2011; Artemiadis et al. 2010), augmented reality interfaces (Weisz et al. 2017), or gestures (Crawford et al. 2005; Kim et al. 2008). The presented framework uses a gesture-based EMG control system that allows the human to actively indicate desired robot targets.

Fig. 2
figure 2

The main EEG+EMG paradigm is illustrated for when the robot incorrectly chooses the leftmost target (a), and when the robot correctly chooses the rightmost target (b). The sequence of events when it incorrectly chooses the center target is similar to (a). All LEDs blink initially to notify the user of the trial beginning, then one is blinked to indicate the desired drilling location. The robot then moves to indicate its intended target, which is randomly chosen with a bias towards being correct (block 1); the user mentally evaluates this choice while ErrP classification is performed (block 2). If an ErrP is classified, the robot stops and waits for gestures; otherwise, it continues reaching while searching for EMG intervention (block 3). If a correction is required, the user scrolls through targets via gestures while the robot is stopped (block 4). Upon target selection or no intervention, the robot completes the reach (block 5)

2.3 Hybrid control methods for human–robot interaction

Using multiple biosignal sensors can yield hybrid systems that leverage their respective strengths (Müller-Putz et al. 2011). Fusing multiple modalities as well as multiple types of brain signals including ErrPs, researchers have demonstrated promising success in applications for healthy and impaired individuals ranging from quadrotor flight to prostheses (Müller-Putz et al. 2015; Sarasola-Sanz et al. 2017; Kim et al. 2014; Kawase et al. 2017; Ma et al. 2015).

This paper focuses on a framework for hybrid EEG and EMG supervision, which may be applicable to safety-critical tasks where robot operation must be corrected with low latency. Such applications require jointly addressing several challenges including rolling EMG classification to allow human control at arbitrary times, fast online ErrP classification even with low signal-to-noise ratios, and experimentally validated system performance in plug-and-play settings.

3 Experimental design

An experiment was designed that allows users to supervise and control an autonomous robot solely via brain and muscle activity. A supervised robot moves an unplugged power drill to one of three targets on a mock plane fuselage, emulating a factory setting where robots assist humans in construction tasks by drilling holes or inserting fasteners.

Two experimental paradigms were implemented: the primary closed-loop supervisory control task, and an open-loop session of gestures. No prior experience controlling a robot or using EEG and EMG is required, encouraging novice users to immediately interact with the robot in more intuitive ways than button sequences or programming.

3.1 Main paradigm: EEG and EMG, closed-loop control

The main task constitutes an autonomous robot selecting targets for a mock drilling operation while a human supervises and intervenes when necessary. As shown in Fig. 1, the human wears an EEG cap and EMG electrodes while sitting behind the robot and observing the task. The human mentally evaluates whether the robot chooses the correct target in each trial, and uses left or right hand-gestures to correct the robot when necessary. An EEG classifier detects ErrP signals upon initial robot motion, and an EMG classifier continuously identifies gestures.

Figure 2 and the online videosFootnote 1 illustrate the sequence of events. Three LEDs are mounted on the mock plane fuselage under the left, center, and right targets. At the start of each trial, the system randomly chooses which target is the desired drilling location with uniform probability. It then illuminates the corresponding LED for 0.5 s to inform the user of the correct location for the upcoming trial. Approximately 1.5 s later the robot randomly chooses a target with a 70% chance of choosing the correct one, and makes an initial arm movement towards the chosen location to cue its intention. The user mentally evaluates whether this motion indicates the correct target, and the EEG classifier assesses whether their brain activity presents an ErrP signal.

If an ErrP is found, the system stops the robot, illuminates the LED representing the robot’s currently chosen target, and waits for the human to select the correct target using wrist gestures. A gesture can be a brief flexion (left motion) or extension (right motion) of the right hand. These are used to scroll left or right through the three possible targets; every time a gesture is detected by the EMG classifier, the illuminated LED is changed to indicate the new selection. Since the robot’s chosen target can be more than one target away from the correct one, several gestures may be performed. The system considers a target selection finalized 3.5 s after the last detected gesture. All LEDs then turn off, and the robot resumes autonomous operation reaching towards the newly selected target. If an incorrect target was accidentally selected, the supervisor can use gestures to interrupt the robot and initiate another target selection.

If no ErrP signal is found, the robot will continue operating autonomously and reach towards the selected target. However, the user can interrupt the robot by gesturing at any time; this provides a safety feature in case of inaccurate EEG classification. The robot then stops and waits for target selection via EMG as described above. The initial intervention gesture also adjusts the illuminated current selection.

Once the robot reaches the selected target, either with or without intervention, it pauses briefly to indicate completion and then returns to its starting position. This concludes a single trial. On average, an experimental session consisted of 4 blocks of 40 trials each and lasted approximately 2 h.

3.2 EMG-only paradigm: open-loop gestures

Since the main paradigm produces gestures performed at arbitrary times, an EMG-only paradigm was included to generate a corpus of structured EMG training data and facilitate classifier evaluation. However, this data was not used to train a new classifier for the subject that provided it; each experiment used a classifier trained on EMG-only sessions from all previous subjects to implement a plug-and-play strategy. Data from 3 initial subjects that did not participate in online sessions was used as the training set for the first online subject, and was also included in subsequent training sets.

These trials follow a “ready-set-go” sequence to cue time-locked labeled gestures. All three fuselage LEDs blink once to gain the subject’s attention, then the left or right LED illuminates for 0.5 s to indicate whether a left or right gesture should be made. After a brief delay, all LEDs illuminate for 1 s; the subject starts and completes their gesture during this period, encouraging consistent gesture timing.

This EMG-only block was performed at the beginning of each experimental session. It consisted of 50 trials and lasted approximately 5 min. The subject is positioned in the same manner as for the main paradigm, but the EEG cap is not worn and the robot is not controlled.

3.3 EEG data paradigm

The EEG classifier was trained on data from 3 sessions of the main closed-loop paradigm performed by 2 preliminary subjects before the primary set of experiments. No EEG-only paradigm was implemented. Training segments were extracted as the window from 200 ms to 800 ms after the robot begins its initial movement. Subjects were instructed to avoid making gestures during this period to avoid motor-related EEG activity. Although there was a low signal-to-noise ratio during the experiments and a small initial training set, the classifier was kept constant throughout all online sessions to evaluate the hybrid control pipeline in a plug-and-play fashion with a fixed EEG classifier.

3.4 Subject selection

A total of 7 subjects participated in the online control experiments (71.4% male, 85.7% right-handed). In addition, 3 separate subjects provided initial training data; all 3 performed the EMG-only blocks to acquire EMG training data (66.7% male, 66.7% right-handed), and 2 of them performed the main paradigm to acquire EEG training data (50.0% male, 100.0% right-handed). No previous experience using EMG or EEG interfaces was required. Subjects were not screened based on EMG or EEG signals. All subjects provided consent for the study, which was approved by MIT’s Committee on the Use of Humans as Experimental Subjects.

4 System overview and data acquisition

An integrated end-to-end system was developed to enable real-time hybrid supervisory control during the target selection tasks. Figure 3 provides an overview of the system.

Fig. 3
figure 3

The system includes EMG and EEG acquisition and classification systems, an experiment controller, and the Baxter robot. A human supervisor closes the loop

4.1 Experiment controller and robot

The experiment controller, implemented in Python, coordinates all subsystems to realize the paradigms. It chooses correct and selected targets, commands the robot and LEDs, and interprets classifier outputs in the experimental context.

For this particular implementation, the Rethink Robotics Baxter robot was used. It communicates via the Robot Operating System (ROS) with the experiment controller, which provides joint angle trajectories for Baxter’s left 7 degree-of-freedom arm. A pushbutton switch is fastened under the arm to determine exactly when the arm lifts from the table; this is used as a time-locking signal for EEG acquisition.

An Arduino Mega 2560 serves as an interface between the experiment controller and the classification subsystems. Key events such as trial timing, chosen targets, robot motion, and LED states are sent to the Arduino via USB serial. These are mapped to predefined 7-bit code words, which the Arduino uses to set a parallel port. The 8\(\mathrm{th}\) pin of the port is wired to the pushbutton switch on Baxter’s arm. The EMG and EEG data acquisition systems read this port along with their respective biosignals, allowing for synchronization between data and experimental events.

4.2 EMG hardware and data acquisition

Two differential pairs of reusable non-adhesive surface bar electrodes are placed over the user’s right posterior (outer) forearm and right anterior (inner) forearm, positioned over the muscles using recommendations in De Luca (2002). An additional electrode is placed slightly distal to the left elbow as a ground reference. Electrode placement sites are cleaned with Nuprep skin preparation gel and alcohol prep pads to reduce electrical impedance and improve adhesion. Conductive gel is applied beneath each electrode. An elastic Velcro strap around each forearm holds the electrodes in place.

An NI USB-6216 data acquisition (DAQ) device is connected directly to the electrodes, the Arduino parallel port, and the LED control signals. The 16-bit analog input channels are configured for a \(-0.2\) V to \(+0.2\) V range. Differential mode is used to reduce common noise. Analog signals are sampled at 2000 Hz and sent via USB as buffers of 200 samples every 0.1 s. These buffers are acquired by Simulink (2017b), which performs online signal processing and gesture identification. Classifications are sent asynchronously to the experiment controller via ROS.

4.3 EEG hardware and data acquisition

A total of 48 passive electrodes, following the 10-20 scalp distribution, are used for EEG data collection. Three Guger Technologies USBamps sample all signals at 256 Hz. Ground and reference electrodes are placed at the AFz position and the right ear, respectively. The Arduino parallel port is connected directly to GPIO inputs of the USBamps.

Online signal processing and classification is performed in Simulink (2015a). The pushbutton switch on Baxter’s arm initiates an EEG buffer, as described in Salazar-Gomez et al. (2017). The buffer is then processed and passed to the ErrP classifier. Classifications are sent asynchronously to the experiment controller via the Arduino, using a USBamp GPIO output.

5 Classification of EMG and EEG signals

Two independent classification pipelines were implemented: one for continuous gesture detection from muscle signals, and one for time-locked error detection from brain signals. Each one operated online and was trained only on data from previous users.

5.1 EMG classification: continuous gesture detection

Muscle signals are acquired from the inner and outer right forearm as described in Sect. 4.2. These two signals are passed through a pipeline of signal processing, feature extraction, and classification to detect gestures on a rolling basis. This pipeline is outlined in Algorithm 1.

figure g
Fig. 4
figure 4

Acquired EMG signals are passed through a signal-processing pipeline. Raw signals for randomly selected left-gesture, right-gesture, and baseline segments are shown on the left. Detected envelopes are shown in the center. The right column shows the segments after shifting down to 0, normalizing, trimming, centering, and downsampling

5.1.1 Signal processing

Each muscle signal is independently band-pass filtered, amplified, and envelope-detected. The initial filtering preserves the useful frequency content of the EMG signal (De Luca 2002) while removing DC offsets, low-frequency motion artifacts, and high-frequency noise. The envelopes indicate muscle activation levels, as shown in Fig. 4.

5.1.2 Segmentation and normalization

As described in Sect. 3.2, EMG training data was collected by cueing subjects to make left or right gestures during specified time windows using LEDs. The data was then segmented according to those LED signals. To accommodate variable reaction times and gesture durations, each extracted segment begins 0.75 s before LEDs turn on and ends 0.75 s after LEDs turn off. This yields one labeled gesture segment per trial. Two baseline segments without gestures were also extracted from each trial when LEDs were off.

For each extracted segment, each EMG channel’s envelope is shifted so its minimum value is at 0. Both envelopes are then scaled by the same factor so the peak value becomes 1. Finally, they are downsampled to 80 Hz. The right column of Fig. 4 presents sample results. This shifting and scaling helps standardize inputs across subjects and time, making classification robust to variations in EMG magnitude and offset. Normalizing each segment independently alleviates issues of calibration, fatigue, and gesture variations.

5.1.3 Data augmentation

To detect gestures on a rolling basis, the trained network should be robust to small time shifts while preferring gestures that are centered in the buffer. This helps make predictions smooth and reliable while eliminating false detections. For example, the top row of Fig. 4 illustrates antagonistic muscle activity during a left gesture that may result in predicting two different gestures, a left gesture then a right gesture, if the network is too insensitive to time shifts.

A data augmentation approach, demonstrated in Fig. 5, was used to guide the network towards robust rolling classification. Each extracted gesture segment is first centered around the peak value. Two copies are then synthesized by shifting slightly left and right by random amounts (1-100 ms); these are assigned the original gesture label. Two copies are also synthesized by shifting farther left and right (400 ms plus a random amount up to 100 ms); these are assigned a baseline label. By creating slightly shifted positive examples and greatly shifted negative examples, this augmentation guides the network towards preferring gestures that are centered in the buffer within a specified tolerance.

For each original baseline segment, a single synthetic baseline example is extracted via a far shift.

Each example is then truncated to 1.2 s (96 samples), using a window centered around the original centered segment. If there is not enough data in the original segment to perform one of the shifts, that synthetic example is discarded.

Fig. 5
figure 5

Training data is augmented via time-shifting, to encourage the network to prefer centered gestures within a specified tolerance (shaded yellow regions). Each gesture trial is shifted slightly to create new positive examples, and shifted farther to create new “baseline” examples

5.1.4 Neural network training

As a result of the segment extraction and data augmentation, each trial yields 9 training examples: 3 positive gesture examples, 2 baseline examples with shifted gestures, and 4 baseline examples without gestures. The training corpus is thus biased towards negative examples, which is acceptable since avoiding false gesture detections is important for smooth operation and the rolling online classifier will encounter vastly more baseline segments than gestures.

The two EMG envelopes within each example are concatenated to yield a feature vector with 192 elements. These labeled vectors are used to train a feed-forward neural network, using the Pattern Recognition functionality of Matlab’s Neural Network Toolbox (2017b). The network has a single hidden layer of size 20 using a hyperbolic tangent sigmoid activation function, and an output layer of size 3 using a softmax activation function. The 3 outputs are used to classify a segment as baseline, a left gesture, or a right gesture.

A new classifier was trained for each experiment using data from the EMG-only blocks of all previous subjects, including the 3 preliminary subjects, without using data from the current subject.

5.1.5 Online continuous classification

As the two muscle signals are acquired by Simulink, the signal-processing pipeline is applied and the detected envelopes are downsampled to 80 Hz. These downsampled envelopes populate rolling buffers of duration 1.2 s. Each time the buffers are updated, they are independently shifted down to 0, jointly normalized, and concatenated. The resulting 192-sample vector is passed to the trained neural network. To avoid spurious predictions, network classifications are slightly filtered. A rolling buffer of 12 network classifications (150 ms) is maintained, and a gesture is declared if at least 60% of them are not baseline and at least 60% of them are the same label. Rising edges in the stream of filtered classifications are interpreted as new gesture predictions.

5.2 EEG classification

Brain signals are acquired as described in Sect. 4.3, then passed through a pipeline of signal processing, feature extraction, and classification to detect ErrP signals.

5.2.1 EEG pre-processing and feature extraction

A buffer of EEG data is initiated during each closed-loop trial when the robot begins its arm motion (at stimulus onset), as detected by a pushbutton switch under the robot’s arm. This buffer collects 800 ms of data from the 48 EEG channels. A decoding window from 200 ms to 800 ms post stimulus onset is then extracted. These signals are band-pass filtered for 1–10 Hz using a \({4}\mathrm{th}\)-order zero-phase Butterworth filter. Based on offline analysis, only signals from 9 electrodes on the mid-line central region, corresponding to the locations FC1, FCz, FC2, C1, Cz, C2, CP1, CPz, and CP2, were selected. These 9 filtered channels are concatenated to create a 1,386-element feature vector for classification.

5.2.2 Network training and online ErrP classification

The EEG classification pipeline was trained on 3 preliminary sessions from 2 of the preliminary subjects. The classifier then remained constant for all 7 online sessions to evaluate the hybrid control pipeline in a plug-and-play fashion.

Two feed-forward neural networks were trained and evaluated. The first network has a single hidden layer (input-100-1). The second network has four hidden layers (input-100-50-100-10-1). To perform binary ErrP classification, a threshold was chosen for each network by minimizing the following cost function:

$$\begin{aligned} Cost = \sqrt{\left( 1-sensitivity\right) ^2 + \left( 1-specificity\right) ^2} \end{aligned}$$
(1)

The offline area under the curve (AUC) metrics for the simpler and deeper neural networks were 70% and 69%, respectively. Offline analysis averaging the regression outputs from both networks and using an averaged threshold increased performance by 3 percentage points. Thus, the final classification pipeline implemented in Simulink uses both networks: the feature vector is fed in parallel to each classifier, then the two outputs are averaged and compared to the averaged threshold. This final output is sent to the experiment controller as the ErrP detection flag.

6 Experimental results and discussion

The system was used for 7 sessions, each with a different subject. This enabled evaluation of the interface efficacy as well as the EMG and EEG classifiers.

6.1 System performance: integrated hybrid control

Results summarizing the trials and exploring effectiveness of the overall system are shown in Fig. 6. There were 151 trials per experiment on average, after removing trials in which subjects self-reported being distracted. The robot randomly chose the correct target in 69.5% of the trials, and after EEG and EMG control chose the correct target in 97.3% of the trials. The hybrid control thus allowed the user to correct errors made by the robot and by the EEG classifier. In most interventions, the minimum number of gestures required to scroll to the correct target was detected by the system even though subjects were not instructed to minimize their gestures; this suggests efficient usage of the gesture interface and a lack of spurious false gesture classifications.

Fig. 6
figure 6

Overall system performance is summarized by whether the robot placed the drill at the correct target after hybrid EEG and EMG control. In addition, the user’s interaction and some of the failure modes are described by considering when gestures were performed or required

Compared to fully autonomous trials with neither error nor gestures detected, trials in which an error was detected and the user performed gestures to select a new target averaged 8.2 s longer with standard deviation (SD) 2.8 s. Trials in which no error was detected via EEG but the user interrupted the robot via gestures averaged 5.6 s (SD 2.4 s) longer than fully autonomous trials.

6.2 EMG classification performance

The EMG classification pipeline was evaluated during the EMG-only blocks and during the main-paradigm blocks. Online results are presented below, while Sect. 7 presents collected data and offline analysis.

6.2.1 Open-loop gesture detection: EMG-only blocks

Figure 7 summarizes the rolling classification performance during open-loop gesture blocks. As described in Sect. 3.2, each trial includes a single cued gesture. Classifiers trained on previous blocks were run online as described in Sect. 5.1.5, but no feedback was presented to the user. The classifiers made a single correct gesture prediction in 92.8% of the 345 trials. There were mixed correct and incorrect gesture predictions in 3.2% of trials, no non-baseline predictions in 3.2% of trials, and multiple correct-gesture predictions in 0.9% of trials. There were no trials in which a left or right gesture was classified as the opposite gesture without additional correct predictions. These results indicate that the classifiers robustly and accurately detected gestures during EMG-only blocks.

As described in Sect. 5.1.5, the neural network classifications are slightly filtered. Without this filter, the pipeline would have made a single correct gesture prediction in 77.7% of trials and multiple correct-gesture predictions in 15.4% of trials. This indicates that the filter aided performance by decreasing repeated correct predictions, but was not needed to remove incorrect predictions.

Fig. 7
figure 7

In each experiment, an EMG classifier trained on previous subjects was continuously invoked during the open-loop EMG-only trials. The results indicate successful real-time gesture detection and generalization to new subjects

6.2.2 Closed-loop gesture detection: EMG+EEG blocks

The classifiers trained on past EMG-only blocks were also used during the closed-loop trials in which users could arbitrarily make left or right gestures at any time. Ground truth gesture labels were obtained by annotating videos of the subjects’ arms in post-processing. Video was not recorded during the first EMG+EEG experiment, so closed-loop EMG performance results are not available for that experiment.

Fig. 8
figure 8

Confusion matrices summarize the EMG classification performance during closed-loop EMG+EEG blocks. Users could make left or right gestures at any time

Figure 8 depicts per-subject and aggregated confusion matrices. The classifiers correctly identified 65.8% of left gestures and 85.2% of right gestures, while not falsely identifying any right gestures as left gestures and only 1.8% of left gestures as right gestures. Most instances of missing a gesture occurred when the subject made multiple rapid gestures, such that there were likely multiple gestures within the rolling buffer window.

The bottom row of Fig. 8a evaluates prediction of baseline activity. The experiments spanned over 18,000 s of predictions being generated at 80 Hz, and there were 17 gesture predictions when no gesture was performed. This spans the entire experiment, including between trials when subjects could move and reposition themselves. It is thus promising that gestures were very rarely predicted when no gesture was intended by the subject.

On average, left-gesture motions lasted for 0.85 s (SD 0.26 s) and were detected 1.15 s (SD 0.16 s) after motion initiation. Right-gesture motions lasted for 0.79 s (SD 0.23 s) and were detected 1.09 s (SD 0.11 s) after motion initiation. Thus, the classifiers generally waited to predict a gesture until it was contained by and relatively centered in the 1.2 s buffer window. Once a detection commenced, a left or right prediction was continuously outputted for an average of 0.20 s or 0.26 s (both SD 0.08 s), respectively, demonstrating smooth predictions. Together, these results imply that the data augmentation successfully encouraged the networks to prefer centered gestures within a specified tolerance – to be robust to small but not large time shifts.

Overall, these results indicate that the EMG classifiers provided a reliable plug-and-play method of communicating with the robot via gestures. The low false positive rate indicates that the robot rarely stopped unnecessarily. If a gesture was missed or misidentified, the subject could simply make another gesture to correct the target selection. The detection latency is also reasonable for real-time control and closed-loop feedback. These results demonstrate effective supervisory control in a multiple-choice context, and are promising for future extensions of the gesture vocabulary beyond the two well-defined hand motions currently explored.

6.3 EEG classification performance

Fig. 9
figure 9

EEG results include online classification performance and a visual comparison of the ErrP signals collected during training, testing, and previously recorded sessions

Figure 9a summarizes the online EEG classification performance in each experiment. Although it was lower than the EMG classification performance, it was sufficient for preliminary investigation of the presented framework for combining EMG and EEG into a hybrid control interface. Using the more reliable EMG interface to correct errors by the EEG interface was acceptable in the current task, but future studies with more time-critical tasks should improve EEG performance to better leverage unconscious error detection. In particular, the current EEG system used a classifier that was trained on only 3 sessions from 2 subjects and that was not updated after each experiment; a larger corpus of ErrP data would likely yield a more robust classifier, as suggested by results of previous work (Salazar-Gomez et al. 2017).

Improving the EEG signal quality may also help improve classification accuracy. Figure 9b presents averaged traces from three different scenarios: the training set, the online experiments, and a related robot supervision task explored in previous work (Salazar-Gomez et al. 2017) (referred to as the offline session). The known negative-positive-negative structure of an ErrP that is expected in the difference traces (black traces) is seen in the offline sessions, but it is less apparent in the training sessions or especially in the online sessions. The online sessions generally presented lower signal-to-noise ratios than expected. This suggests that reducing the noise of the raw EEG signal during online sessions might yield improved classification results for future experiments. Many EEG studies carefully separate subjects from sources of interference such as robots and actuators, but the current experiments featured subjects quite close to the robot since the focus was on testing the capabilities of the hybrid control system; the impact of this placement and other signal quality concerns could be further evaluated in future investigations.

7 Offline EMG analysis for plug-and-play training

The amount of training data needed as well as how to choose the most informative data among past subjects are crucial questions for creating a plug-and-play gesture prediction system. After completing the online experiments, collected EMG data for left and right gestures was used to explore these questions and further evaluate the system’s generalizability in plug-and-play evaluations.

Section 7.1 outlines the training and testing procedures used during all of the presented offline analyses. Section 7.2 then investigates gesture variations between individual subjects and the viability of classifiers trained on a single subject, while Sect. 7.3 investigates training on multiple subjects and how the size of the training group affects classification performance. Building on these results, Sect. 7.4 presents a clustering-based algorithm for strategically selecting training data from past subjects that may facilitate creating plug-and-play classifiers for future subjects.

7.1 Offline training and evaluation procedure

Data from all of the EMG-only open-loop gesture blocks was used. The 7 subjects that performed online experiments will be referred to as subjects 1-7, and the 3 preliminary subjects will be referred to as subjects 8-10. Mirroring the procedure used for online experiments, the offline analyses train networks on time-locked gesture examples but evaluate using rolling classification.

Training examples were extracted, processed, and augmented as described in Sect. 5.1. For each individual classifier used in offline analysis, 10 neural networks with the same architecture as used during online experiments were trained. Each one randomly divided gesture examples from the subset of subjects being considered by the analysis into training, validation, and testing examples using a 70-15-15 split. The one with the highest accuracy on its testing set was then selected for use in the analysis. This offline training process mimics what could be done in an online experimental context when creating a neural network for use with a future unknown subject.

Offline performance scores were then obtained by simulating the selected networks in a streaming fashion. The complete online pipeline described in Sect. 5.1.5 was executed with recorded raw EMG data and the new classifiers. Since the data is from open-loop sessions, this process mimics what would have been observed online. Accuracy is then reported as the fraction of trials in which these new rolling classifications indicate a single gesture and indicate the correct gesture (predicting multiple gestures in a trial is considered a failure even if they all indicate the cued gesture).

7.2 Inter-subject variability

Examining variability between subjects’ gestures and the generalizability of training on individual subjects can yield insight into creating plug-and-play classifiers. Figure 10 reveals commonalities among all subjects such as the general signal shape and the primary muscles. Yet it also reveals distinctive traits such as the prominence of antagonistic muscle activity, gesture speed, and consistency across trials.

Fig. 10
figure 10

Training segments extracted from EMG-only blocks are illustrated for all subjects. Thicker lines represent mean traces, and shading spans 1 standard deviation on each side. The segments have been vertically shifted, normalized, downsampled, and centered around their peak. Synthetic augmentation examples are not included

To investigate the impact of these variations on system performance, classifiers were trained on data from a single subject and then tested on each of the remaining 9 subjects. Figure 11a aggregates results by the exclusive training subject used. It illustrates that training on certain subjects can be more generalizable than others; for example, training on subjects 1 or 2 yields higher accuracies than training on subjects 5, 7, 8, 9, or 10 (\({p\,<\,0.05}\) for each of these 10 pairwise comparisons). In addition to trends regarding training on specific subjects, Fig. 11b reveals trends regarding testing on specific subjects. For example, testing on subjects 3, 5, 6, or 10 yields higher accuracies across all training subjects than testing on subjects 1, 2, 8, or 9 (\({p\,<\,0.04}\) for each of these 16 pairwise comparisons). The performance matrix is also not symmetric. These results demonstrate that it can be difficult to predict how a subject’s data will impact the neural network and its performance on future subjects. Yet they also suggest that a small training set may be sufficient to create a generalizable classifier if chosen carefully.

7.3 Number of training subjects

Building on results from individual subjects, performance and robustness were explored when combining data from multiple subjects. A neural network was trained on every possible grouping of subjects, using groups of size 1 through 9. Each network was evaluated on each subject not in its training group. This yielded \(\sum _{N=1}^{9} \frac{10!}{N!(10-N)!} = {1022}\) neural networks and \(\sum _{N=1}^{9} \frac{10!(10-N)}{N!(10-N)!} = {5110}\)  evaluations. This whole process was then repeated 5 times, to help account for randomness in the neural network training process.

Fig. 11
figure 11

Training a classifier on each subject individually and then separately evaluating on each remaining subject investigates the generalizability of gestures from each subject. Each box in (a) aggregates the corresponding column of (b)

Figure 12a summarizes the results. Training on all previous subjects (groups of 9 subjects) yielded the highest mean accuracy of 91.2% (SD 12.4%). Training on a single subject yielded a lower distribution of accuracies than each other group size (\({p\,<\,0.01}\) for each of the 8 pairwise comparisons). However, pairwise statistical comparisons among all of the accuracy distributions do not reveal a consistent pattern based on the number of training subjects; the current results therefore suggest that the amount of training data did not have a reliable impact on performance beyond a single subject, but increasing the sample sizes via additional experiments would be needed to further investigate such trends. The results also suggest that a small number of training subjects might be sufficient if chosen strategically.

7.4 Choosing training subjects via clustering

The previous analyses suggest that training on certain subjects may be more generalizable than others, and that using small groups of subjects may be desirable. However, it is still unclear how to choose training subjects that can leverage the similarities and variations between their gestures to improve performance on future subjects. Towards this end, a subject selection algorithm based on clustering is explored.

To first evaluate the effectiveness of clustering in the context of the collected muscle signals, clustering was performed on gestures from all 10 subjects. This was done on left and right gestures independently, using k-means clustering (Lloyd 1982) with correlation as a distance metric. An optimal number of clusters was chosen using the silhouette evaluation metric of Matlab (2018b) on up to 10 clusters.

As illustrated in Fig. 13, two clusters were found for left gestures and three clusters were found for right gestures. The prominence of antagonistic muscle activity appears to be identified as a distinctive feature for both gestures. The speed of motion onset and return to neutral also seems to be a distinctive feature for right gestures. These traits agree with the discussion of Fig. 10, and suggest that the chosen clustering mechanism is reasonable for the current task.

A greedy algorithm was then implemented to apply this identification of distinctive gesture characteristics to training subject selection. Left and right gestures from past subjects are first clustered as described above. Training subjects are then greedily selected until each cluster has at least 30% of its gestures included in the training pool; at each iteration, the unselected subject with the most gestures from the least-represented cluster is selected and all gestures from that subject are added to the training pool. This aims to prefer subjects that exemplify certain gestural characteristics well; less consistent subjects with gestures from multiple clusters are less likely to be selected for training. Using a fixed cluster coverage ratio helps preserve the original gesture breakdown, so more common characteristics are still more prevalent in the downselected corpus.

This algorithm was evaluated with a leave-one-subject-out strategy to simulate plug-and-play performance; each subject was in turn treated as an unknown future subject. For each subject left out, the clustering and greedy subject selection was performed on the 9 known subjects. A neural network trained on the chosen subjects was then evaluated on the unknown subject that was not part of the clustering and selection procedure. This whole process was repeated 5 times for each subject left out, to help account for randomness in the clustering and training processes.

Results are summarized in Fig. 12b. Accuracy averaged 94.4% (SD 9.2%) across all subjects left out. The distribution of accuracies was significantly higher than the first 8 distributions of Fig. 12a (\({p\,<\,0.02}\) for each of these 8 pairwise comparisons). While increasing the sample sizes would be needed to further investigate such performance comparisons, the current results also suggest that using the clustering-based selection algorithm generally achieved comparable or improved performance on new subjects compared to what would have been achieved by training on all past subjects, despite using less training data.

The algorithm chose 4.2 training subjects on average (SD 0.4) from the 9 available, which agrees with the conclusions drawn from Fig. 12a regarding the feasibility of a small training group size if the subjects can be chosen strategically. The mean optimal number of left and right gesture clusters was 2.2 (SD 0.5) and 2.5 (SD 0.6), respectively, which is consistent with the clustering of Fig. 13.

Subjects 7, 8, and 9 were chosen most frequently, being included in 100.0%, 95.6%, and 80.0% respectively of the iterations in which they were not the subject left out. Subjects 1, 6, 3, and 10 were included in 51.1%, 51.1%, 46.7%, and 15.6% of such iterations, respectively, and the remaining subjects were chosen in less than 15.0% of such iterations.

Fig. 12
figure 12

For every possible subset of subjects ranging in size from 1 to 9 subjects, 5 classifiers were trained and then independently evaluated on each of the remaining subjects; results are then grouped by the size of the training subset (a). The cluster-based algorithm for selecting training subjects can yield improved performance on new subjects (b)

Fig. 13
figure 13

Performing k-means clustering on left and right gestures from all subjects yields two left-gesture clusters and three right-gesture clusters. Bold traces represent cluster centroids, and shading represents 1 standard deviation on each side. The clustering highlights antagonistic muscle activity and gesture onset speed as key characteristics

These results suggest that the presented clustering-based algorithm may facilitate generalizable classifiers for future subjects by identifying a key subset of past subjects on which to train. While the present analysis uses a vocabulary of two well-defined gestures and a relatively small number of subjects, the results are promising for future investigations with more subjects and for extensions to additional scenarios.

8 Conclusion and future work

Detection, correction, and prevention of robot errors are important tasks for a human–robot interface to accomplish. Moving towards these goals, the presented system combines brain and muscle signals into a hybrid framework for detecting and correcting robot mistakes during selection tasks. Error-related potential signals in the brain can provide an unconscious mechanism for quickly detecting when a user perceives a robot error, and detecting gestures via muscle signals can provide a reliable mechanism for actively indicating desired behavior. Both pipelines were evaluated with 7 untrained subjects in a plug-and-play fashion to reduce the barrier to new users controlling the robot.

While this system has demonstrated the use of a hybrid EMG+EEG interface for robot control, future work is required to investigate whether it could be deployed in safety-critical or time-critical tasks. In particular, the EEG classification accuracy should be improved beyond the 54% currently observed in the robot control experiments. Increasing the size of the EEG training corpus to include significantly more than 2 subjects, improving the EEG signal quality, and potentially using other biosignals such as EMG as a training input for the EEG system may help increase the ErrP classification performance and reliability in future studies.

Future work can also investigate this framework with a larger subject pool and with additional robot control tasks. Evaluating the EMG pipeline on more users can further investigate the observed trends in inter-subject variations and their impact on performance, and can further test the performance of the presented clustering-based subject selection algorithm for training plug-and-play classifiers. The presented selection algorithm could also be adapted to choose individual gesture examples instead of treating subjects’ datasets as atomic units. Future work could also increase the gesture vocabulary beyond the two hand gestures currently implemented to address a broader range of supervisory tasks and test the generalizability of the presented detection approach.

This work thereby moves towards improved human–robot interaction in situations where effective supervision can mean the difference between a dangerous environment and a safe one, or between a costly mistake and a swift intervention.