Introduction

Clinical trials in cancer research are increasingly incorporating biomarkers, for example, as an inclusion criterion or for stratification of patients to control for confounding factors. Practical challenges, such as interobserver variation in the assessment of biomarkers during the execution of the trial, are often overlooked. If not handled appropriately, these challenges can limit the effectiveness and ability to complete the biomarker and drug development process. According to Hall et al.1, the risks inherent to biomarker integration can be divided into risks to patients, operational risks, and direct risks to biomarker development. A practical risk-management framework developed by a National Cancer Institute (NCI), National Cancer Research Institute (NCRI), and European Organization for Research and Treatment of Cancer (EORTC) Working Group1 was proposed to manage the risks inherent to biomarker integration into clinical trials.

Stromal tumor-infiltrating lymphocytes (sTILs) have been strongly associated with prognosis in early-stage triple-negative breast cancer (TNBC) and HER2-positive breast cancer. In addition, sTILs are predictive for neo-adjuvant chemotherapy response in early breast cancer2,3. Furthermore, sTILs correlate with outcome after immune checkpoint blockade in metastatic TNBC4,5,6. The readout of sTILs, however, can be challenging impeding its effective use as a biomarker and its usage in the clinic7. The International Immuno-Oncology Biomarker Working Group (hereafter called the TIL Working Group) has provided guidelines for the scoring of sTILs in breast cancer8, and the St. Gallen Breast Cancer Conference of 2019 endorsed sTILs being routinely characterized in TNBC and reported according to these guidelines8.

Risks associated with integration of biomarkers in clinical trials

In contemporary clinical research there is an increasing trend toward the use of biomarker results obtained in daily practice to select patients for inclusion in clinical trials. Although biomarker research is more and more prominent in clinical trials, most biomarkers will not make into the clinic9. Therefore, continuous monitoring of the predefined risks and the solutions can improve the quality of the biomarker, which can be applied in a clinical trial setting, as well as in daily practice. The recommendations of the TIL Working Group8,10 for appropriate scoring, and the risk-management framework of the NCI, NCRI, and EORTC Working Groups1 will help to effectively and efficiently improve the incorporation of biomarkers in clinical trials in first instance.

Several risks are associated with biomarker development and integration of biomarkers in clinical trials. Roughly, risks can be divided into three categories: risks to patient safety, operational risks, and risks to biomarker development. Not all risks are applicable to all clinical trials and upon designing a biomarker-incorporating clinical trial, risks should be defined and mitigation approaches formulated. It is highly recommended that during a clinical trial, risks are not only pre-identified but are also continuously monitored to prevent stagnation in biomarker development1. For example, incorporating biomarkers in a large multi-center international clinical trial involves different risks than a small single-center trial. In the first case, there might be different legislation regarding data confidentiality, and inter-laboratory variability can be an issue. When incorporating a biomarker as inclusion criterion or stratification factor in clinical trials, rapid turnaround times are needed and the highest level of quality is necessary for correct interpretation of the results. In the next steps of biomarker development, high-quality results are needed to ensure implementation in daily clinical practice.

Use of digital pathology in clinical trials and development of a novel web application

In larger trials, usually phase II–III, central pathology review (CPR) plays an important role in the reliable assessment of biomarker scoring. However, logistical issues, such as the sending of tumor blocks or slides, can be time consuming, costly for the pathology laboratory, and error prone with significant consequences for patient inclusion if the wrong material is sent to the central lab. Digital sharing of histology slides and patient data simplifies logistics for CPR11. Besides digital sharing and scoring of slides, digital image analysis and machine learning approaches are emerging in clinical research12,13. The use of digital pathology or digital evaluation of histology slides most prominently mitigates risks associated with operational processes. It can reduce the number of missing samples, since the sharing of material is simplified; it enables rapid turnaround times; reduces manual errors; and can streamline local versus central assessment of biomarker.

For clinicians and researchers to use digital pathology, applications and websites should be user-friendly and intuitive. As an example, a web-based tool called Slide Score (www.slidescore.com) was developed as a cross-platform web application to facilitate the scoring of whole slide images and tissue microarray (TMA) cores. Application programming interface (API) was implemented that allowed programmatic administration of studies, uploading slides, fetching results, and retrieving pixel data for regions of images. This API enabled automating creation of new studies from internal database system for managing biobanking workflows. Additionally, a plugin was developed for QuPath14—open-source image analysis software—which uses this API to run image analysis algorithms on slides stored on the Slide Score platform avoiding the need to download the slides. This web-based platform was used in high-impact projects6,15, for example, for the digital scoring of biomarkers in the first stage of the TONIC trial6, and the estimation of the immune infiltrate of tumors of melanoma patients used for single-cell sequencing15. Furthermore, the web-based platform is currently used for several other types of research, such as interrater variability studies, retrospective TMA, and whole slide scoring and prospective biomarker scoring.

Design of a workflow to mitigate risks associated with biomarker development: an example

We identified seven distinct risks with the risk-management framework published by Hall et al.1 as possibly interfering with the quality and integration of prospective sTILs scores in a clinical trial, and designed our workflow accordingly (Table 1). These risks are specific for this trial, but some of them are applicable also to other trials. They span all three categories mentioned above1 and included (1) poor-quality biopsies, (2) possible loss of data confidentiality, (3) interrater variability, (4) poor sample quality, (5) poor scoring quality, (6) delay in patient registration, and (7) manual errors (Table 1). We then defined solutions to mitigate these risks and integrated these solutions in a workflow that can be applied across clinical trials and across biomarkers (Fig. 1). The workflow can be modified according to local guidelines, research questions, and clinical trial designs. We used the following workflow to obtain timely and reliable sTILs scores (summary in Supplementary Fig. 1).

Table 1 Risks with possible high impact identified in a phase II immunotherapy trial6 based on the perspectives of Hall et al.1 with our approach to mitigation of that risk.
Fig. 1: Organization of a workflow for reliable and timely biomarker scoring in a general single-center or multi-center trial.
figure 1

Personnel at individual centers scan the slides after processing by the local pathology department. Digital slides are uploaded to a central web-based repository, such as Slide Score. A study-specific identifier is assigned to each sample. The central manager is notified by the system when new slides are available and requests pathologists to review it. When a consensus score is obtained, the trial office is notified for randomization of the patient.

After obtaining informed consent of a patient, three biopsies of one metastatic lesion (lymph node, skin, liver, or other) were obtained in this trial. Previous research has shown that three 14 G core needle biopsies should be sufficient for accurate breast cancer diagnosis16. A hematoxylin and eosin (H&E)-stained slide of one biopsy was then evaluated, to ensure that the biopsy contained enough tumor cells (more than 100 cells) for further analysis (risk 1). Next, a high-resolution digital scan was obtained and automatically pseudonymized with study-specific identifiers (risk 2) before uploading to Slide Score. Display of the original labels was masked to ensure confidentiality of all data within Slide Score (Supplementary Fig. 2b). Pathologists and administrators had to login with their username and password to access the slides and were able to add a two-factor authentication application. Four well-trained breast pathologists, based in three different institutes and in two different countries, were notified via email to score each slide using existing sTIL scoring guidelines of the TIL Working Group8,10 to reduce interrater variability (risk 3). sTILs are scored as the percentage of lymphocytes in the total stromal area (in close proximity of the tumor cells). Interrater variability can lead to bias in the results, when assessment of a biomarker is skewed towards either the lower or higher ranges. When there was a disagreement (using a 5% cut-off) a concordance-score was agreed upon (Supplementary Fig. 1). Low-quality, inaccurate collection or processing of samples can result in low sample availability and introduce batch effects or bias in the results (risk 4) and lead to non-consistent scores (risk 5). High quality of samples was ensured by standardization of our workflow in which all steps were performed in the same manner for every biopsy (Supplementary Fig. 1). Oversight of the entire workflow by one person, referred to as the central manager, is essential for timely identification of technical errors. The central manager tracked the timing of the biopsies, notified the pathologists immediately after the scan was uploaded and sent reminders if necessary, kept track of the scores and timing, and noted the score in the patient record for trial office notification. We predefined acceptable timeframes for obtaining the scores of the reviewers and tracked these during the study progress (risk 6; Supplementary Fig. 1). Pathologists were notified via email the next working day when the slide was not scored yet to minimize the waiting period to start treatment (risk 6). Finally, using Slide Score, we reduced the risks of typos and other manual errors by collecting all slides within one online study group (collection of slides) and a customized scoring form was built to standardize scores and obtain structured data (risk 7).

Implementation of workflow in the TONIC trial

The TONIC trial (NCT02499367)6 is a phase II, non-comparative randomized multi-cohort single-center trial (full title: Adaptive phase II randomized non-comparative Trial Of Nivolumab after Induction treatment in TNBC patients), designed to assess the efficacy of induction of an anti-cancer immune response by low-dose chemotherapy or irradiation to increase response to anti-PD-1 in patients with metastatic TNBC. In the first part of the trial6, patients with metastatic TNBC were randomized to nivolumab (1) without induction or two-week low-dose induction, with (2) irradiation (3 × 8 Gy), (3) cyclophosphamide, (4) cisplatin, or (5) doxorubicin, all followed by nivolumab (anti-PD-1; 3 mg/kg). Based on a Simon’s two-stage design17 and prespecified pick-the-winner criteria, only the doxorubicin cohort was allowed to continue in the second part of the trial6. In the second part of the TONIC trial, patients were randomized between anti-PD-1 monotherapy (control group) and two cycles of low-dose doxorubicin (15 mg flat dose, weekly), followed by anti-PD-1 (Supplementary Fig. 2a). Randomization was stratified for sTILs. Stratification is done by dividing patients in two categories, namely sTILhigh (equal or exceeding 5%) and sTILlow (lower than 5%). The cut-off was determined based on data obtained in the first part of the TONIC trial, in which we observed that sTILs were predictive of response to anti-PD-1, both continuous and when a cut-off of 5% was used6. These data confirmed the predictive value of sTILs of at least 5% in another trial, which tested the efficacy of anti-PD-1 in patients with metastatic TNBC4. The full protocol, including four amendments, and the informed consent form were approved by the medical-ethical committee of The Netherlands Cancer Institute. All patients provided written informed consent before enrollment. The trial was registered on 17 August 2015. The 47 patients of the second part of the trial were randomized between March 2018 and July 2019. Full eligibility criteria and trial procedures have been described previously6.

In the second part of the TONIC trial, we could implement our workflow with a focus on accurate and reproducible sTIL scores within a reasonable timeframe after a biopsy was taken (72 h). For all 47 patients included in the trial, reliable sTIL scores were obtained with 45 biopsies scored within the 72-h timeframe (Supplementary Fig. 3). During the course of the study, the server of Slide Score was available 99.9% of the time. Five biopsies had to be re-evaluated due to a discrepancy in the categorical scores, when not all pathologists agreed on the appropriate category of the sTIL score (lower than 5% versus higher or equal to 5%). In three of these cases the score of one pathologist was higher (5 or 10%) than the score of the other two or three pathologists (0–3%). The average sTIL score was obtained and the pathologist causing the disagreement was notified. In the fourth and fifth case, two pathologists scored 5 and 10%, whereas the other pathologists scored 1%. All four pathologists were notified of the disagreement and a consensus score of 5% was obtained. We observed an intraclass correlation coefficient of 0.94 (95% confidence interval (CI): 0.91–0.97) for sTILs as a continuous variable. Interrater agreement for the categorical variable used in the stratification (sTILs <5% or ≥5%) was 0.86 (multirater Fleiss’ κ18; 95% CI: 0.73–1; Supplementary Fig. 2c). In the anti-PD-1 monotherapy cohort, we observed that 13 out of 23 patients (56.5 %) had sTILs below 5%, as compared to 15 out of 24 patients in the doxorubicin cohort (62.5 %; Fisher’s exact test p value 0.77). The distribution of the sTIL scores is depicted in Supplementary Fig. 2d. These data indicate effective stratification based on the cut-off of 5%, but a slightly uneven distribution in the higher ranges of sTIL scores (10% or higher) inherent to the use of our cut-off. We observed a median time from biopsy until the scanning of the H&E slide of 30 h (range 24–98 h) and a median time from the biopsy until at least three scores were obtained of 43 h (range 27–106 h). In total, the median time from biopsy until registration in the patient records was 49 h (range 41–106 h; Supplementary Fig. 2e), with 96% of biopsies scored within 72 h. Two biopsies were not scored within the 72-h time limit, due to additional processing of one sample and one delay in registration time due to the absence of the central manager (Supplementary Figs 1 and 3).

Advantages and limitations of a web-based risk-mitigation workflow

Our proposed solutions involved standardization of our workflow, obtaining digital images and the use of a web-based tool such as Slide Score for the managing and scoring of digital images. Anticipating the incorporation of digital images in routine diagnostics, our workflow shows that it is feasible for a pathologist to score digital images with high reliability. Moreover, a web-based tool can facilitate the process of coordinated uploading of digital images, pseudonymizing slides, and regulate access to studies and proper data management. Web-based platforms are therefore of high interest in biomarker research and can help with automation that can be transferred to clinical practice in the future.

In this study, we obtained sTIL scores within 72 h after a biopsy was taken, which is a reasonable timeframe for clinicians to start randomization of patients to treatment arms in a clinical trial. We observed an excellent interrater agreement score between our panel of four expert pathologists. In an accompanying paper7 we demonstrate using data from three RING studies of the TIL Working Group that the concordance achieved using a risk-management approach as detailed in this study is substantially higher than observed outside this risk-management perspective as observed in the three RING studies and in other published studies19,20. However, our sample size is small and the four pathologists in the current study were trained and experienced in the scoring of sTILs in breast cancer. Also, the biopsies used in this study were checked for containing sufficient tumor cells (≥100 cells) before the slide was scored for sTILs, which could have further improved our results. In the future, it is to be expected that computational workflows will further improve the scoring of sTILs13. Although we obtained reliable and timely results in 96% of cases, the presence of a central manager is crucial. In one case there was a delay in registration time due to the absence of the central manager. The manual intervention of quality checks, processing of the slides, and data cannot be circumvented in our workflow.

Stratification in this study was performed using sTILs as a binary variable (lower than 5% versus higher or equal to 5%). Consequently, we observed an uneven distribution in continuous sTILs scores between the cohorts (Supplementary Fig. 2d). This was mainly due to more patients with sTILs scores above 10% in the anti-PD-1 monotherapy cohort. Inherent to the use of a binary cut-off for stratification, the median of the continuous measurement might still differ between cohorts. Alternatively, multiple categories for the same variable can be used in stratification. However, this approach generates more strata, with lower number of patients in each stratum, possibly leading to an imbalance in distribution21,22. Moreover, at the time of writing of this paper no cut-offs for sTILs are established and/or properly validated for predictive purposes.

During the trial, we continuously monitored whether our strategy was still feasible within the set timeframe by means of regular evaluation by the pathologists and the study coordinators. This led to rapid adjustment of the workflow if needed, ensuring the quality of the sTIL scores. For example, pathologists could easily login remotely and score a digital H&E outside the hospital ensuring that sTILs were still scored within 72 h after biopsy. Ongoing evaluation during the clinical trial is of critical importance for risk mitigation in biomarker research1.

Future applications of the workflow

Our strategy can serve as a template for risk management and mitigation of all identified risks in future clinical trials incorporating biomarkers for inclusion, enrichment, or stratification. By no means will risks identified in this study be similar for all clinical trials. Each trial will have its own risks that need to be mitigated, although there will be similarities between the risks across clinical trials. Defining the risks that come with biomarker development will help tested biomarkers eventually make their way to the clinic. However, one may even argue that a similar risk-management strategy can be applied in daily practice. In the BELLINI trial (NCT03815890), two cycles of neo-adjuvant anti-PD-1 are administered in patients with early-stage TNBC or luminal B breast cancer. All patients are required to have at least 5% sTILs in the pretreatment biopsy and patients are thereafter stratified in three sTIL categories. Our workflow will be used to ensure timely and reliable sTIL scores for the right patient selection. By using our workflow, scoring of sTILs is highly standardized, allowing also smaller centers with less extensive experience in sTILs scoring to participate in a clinical trial.

Conclusions

In contemporary clinical research there is an increasing trend toward the use of biomarker results obtained in daily practice to select patients for inclusion in clinical trials. Therefore, continuous monitoring of the predefined risks and the solutions can improve the quality of the biomarker, as can be applied in a clinical trial setting, as well as in daily practice. The recommendations of the TIL Working Group8,10 for appropriate scoring, the risk-management framework of the NCI, NCRI, and EORTC Working Groups1, as well as our proposed strategies to reduce risks will help to effectively and efficiently improve the incorporation of biomarkers in clinical trials in first instance, herewith illustrated using sTILs as a paradigm of this development.