• The predictive performance of RR was superior to the conventional criteria in both primary and second-line treatment settings.

  • RR was externally validated and remained consistent despite differences in patient characteristics and second-line therapy.

Abstract

Overall response (OR) that combines complete (CR) and partial responses (PR) is the conventional end point for acute graft-versus-host disease (GVHD) trials. Because PR includes heterogeneous clinical presentations, reclassifying PR could produce a better end point. Patients in the primary treatment cohort from the Japanese Society for Transplantation and Cellular Therapy (JSTCT) were randomly divided into training and validation sets. In the training set, a classification and regression tree algorithm generated day 28 refined response (RR) criteria based on symptoms at treatment and day 28. We then evaluated RR for primary and second-line treatments, using the area under the receiver operating characteristic curve (AUC) and negative predictive value (NPV) for 6-month nonrelapse mortality as performance measures. RR considered patients with grade 0/1 at day 28 without additional treatment as responders. RR for primary treatment produced higher AUCs than OR with small improvement of NPVs in both validation sets: JSTCT (AUC, 0.73 vs 0.69 [P < .001]; NPV, 92.0% vs 89.6% [P < .001]) and the Mount Sinai Acute GVHD International Consortium (MAGIC; AUC, 0.71 vs 0.68 [P = .032]; NPV, 90.9% vs 89.8% [P = .009]). RR for second-line treatment produced similar AUCs but much higher NPVs than OR in both validation sets of JSTCT (AUC, 0.64 vs 0.63 [P = .775]; NPV, 74.5% vs 66.0% [P < .001]) and MAGIC (AUC, 0.67 vs 0.64 [P = .105]; NPV, 86.8% vs 76.1% [P = .004]). Classifying persistent but mild skin symptoms as responses and residual lower gastrointestinal GVHD as nonresponses were major drivers in improving the prognostic performance of RR. Our externally validated day 28 RR would serve as a better end point than conventional criteria in future first- and second-line treatment trials.

Acute graft-versus-host disease (GVHD) is a common complication after allogeneic hematopoietic cell transplantation (HCT).1 Despite recent advances in GVHD prophylaxis, the incidence of grade 2 to 4 acute GVHD requiring systemic treatment remains high.2-5 Similarly, long-term outcomes, particularly for steroid-refractory GVHD,6,7 remain poor with rates of nonrelapse mortality (NRM) at 1 to 2 years as high as 50%.1,8-13 These findings underscore the need for novel therapeutic agents.

The effectiveness of experimental treatments is assessed through both short-term (eg, disease response) and long-term (eg, NRM) end points. Higher response rates in GVHD, however, do not always correlate with long-term survival benefits13,14 and may be confounded by crossover designs, underscoring the need for more precise short-term end points that can serve as reliable surrogates for long-term outcomes. The current standard short-term end point in clinical trials of both primary and second-line treatment of acute GVHD is overall response (OR) at day 28 that combines complete (CR) and partial responses (PR).13-25 Previous studies in the primary treatment setting have shown that CR and PR both correlate with similar long-term outcomes.26-30 However, limited data are available on its predictive performance for long-term outcomes in the current treatment landscape,31 particularly after second-line therapy.32 Furthermore, PR is considered equally indicative of long-term acute GVHD control, whether the response constitutes a near resolution of all symptoms or a modest improvement but persistent severe symptoms. For example, a patient with persistent diarrhea (stable stage 2 gastrointestinal [GI] GVHD) who experiences a minimal improvement in their skin rash (stage 2 to stage 1) is considered to have experienced a PR equivalent to a patient with complete resolution of diarrhea (GI stage 2 to stage 0) and the same improvement in skin rash (stage 2 to stage 1). This limitation prompted an expert panel to conclude that PR alone is inadequate for accurately assessing acute GVHD control.33 As a result, day 28 CR is sometimes used as a primary or secondary end point in clinical trials.13,14,18-20,23,24 We hypothesized that refining the standard response criteria, especially by reclassifying PR, would provide a more accurate reflection of disease control and improve the prediction of survival outcomes across both first- and second-line treatments.

The Japanese Society for Transplantation and Cellular Therapy (JSTCT) and the Japanese Data Center for HCT have compiled clinical information on patients who underwent HCT at >300 centers across Japan.34,35 In parallel, the Mount Sinai Acute GVHD International Consortium (MAGIC) prospectively monitored the clinical course of acute GVHD at 24 HCT centers in North America, Europe, and Thailand.36 In this collaborative study between the JSTCT and MAGIC, we first developed day 28 refined response (RR) criteria and then validated and compared its performance against conventional treatment response measures, day 28 OR and CR, using validation cohorts from the JSTCT and MAGIC registry databases.

Data source and patient selection

This study included adolescent/adult patients (aged ≥16 years) who underwent their first allogeneic HCT between 2014 and 2021 in Japan and between 2014 and 2022 in the MAGIC centers. Patients who received systemic corticosteroids as primary treatment for grade 2 to 4 acute GVHD were included in the primary treatment cohort. Patients from the primary treatment cohort who received second-line treatment for grade 2 to 4 acute GVHD were included in the second-line treatment cohort. A CONSORT diagram is presented in supplemental Figure 1. Informed consent was previously obtained from all participants in accordance with the Declaration of Helsinki. This study was approved by the institutional review boards of the Jichi Medical University Saitama Medical Center and MAGIC and the data management committee of JSTCT/Japanese Data Center for HCT.

Definitions

Acute GVHD was graded according to the MAGIC criteria.37 Treatment response was assessed at day 28 from the initiation of primary or second-line treatment. Conventional GVHD response was assessed using the following criteria27,33: CR was defined as the resolution of all symptoms without additional systemic treatment; PR was defined as improvement in at least 1 organ with no worsening of other target organs or additional systemic treatment; and OR was defined as achieving either CR or PR. When indicated, grades 1a and 1b were defined as any improvement that approximates CR but with residual stage 1 and stage 2 skin disease, respectively.33 Grade 2a was defined as skin stage of ≤2 and GI stage of ≤1 without liver involvement.38 Grade 2b was defined as skin stage of >2, GI stage of >1, or any liver GVHD. The criteria for determining cause of death were slightly different between the JSTCT and MAGIC cohorts. For JSTCT patients, cause of death was assessed by the treating physician at each center and reported as either acute GVHD or other causes. For MAGIC patients, cause of death was centrally adjudicated and categorized as either uncontrolled acute GVHD (if active GVHD was present at the time of death) or other causes if GVHD was inactive.

Statistical analysis

Gray method was used to estimate and compare NRM, with primary disease relapse considered a competing event. The landmark point was set at day 28 from the initiation of primary or second-line treatment, which served as the starting point for all analyses. The primary end point was 6-month NRM from the landmark point.

The predictive accuracy of treatment response criteria was assessed using 2 key measures: the area under the receiver operating characteristic curve (AUC); and the negative predictive value (NPV), which quantifies the likelihood that a patient who responds to treatment (ie, short-term treatment success) will not experience NRM (ie, long-term treatment failure). NPV is particularly valuable for its clinical relevance, because it indirectly reflects GVHD control and aids in the interpretation of clinical trial outcomes. The DeLong test and permutation test were used to calculate P values for the AUC and NPV comparisons.39 The ΔAUC, ΔNPV, and their corresponding 95% confidence intervals (CIs) were calculated using 1000 bootstrap resamples.40 All outcomes were censored at 6 months from the landmark point, except for time-dependent receiver operating curve analyses.41 Because CR is expected to have the highest NPV, the noninferiority margin for change in NPV was set at –5% compared to CR.

We developed day 28 RR criteria as follows. First, the primary treatment cohort from JSTCT was randomly divided into training and validation sets in a 2:1 ratio. Second, in the training set, we categorized the patients into 10 groups based on the GVHD grades at treatment initiation (grades 2 or 3/4) and at day 28 (grade 0, 1, 2, or 3/4; or given second-line therapy before day 28). Third, a classification and regression tree (CART)42 algorithm was used to divide these 10 groups into 2 groups based on the risk of 6-month NRM: response and nonresponse (NR). The CART model was constructed using the “rpart” package in R. The splitting criterion was based on the Gini index or reduction in variance, with a complexity parameter set to 0.1 and a maximum depth of 2 levels. To prevent overfitting, we specified a minimum number of 20 observations required to attempt a split and a minimum of 20 observations per terminal node. Fourth, we compared the treatment response criteria across the 4 validation sets from JSTCT and MAGIC for primary and second-line treatments. The number of patients in the JSTCT second-line treatment cohort (n = 510) was insufficient to develop second-line–specific response criteria.

All P values were 2-sided, with values <.05 considered statistically significant. All statistical analyses were performed with R version 4.3.2 (R Foundation) and EZR43 version 1.63.

Patient characteristics

In the primary treatment cohorts, 188 patients (5.1%) from JSTCT and 101 patients (8.1%) from MAGIC were excluded for death or relapse before the landmark point, day 28 after treatment. Similarly, 50 patients (8.9%) in the JSTCT cohort and 52 patients (21.0%) in the MAGIC cohort were excluded from the second-line treatment cohorts. A summary of patient characteristics is provided in Table 1. There were differences in pretransplant characteristics between JSTCT and MAGIC patients, such as race, donor sources, and GVHD prophylaxis.

Table 1.

Patient and GVHD characteristics

Primary Tx cohortSecond-line Tx cohort
JSTCTMAGICJSTCTMAGIC
n = 3497n = 1141n = 510n = 196
Median age at HCT (IQR), y 53 (42-61) 58 (45-65) 55 (44-62) 58 (42.75-64) 
Age, category, n (%)     
<50 years 1455 (41.6) 473 (41.5) 186 (36.5) 83 (42.3) 
≥50 years 2042 (58.4) 668 (58.5) 324 (63.5) 113 (57.7) 
Sex match between recipient and donor, n (%)     
Female to male 728 (20.8) 175 (15.4) 85 (16.7) 33 (16.9) 
Other 2769 (79.2) 965 (84.6) 425 (83.3) 162 (83.1) 
Race, n (%)     
Asian 3486 (99.7) 46 (4.0) 508 (99.6) 13 (6.6) 
White 2 (0.0) 950 (83.3) 0 (0.0) 153 (78.1) 
Black 0 (0.0) 52 (4.6) 0 (0.0) 9 (4.6) 
Others 9 (0.3) 9 (0.8) 2 (0.4) 2 (1.0) 
Unknown 0 (0.0) 84 (7.4) 0 (0.0) 19 (9.7) 
Primary disease, n (%)     
Acute leukemia 2137 (61.1) 604 (52.9) 281 (55.1) 93 (47.4) 
MDS/MPN 720 (20.6) 316 (27.7) 128 (25.1) 62 (31.6) 
Malignant lymphoma 523 (15.0) 88 (7.7) 85 (16.7) 18 (9.2) 
Other malignancies 44 (1.3) 94 (8.2) 11 (2.2) 17 (8.7) 
Nonmalignancies 73 (2.1) 39 (3.4) 5 (1.0) 6 (3.1) 
HCT-CI, n (%)     
<3 2912 (83.7) 726 (63.6) 429 (84.4) 122 (62.2) 
≥3 566 (16.3) 415 (36.4) 79 (15.6) 74 (37.8) 
Donor source, n (%)     
HLA matched related 590 (16.9) 247 (21.6) 70 (13.7) 43 (21.9) 
HLA mismatched related 78 (2.2) 8 (0.7) 8 (1.6) 1 (0.5) 
HLA matched unrelated 719 (20.6) 618 (54.2) 107 (21.0) 113 (57.7) 
HLA mismatched unrelated 602 (17.2) 99 (8.7) 85 (16.7) 21 (10.7) 
Umbilical cord blood 1175 (33.6) 38 (3.3) 182 (35.7) 10 (5.1) 
Haploidentical 333 (9.5) 131 (11.5) 58 (11.4) 8 (4.1) 
Conditioning intensity, n (%)     
Myeloablative 2251 (64.4) 606 (53.1) 315 (61.8) 100 (51.0) 
Reduced intensity 1246 (35.6) 535 (46.9) 195 (38.2) 96 (49.0) 
Posttransplant cyclophosphamide, n (%)     
No 3287 (94.0) 934 (81.9) 493 (96.7) 178 (90.8) 
Yes 210 (6.0) 207 (18.1) 17 (3.3) 18 (9.2) 
In vivo T-cell depletion, n (%)     
No 3055 (87.4) 717 (62.8) 427 (83.7) 120 (61.2) 
Yes 442 (12.6) 424 (37.2) 83 (16.3) 76 (38.8) 
Median year of HCT (IQR) 2018 (2016-2020) 2018 (2016-2020) 2018 (2016-2020) 2018 (2016-2020) 
GVHD grades at Tx onset, n (%)     
Grade 2 2545 (72.8) 823 (72.1) 135 (26.5) 70 (35.7) 
Grade 3 799 (22.8) 257 (22.5) 264 (51.8) 73 (37.2) 
Grade 4 153 (4.4) 61 (5.3) 111 (21.8) 53 (27.0) 
GVHD grades on day 28, n (%)     
Grade 0 1933 (55.3) 677 (59.3) 153 (30.0) 51 (26.0) 
Grade 1 472 (13.5) 122 (10.7) 45 (8.8) 12 (6.1) 
Grade 2 419 (12.0) 72 (6.3) 135 (26.5) 22 (11.2) 
Grade 3 157 (4.5) 24 (2.1) 109 (21.4) 12 (6.1) 
Grade 4 55 (1.6) 16 (1.4) 54 (10.6) 9 (4.6) 
Additional systemic Tx before day 28 461 (13.2) 230 (20.2) 14 (2.7) 90 (45.9) 
Days from HCT to treatment (IQR) 34 (26-48) 35 (23-61) 50 (37-72) 43 (35-59.5) 
Median initial corticosteroid dose (methylprednisolone [mg/kg], IQR) — 1.0 (0.75-1.84) NA NA 
Days from primary Tx to second-line Tx (IQR) NA NA 12.5 (8-26) 13 (7-23) 
6-Month NRM (95% CI), % 16.7 (15.5-18.0) 17.2 (15.0-19.5) 44.0 (39.5-48.3) 41.7 (34.6-48.6) 
Primary Tx cohortSecond-line Tx cohort
JSTCTMAGICJSTCTMAGIC
n = 3497n = 1141n = 510n = 196
Median age at HCT (IQR), y 53 (42-61) 58 (45-65) 55 (44-62) 58 (42.75-64) 
Age, category, n (%)     
<50 years 1455 (41.6) 473 (41.5) 186 (36.5) 83 (42.3) 
≥50 years 2042 (58.4) 668 (58.5) 324 (63.5) 113 (57.7) 
Sex match between recipient and donor, n (%)     
Female to male 728 (20.8) 175 (15.4) 85 (16.7) 33 (16.9) 
Other 2769 (79.2) 965 (84.6) 425 (83.3) 162 (83.1) 
Race, n (%)     
Asian 3486 (99.7) 46 (4.0) 508 (99.6) 13 (6.6) 
White 2 (0.0) 950 (83.3) 0 (0.0) 153 (78.1) 
Black 0 (0.0) 52 (4.6) 0 (0.0) 9 (4.6) 
Others 9 (0.3) 9 (0.8) 2 (0.4) 2 (1.0) 
Unknown 0 (0.0) 84 (7.4) 0 (0.0) 19 (9.7) 
Primary disease, n (%)     
Acute leukemia 2137 (61.1) 604 (52.9) 281 (55.1) 93 (47.4) 
MDS/MPN 720 (20.6) 316 (27.7) 128 (25.1) 62 (31.6) 
Malignant lymphoma 523 (15.0) 88 (7.7) 85 (16.7) 18 (9.2) 
Other malignancies 44 (1.3) 94 (8.2) 11 (2.2) 17 (8.7) 
Nonmalignancies 73 (2.1) 39 (3.4) 5 (1.0) 6 (3.1) 
HCT-CI, n (%)     
<3 2912 (83.7) 726 (63.6) 429 (84.4) 122 (62.2) 
≥3 566 (16.3) 415 (36.4) 79 (15.6) 74 (37.8) 
Donor source, n (%)     
HLA matched related 590 (16.9) 247 (21.6) 70 (13.7) 43 (21.9) 
HLA mismatched related 78 (2.2) 8 (0.7) 8 (1.6) 1 (0.5) 
HLA matched unrelated 719 (20.6) 618 (54.2) 107 (21.0) 113 (57.7) 
HLA mismatched unrelated 602 (17.2) 99 (8.7) 85 (16.7) 21 (10.7) 
Umbilical cord blood 1175 (33.6) 38 (3.3) 182 (35.7) 10 (5.1) 
Haploidentical 333 (9.5) 131 (11.5) 58 (11.4) 8 (4.1) 
Conditioning intensity, n (%)     
Myeloablative 2251 (64.4) 606 (53.1) 315 (61.8) 100 (51.0) 
Reduced intensity 1246 (35.6) 535 (46.9) 195 (38.2) 96 (49.0) 
Posttransplant cyclophosphamide, n (%)     
No 3287 (94.0) 934 (81.9) 493 (96.7) 178 (90.8) 
Yes 210 (6.0) 207 (18.1) 17 (3.3) 18 (9.2) 
In vivo T-cell depletion, n (%)     
No 3055 (87.4) 717 (62.8) 427 (83.7) 120 (61.2) 
Yes 442 (12.6) 424 (37.2) 83 (16.3) 76 (38.8) 
Median year of HCT (IQR) 2018 (2016-2020) 2018 (2016-2020) 2018 (2016-2020) 2018 (2016-2020) 
GVHD grades at Tx onset, n (%)     
Grade 2 2545 (72.8) 823 (72.1) 135 (26.5) 70 (35.7) 
Grade 3 799 (22.8) 257 (22.5) 264 (51.8) 73 (37.2) 
Grade 4 153 (4.4) 61 (5.3) 111 (21.8) 53 (27.0) 
GVHD grades on day 28, n (%)     
Grade 0 1933 (55.3) 677 (59.3) 153 (30.0) 51 (26.0) 
Grade 1 472 (13.5) 122 (10.7) 45 (8.8) 12 (6.1) 
Grade 2 419 (12.0) 72 (6.3) 135 (26.5) 22 (11.2) 
Grade 3 157 (4.5) 24 (2.1) 109 (21.4) 12 (6.1) 
Grade 4 55 (1.6) 16 (1.4) 54 (10.6) 9 (4.6) 
Additional systemic Tx before day 28 461 (13.2) 230 (20.2) 14 (2.7) 90 (45.9) 
Days from HCT to treatment (IQR) 34 (26-48) 35 (23-61) 50 (37-72) 43 (35-59.5) 
Median initial corticosteroid dose (methylprednisolone [mg/kg], IQR) — 1.0 (0.75-1.84) NA NA 
Days from primary Tx to second-line Tx (IQR) NA NA 12.5 (8-26) 13 (7-23) 
6-Month NRM (95% CI), % 16.7 (15.5-18.0) 17.2 (15.0-19.5) 44.0 (39.5-48.3) 41.7 (34.6-48.6) 

HCT-CI, HCT-specific comorbidity index; IQR, interquartile range; MDS, myelodysplastic syndrome; MPN, myeloproliferative neoplasm; NA, not applicable; Tx, treatment.

As outlined in Table 1, GVHD characteristics and treatment were similar in both JSTCT and MAGIC. In the primary treatment cohort, the proportion of grade 3 to 4 GVHD (27.2% vs 27.9%), median time to treatment (34 vs 35 days), and 6-month NRM (16.7% vs 17.2%) were nearly identical between JSTCT and MAGIC patients; however, fewer patients in JSTCT received second-line treatment by day 28 (13.2% vs 20.2%; supplemental Table 1). In MAGIC centers, ruxolitinib accounted for >40% of second-line therapies, whereas, in JSTCT, mesenchymal stromal cells and antithymocyte globulin were commonly used because ruxolitinib was not approved for acute GVHD in Japan during this period.

In the second-line treatment cohort, JSTCT patients had more severe GVHD at the initiation of second-line therapy (grades 3-4, 73.5% vs 64.3%) and were much less likely to receive third-line treatments (supplemental Table 2) than MAGIC patients (2.7% vs 45.9%), reflecting different regional thresholds for off-label use of treatments for highly treatment-resistant GVHD. Six-month NRM from the initiation of second-line therapy was similar between cohorts (44.0% vs 41.7%), despite these practice differences.

Conventional treatment response

We first assessed the predictive performance of conventional criteria, CR, PR, and NR, for NRM using the JSTCT and MAGIC data sets (Figure 1). In the primary treatment cohort, the distribution was as follows: JSTCT (CR, 55.3%; PR, 19.8%; NR, 25.0%) and MAGIC (CR, 59.3%; PR, 13.2% NR, 27.5%). In the second-line treatment cohort, the distribution was as follows: for JSTCT (CR, 30.0%; PR, 36.7%; NR, 33.3%) and MAGIC (CR, 26.0%; PR, 13.8%; NR, 60.2%). The large difference in PR rates in the second-line treatment setting may be partly due to the infrequent initiation of third-line treatment before day 28 in Japan, which often resulted in patients being categorized as NR in the MAGIC cohort. When patients who died or relapsed before day 28 were included, the proportion of PR in the JSTCT and MAGIC cohorts was 18.8% and 12.1% for primary treatment and 33.4% and 10.9% for second-line treatment, respectively (supplemental Table 3). Although OR considers both PR and CR to be equivalent predictors of long-term outcomes, in both primary treatment cohorts, patients with PR had 1.5-fold to twofold higher 6-month NRM than patients with CR; this difference was not statistically significant for the MAGIC cohort (JSTCT, 17.3% vs 6.8% [P < .001]; MAGIC, 14.5% vs 9.4% [P = .190]; Figure 1A-B). Similar 1.5-fold to twofold differences in NRM between PR and CR were observed in the second-line treatment cohorts, although the difference was not statistically significant for the MAGIC cohort due to small number of patients (JSTCT, 40.9% vs 25.7% [P = .009]; MAGIC, 34.1% vs 16.3% [P = .250]; Figure 1C-D).

Figure 1.

Conventional treatment response criteria at day 28. The cumulative incidence of NRM within 6 months stratified by NR (red), PR (orange), and CR (blue) in the primary treatment cohort of JSTCT (A), primary treatment cohort of MAGIC (B), second-line treatment cohort of JSTCT (C), and second-line treatment cohort of MAGIC (D). The pie chart represents the proportion of each risk group. P values for pair-wise comparisons were adjusted by the Bonferroni method.

Figure 1.

Conventional treatment response criteria at day 28. The cumulative incidence of NRM within 6 months stratified by NR (red), PR (orange), and CR (blue) in the primary treatment cohort of JSTCT (A), primary treatment cohort of MAGIC (B), second-line treatment cohort of JSTCT (C), and second-line treatment cohort of MAGIC (D). The pie chart represents the proportion of each risk group. P values for pair-wise comparisons were adjusted by the Bonferroni method.

Close modal

RR

We developed day 28 RR criteria using a training set derived by randomly dividing patients from the primary treatment cohort of JSTCT (supplemental Table 4). In the training set, patients were grouped into 10 categories based on the GVHD grades at treatment initiation and day 28 (Table 2). The CART algorithm divided these 10 groups into 2 distinct categories based on the risk of 6-month NRM. Specifically, patients with grades 0/1 at day 28 who did not require additional systemic therapy were classified as refined responders, whereas all other patients, regardless of their initial GVHD severity, were classified as refined nonresponders (Table 2). To evaluate the robustness of this classification, we performed sensitivity analyses by further stratifying GVHD grades. When we separated grades 3 from 4 (supplemental Table 5), grade 2 into 2a and 2b38 (supplemental Table 6), or grade 1 into 1a and 1b33 (supplemental Table 7), the CART algorithm produced nearly identical response criteria.

Table 2.

Development of RR in the JSTCT training set

At treatmentAt day 28n (%)6-Month NRM (95% CI), %Day 28 RR 
Grade 2 Grade 0 1071 (45.7) 5.9 (4.6-7.5) Response 
Grade 2 Grade 1 263 (11.2) 8.2 (5.2-11.9) Response 
Grade 2 Grade 2 188 (8.0) 23.4 (17.5-29.8) Nonresponse 
Grade 2 Grade 3/4 39 (1.7) 67.4 (49.5-80.1) Nonresponse 
Grade 2 Second-line Tx 128 (5.5) 31.3 (23.3-39.6) Nonresponse 
Grade 3/4 Grade 0 234 (10.0) 8.3 (5.2-12.4) Response 
Grade 3/4 Grade 1 41 (1.8) 9.8 (3.0-21.2) Response 
Grade 3/4 Grade 2 99 (4.2) 34.5 (25.1-44.0) Nonresponse 
Grade 3/4 Grade 3/4 99 (4.2) 47.6 (37.4-57.1) Nonresponse 
Grade 3/4 Second-line Tx 179 (7.7) 45.5 (38.0-52.7) Nonresponse 
At treatmentAt day 28n (%)6-Month NRM (95% CI), %Day 28 RR 
Grade 2 Grade 0 1071 (45.7) 5.9 (4.6-7.5) Response 
Grade 2 Grade 1 263 (11.2) 8.2 (5.2-11.9) Response 
Grade 2 Grade 2 188 (8.0) 23.4 (17.5-29.8) Nonresponse 
Grade 2 Grade 3/4 39 (1.7) 67.4 (49.5-80.1) Nonresponse 
Grade 2 Second-line Tx 128 (5.5) 31.3 (23.3-39.6) Nonresponse 
Grade 3/4 Grade 0 234 (10.0) 8.3 (5.2-12.4) Response 
Grade 3/4 Grade 1 41 (1.8) 9.8 (3.0-21.2) Response 
Grade 3/4 Grade 2 99 (4.2) 34.5 (25.1-44.0) Nonresponse 
Grade 3/4 Grade 3/4 99 (4.2) 47.6 (37.4-57.1) Nonresponse 
Grade 3/4 Second-line Tx 179 (7.7) 45.5 (38.0-52.7) Nonresponse 

The training set included 2341 patients.

Categorized by CART analyses based on 6-month NRM.

Validation and comparative performance of day 28 RR

The day 28 RR criteria created distinct risk strata for 6-month NRM across all cohorts, including the JSTCT primary treatment training set (supplemental Figure 2A), both JSTCT and MAGIC primary treatment validation sets (Figure 2A-B), and both second-line treatment sets (Figure 2C-D). Response rates were lower by RR than OR at day 28 in all 4 cohorts, especially in the second-line treatment setting (Table 3). RR demonstrated greater sensitivity (ie, more nonsurvivors correctly classified as nonresponders) but lower specificity (ie, more survivors misclassified as nonresponders) than OR. The balance between sensitivity and specificity favored RR over OR in all 4 validation cohorts, as measured by balanced accuracy. We then quantified the comparative performance of RR to OR using AUC and NPV as primary measures. After primary treatment, RR demonstrated statistically higher AUCs and slightly higher NPVs than OR in both the JSTCT (AUC, 0.73 vs 0.69 [P < .001]; NPV, 92.0% vs 89.6% [P < .001]) and MAGIC cohorts (AUC, 0.71 vs 0.68 [P = .032]; NPV, 90.9% vs 89.8% [P = .009]; Table 3; Figure 3). A time-dependent AUC analysis showed that RR consistently outperformed OR across time points from 2 to 12 months (Figure 3). After second-line treatment, RR demonstrated substantially higher NPVs and small, nonsignificant increases in AUC in both the JSTCT (AUC, 0.64 vs 0.63 [P = .775]; NPV, 74.5% vs 66.0% [P < .001]) and the MAGIC cohorts (AUC, 0.67 vs 0.64 [P = .105]; NPV, 86.8% vs 76.1% [P = .004]; Table 3; Figure 3).

Figure 2.

NRM stratified by day 28 RR. The cumulative incidence of NRM within 6 months stratified by day 28 RR in the JSTCT validation set for primary treatment (A), MAGIC validation set for primary treatment (B), JSTCT validation set for second-line treatment (C), and MAGIC validation set for second-line treatment (D). The pie chart represents the proportion of each risk group.

Figure 2.

NRM stratified by day 28 RR. The cumulative incidence of NRM within 6 months stratified by day 28 RR in the JSTCT validation set for primary treatment (A), MAGIC validation set for primary treatment (B), JSTCT validation set for second-line treatment (C), and MAGIC validation set for second-line treatment (D). The pie chart represents the proportion of each risk group.

Close modal
Table 3.

Predictive performances for 6-month NRM

Response rates (%)Sensitivity (%)Specificity (%)Balanced accuracy (%)PPV (%)NPV (%)P values for NPVΔNPV (95% CI), %AUCP values for AUCΔAUC (95% CI)
Primary Tx           
Training set (JSTCT)           
Day 28 RR 68.7 71.8 76.5 74.2 37.3 93.2 Ref Ref 0.74 Ref Ref 
Day 28 OR 75.2 58.1 81.5 69.8 37.9 90.8 <.001 2.5 (1.8-3.2) 0.70 <.001 0.044 (0.027-0.062) 
Day 28 CR 55.8 78.5 62.1 70.3 28.8 93.6 .178 –0.4(–1.0 to 0.2) 0.70 <.001 0.037 (0.022-0.052) 
Validation set (JSTCT)           
Day 28 RR 68.9 68.3 76.6 72.5 37.9 92.0 Ref Ref 0.73 Ref Ref 
Day 28 OR 74.7 55.5 81.1 68.3 38.0 89.6 <.001 2.3 (1.3-3.4) 0.69 <.001 0.041 (0.017-0.067) 
Day 28 CR 54.3 75.9 60.9 68.4 28.9 92.3 .556 –0.3(–1.3 to 0.7) 0.69 <.001 0.040 (0.018-0.061) 
MAGIC           
Day 28 RR 70.0 62.9 77.9 70.4 36.5 90.9 Ref Ref 0.71 Ref Ref 
Day 28 OR 72.5 56.6 79.8 68.2 35.9 89.8 .009 1.2 (0.4-2.1) 0.68 .032 0.021 (0.003-0.045) 
Day 28 CR 59.3 67.5 65.6 66.6 28.7 90.6 .703 0.3 (–0.5 to 1.1) 0.67 <.001 0.040 (0.020-0.058) 
Second-line Tx           
JSTCT            
Day 28 RR 38.8 77.5 51.4 64.5 55.7 74.5 Ref Ref 0.64 Ref Ref 
Day 28 OR 66.7 48.7 78.5 63.6 63.7 66.0 <.001 8.4 (4.0-13.3) 0.63 .775 0.006 (–0.036 to 0.046) 
Day 28 CR 30.0 82.5 39.6 61.1 51.8 74.3 .863 0.2 (–2.9 to 3.5) 0.61 .005 0.035 (0.010-0.058) 
MAGIC            
Day 28 RR 32.1 90.0 44.3 67.2 54.7 86.8 Ref Ref 0.67 Ref Ref 
Day 28 OR 39.8 77.4 49.2 63.3 53.1 76.1 .004 10.5 (4.0-18.3) 0.64 .105 0.036 (–0.010 to 0.084) 
Day 28 CR 26.0 90.0 33.8 61.9 50.3 83.7 .001 3.1 (1.0-6.1) 0.62 <.001 0.051 (0.025-0.082) 
Response rates (%)Sensitivity (%)Specificity (%)Balanced accuracy (%)PPV (%)NPV (%)P values for NPVΔNPV (95% CI), %AUCP values for AUCΔAUC (95% CI)
Primary Tx           
Training set (JSTCT)           
Day 28 RR 68.7 71.8 76.5 74.2 37.3 93.2 Ref Ref 0.74 Ref Ref 
Day 28 OR 75.2 58.1 81.5 69.8 37.9 90.8 <.001 2.5 (1.8-3.2) 0.70 <.001 0.044 (0.027-0.062) 
Day 28 CR 55.8 78.5 62.1 70.3 28.8 93.6 .178 –0.4(–1.0 to 0.2) 0.70 <.001 0.037 (0.022-0.052) 
Validation set (JSTCT)           
Day 28 RR 68.9 68.3 76.6 72.5 37.9 92.0 Ref Ref 0.73 Ref Ref 
Day 28 OR 74.7 55.5 81.1 68.3 38.0 89.6 <.001 2.3 (1.3-3.4) 0.69 <.001 0.041 (0.017-0.067) 
Day 28 CR 54.3 75.9 60.9 68.4 28.9 92.3 .556 –0.3(–1.3 to 0.7) 0.69 <.001 0.040 (0.018-0.061) 
MAGIC           
Day 28 RR 70.0 62.9 77.9 70.4 36.5 90.9 Ref Ref 0.71 Ref Ref 
Day 28 OR 72.5 56.6 79.8 68.2 35.9 89.8 .009 1.2 (0.4-2.1) 0.68 .032 0.021 (0.003-0.045) 
Day 28 CR 59.3 67.5 65.6 66.6 28.7 90.6 .703 0.3 (–0.5 to 1.1) 0.67 <.001 0.040 (0.020-0.058) 
Second-line Tx           
JSTCT            
Day 28 RR 38.8 77.5 51.4 64.5 55.7 74.5 Ref Ref 0.64 Ref Ref 
Day 28 OR 66.7 48.7 78.5 63.6 63.7 66.0 <.001 8.4 (4.0-13.3) 0.63 .775 0.006 (–0.036 to 0.046) 
Day 28 CR 30.0 82.5 39.6 61.1 51.8 74.3 .863 0.2 (–2.9 to 3.5) 0.61 .005 0.035 (0.010-0.058) 
MAGIC            
Day 28 RR 32.1 90.0 44.3 67.2 54.7 86.8 Ref Ref 0.67 Ref Ref 
Day 28 OR 39.8 77.4 49.2 63.3 53.1 76.1 .004 10.5 (4.0-18.3) 0.64 .105 0.036 (–0.010 to 0.084) 
Day 28 CR 26.0 90.0 33.8 61.9 50.3 83.7 .001 3.1 (1.0-6.1) 0.62 <.001 0.051 (0.025-0.082) 

Balanced accuracy is the average of sensitivity and specificity, calculated as their sum divided by 2.

Ref, reference.

Figure 3.

Comparisons of AUC, NPV, and time-dependent AUC of 3 response criteria for 6-month NRM. The AUC and the NPV of day 28 RR (red), day 28 OR (blue), and day 28 CR (orange) criteria for 6-month NRM were plotted in the JSTCT validation set for primary treatment (A), MAGIC validation set for primary treatment (B), JSTCT validation set for second-line treatment (C), and MAGIC validation set for second-line treatment (D). The 95% CIs are shown as error bars. Baseline rates of NPV are shown as horizontal dash lines. Better treatment response criteria are closer to the top right corner. Time-dependent AUC were compared between the treatment response criteria through the 2- to12-month interval in the JSTCT validation set for primary treatment (E), MAGIC validation set for primary treatment (F), JSTCT validation set for second-line treatment (G), and MAGIC validation set for second-line treatment (H).

Figure 3.

Comparisons of AUC, NPV, and time-dependent AUC of 3 response criteria for 6-month NRM. The AUC and the NPV of day 28 RR (red), day 28 OR (blue), and day 28 CR (orange) criteria for 6-month NRM were plotted in the JSTCT validation set for primary treatment (A), MAGIC validation set for primary treatment (B), JSTCT validation set for second-line treatment (C), and MAGIC validation set for second-line treatment (D). The 95% CIs are shown as error bars. Baseline rates of NPV are shown as horizontal dash lines. Better treatment response criteria are closer to the top right corner. Time-dependent AUC were compared between the treatment response criteria through the 2- to12-month interval in the JSTCT validation set for primary treatment (E), MAGIC validation set for primary treatment (F), JSTCT validation set for second-line treatment (G), and MAGIC validation set for second-line treatment (H).

Close modal

Day 28 CR is another clinical response end point used for GVHD treatment trials that has the advantage of a high NPV because of the strong correlation between CR and freedom from NRM. RR produced significantly higher AUCs than CR in both primary and second-line treatment settings across all cohorts (Table 3; Figure 3). An exploratory analysis found that the NPVs of RR were noninferior to CR among all cohorts: primary treatment (JSTCT [lower limit of 95% CI for ΔNPV, –1.3] and MAGIC [lower limit of 95% CI for ΔNPV, –0.5]); and second-line treatment (JSTCT [lower limit of 95% CI for ΔNPV, –2.9]; MAGIC [lower limit of 95% CI for ΔNPV, +1.0]).

Recategorization of PR

We performed additional analyses to explore the reasons for the improved predictive performance of RR compared to OR. Most recategorized patients (>90%) by day 28 RR were those with PR. In all validation sets, patients with a PR by conventional criteria but NR by refined criteria experienced high NRM (Figure 4, orange lines). More than 80% of these refined nonresponders had persistent GI GVHD at day 28 (supplemental Table 8). Recategorizing persistent GI GVHD at day 28 as refined NR decreased the number of deaths from uncontrolled acute GVHD among responders in the JSTCT (primary treatment, 17.9% vs 25.4%; second-line treatment, 20.0% vs 40.7%; supplemental Table 9) and MAGIC cohorts (primary treatment, 27.8% vs 32.1%; second-line treatment, 12.5% vs 41.2%; supplemental Table 10).

Figure 4.

Recategorization of PR by day 28 RR. The cumulative incidence of NRM within 6 months stratified by conventional NR, PR, and CR. PR was further divided by day 28 RR. (A) JSTCT validation set for primary treatment. (B) MAGIC validation set for primary treatment. (C) JSTCT validation set for second-line treatment. (D) MAGIC validation set for second-line treatment. The pie chart represents the proportion of each risk group.

Figure 4.

Recategorization of PR by day 28 RR. The cumulative incidence of NRM within 6 months stratified by conventional NR, PR, and CR. PR was further divided by day 28 RR. (A) JSTCT validation set for primary treatment. (B) MAGIC validation set for primary treatment. (C) JSTCT validation set for second-line treatment. (D) MAGIC validation set for second-line treatment. The pie chart represents the proportion of each risk group.

Close modal

In contrast, patients with a PR by conventional criteria and a response by refined criteria had low NRM, similar to those with CR (Figure 4, green lines). These responders achieved complete resolution of GI/liver GVHD but had stage 1/2 skin GVHD at day 28 (supplemental Table 11), suggesting that a persistent but mild skin rash was not linked to poor long-term outcomes. In addition, a small number of patients achieved complete resolution of GI and/or liver GVHD but were categorized as NR by conventional criteria because their skin staging increased. These patients were also categorized as responders by refined criteria, and their NRM was low (supplemental Figure 3, green lines).

Ruxolitinib has become the standard second-line treatment for GVHD.13,44 Day 28 RR effectively stratified the risk of 6-month NRM (4.0% vs 47.5%) in this subset (n = 78; supplemental Figure 4).

UGI GVHD

The JSTCT data set did not record upper GI (UGI) GVHD staging separately from lower GI staging; thus, a few patients were likely categorized as refined nonresponders only due to the presence of UGI GVHD symptoms at day 28 (ie, grade 2 GVHD). In the MAGIC cohort, in which UGI GVHD was staged separately, the 6-month NRM for patients categorized as nonresponders for this reason alone (23/1141 [2.0%]) was 8.7% (95% CI, 1.4-24.7). The single patient categorized as a nonresponder after second-line therapy due to persistent UGI GVHD was alive at 2 years.

The US Food and Drug Administration accepts day 28 OR in acute GVHD clinical trials as evidence of clinical benefit, given its correlation with survival. However, PRs can be misleading, because some patients with persistent disease qualify as a “treatment success,” even though they die from acute GVHD. Furthermore, PRs may be more common in clinical trials as suggested by the ∼25% rate in previous trials for primary and second-line treatments13,14,24,25 compared to the 12% PR rate observed in “real-world” data sets such as MAGIC, in which investigators may have a lower clinical threshold to initiate additional lines of treatment. This study highlighted the advantage of a fundamental shift in which clinical status at day 28 is more informative than the change from baseline and successfully reclassified an important but heterogeneous group of PRs into response categories that more effectively discriminated disease control and survival outcomes.

For an RR, GVHD must be staged as grade 0 or 1 on day 28 without additional therapy, regardless of symptom severity at treatment start. This result emphasizes that the clinical status at the time of response assessment better reflects long-term patient outcomes than the degree of symptom improvement. Notably, persistent GI GVHD at day 28 after treatment strongly correlated with higher NRM, whereas persistent mild skin GVHD did not adversely affect long-term outcomes. Consequently, day 28 RR demonstrated significantly superior performance compared to day 28 OR and CR across multiple metrics. This data-driven categorization aligns with clinical intuition and reflects the concept of very good PR as previously suggested by an expert panel.33 These findings were consistent across 4 independent validation cohorts from JSTCT and MAGIC, regardless of treatment line (first or second) and despite differences in race, donor sources, comorbidities, GVHD prophylaxis, and second-line therapy choices.

Although AUC is a standard metric for comparing prediction models, interpreting the clinical implications of small differences in AUC values can be challenging. Moreover, AUC has limited ability to detect modest differences, and the lack of statistical significance between AUCs does not imply equivalent performance.45,46 For example, although RR showed higher AUCs than OR after second-line treatment in this study, the differences were not statistically different. In the MAGIC cohort, this was likely due to smaller sample size. In the JSTCT cohort, patients with a conventional PR but persistent GI GVHD (ie, refined nonresponders) had an intermediate NRM (45.3%) that fell midway between CR (26.2%) and NR (63.7%). The resulting increase in sensitivity was offset by a decrease in specificity, and thus, the AUC for RR was nearly equivalent to OR. However, the high NRM in these refined nonresponders across all validation sets was clear evidence that the recategorization was appropriate in this circumstance, even though the improvement in AUC was not substantial. In contrast to AUC, NPV and positive predictive value (PPV) are valuable parameters because they offer direct clinical interpretability. NPV, in particular, provides a complementary perspective when comparing response criteria, especially in treatment trials in which durable acute GVHD control is difficult to achieve and in which it is highly desirable for responses to reflect long-term disease control and survival. Compared to CR, RR showed noninferior NPVs and, more importantly, consistently demonstrated significantly higher NPVs than OR across all cohorts. The largest differences in NPVs were observed in the second-line setting, in which a substantial number of patients who died from uncontrolled GVHD were reclassified from conventional PR to refined NR. These findings highlight the clinical relevance of day 28 RR, which more accurately captures long-term GVHD control than conventional criteria and addresses key limitations. In contrast, NRM among nonresponders (ie, the PPV) was high regardless of response criteria used and exceeded 50% in the second-line treatment cohorts, suggesting that additional treatment is already justified in these cases, and therefore, improvements in PPV may carry less clinical significance. Taken together, although AUC remains an important metric, our findings support the notion that NPV is a clinically meaningful measure for evaluating response criteria in the context of acute GVHD treatment.

This study has several limitations. First, we used GVHD grades instead of target organ stages to categorize patients when training the RR criteria to avoid unreliable results from small patient groups. Second, UGI staging was not available in the JSTCT data set used to develop day 28 RR. Indeed, MAGIC patients with grade 2 GVHD at day 28 due to UGI involvement demonstrated favorable long-term outcomes, suggesting that they are more appropriately classified in the refined responder category. Nonetheless, treatment-resistant UGI GVHD was very rare. Third, ruxolitinib had not yet been approved for acute GVHD in Japan,47 so we could confirm the RR’s ability after second-line treatment with ruxolitinib only in the MAGIC data set. Fourth, although NRM is an important end point, more direct measures of long-term GVHD control such as GVHD flare incidence were not available. Given that GVHD flares are relatively common and linked to poor outcomes,12 future studies incorporating these data would be beneficial. Finally, GVHD biomarkers that can serve as response biomarkers and predict flares12 and NRM40,48,49 were not available for this analysis. An end point integrating both clinical symptoms and biomarkers may better predict long-term outcomes,50 although real-time biomarker assays would be necessary for widespread use.

In summary, we refined clinical treatment response criteria to improve the prediction of 6-month NRM, primarily by reclassifying mild skin symptoms at day 28 as response and residual lower GI involvement at day 28 as NR. The performance of the day 28 RR criteria was consistently confirmed across both primary and second-line treatment cohorts in the JSTCT and MAGIC data sets. This refinement also has important design implications if RR were to replace OR as the primary end point in clinical trials. Although the lower specificity of RR could reduce statistical power and necessitate larger sample sizes, its higher sensitivity may reduce type I error and thereby decrease the likelihood of false-positive trial results. This trade-off could be particularly advantageous when deciding whether new agents tested in small, single-arm trials should proceed to larger, randomized studies. Given its simplicity and external validation, the day 28 RR end point may facilitate more appropriate evaluation of future investigational therapies in both first- and second-line treatment settings.

The authors greatly appreciate the contributions of many physicians and data managers throughout the Japanese Data Center for Hematopoietic Cell Transplantation, the Japan Marrow Donor Program, the Japan Cord Blood Bank Network, and the data coordinating center at the Icahn School of Medicine at Mount Sinai who made this analysis possible. Research reported in this publication also used the Biostatistics Shared Resource Facility.

This work was supported by the Japanese government subsidy of “Act on Promotion of Appropriate Provision of Hematopoietic Stem Cells for Transplantation,” National Institutes of Health/National Cancer Institute grants P01 CA039542 and P30 CA196521, the Pediatric Cancer Foundation, and German Jose Carreras Leukemia Foundation grants DJCLS 01 GVHD 2016 and DJCLS 01 GVHD 2020. Y. Akahoshi is a recipient of the Japan Society for the Promotion of Science Postdoctoral Fellowship for Research Abroad.

Contribution: Y. Akahoshi designed the study, collected clinical data, conducted statistical analysis, wrote the manuscript, and organized the project; Y.I., N.S., H.N., J.K., and R.N. advised on study design and reviewed and revised the manuscript; M.A.D. advised on statistical methods; N.A., F.A., H.K.C., N.D., T.E., A.M.E., E.O.H., N.H., W.J.H., E.H., K.K., T.K., M.T., T. Tanaka, N.U., I.V., S.Y., and Y.-B.C. provided clinical data and reviewed and revised the manuscript; F.I., T.F., and Y. Atsuta collected patient data and reviewed and revised the manuscript; Y.K. interpreted data, advised on statistical methods, and reviewed and revised the manuscript; J.L.M.F., J.E.L., and T. Teshima interpreted data, advised on methods, and reviewed and revised the manuscript.

Conflict-of-interest disclosure: Y. Akahoshi reports honoraria from Novartis and AstraZeneca. Y.I. reports honoraria from Meiji Seika Pharma, Novartis, and Janssen Pharmaceutical K.K.; and research grants from Meiji Seika Pharma, Incyte, and Amgen. H.N. reports honoraria from Merck Sharp and Dohme (MSD), Otsuka Pharmaceutical, Pfizer, Novartis, Takeda Pharmaceutical, Janssen Pharmaceutical, Chugai Pharmaceutical, Sanofi, Meiji Seika Pharma, Asahi Kasei Pharma, and Nippon Shinyaku; and research funding from JCR Pharmaceuticals, Kyowa Kirin, Taiho Pharma, Santen Pharmaceutical, and Terumo. N.A. reports speakers’ bureau fees/honoraria from AbbVie, Nippon Shinyaku, Meiji Seika Pharma, Otsuka Pharmaceutical, Daiichi Sankyo, Novartis, Kyowa Kirin, Astellas, Asahi Kasei Pharma, BeiGene, and Novartis; and research funding from Novartis. M.A.D. reports consultancy fees from Comanche Biopharma and Bloomer Tech. F.A. reports honoraria and consultancy fees from Bristol Myers Squibb, Medac, Novartis, Miltenyi Biomedicine, Janssen Pharmaceutical, Kite, AbbVie, and Mallinckrodt/Therakos; and research funding from Mallinckrodt/Therakos. H.K.C. reports consultancy fees from Ironwood Pharmaceuticals, Actinium, AbbVie, REGiMMUNE, Sanofi, Orca Bio, and Incyte. A.M.E. has consulted for Incyte (advisory board). E.O.H. reports consultancy fees from Disc Medicine. K.K. reports honoraria from AbbVie, Pfizer, Nippon Shinyaku, Meiji Seika Pharma, Otsuka Pharmaceutical, Daiichi Sankyo, Celgene, Novartis, AstraZeneca, Ono Pharmaceutical, Kyowa Kirin, Sumitomo Dainippon Pharma, SymBio Pharmaceuticals, Sanofi, Alexion Pharmaceuticals, Bristol Myers Squibb, and Janssen Pharmaceutical; research funding from Shionogi, Otsuka Pharmaceutical, JCR Pharmaceuticals, Takeda Pharmaceutical, Japan Blood Products Organization, Mochida Pharmaceutical, Asahi Kasei Pharma, Chordia Therapeutics, Chugai Pharmaceutical, Teijin Pharma, Eisai, and Kyowa Kirin; and is a current equity holder in Asahi Genomics. M.T. reports honoraria from AbbVie, Kyowa Kirin, Daiichi Sankyo, Sumitomo Pharma, Astellas Pharma, Pfizer, Otsuka Pharmaceutical, MSD, Asahi Kasei Pharma, Chugai Pharmaceutical, Amgen, Janssen Pharmaceutical, Nippon Shinyaku, Novartis, and Meiji Seika Pharma; and research funding from Chugai Pharmaceutical. N.U. has received honoraria from CSL Behring, MSD, Astellas Pharma, AstraZeneca, AbbVie, Otsuka Pharmaceutical, Kyowa Kirin, SymBio Pharmaceuticals, Daiichi Sankyo, Takeda Pharmaceutical, and Novartis; research funding from Chugai Pharmaceutical, Fuji Pharma, Nippon Boehringer Ingelheim, JCR Pharmaceuticals, and Sumitomo Pharma; and consultancy fees from Takeda Pharmaceutical. S.Y. has received honoraria from Daiichi Sankyo, Novartis, Genmab, Janssen, Pfizer, Asahi Kasei Pharma, Meiji Seika Pharma, Takeda, Gilead, MSD, Bristol Myers Squibb, Sanofi, AbbVie, Chugai, AstraZeneca, and Ono Pharmaceutical. Y.-B.C. reports consultancy fees from Ironwood Pharmaceuticals, Vor Bio, Garuda, Editas, Alexion, and Incyte. J.K. reports honoraria from Janssen Pharmaceutical, Astellas Pharma, CSL Behring, MSD, Sumitomo Dainippon Pharma, Takeda Pharmaceutical, Chugai Pharmaceutical, Amgen, Otsuka Pharmaceutical, Bristol Myers Squibb, Ono Pharmaceutical, Asahi Kasei Pharma, Sanofi, CareNet, Inc, Kyowa Kirin, Nippon Shinyaku, Nippon Kayaku, Novartis, Daiichi Sankyo, and AbbVie; consultancy fees from Janssen Pharmaceutical, Astellas Pharma, Novartis, Daiichi Sankyo, Megakaryon, SymBio Pharmaceuticals, and AbbVie; and research funding from Eisai. R.N. reports research funding from Mitarisan, Helocyte, and MaaT Pharma; and consultancy fees from Omeros, bluebird bio, Sanofi, Ono Pharmaceutical, and Pfizer. Y. Atsuta reports speakers’ bureau fees/honoraria from Otsuka Pharmaceutical, Chugai Pharmaceutical, Novartis Pharma K.K., Meiji Seika Pharma, and Janssen Pharmaceutical K.K.; and consultancy fees from JCR Pharmaceuticals. J.E.L. reports research support from Equillium, Incyte, MaaT Pharma, and Mesoblast; and consulting fees from Sanofi, bluebird bio, Inhibrx, X4 Pharmaceuticals, Editas, Equillium, Kamada, and Mesoblast. J.L.M.F. reports consulting fees from Editas, Equillium, Kamada, Mesoblast, Alexion, Realta, Medpace, Viracor, AlloVir, and Physicians’ Education Resource; and research support from Equillium, Incyte, MaaT Pharma, and Mesoblast. Y.K. reports honoraria from Asahi Kasei, MSD, Novartis, Pfizer, Sanofi, Chugai, Astellas, and Kyowa Kirin; and research funding from Chugai, Kyowa Kirin, Asahi Kasei, and Otsuka. T.T. reports honoraria from Nippon Shinyaku, Daiichi Sankyo, Novartis, Otsuka, Genmab, Janssen, Pfizer, Kyowa Kirin, Asahi Kasei Pharma, Meiji Seika Pharma, Takeda, Gilead, SymBio Pharmaceuticals, MSD, Bristol Myers Squibb, Sanofi, Nippon Kayaku, AbbVie, Chugai, AstraZeneca, and Astellas; consultancy fees from Kyowa Kirin, Meiji Seika Pharma, Takeda, Roche Diagnostics, and Nippon Shinyaku; and research funding from Daiichi Sankyo, Otsuka, Kyowa Kirin, Asahi Kasei Pharma, LUCA Science, PharmaEssentia Japan, Sumitomo Pharma, JCR Pharmaceuticals, Chugai, and Astellas. The remaining authors declare no competing financial interests.

Correspondence: Yu Akahoshi, The Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Pl, New York, NY 10029; email: akahoshiu@gmail.com.

1.
Akahoshi
Y
,
Spyrou
N
,
Hogan
WJ
, et al
.
Incidence, clinical presentation, risk factors, outcomes, and biomarkers in de novo late acute GVHD
.
Blood Adv
.
2023
;
7
(
16
):
4479
-
4491
.
2.
Penack
O
,
Marchetti
M
,
Aljurf
M
, et al
.
Prophylaxis and management of graft-versus-host disease after stem-cell transplantation for haematological malignancies: updated consensus recommendations of the European Society for Blood and Marrow Transplantation
.
Lancet Haematol
.
2024
;
11
(
2
):
e147
-
e159
.
3.
Jamy
O
,
Zeiser
R
,
Chen
YB
.
Novel developments in the prophylaxis and treatment of acute GVHD
.
Blood
.
2023
;
142
(
12
):
1037
-
1046
.
4.
Bolaños-Meade
J
,
Hamadani
M
,
Wu
J
, et al
.
Post-transplantation cyclophosphamide-based graft-versus-host disease prophylaxis
.
N Engl J Med
.
2023
;
388
(
25
):
2338
-
2348
.
5.
Watkins
B
,
Qayed
M
,
McCracken
C
, et al
.
Phase II trial of costimulation blockade with abatacept for prevention of acute GVHD
.
J Clin Oncol
.
2021
;
39
(
17
):
1865
-
1877
.
6.
Mohty
M
,
Holler
E
,
Jagasia
M
, et al
.
Refractory acute graft-versus-host disease: a new working definition beyond corticosteroid refractoriness
.
Blood
.
2020
;
136
(
17
):
1903
-
1906
.
7.
Martin
PJ
.
How I treat steroid-refractory acute graft-versus-host disease
.
Blood
.
2020
;
135
(
19
):
1630
-
1638
.
8.
El-Jawahri
A
,
Li
S
,
Antin
JH
, et al
.
Improved treatment-related mortality and overall survival of patients with grade IV acute GVHD in the modern years
.
Biol Blood Marrow Transpl
.
2016
;
22
(
5
):
910
-
918
.
9.
Akahoshi
Y
,
Igarashi
A
,
Fukuda
T
, et al
.
Impact of graft-versus-host disease and graft-versus-leukemia effect based on minimal residual disease in Philadelphia chromosome-positive acute lymphoblastic leukemia
.
Br J Haematol
.
2020
;
190
(
1
):
84
-
92
.
10.
Greinix
HT
,
Eikema
DJ
,
Koster
L
, et al
.
Improved outcome of patients with graft-versus-host disease after allogeneic hematopoietic cell transplantation for hematologic malignancies over time: an EBMT mega-file study
.
Haematologica
.
2022
;
107
(
5
):
1054
-
1063
.
11.
Khoury
HJ
,
Wang
T
,
Hemmer
MT
, et al
.
Improved survival after acute graft-versus-host disease diagnosis in the modern era
.
Haematologica
.
2017
;
102
(
5
):
958
-
966
.
12.
Akahoshi
Y
,
Spyrou
N
,
Hoepting
M
, et al
.
Flares of acute graft-versus-host disease: a Mount Sinai Acute GVHD International Consortium analysis
.
Blood Adv
.
2024
;
8
(
8
):
2047
-
2057
.
13.
Zeiser
R
,
von Bubnoff
N
,
Butler
J
, et al
.
Ruxolitinib for glucocorticoid-refractory acute graft-versus-host disease
.
N Engl J Med
.
2020
;
382
(
19
):
1800
-
1810
.
14.
Zeiser
R
,
Socié
G
,
Schroeder
MA
, et al
.
Efficacy and safety of itacitinib versus placebo in combination with corticosteroids for initial treatment of acute graft-versus-host disease (GRAVITAS-301): a randomised, multicentre, double-blind, phase 3 trial
.
Lancet Haematol
.
2022
;
9
(
1
):
e14
-
e25
.
15.
Pidala
J
,
Hamadani
M
,
Dawson
P
, et al
.
Randomized multicenter trial of sirolimus vs prednisone as initial therapy for standard-risk acute GVHD: the BMT CTN 1501 trial
.
Blood
.
2020
;
135
(
2
):
97
-
107
.
16.
Etra
A
,
Capellini
A
,
Alousi
A
, et al
.
Effective treatment of low-risk acute GVHD with itacitinib monotherapy
.
Blood
.
2023
;
141
(
5
):
481
-
489
.
17.
Ponce
DM
,
Alousi
AM
,
Nakamura
R
, et al
.
A phase 2 study of interleukin-22 and systemic corticosteroids as initial treatment for acute GVHD of the lower GI tract
.
Blood
.
2023
;
141
(
12
):
1389
-
1401
.
18.
Al Malki
MM
,
London
K
,
Baez
J
, et al
.
Phase 2 study of natalizumab plus standard corticosteroid treatment for high-risk acute graft-versus-host disease
.
Blood Adv
.
2023
;
7
(
17
):
5189
-
5198
.
19.
Couriel
DR
,
Saliba
R
,
de Lima
M
, et al
.
A phase III study of infliximab and corticosteroids for the initial treatment of acute graft-versus-host disease
.
Biol Blood Marrow Transpl
.
2009
;
15
(
12
):
1555
-
1562
.
20.
Kekre
N
,
Kim
HT
,
Hofer
J
, et al
.
Phase II trial of natalizumab with corticosteroids as initial treatment of gastrointestinal acute graft-versus-host disease
.
Bone Marrow Transpl
.
2021
;
56
(
5
):
1006
-
1012
.
21.
Jagasia
M
,
Perales
MA
,
Schroeder
MA
, et al
.
Ruxolitinib for the treatment of steroid-refractory acute GVHD (REACH1): a multicenter, open-label phase 2 trial
.
Blood
.
2020
;
135
(
20
):
1739
-
1749
.
22.
Schroeder
MA
,
Khoury
HJ
,
Jagasia
M
, et al
.
A phase 1 trial of itacitinib, a selective JAK1 inhibitor, in patients with acute graft-versus-host disease
.
Blood Adv
.
2020
;
4
(
8
):
1656
-
1669
.
23.
Kebriaei
P
,
Hayes
J
,
Daly
A
, et al
.
A phase 3 randomized study of remestemcel-L versus placebo added to second-line therapy in patients with steroid-refractory acute graft-versus-host disease
.
Biol Blood Marrow Transpl
.
2020
;
26
(
5
):
835
-
844
.
24.
Magenau
JM
,
Goldstein
SC
,
Peltier
D
, et al
.
α1-antitrypsin infusion for treatment of steroid-resistant acute graft-versus-host disease
.
Blood
.
2018
;
131
(
12
):
1372
-
1379
.
25.
Zhao
K
,
Lin
R
,
Fan
Z
, et al
.
Mesenchymal stromal cells plus basiliximab, calcineurin inhibitor as treatment of steroid-resistant acute graft-versus-host disease: a multicenter, randomized, phase 3, open-label trial
.
J Hematol Oncol
.
2022
;
15
(
1
):
22
.
26.
MacMillan
ML
,
DeFor
TE
,
Weisdorf
DJ
.
The best endpoint for acute GVHD treatment trials
.
Blood
.
2010
;
115
(
26
):
5412
-
5417
.
27.
Levine
JE
,
Logan
B
,
Wu
J
, et al
.
Graft-versus-host disease treatment: predictors of survival
.
Biol Blood Marrow Transpl
.
2010
;
16
(
12
):
1693
-
1699
.
28.
Saliba
RM
,
Couriel
DR
,
Giralt
S
, et al
.
Prognostic value of response after upfront therapy for acute GVHD
.
Bone Marrow Transpl
.
2012
;
47
(
1
):
125
-
131
.
29.
Inamoto
Y
,
Martin
PJ
,
Storer
BE
,
Mielcarek
M
,
Storb
RF
,
Carpenter
PA
.
Response endpoints and failure-free survival after initial treatment for acute graft-versus-host disease
.
Haematologica
.
2014
;
99
(
2
):
385
-
391
.
30.
Martin
PJ
,
Rizzo
JD
,
Wingard
JR
, et al
.
First- and second-line systemic treatment of acute graft-versus-host disease: recommendations of the American Society of Blood and Marrow Transplantation
.
Biol Blood Marrow Transpl
.
2012
;
18
(
8
):
1150
-
1163
.
31.
MacMillan
ML
,
DeFor
TE
,
Holtan
SG
,
Rashidi
A
,
Blazar
BR
,
Weisdorf
DJ
.
Validation of Minnesota acute graft-versus-host disease risk score
.
Haematologica
.
2020
;
105
(
2
):
519
-
524
.
32.
DeFilipp
Z
,
Kim
HT
,
Spyrou
N
, et al
.
The MAGIC algorithm probability predicts treatment response and long-term outcomes to second-line therapy for acute GVHD
.
Blood Adv
.
2024
;
8
(
13
):
3488
-
3496
.
33.
Martin
PJ
,
Bachier
CR
,
Klingemann
HG
, et al
.
Endpoints for clinical trials testing treatment of acute graft-versus-host disease: a joint statement
.
Biol Blood Marrow Transpl
.
2009
;
15
(
7
):
777
-
784
.
34.
Atsuta
Y
,
Suzuki
R
,
Yoshimi
A
, et al
.
Unification of hematopoietic stem cell transplantation registries in Japan and establishment of the TRUMP System
.
Int J Hematol
.
2007
;
86
(
3
):
269
-
274
.
35.
Atsuta
Y
.
Introduction of Transplant Registry Unified Management Program 2 (TRUMP2): scripts for TRUMP data analyses, part I (variables other than HLA-related data)
.
Int J Hematol
.
2016
;
103
(
1
):
3
-
10
.
36.
Levine
JE
,
Hogan
WJ
,
Harris
AC
, et al
.
Improved accuracy of acute graft-versus-host disease staging among multiple centers
.
Best Pract Res Clin Haematol
.
2014
;
27
(
3-4
):
283
-
287
.
37.
Harris
AC
,
Young
R
,
Devine
S
, et al
.
International, multicenter standardization of acute graft-versus-host disease clinical data collection: a report from the Mount Sinai Acute GVHD International Consortium
.
Biol Blood Marrow Transpl
.
2016
;
22
(
1
):
4
-
10
.
38.
Mielcarek
M
,
Storer
BE
,
Boeckh
M
, et al
.
Initial therapy of acute graft-versus-host disease with low-dose prednisone does not compromise patient outcomes
.
Blood
.
2009
;
113
(
13
):
2888
-
2894
.
39.
Park
C
,
Park
SY
,
Kim
HJ
,
Shin
HJ
.
Statistical methods for comparing predictive values in medical diagnosis
.
Korean J Radiol
.
2024
;
25
(
7
):
656
-
661
.
40.
Spyrou
N
,
Akahoshi
Y
,
Ayuk
F
, et al
.
The utility of biomarkers in acute GVHD prognostication
.
Blood Adv
.
2023
;
7
(
17
):
5152
-
5155
.
41.
Blanche
P
,
Dartigues
JF
,
Jacqmin-Gadda
H
.
Estimating and comparing time-dependent areas under receiver operating characteristic curves for censored event times with competing risks
.
Stat Med
.
2013
;
32
(
30
):
5381
-
5397
.
42.
Breiman
L
,
Friedman
J
,
Stone
CJ
,
Olshen
RA
. Classification and Regression Trees. 1st ed.
Taylor & Francis
;
1984
.
43.
Kanda
Y
.
Investigation of the freely available easy-to-use software 'EZR' for medical statistics
.
Bone Marrow Transpl
.
2013
;
48
(
3
):
452
-
458
.
44.
Zeiser
R
,
Socié
G
.
The development of ruxolitinib for glucocorticoid-refractory acute graft-versus-host disease
.
Blood Adv
.
2020
;
4
(
15
):
3789
-
3794
.
45.
Biswas
S
,
Arun
B
,
Parmigiani
G
.
Reclassification of predictions for uncovering subgroup specific improvement
.
Stat Med
.
2014
;
33
(
11
):
1914
-
1927
.
46.
Spitz
MR
,
Amos
CI
,
D'Amelio
A
,
Dong
Q
,
Etzel
C
.
Re: discriminatory accuracy from single-nucleotide polymorphisms in models to predict breast cancer risk
.
J Natl Cancer Inst
.
2009
;
101
(
24
):
1731
-
1732
.
47.
Teshima
T
,
Onishi
Y
,
Kato
K
, et al
.
Ruxolitinib in steroid-refractory acute graft-vs-host disease: Japanese subgroup analysis of the randomized REACH2 trial
.
Int J Hematol
.
2024
;
120
(
1
):
106
-
116
.
48.
Srinagesh
HK
,
Özbek
U
,
Kapoor
U
, et al
.
The MAGIC algorithm probability is a validated response biomarker of treatment of acute graft-versus-host disease
.
Blood Adv
.
2019
;
3
(
23
):
4034
-
4042
.
49.
Akahoshi
Y
,
Spyrou
N
,
Weber
D
, et al
.
Novel MAGIC composite scores using both clinical symptoms and biomarkers best predict treatment outcomes of acute GVHD
.
Blood
.
2024
;
144
(
9
):
1010
-
1021
.
50.
Spyrou
N
,
Akahoshi
Y
,
Kowalyk
S
, et al
.
A day 14 endpoint for acute GVHD clinical trials
.
Transpl Cell Ther
.
2024
;
30
(
4
):
421
-
432
.

Author notes

J.E.L. and T. Teshima contributed equally to this study.

The data of this study are not publicly available due to restrictions related to the recipient/donor’s consent for research use.

The full-text version of this article contains a data supplement.

Supplemental data