Key Points
A Manhattan model based solely on clinical symptoms identifies a low-risk group needed to test strategies that minimize treatment.
MAGIC composite scores using both clinical and biomarker parameters further enlarge the low-risk group and most accurately predict outcomes.
Visual Abstract
Acute graft-versus-host disease (GVHD) grading systems that use only clinical symptoms at treatment initiation such as the Minnesota risk identify standard and high-risk categories but lack a low-risk category suitable to minimize immunosuppressive strategies. We developed a new grading system that includes a low-risk stratum based on clinical symptoms alone and determined whether the incorporation of biomarkers would improve the model’s prognostic accuracy. We randomly divided 1863 patients in the Mount Sinai Acute GVHD International Consortium (MAGIC) who were treated for GVHD into training and validation cohorts. Patients in the training cohort were divided into 14 groups based on similarity of clinical symptoms and similar nonrelapse mortality (NRM); we used a classification and regression tree (CART) algorithm to create three Manhattan risk groups that produced a significantly higher area under the receiver operating characteristic curve (AUC) for 6-month NRM than the Minnesota risk classification (0.69 vs 0.64, P = .009) in the validation cohort. We integrated serum GVHD biomarker scores with Manhattan risk using patients with available serum samples and again used a CART algorithm to establish 3 MAGIC composite scores that significantly improved prediction of NRM compared to Manhattan risk (AUC, 0.76 vs 0.70, P = .010). Each increase in MAGIC composite score also corresponded to a significant decrease in day 28 treatment response (80% vs 63% vs 30%, P < .001). We conclude that the MAGIC composite score more accurately predicts response to therapy and long-term outcomes than systems based on clinical symptoms alone and may help guide clinical decisions and trial design.
Introduction
Acute graft-versus-host disease (GVHD) remains a substantial cause of morbidity and nonrelapse mortality (NRM) and a major obstacle to successful outcomes after allogeneic hematopoietic cell transplantation (HCT) despite advances in prophylaxis.1-7 High doses of systemic steroids are used as first-line treatment for acute GVHD,8-10 but ∼30% of patients develop steroid-refractory GVHD and experience poor outcomes.1,11-15 The long-term outcomes of patients who initially respond to steroid therapy can vary and be complicated by GVHD flares.16 Thus, steroid treatment courses tend to be long, resulting in significant morbidities including increased infection risk.17-20 Treatment for GVHD may thus lead to both undertreatment of some patients and overtreatment of others.
The maximum severity of acute GVHD correlates with survival outcomes,21-24 but can only be determined retrospectively, and therefore cannot be used to guide treatment in real time. The Minnesota risk system, the only validated risk stratification that was modeled on GVHD symptoms at the initiation of treatment, possesses 2 strata (standard and high),25,26 but lacks the low-risk stratum necessary for treatment minimization. Several groups, including ours, have reported that GVHD biomarkers predict outcomes independently of clinical parameters.2,16,27-35 The Mount Sinai Acute GVHD International Consortium (MAGIC) has validated the MAGIC algorithm probability (MAP), a single value incorporating weighted serum concentrations of the following 2 biomarkers: suppression of tumorigenicity 2 (ST2) and regenerating islet–derived protein 3-α (REG3α). The MAP can be considered a liquid biopsy of GVHD damage to intestinal crypts36 and accurately predicts long-term outcomes before, during, and after therapy for acute GVHD.2,16,27,28,30,31,37 No studies have validated a model integrating both clinical and laboratory parameters at treatment onset. We hypothesized that the combination of clinical and biomarker values could create 3 separate acute GVHD grades with distinct prognoses. We used the MAGIC database and biorepository to develop and validate a grading system with 3 strata solely based on clinical symptoms and then developed new MAGIC composite scores that integrate both clinical and biomarker parameters with improved prognostic accuracy.
Methods
Patient selection
We obtained clinical data and serum samples from the MAGIC database and biorepository, which encompasses 23 HCT centers in North America, Europe, and Asia. Participating centers collected clinical information that focused on acute GVHD using a prospective-specimen-collection, retrospective-blinded-evaluation (PRoBE) study design,38,39 and provided longitudinal serum samples. Patients were prospectively monitored weekly for acute GVHD symptoms according to institutional frequency. Informed consent was obtained from all participants in accordance with the Declaration of Helsinki under an institutional review board–approved protocol.
We included both pediatric and adult patients who received a first HCT between 2014 and 2021 and who received systemic treatment for acute GVHD. We excluded patients who developed primary relapse of malignancy or who received donor lymphocyte infusion or second HCT before systemic GVHD treatment. Acute GVHD was diagnosed and staged according to the published criteria.39 Minnesota risk, HCT-specific comorbidity index scores, intensity of conditioning regimens, and disease risk were classified as previously reported.2,25,40,41 A complete response (CR) was defined as complete resolution of acute GVHD manifestations without secondary treatment. A partial response (PR) was defined as a decrease in at least 1 organ stage without worsening of other organs and without the need for secondary treatment, provided that the improvement was less than a CR.11 Overall response rate (ORR) was defined by CR or PR at day 28 after systemic treatment.
Serum samples
Serial serum samples were collected prospectively, cryopreserved, and shipped to a central laboratory. Serum concentrations of ST242 and REG3α43 were analyzed by enzyme-linked immunosorbent assays, as previously reported. The MAP was calculated as a single value between 0.001 and 0.999 according to the formula: log[–log(1 – MAP)] = −11.263 + 1.844(log10ST2) + 0.577(log10REG3α).28,30 We calculated Ann Arbor (AA) scores using previously validated thresholds (AA1 < 0.141; 0.141 ≤ AA2 < 0.291; AA3 ≥ 0.291).2,16,28,44
Statistical analysis
The beginning of systemic treatment served as the starting point in all analyses. The primary end point was 6-month NRM, and outcomes were censored at 6 months. We estimated and plotted the cumulative incidence of NRM according to the Gray method, and we considered relapse and second allogeneic HCT as competing risks. We used the Kaplan-Meier method and the log-rank test to estimate and compare overall survival (OS) probabilities. We compared categorical variables using the Fisher exact test, and continuous variables using the Mann-Whitney U test. We used the area under the curve (AUC) of receiver operating characteristic analysis and the DeLong test to compare the prognostic value of the different models. The ΔAUC and its corresponding 95% confidence intervals (CIs) were calculated using 1000 bootstrap resamples.32,45
We developed algorithms to predict 6-month NRM as follows. First, we randomly divided patients into training and validation cohorts in a 7:3 ratio, aiming to maximize both the size of the validation cohort and the representation of patients with uncommon clinical presentations in the training cohort.46 Second, we created groups with a minimum of 20 patients per group according to clinical similarities at the time of treatment and in 6-month NRM. Third, we used a classification and regression tree (CART) algorithm47 to create 3 groups according to the risk of 6-month NRM after treatment onset. The criteria to separate groups included a maximum depth of 2 levels with a complexity parameter of 0.2 and at least 30 observations in each terminal node.48 We also applied a K-means approach with the Lloyd algorithm as a sensitivity analysis for the accuracy of aggregation.49 The performance of each model was then evaluated in the validation cohort.
All statistical tests were 2-sided, and a P value <.05 was considered statistically significant. Statistical analyses were performed with R version 4.2.2 (R Foundation for Statistical Computing, Vienna, Austria) or EZR version 1.61 (Jichi Medical University Saitama Medical Center, Saitama, Japan).50
Results
Patient characteristics
We randomly divided 1863 patients who fulfilled all the inclusion criteria into a training (n = 1306) and validation cohort (n = 557) (supplemental Figure 1, available on the Blood website). There were no significant differences in baseline characteristics between cohorts, except for donor source (Table 1). Severity of GVHD and organ involvement at the time of treatment were also similar between the training and validation cohorts (supplemental Table 1). Treatment practices varied in our real-world study, and notably, >20% of patients treated for grades 1 or 2 acute GVHD received low-dose systemic steroids (<0.5 mg/kg methylprednisolone) (supplemental Table 2). The median follow-up of survivors after treatment initiation was 22 months (range, 1-58) and 23 months (range, 1-37) in the training and validation cohorts, respectively.
Patient characteristics
. | Training . | Validation . | P values . |
---|---|---|---|
n = 1306 . | n = 557 . | ||
Median age at HCT, y (range) | 56 (0, 79) | 54 (0, 79) | .206 |
Recipient age, category | |||
<18 | 146 (11.2) | 83 (14.9) | .077 |
18-54 | 474 (36.3) | 199 (35.7) | |
≥55 | 686 (52.5) | 275 (49.4) | |
Sex mismatch | |||
Female to male | 219 (16.8) | 103 (18.6) | .349 |
Other | 1087 (83.2) | 452 (81.4) | |
Race | |||
White | 1108 (84.8) | 462 (82.9) | .362 |
Black | 70 (5.4) | 25 (4.5) | |
Asian | 46 (3.5) | 25 (4.5) | |
Others | 6 (0.5) | 5 (0.9) | |
Unknown | 76 (5.8) | 40 (7.2) | |
Primary disease | |||
Acute leukemia | 677 (51.8) | 299 (53.7) | .349 |
MDS/MPN | 341 (26.1) | 147 (26.4) | |
Malignant lymphoma | 117 (9.0) | 36 (6.5) | |
Other | 171 (13.1) | 75 (13.5) | |
Disease risk | |||
Standard | 1059 (81.1) | 430 (77.2) | .058 |
High | 247 (18.9) | 127 (22.8) | |
Donor type | |||
HLA matched related | 267 (20.4) | 111 (19.9) | .058 |
HLA matched unrelated | 714 (54.7) | 277 (49.7) | |
HLA mismatched related | 8 (0.6) | 4 (0.7) | |
HLA mismatched unrelated | 122 (9.3) | 65 (11.7) | |
Haploidentical | 148 (11.3) | 65 (11.7) | |
Umbilical cord blood | 47 (3.6) | 35 (6.3) | |
GVHD prophylaxis | |||
CNI and MTX based | 691 (52.9) | 292 (52.4) | .744 |
CNI and MMF based | 305 (23.4) | 143 (25.7) | |
PTCy | 219 (16.8) | 82 (14.7) | |
Ex vivo T-cell depletion | 38 (2.9) | 17 (3.1) | |
Other | 53 (4.1) | 23 (4.1) | |
HCT-CI | |||
0-2 | 884 (67.7) | 372 (66.8) | .706 |
≥3 | 422 (32.3) | 185 (33.2) | |
In vivo T-cell depletion | |||
No | 809 (61.9) | 341 (61.2) | .795 |
Yes | 497 (38.1) | 216 (38.8) | |
Donor source | |||
Bone marrow | 252 (19.3) | 117 (21.0) | .020 |
Peripheral blood | 1007 (77.1) | 405 (72.7) | |
Umbilical cord blood | 47 (3.6) | 35 (6.3) | |
Conditioning | |||
MAC (TBI <8 Gy) | 532 (40.7) | 226 (40.6) | .923 |
MAC (TBI ≥8 Gy) | 204 (15.6) | 91 (16.3) | |
RIC | 570 (43.6) | 240 (43.1) | |
Sample available at Tx | |||
No | 256 (19.6) | 125 (22.4) | .168 |
Yes | 1050 (80.4) | 432 (77.6) | |
Median y of HCT (range) | 2018 (2014-2021) | 2017 (2014-2021) | .117 |
. | Training . | Validation . | P values . |
---|---|---|---|
n = 1306 . | n = 557 . | ||
Median age at HCT, y (range) | 56 (0, 79) | 54 (0, 79) | .206 |
Recipient age, category | |||
<18 | 146 (11.2) | 83 (14.9) | .077 |
18-54 | 474 (36.3) | 199 (35.7) | |
≥55 | 686 (52.5) | 275 (49.4) | |
Sex mismatch | |||
Female to male | 219 (16.8) | 103 (18.6) | .349 |
Other | 1087 (83.2) | 452 (81.4) | |
Race | |||
White | 1108 (84.8) | 462 (82.9) | .362 |
Black | 70 (5.4) | 25 (4.5) | |
Asian | 46 (3.5) | 25 (4.5) | |
Others | 6 (0.5) | 5 (0.9) | |
Unknown | 76 (5.8) | 40 (7.2) | |
Primary disease | |||
Acute leukemia | 677 (51.8) | 299 (53.7) | .349 |
MDS/MPN | 341 (26.1) | 147 (26.4) | |
Malignant lymphoma | 117 (9.0) | 36 (6.5) | |
Other | 171 (13.1) | 75 (13.5) | |
Disease risk | |||
Standard | 1059 (81.1) | 430 (77.2) | .058 |
High | 247 (18.9) | 127 (22.8) | |
Donor type | |||
HLA matched related | 267 (20.4) | 111 (19.9) | .058 |
HLA matched unrelated | 714 (54.7) | 277 (49.7) | |
HLA mismatched related | 8 (0.6) | 4 (0.7) | |
HLA mismatched unrelated | 122 (9.3) | 65 (11.7) | |
Haploidentical | 148 (11.3) | 65 (11.7) | |
Umbilical cord blood | 47 (3.6) | 35 (6.3) | |
GVHD prophylaxis | |||
CNI and MTX based | 691 (52.9) | 292 (52.4) | .744 |
CNI and MMF based | 305 (23.4) | 143 (25.7) | |
PTCy | 219 (16.8) | 82 (14.7) | |
Ex vivo T-cell depletion | 38 (2.9) | 17 (3.1) | |
Other | 53 (4.1) | 23 (4.1) | |
HCT-CI | |||
0-2 | 884 (67.7) | 372 (66.8) | .706 |
≥3 | 422 (32.3) | 185 (33.2) | |
In vivo T-cell depletion | |||
No | 809 (61.9) | 341 (61.2) | .795 |
Yes | 497 (38.1) | 216 (38.8) | |
Donor source | |||
Bone marrow | 252 (19.3) | 117 (21.0) | .020 |
Peripheral blood | 1007 (77.1) | 405 (72.7) | |
Umbilical cord blood | 47 (3.6) | 35 (6.3) | |
Conditioning | |||
MAC (TBI <8 Gy) | 532 (40.7) | 226 (40.6) | .923 |
MAC (TBI ≥8 Gy) | 204 (15.6) | 91 (16.3) | |
RIC | 570 (43.6) | 240 (43.1) | |
Sample available at Tx | |||
No | 256 (19.6) | 125 (22.4) | .168 |
Yes | 1050 (80.4) | 432 (77.6) | |
Median y of HCT (range) | 2018 (2014-2021) | 2017 (2014-2021) | .117 |
CNI, calcineurin inhibitor; HCT-CI, hematopoietic cell transplantation–specific comorbidity index; MAC, myeloablative conditioning; MDS/MPN, myelodysplastic syndromes/myeloproliferative neoplasms; MMF, mycophenolate mofetil; MTX, methotrexate; PTCy, posttransplant cyclophosphamide; RIC, reduced-intensity conditioning; TBI, total body irradiation; Tx, treatment.
Manhattan risk system
We first categorized all 76 distinct combinations of GVHD target organ severity that possessed at least 1 case in the training cohort (supplemental Table 3) into 24 groups based on similarities of individual organ severity at the time of treatment (Table 2). We then combined groups with both similar clinical characteristics and 6-month NRM to create 14 categories with at least 20 patients in each category (Table 2). Using the CART algorithm, we further reduced the number of categories to 3 (low, intermediate, and high risk), which we termed the Manhattan risk model. A sensitivity analysis using an unsupervised K-means clustering algorithm confirmed the accuracy of aggregation (Table 2).
GVHD organ involvement categories in the training cohort
Organ involvement . | First 24 categories . | Collapsed 14 categories . | Day 28 ORR (%) . | Glucksberg . | Minnesota risk . | CART . | K-means . | ||
---|---|---|---|---|---|---|---|---|---|
n . | 6-mo NRM (%) . | n . | 6-mo NRM (%) . | Manhattan risk . | Manhattan risk . | ||||
Isolated stage I skin | 135 | 5.3 | 135 | 5.3 | 63.0 | Grade 1 | Standard | Low | Low |
Isolated stage II skin | 259 | 7.8 | 259 | 7.8 | 76.1 | Grade 1 | Standard | Low | Low |
Isolated UGI | 112 | 8.1 | 112 | 8.1 | 73.2 | Grade 2 | Standard | Low | Low |
Stage I skin + UGI | 35 | 5.7 | 35 | 5.7 | 77.1 | Grade 2 | Standard | Low | Low |
Stage II skin + UGI | 33 | 21.2 | 33 | 21.2 | 81.8 | Grade 2 | Standard | Intermediate | Intermediate |
Stage I LGI ± UGI | 151 | 14.0 | 151 | 14.0 | 66.2 | Grade 2 | Standard | Intermediate | Intermediate |
Stage I skin + stage I LGI ± UGI | 47 | 14.9 | 83 | 14.6 | 71.1 | Grade 2 | Standard | Intermediate | Intermediate |
Stage II skin + stage I LGI ± UGI | 20 | 14.8 | Grade 2 | ||||||
Stage III skin + stage I LGI ± UGI | 16 | 12.5 | Grade 2 | ||||||
Stage III skin ± UGI | 222 | 11.8 | 222 | 11.8 | 72.5 | Grade 2 | Standard | Intermediate | Intermediate |
Stage II LGI ± UGI | 63 | 15.9 | 63 | 15.9 | 58.7 | Grade 3 | Standard | Intermediate | Intermediate |
Stage I liver ± other organ involvement | 20 | 35.5 | 59 | 46.6 | 40.7 | Grade 2-4 | Standard/High | High | High |
Stage II liver ± other organ involvement | 23 | 37.8 | Grade 3-4 | Standard/High | High | High | |||
Stage III liver ± other organ involvement | 12 | 73.1 | Grade 3-4 | Standard/High | High | High | |||
Stage IV liver ± other organ involvement | 4 | 75.0 | Grade 4 | Standard/High | High | High | |||
Stage I skin + stage II LGI ± UGI | 15 | 26.7 | 34 | 29.4 | 64.7 | Grade 3 | High | High | High |
Stage II skin + stage II LGI ± UGI | 10 | 40.0 | Grade 3 | ||||||
Stage III skin + stage II LGI ± UGI | 9 | 22.2 | Grade 3 | ||||||
Stage III LGI ± UGI | 53 | 34.3 | 53 | 34.3 | 43.4 | Grade 3 | High | High | High |
Stage I skin + stage III LGI ± UGI | 9 | 33.3 | 21 | 49.1 | 52.4 | Grade 3 | High | High | High |
Stage II skin + stage III LGI ± UGI | 7 | 65.7 | Grade 3 | ||||||
Stage III skin + stage III LGI ± UGI | 5 | 60.0 | Grade 3 | ||||||
Stage IV skin ± LGI ± UGI | 8 | 25.0 | 46 | 32.6 | 47.8 | Grade 4 | High | High | High |
Stage IV LGI ± skin ± UGI | 38 | 34.2 | Grade 3 |
Organ involvement . | First 24 categories . | Collapsed 14 categories . | Day 28 ORR (%) . | Glucksberg . | Minnesota risk . | CART . | K-means . | ||
---|---|---|---|---|---|---|---|---|---|
n . | 6-mo NRM (%) . | n . | 6-mo NRM (%) . | Manhattan risk . | Manhattan risk . | ||||
Isolated stage I skin | 135 | 5.3 | 135 | 5.3 | 63.0 | Grade 1 | Standard | Low | Low |
Isolated stage II skin | 259 | 7.8 | 259 | 7.8 | 76.1 | Grade 1 | Standard | Low | Low |
Isolated UGI | 112 | 8.1 | 112 | 8.1 | 73.2 | Grade 2 | Standard | Low | Low |
Stage I skin + UGI | 35 | 5.7 | 35 | 5.7 | 77.1 | Grade 2 | Standard | Low | Low |
Stage II skin + UGI | 33 | 21.2 | 33 | 21.2 | 81.8 | Grade 2 | Standard | Intermediate | Intermediate |
Stage I LGI ± UGI | 151 | 14.0 | 151 | 14.0 | 66.2 | Grade 2 | Standard | Intermediate | Intermediate |
Stage I skin + stage I LGI ± UGI | 47 | 14.9 | 83 | 14.6 | 71.1 | Grade 2 | Standard | Intermediate | Intermediate |
Stage II skin + stage I LGI ± UGI | 20 | 14.8 | Grade 2 | ||||||
Stage III skin + stage I LGI ± UGI | 16 | 12.5 | Grade 2 | ||||||
Stage III skin ± UGI | 222 | 11.8 | 222 | 11.8 | 72.5 | Grade 2 | Standard | Intermediate | Intermediate |
Stage II LGI ± UGI | 63 | 15.9 | 63 | 15.9 | 58.7 | Grade 3 | Standard | Intermediate | Intermediate |
Stage I liver ± other organ involvement | 20 | 35.5 | 59 | 46.6 | 40.7 | Grade 2-4 | Standard/High | High | High |
Stage II liver ± other organ involvement | 23 | 37.8 | Grade 3-4 | Standard/High | High | High | |||
Stage III liver ± other organ involvement | 12 | 73.1 | Grade 3-4 | Standard/High | High | High | |||
Stage IV liver ± other organ involvement | 4 | 75.0 | Grade 4 | Standard/High | High | High | |||
Stage I skin + stage II LGI ± UGI | 15 | 26.7 | 34 | 29.4 | 64.7 | Grade 3 | High | High | High |
Stage II skin + stage II LGI ± UGI | 10 | 40.0 | Grade 3 | ||||||
Stage III skin + stage II LGI ± UGI | 9 | 22.2 | Grade 3 | ||||||
Stage III LGI ± UGI | 53 | 34.3 | 53 | 34.3 | 43.4 | Grade 3 | High | High | High |
Stage I skin + stage III LGI ± UGI | 9 | 33.3 | 21 | 49.1 | 52.4 | Grade 3 | High | High | High |
Stage II skin + stage III LGI ± UGI | 7 | 65.7 | Grade 3 | ||||||
Stage III skin + stage III LGI ± UGI | 5 | 60.0 | Grade 3 | ||||||
Stage IV skin ± LGI ± UGI | 8 | 25.0 | 46 | 32.6 | 47.8 | Grade 4 | High | High | High |
Stage IV LGI ± skin ± UGI | 38 | 34.2 | Grade 3 |
UGI, upper gastrointestinal; LGI, lower gastrointestinal.
Manhattan risk differed from Minnesota risk in 2 important subsets. First, approximately half of Minnesota-standard-risk patients became low-risk: clinical symptoms included isolated stage I or II skin, isolated upper gastrointestinal (UGI), and stage I skin plus UGI GVHD. Second, in contrast to the Minnesota criteria, which classifies patients with liver GVHD with stage I to III skin as standard-risk, the Manhattan risk system classifies patients with any liver involvement as high-risk.
The Glucksberg classification (grades 1/2 vs 3/4),22 and a recently proposed principal component–derived grading system24 possess AUCs similar to those of the Minnesota risk system25 for the prediction of 6-month NRM (supplemental Figure 2). In the validation cohort (supplemental Table 4), the AUC of the Manhattan model for 6-month NRM was significantly higher than that of the Minnesota model (0.69 vs 0.64; P = .009; ΔAUC, 0.057 [95% CI, 0.016-0.101]) (supplemental Figure 3). The Manhattan risk model did not predict relapse, and thus differences in OS between groups were determined by differences in NRM (supplemental Figure 4). The Manhattan model defined 40% of patients as low-risk, and the 3 Manhattan strata possessed distinctly different 6-month NRM in both the training and the validation cohorts (Figure 1A-B). Comparisons of risk categories by organ involvement are summarized for the 2 models in supplemental Table 5.
NRM in the clinical risk models. Six-month cumulative incidence of NRM by Minnesota (left) and Manhattan (right) risk strata. (A) Training cohort. Minnesota standard risk: 10.2% (95% CI, 8.5-12.2); Minnesota high risk: 36.8% (95% CI, 30.5-43.0); Manhattan low risk: 7.1% (95% CI, 5.1- 9.5); Manhattan intermediate risk: 13.9% (95% CI, 11.1-16.9); Manhattan high risk: 37.8% (95% CI, 31.2-44.4). (B) Validation cohort. Minnesota standard risk: 11.0% (95% CI, 8.3-14.1); Minnesota high risk: 34.4% (95% CI, 25.3-43.6); Manhattan low risk: 7.0% (95% CI, 4.2-10.8); Manhattan intermediate risk: 14.9% (95% CI, 10.6-19.9); Manhattan high risk: 35.8% (95% CI, 26.4-45.4). Pie charts depict the percentage of each clinical risk. ∗P values for pairwise comparisons were adjusted using the Bonferroni method.
NRM in the clinical risk models. Six-month cumulative incidence of NRM by Minnesota (left) and Manhattan (right) risk strata. (A) Training cohort. Minnesota standard risk: 10.2% (95% CI, 8.5-12.2); Minnesota high risk: 36.8% (95% CI, 30.5-43.0); Manhattan low risk: 7.1% (95% CI, 5.1- 9.5); Manhattan intermediate risk: 13.9% (95% CI, 11.1-16.9); Manhattan high risk: 37.8% (95% CI, 31.2-44.4). (B) Validation cohort. Minnesota standard risk: 11.0% (95% CI, 8.3-14.1); Minnesota high risk: 34.4% (95% CI, 25.3-43.6); Manhattan low risk: 7.0% (95% CI, 4.2-10.8); Manhattan intermediate risk: 14.9% (95% CI, 10.6-19.9); Manhattan high risk: 35.8% (95% CI, 26.4-45.4). Pie charts depict the percentage of each clinical risk. ∗P values for pairwise comparisons were adjusted using the Bonferroni method.
To evaluate the robustness of the Manhattan model, we evaluated subsets limited to Glucksberg grade 2 to 4 acute GVHD or to treatment with ≥0.5-mg/kg methylprednisolone in the whole cohort. The AUCs of the Manhattan risk model remained superior to those of the Minnesota model for both groups (0.67 vs 0.65; P = .024; ΔAUC, 0.024 [95% CI, 0.003-0.044]; 0.68 vs 0.65; P = .005; ΔAUC, 0.035 [95% CI, 0.011-0.061], respectively). The risks of NRM were similar within each risk category of these subsets (supplemental Tables 6 and 7).
MAGIC composite scores
We hypothesized that the inclusion of serum biomarkers at the onset of treatment would further improve the performance of the Manhattan clinical risk model, particularly for intermediate-risk patients, for whom 6-month NRM was only 7% higher than that of low-risk patients. Serum samples at treatment onset were available in 80% (1050/1306) of the training and 78% (432/557) of the validation cohort (Table 1; supplemental Figure 1). The 6-month NRM did not differ between patients with and without samples in either cohort (16% vs 13%; P = .296 and 16% vs 12%; P = .329, respectively). As expected, we found that AA scores independently stratified the risk of NRM in each risk group of the Manhattan risk model. The risk of NRM for each AA score increased with escalating Manhattan risk, further demonstrating improved prediction of outcome by combining clinical and biomarker assessments (supplemental Figure 5). We again applied a CART analysis to the 9 combinations of the Manhattan risk and AA scores in the training cohort that created a new composite scoring system of 3 strata, which we called the MAGIC composite scores (Table 3). In analyses not presented herein, we tested the accuracy of models containing 4 to 9 categories, but none provided significantly greater AUCs than the 3-category model, which was used in all subsequent analyses. We confirmed the accuracy performance of the new model using an unsupervised K-means clustering algorithm.
Algorithm assignment to MAGIC composite scores of 9 categories determined by the Manhattan risk and AA scores in the training cohort
Manhattan risk . | AA scores . | n (%) . | 6-mo NRM (%) . | CART . | K-means . |
---|---|---|---|---|---|
Low | AA1 | 296 (28.2) | 3.1 | MCS1 | MCS1 |
Low | AA2 | 99 (9.4) | 12.1 | MCS1 | MCS1 |
Low | AA3 | 36 (3.4) | 27.8 | MCS2 | MCS2 |
Intermediate | AA1 | 247 (23.5) | 8.7 | MCS1 | MCS1 |
Intermediate | AA2 | 125 (11.9) | 18.4 | MCS2 | MCS2 |
Intermediate | AA3 | 74 (7.0) | 29.7 | MCS2 | MCS2 |
High | AA1 | 50 (4.8) | 20.4 | MCS2 | MCS2 |
High | AA2 | 52 (5.0) | 29.1 | MCS2 | MCS2 |
High | AA3 | 71 (6.8) | 56.3 | MCS3 | MCS3 |
Manhattan risk . | AA scores . | n (%) . | 6-mo NRM (%) . | CART . | K-means . |
---|---|---|---|---|---|
Low | AA1 | 296 (28.2) | 3.1 | MCS1 | MCS1 |
Low | AA2 | 99 (9.4) | 12.1 | MCS1 | MCS1 |
Low | AA3 | 36 (3.4) | 27.8 | MCS2 | MCS2 |
Intermediate | AA1 | 247 (23.5) | 8.7 | MCS1 | MCS1 |
Intermediate | AA2 | 125 (11.9) | 18.4 | MCS2 | MCS2 |
Intermediate | AA3 | 74 (7.0) | 29.7 | MCS2 | MCS2 |
High | AA1 | 50 (4.8) | 20.4 | MCS2 | MCS2 |
High | AA2 | 52 (5.0) | 29.1 | MCS2 | MCS2 |
High | AA3 | 71 (6.8) | 56.3 | MCS3 | MCS3 |
MCS, MAGIC composite scores.
The incidence of NRM within 6 months increased with each increase in MAGIC composite score, but the incidence of relapse did not change, resulting in large differences in OS between each group in both the training and validation cohorts (Figure 2A-C; supplemental Figure 6A-C). In the total population, 24% (356/1482) of intermediate-risk patients in the Manhattan model had a MAGIC composite score of 1, with a 6-month NRM rate of only 8%. Furthermore, 3% (46/1482) of low-Manhattan-risk patients increased by 1 risk stratum to a MAGIC composite score of 2, with a 6-month NRM of 28%, and 12% (147/1428) of high-Manhattan-risk patients decreased by 1 risk stratum to a MAGIC composite score of 2, with a 6-month NRM of 26%.
NRM and AUC of the MAGIC composite scores. (A) Six-month cumulative incidence of NRM. MAGIC composite score 1: 5.7% (95% CI, 3.3-8.9); composite score 2: 28.8% (95% CI, 21.2-36.8); composite score 3: 51.5% (95% CI, 33.1-67.2). (B) Six-month cumulative incidence of relapse. MAGIC composite score 1: 8.3% (95% CI, 5.4-12.0); composite score 2: 10.8% (95% CI, 6.2-16.9); composite score 3: 6.7% (95% CI, 1.1-19.7). (C) Probability of OS at 6 months; MAGIC composite score 1: 90.6% (95% CI, 86.4-93.5); composite score 2: 64.3% (95% CI, 55.3-71.9); composite score 3: 42.4% (95% CI, 25.6-58.3). Pie charts depict the percentage of each composite score. ∗P values for pairwise comparisons were adjusted using the Bonferroni method. (D) Time-dependent area under the receiver operating characteristic curve for NRM from the time of systemic treatment.
NRM and AUC of the MAGIC composite scores. (A) Six-month cumulative incidence of NRM. MAGIC composite score 1: 5.7% (95% CI, 3.3-8.9); composite score 2: 28.8% (95% CI, 21.2-36.8); composite score 3: 51.5% (95% CI, 33.1-67.2). (B) Six-month cumulative incidence of relapse. MAGIC composite score 1: 8.3% (95% CI, 5.4-12.0); composite score 2: 10.8% (95% CI, 6.2-16.9); composite score 3: 6.7% (95% CI, 1.1-19.7). (C) Probability of OS at 6 months; MAGIC composite score 1: 90.6% (95% CI, 86.4-93.5); composite score 2: 64.3% (95% CI, 55.3-71.9); composite score 3: 42.4% (95% CI, 25.6-58.3). Pie charts depict the percentage of each composite score. ∗P values for pairwise comparisons were adjusted using the Bonferroni method. (D) Time-dependent area under the receiver operating characteristic curve for NRM from the time of systemic treatment.
Using 6-month NRM as the outcome, the AUC of the MAGIC composite score model was significantly higher than that of the Manhattan model in both the training (0.73 vs 0.69; P = .019; ΔAUC, 0.042 [95% CI, 0.007-0.076]) and the validation cohorts (0.76 vs 0.70; P = .010; ΔAUC, 0.064 [95% CI, 0.018-0.112]) (supplemental Figure 7). We next assessed the prognostic efficacy of each model at several time points during the first year from GVHD treatment. The MAGIC composite scores were consistently superior to both Manhattan and Minnesota risk models (Figure 2D). The Akaike information criterion for predicting 6-month NRM based on MAGIC composite scores was also substantially lower (763.9) than the Manhattan (1006.9) or Minnesota risk model (1046.7).
In addition to NRM, the response to primary treatment is a key metric of successful predictive tests. We therefore assessed these models for their prediction of day 28 ORR, the standard end point for treatment response in clinical trials. In both the training and validation sets, there were significant differences in ORR between each MAGIC composite score, but there was no significant difference in ORR between the low- and intermediate-Manhattan-risk groups (Figure 3). These differences in ORR by MAGIC composite score align with the improved prediction of 6-month NRM and support the use of MAGIC composite scores to guide first-line therapy. Interestingly, although AA biomarker scores alone also predicted 6-month NRM as well as the composite scores, the composite scores were better predictors of day 28 ORR (supplemental Table 8). The large 17-point difference between MAGIC composite score 1 and 2 was highly significant (80% vs 63%; P < .001), whereas the smaller 11-point difference between AA 1 and 2 was not (80% vs 69%; P = .068).
Day 28 ORR. Day 28 ORR by the Minnesota risk (left), Manhattan risk (middle), and MAGIC composite scores (right). (A) Training cohort. Minnesota standard risk: 71.5%; Minnesota high risk: 47.0%; Manhattan low risk: 72.3%; Manhattan intermediate risk: 69.6%; Manhattan high risk: 47.9%; MAGIC composite score 1: 74.8%; MAGIC composite score 2: 63.2%; MAGIC composite score 3: 35.2%. (B) Validation cohort. Minnesota standard risk: 73.3%; Minnesota high risk: 49.5%; Manhattan low risk: 77.0%; Manhattan intermediate risk: 69.7%; Manhattan high risk: 48.5%; MAGIC composite score 1: 79.8%; MAGIC composite score 2: 62.9%; MAGIC composite score 3: 30.3%. The error bars represent standard errors. ∗P values for pairwise comparisons were adjusted using the Bonferroni method.
Day 28 ORR. Day 28 ORR by the Minnesota risk (left), Manhattan risk (middle), and MAGIC composite scores (right). (A) Training cohort. Minnesota standard risk: 71.5%; Minnesota high risk: 47.0%; Manhattan low risk: 72.3%; Manhattan intermediate risk: 69.6%; Manhattan high risk: 47.9%; MAGIC composite score 1: 74.8%; MAGIC composite score 2: 63.2%; MAGIC composite score 3: 35.2%. (B) Validation cohort. Minnesota standard risk: 73.3%; Minnesota high risk: 49.5%; Manhattan low risk: 77.0%; Manhattan intermediate risk: 69.7%; Manhattan high risk: 48.5%; MAGIC composite score 1: 79.8%; MAGIC composite score 2: 62.9%; MAGIC composite score 3: 30.3%. The error bars represent standard errors. ∗P values for pairwise comparisons were adjusted using the Bonferroni method.
Using the whole cohort, we next evaluated the robustness of the MAGIC composite score model in the following 2 key subsets: patients with Glucksberg grade 2 to 4 acute GVHD and patients treated with ≥0.5-mg/kg methylprednisolone. The MAGIC composite score model produced significantly higher AUCs in both subsets compared with the Manhattan risk model (0.73 vs 0.67; P < .001; ΔAUC, 0.054 [95% CI, 0.024-0.086]; 0.75 vs 0.69; P = .009; ΔAUC, 0.054 [95% CI, 0.025-0.083]). Each risk category of the MAGIC composite score demonstrated a similar 6-month NRM within these subsets (supplemental Tables 6 and 7).
Black patients comprised 5% (95/1863) of the total population, whereas pediatric (<18 years old) patients comprised 12% (229/1863) of the total population (Table 1). As shown in supplemental Figures 8 and 9, both the Manhattan and the MAGIC composite score models performed well in these small subgroups, although numbers were not sufficiently large to indicate statistically significant differences between strata. The Manhattan risk model divided patients into approximately equal portions, with small differences between low and intermediate groups and a large difference between intermediate- and high-risk groups. The MAGIC composite scores correctly recategorized some higher-risk patients as lower-risk, resulting in a majority of patients with a composite score of 1 and a very low NRM rate, whereas the NRM increased in the smaller group of patients with a score of 2.
Finally, we also evaluated the model in a second key subset of patients developing acute GVHD after receiving prophylaxis that contained posttransplantation cyclophosphamide (n = 301). The overall 6-month NRM in this group was low (12%), and in the Manhattan risk model, there was no significant difference between groups in NRM among these patients (11% vs 11% vs 19%; P = .336). Incorporation of biomarkers into the MAGIC composite score model effectively stratified the risk of NRM in these patients (8% vs 16% vs 27%; P = .026) (supplemental Figure 10). When donor groups were evaluated separately, similar patterns were observed for recipients of both haploidentical and nonhaploidentical donors (supplemental Table 9).
Discussion
High initial doses of corticosteroids and gradual tapers lasting for months have been the recommended treatment for GVHD for decades.8,10 Recent advances in GVHD prophylaxis, however, have reduced the overall incidence of severe GVHD, and mild to moderate symptoms are now the dominant clinical phenotype.5,6 In this study, the observed NRM for standard-Minnesota-risk patients (∼11%) was half that of previous publications,25,26 reflecting a trend toward less NRM from GVHD that may be due to improved GVHD prophylaxis, anti-infective therapy, and supportive care.5,6,19 We first validated the Manhattan risk system using only clinical organ severity, which identified significant numbers of patients with mild GVHD in a low-risk stratum encompassing ∼40% of patients. These data confirm an important finding of a recent retrospective study by Nikiforow et al51 demonstrating that patients with isolated UGI disease experienced low NRM (Table 2). However, in the current study, although UGI symptoms did not increase the NRM of patients with stage I skin GVHD, they did increase the NRM of patients with stage II skin GVHD fourfold, elevating the risk of this latter group from low to intermediate. The size of the Manhattan low-risk stratum significantly increased to >60% of patients with the incorporation of biomarker values. The incidence of 6-month NRM for these patients (∼6%) is almost half that of the Minnesota standard risk (∼11%), which may represent a clinically important difference in outcomes. Patients with MAGIC composite score 1 thus have a very low risk of NRM, which may serve to guide individual treatment strategies that minimize steroid exposure18 (NCT05090384). We have created a public website that includes a calculator for combining GVHD stages of individual organs (with specific guidance for gastrointestinal symptoms) to generate Manhattan risk, and for integrating biomarker values to generate MAGIC composite scores.52
Unusual presentations and subtle manifestations of GVHD may present challenges to its accurate diagnosis and staging, particularly in ethnic minority populations for whom there are minimal historical data. When clinical findings alone are not definitive, physicians often consider both clinical symptoms and laboratory findings in determining the treatment of individual patients. The incorporation of biomarkers and clinical symptoms in the MAGIC composite scores by creating a third risk group leverages the prognostic accuracy of the MAGIC serum biomarkers2,27,28,30,31 and resolves the dilemma that clinicians face when the severity of clinical and laboratory parameters does not align. The MAGIC composite score model not only produced statistically significant differences in AUCs compared with the clinical risk models, but the integration of AA scores with Manhattan risk produced clinically meaningful changes in risk of NRM for 2 subsets of patients. First, nearly one-quarter of all patients who were classified as Manhattan-intermediate-risk but who had the lowest biomarker risk (AA1) experienced very low 6-month NRM of 8% and were therefore classified as the MAGIC composite score 1. Second, a small group (<5% of all patients) with Manhattan low risk and the highest biomarker risk (AA3) had 6-month NRM of 28% and were therefore classified as the MAGIC composite score 2. The high risk of NRM in this small group is important to consider in treatment decisions.
Increasing numbers of patients are currently receiving posttransplantation cyclophosphamide–based GVHD prophylaxis in human leukocyte antigen (HLA)–matched donor HCT and HLA–mismatched and haploidentical HCT.5,53-57 The Manhattan risk model using clinical symptoms alone did not distinguish between low and intermediate risk in such patients. However, MAGIC composite scores did successfully stratify these patients into 3 groups for risk of NRM, further demonstrating the additive value of biomarker scores to clinical phenotypes.
When biomarker values are not readily available, the Manhattan risk model offers advantages relative to the Minnesota risk model. First, given the superior survival and response rate of patients with Manhattan-low-risk GVHD, they may be considered for clinical trials designed to minimize immunosuppressive treatment. Second, it may be desirable to exclude patients with Manhattan low risk who have excellent treatment responses to standard treatment along with low NRM from trials investigating treatments intended to improve response rates. Third, the inclusion of all patients with liver GVHD in the high-risk group32 regardless of other organ involvement resolves an anomaly of the Minnesota risk system that categorized some patients with both skin and liver GVHD as standard-risk instead of high-risk.
Our study has several limitations. First, although our training cohort was large, certain groups of patients such as ethnic minorities and individuals with unusual combinations of GVHD manifestations were small. We approached the latter issue by merging a number of groups on the basis of similarities of their symptoms and NRM rates so that each group contained at least 20 patients before applying CART analysis. Thus, GVHD presenting with a rare constellation of symptoms may not be ideally classified. Second, carefully designed clinical trials are required to determine whether infectious deaths in patients whose GVHD has resolved can be prevented by less immunosuppressive (but equally effective) therapy. In this regard, the ability of the MAGIC composite scores to predict day 28 ORR is encouraging. Third, we included patients with Glucksberg grade 1 acute GVHD that was systemically treated, although systemic treatment is not uniformly recommended for this group.10 It is thus reassuring that the Manhattan and MAGIC composite score models performed well in subset analyses when patients treated with low doses of corticosteroids or those with Glucksberg grade 1 GVHD were excluded. Fourth, the initial dose of steroids varied among the participating centers, reflecting the heterogeneity of real-world practices.58 These findings must therefore be confirmed with data from prospective clinical trials applying strict exclusion and inclusion criteria and using consistent, homogeneous treatments.
In summary, a new Manhattan risk system based on clinical symptoms alone at the initiation of systemic treatment, and new MAGIC composite scores that include biomarkers, are more accurate than the current risk classification systems. The verification by a second statistical approach of both models lends confidence to the accuracy of their categorizations. However, the number of ethnic minority patients in this report is small, which may limit the models’ application to these and other rare populations. These improved models offer the potential to more accurately identify patients with both low- and high-risk disease who could benefit from personalized primary treatment strategies.
Acknowledgment
The authors greatly appreciate the patients, their families, the medical staff, and the data managers in the Mount Sinai Acute GVHD International Consortium centers.
This work was supported by the National Institutes of Health, National Cancer Institute (grants P01 CA039542 and P30 CA196521), the National Pediatric Cancer Foundation, and the German José Carreras Leukaemia Foundation (grants DJCLS 01 GVHD 2016 and DJCLS 01 GVHD 2020). Y.A. is a recipient of the Japan Society for the Promotion of Science Postdoctoral Fellowship for Research Abroad.
Authorship
Contribution: Y.A. designed the study, collected the clinical data, conducted the statistical analysis, and wrote the manuscript; N.S. collected the clinical data, advised on statistical methods, and reviewed and revised the manuscript; D.W., P.A.-H., F.A., C.C., H.K.C., M.E., A.M.E., S.A.G., E.O.H., W.J.H., C.L.K., S. Kraus, M.M.A.M., P.M., M.Q., R.R., T.S., E.U., I.V., M.W., R.Z., Y.-B.C., and R.N. collected the clinical data, and reviewed and revised the manuscript; J.B., G.E., S.G., N.K., and R.Y. collected and reviewed the clinical data; R.B., S. Kowalyk, and G.M. performed the laboratory analysis; J.E.L. and J.L.M.F. designed the study, interpreted data, advised on methods, reviewed and revised the manuscript, and organized this project; and all authors contributed to the writing of the report and approved the final version of the manuscript.
Conflict-of-interest disclosure: M.W. received consulting fees from Amgen, Germany and speaker’s fees from Novartis, Germany. J.E.L. and J.L.M.F. report research support from Equillium, Incyte, MaaT Pharma, and Mesoblast, and consulting fees from Editas, Equillium, Kamada, and Mesoblast. J.E.L. reports additional consulting fees from Sanofi, bluebird bio, Inhibrx, and X4 Pharmaceuticals. J.L.M.F. reports additional consulting fees from Alexion, Realta, Medpace, Viracor, AlloVir, and Physicians’ Education Resource. The remaining authors declare no competing financial interests.
Correspondence: John E. Levine, The Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place/Box 1410, New York, NY 10029; email: john.levine@mssm.edu; and James L. M. Ferrara, The Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, 1 Gustave L. Levy Place, New York, NY 10029; email: james.ferrara@mssm.edu.
References
Author notes
J.E.L. and J.L.M.F. contributed equally to this study.
Data are available from authors John E. Levine (john.levine@mssm.edu), James L. M. Ferrara (james.ferrara@mssm.edu), or Yu Akahoshi (akahoshiu@gmail.com) upon reasonable request.
The online version of this article contains a data supplement.
There is a Blood Commentary on this article in this issue.
The publication costs of this article were defrayed in part by page charge payment. Therefore, and solely to indicate this fact, this article is hereby marked “advertisement” in accordance with 18 USC section 1734.
This feature is available to Subscribers Only
Sign In or Create an Account Close Modal