Deep transfer learning for reducing health care disparities arising from biomedical data inequality

Gao, Yan; Cui, Yan

doi:10.1038/s41467-020-18918-3

Download PDF

Article
Open access
Published: 12 October 2020

Deep transfer learning for reducing health care disparities arising from biomedical data inequality

Nature Communications volume 11, Article number: 5131 (2020) Cite this article

15k Accesses
50 Citations
32 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 18 December 2020

This article has been updated

Abstract

As artificial intelligence (AI) is increasingly applied to biomedical research and clinical decisions, developing unbiased AI models that work equally well for all ethnic groups is of crucial importance to health disparity prevention and reduction. However, the biomedical data inequality between different ethnic groups is set to generate new health care disparities through data-driven, algorithm-based biomedical research and clinical decisions. Using an extensive set of machine learning experiments on cancer omics data, we find that current prevalent schemes of multiethnic machine learning are prone to generating significant model performance disparities between ethnic groups. We show that these performance disparities are caused by data inequality and data distribution discrepancies between ethnic groups. We also find that transfer learning can improve machine learning model performance for data-disadvantaged ethnic groups, and thus provides an effective approach to reduce health care disparities arising from data inequality among ethnic groups.

An intriguing vision for transatlantic collaborative health data use and artificial intelligence development

Article Open access 23 January 2024

Federated learning enables big data for rare cancer boundary detection

Article Open access 05 December 2022

Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data

Article Open access 28 July 2020

Introduction

Artificial intelligence (AI) is fundamentally transforming biomedical research and health care systems are increasingly reliant on AI-based predictive analytics to make better diagnosis, prognosis, and therapeutic decisions^1,2,3. Since data are the most important resources for developing high-quality AI models, data inequality among ethnic groups is becoming a global health problem in the AI era. Recent statistics showed that samples from cancer genomics research projects, including the TCGA⁴, TARGET⁵, OncoArray⁶, and 416 cancer-related genome-wide association studies, were collected primarily from Caucasians (91.1%), distantly followed by Asians (5.6%), African Americans (1.7%), Hispanics (0.5%), and other populations (0.5%)⁷. Most clinical genetics and genomics data have been collected from individuals of European ancestry and ethnic diversity of studied cohorts has largely remained the same or even declined in recent years^8,9. As a result, non-Caucasians, which constitute about 84% of the world’s population, have a long-term cumulative data disadvantage. Inadequate training data may lead to nonoptimal AI models with low prediction accuracy and robustness, which may have profound negative impacts on health care for the data-disadvantaged ethnic groups^9,10. Thus, data inequality between ethnic groups is set to generate new health care disparities.

The current prevalent scheme of machine learning with multiethnic data is the mixture learning scheme in which data for all ethnic groups are mixed and used indistinctly in model training and testing (Fig. 1). Under this scheme, it is unclear whether the machine learning model works well for all ethnic groups involved. An alternative approach is the independent learning scheme in which data from different ethnic groups are used separately to train independent models for each ethnic group (Fig. 1). This learning scheme also tends to produce models with low prediction accuracy for data-disadvantaged minority groups due to inadequate training data.

**Fig. 1: Multiethnic machine learning schemes.**

Here, we show that the mixture learning scheme tends to produce models with relatively low prediction accuracy for data-disadvantaged minority groups, due to data distribution mismatches between ethnic groups. Therefore, the mixture learning scheme often leads to unintentional and even unnoticed model performance gaps between ethnic groups. We find that the transfer learning^11,12,13 scheme (Fig. 1), in many cases, can provide machine learning models with improved performance for data-disadvantaged ethnic groups. Our results from machine learning experiments on synthetic data indicate that data inequality and data distribution discrepancy between different ethnic groups are the key factors underlying the model performance disparities. We anticipate that this work will provide a starting point for an unbiased multiethnic machine learning paradigm that implements regular tests of the performance of machine learning models on all ethnic groups to identify model performance disparities between ethnic groups, and that uses transfer learning or other techniques to reduce performance disparities. Such a paradigm is essential for reducing health care disparities arising from the long-standing biomedical data inequality among ethnic groups.

Results

Clinical omics data inequalities among ethnic groups

Interrelated multi-omics factors including genetic polymorphisms, somatic mutations, epigenetic modifications, and alterations in expression of RNAs and proteins collectively contribute to cancer pathogenesis and progression. Clinical omics data from large cancer cohorts provide an unprecedented opportunity to elucidate the complex molecular basis of cancers^14,15,16 and to develop machine learning-based predictive analytics for precision oncology^{17,18,19,20,21,22}. However, data inequality among ethnic groups continues to be conspicuous in recent large-scale genomics-focused biomedical research programs^7,23,24. The TCGA cohort consists of 80.5% European Americans (EAs), 9.2% African Americans (AAs), 6.1% East Asian Americans (EAAs), 3.6% Native Americans (NAs), and 0.7% others, based on genetic ancestry analysis^25,26. The TARGET⁵ and MMRF CoMMpass²⁷ cohorts have similar ethnic compositions²⁸, which are typical for current clinical omics datasets⁷. The data inequality among ethnic groups is ubiquitous across almost all cancer types in the TCGA and MMRF CoMMpass cohorts (see Supplementary Fig. 1); therefore, its negative impacts would be broad and not limited to the cancer types or subtypes for which ethnic disparities have already been reported.

Disparities in machine learning model performance

We assembled machine learning tasks using the cancer omics data and clinical outcome endpoints²⁹ from the TCGA data of two ethnic groups: AA and EA groups, assigned by genetic ancestry analysis^25,26. A total of 1600 machine learning tasks were assembled using combinations of four factors: (1) 40 types of cancers and pan-cancers¹⁵; (2) two types of omics features: mRNA and protein expression; (3) four clinical outcome endpoints: overall survival (OS), disease-specific survival (DSS), progression-free interval (PFI), and disease-free interval (DFI)²⁹; and (4) five thresholds for the event time associated with the clinical outcome endpoints (Supplementary Fig. 2). For each learning task, each patient is assigned to a positive (or a negative) prognosis category based on whether the patient’s event time for the clinical outcome endpoint of the learning task is no less than (or less than) a certain threshold.

Since the AA patients consist of less than 10% of the TCGA cohort, there were only very small numbers of AA cases in many learning tasks. We filtered out the learning tasks having too few cases to permit reliable machine learning experiments. We then performed machine learning experiments on the remaining 447 learning tasks that had at least five AA cases and five EA cases in each of the positive and negative prognosis categories. For each machine learning task, we trained a deep neural network (DNN) model for classification between the two prognosis categories using the mixture learning scheme. The mixture learning models achieved reasonably good baseline performance (AUROC > 0.65) for 224 learning tasks. A total of 21 types of cancers and pan-cancers and all four clinical outcome endpoints were represented in these learning tasks. The proportion of AA patients ranged from 0.06 to 0.25 in these learning tasks with a median of 0.12 (Supplementary Fig. 3a). For each of the 224 learning tasks (Supplementary Data 1), we performed six machine learning experiments (Table 1) to compare the performance of the three multiethnic machine learning schemes on the AA and EA groups (Fig. 2).

Table 1 The machine learning experiments.

Full size table

**Fig. 2: Performance index values for the multiethnic machine learning experiments.**

In the machine learning experiments, we observed that the mixture learning scheme was prone to produce biased models with a lower prediction performance for the data-disadvantaged AA group. The model performance differences between the EA and AA groups were statistically significant with a p value of 6.72 × 10⁻¹¹ (Fig. 2, Mixture 1 & 2). The average EA–AA model performance gap over the 224 learning tasks was 0.06 (AUROC, Table 1). Without testing the model performance of the machine learning models on each ethnic group separately, the performance differences would be concealed by the overall good performance for the entire multiethnic cohort (Fig. 2, Mixture 0). The independent learning scheme produced even larger EA–AA performance differences with a p value of 1.29 × 10⁻²⁶ and the average performance gap was 0.13 (Table 1, Fig. 2, Independent 1 & 2).

Transfer learning for improving machine learning model performance for data-disadvantaged ethnic groups

We compared machine learning schemes on performance for the data-disadvantaged AA group and found that transfer learning produced models with significantly better performance for the AA group compared to the models from mixture learning (p = 6.79 × 10⁻⁵) and independent learning (p = 6.0.5 × 10⁻³⁵) (Fig. 2). The machine learning experiment results for four learning tasks with different cancer types and clinical outcome endpoints are shown in Fig. 3 (more results in Supplementary Fig. 4). We used threefold cross-validation and performed 20 independent runs for each experiment using different random partitions of training and testing data to assess machine learning model performance. The median AUROC of the six experiments are denoted as A_Mixture0, A_Mixture1, A_Mixture2, A_Independent1, A_Independent2, and A_Transfer. The results of these experiments showed a consistent pattern:

(1)
Both mixture learning and independent learning schemes produced models with relatively high and stable performance for the EA group but low and unstable performance for the data-disadvantaged ethnic group (AA). We defined the performance disparity gap as \(G = \overline {AUROC} _{EA} - \overline {AUROC} _{AA}\), where \(\overline {AUROC} _{EA} = (A_{Mixture1} + A_{Independent1})/2\), and \(\overline {AUROC} _{AA} = (A_{Mixture2} + A_{Independent2})/2\). G is represented by the distance between the blue and red dash lines in Fig. 3 and Supplementary Fig. 4.
(2)
The transfer learning scheme produced models with improved performance for the data-disadvantaged AA group, and thus reduced the model performance gap. The reduced model performance disparity gap is \(\tilde G = \overline {AUROC} _{EA} - A_{Transfer}\), which is represented by the distance between the blue and green dash lines in Fig. 3 and Supplementary Fig. 4.

**Fig. 3: Comparison of multiethnic machine learning schemes.**

Among the 224 learning tasks, 142 had a performance gap G > 0.05 and 88.7% (125/142) of these performance gaps were reduced by transfer learning.

We also performed the machine learning experiments on two additional learning tasks that involved either another ethnic group or non-TCGA data: (1) Stomach Adenocarcinoma (STAD)-EAA/EA-PFI-2YR assembled using the TCGA STAD data of EAA and EA patients; and (2) MM-AA/EA-mRNA-OS-3YR assembled using the MMRF CoMMpass²⁷ data of AA and EA patients (Supplementary Data 1). For both learning tasks, machine learning experiments showed the same pattern of performance as described above (Supplementary Fig. 4a, b).

Key factors underlying ethnic disparities in machine learning model performance

A machine learning task \({\cal{T}} = \left\{ {{\cal{X}},{\cal{Y}},f:{\cal{X}} \to {\cal{Y}}} \right\}\) consists of a feature space \({\cal{X}}\), a label space \({\cal{Y}}\), and a predictive function f learned from feature-label pairs. From a probabilistic perspective, f can be written as¹³ P(Y|X), where \(X \in {\cal{X}}\), and \(Y \in {\cal{Y}}\). It is generally assumed that each feature-label pair is drawn from a single distribution³⁰ P(X, Y). However, this assumption needs to be tested for multiethnic omics data. Given \(P\left( {X,Y} \right) = P\left( {Y{\mathrm{|}}X} \right)P(X)\), both marginal distribution P(X) and the conditional distribution P(Y|X) may contribute to the data distribution discrepancy among ethnic groups. We used t-test to identify differentially expressed mRNAs or proteins between the AA and EA groups. The median percentage of differentially expressed mRNA or protein features in the 224 learning tasks was 10%, and 70% of the learning tasks had at least 5% differentially expressed mRNA or protein features (Supplementary Fig. 3b). We used logistic regression to model the conditional distribution f = P(Y|X), and calculated the Pearson correlation coefficient between the logistic regression parameters for the AA and EA groups. The Pearson correlation coefficients ranged from −0.14 to 0.26 in the learning tasks, with a median of 0.04 (Supplementary Fig. 3c). These results indicate that various degrees of marginal and conditional distribution discrepancies between the AA and EA groups exist in most of the 224 learning tasks.

We hypothesized that the data inequality represented by cohort ethnic composition and data distribution discrepancy between ethnic groups are the key factors underlying the ethnic disparity in machine learning model performance and that both factors can be addressed by transfer learning. To test this hypothesis, we performed the six machine learning experiments (Table 1) on synthetic data generated using a mathematical model whose parameters represent these hypothetical key factors (Methods). Synthetic Data 1 was generated using parameters estimated from the data for the learning task PanGyn-AA/EA-mRNA-DFI-5YR (Fig. 3d), which simulated data inequality and distribution discrepancy between the ethnic groups in the real data (Table 2). For this synthetic dataset, the six machine learning experiments showed a performance pattern (Fig. 4a) similar to that of the real data (Fig. 3), which was characterized by performance gaps from the mixture and independent learning schemes and by transfer learning reduction of the performance gaps. Synthetic Data 2 has no distribution difference between the two ethnic groups (Table 2). For this dataset, there is no performance gap from the mixture learning scheme, however, the performance gap from the independent learning scheme remains (Fig. 4b). Synthetic Data 3 has equal numbers of cases from the two ethnic groups (no data inequality) but has a distribution discrepancy between the two ethnic groups. Synthetic Data 4 has equal numbers of cases from the two ethnic groups (no data inequality) and does not have a distribution difference between the two ethnic groups. For these two datasets, there is no significant performance gap from any learning scheme (Fig. 4c, d). These results confirm that the performance gap from the mixture learning scheme is caused by both data inequality and data distribution discrepancy between ethnic groups while the performance gap from the independent learning scheme is caused by inadequate data for the disadvantaged ethnic group, and transfer learning may reduce these performance gaps (Fig. 1).

Table 2 Multiethnic machine learning experiments on synthetic data.

Full size table

**Fig. 4: Comparison of multiethnic machine learning schemes on synthetic data.**

Discussion

In this work, we show that the current prevalent scheme for machine learning with multiethnic data, the mixture learning scheme, and its main alternative, the independent learning scheme, tend to generate machine learning models with relatively low performance for data-disadvantaged ethnic groups due to inadequate training data and data distribution discrepancies among ethnic groups. We also find that transfer learning can provide improved machine learning models for data-disadvantaged ethnic groups by leveraging knowledge learned from other groups having more abundant data. These results indicate that transfer learning can provide an effective approach to reduce health care disparities arising from data inequality among ethnic groups. Our simulation experiments show that the machine learning performance disparity gaps would be eliminated completely if there was no data inequality regardless of data distribution discrepancies (Table 2, Fig. 4c, d). Algorithm-based methods may mitigate health care disparities arising from long-standing data inequality among ethnic groups; however, the ultimate solution to this challenge would be to increase the number of minority participants in clinical studies.

Many factors, including ethnic composition of the cohort, omics data type, cancer type, and clinical outcome endpoint, may potentially affect the performance of multiethnic machine learning schemes. At this point, it is not clear how these factors affect the performance of transfer learning and other learning schemes. One possible direction for future research is to discover how the performance pattern of multiethnic learning schemes changes as a function of these factors.

Methods

Data source and data preprocessing

The TCGA and MMRF CoMMpass data used in this work were downloaded from the Genome Data Commons (GDC, https://gdc.cancer.gov). The ethnic groups of TCGA patients were determined based on the genetic ancestry data downloaded from The Cancer Genetic Ancestry Atlas²⁵ (TCGAA, http://52.25.87.215/TCGAA). The ethnic groups of MMRF CoMMpass patients were based on the self-reported information in the clinical data file downloaded from the GDC Data Portal (https://portal.gdc.cancer.gov).

For the TCGA data, we used all the 189 protein expression features, and the 17176 mRNA features without missing values. We further removed samples with more than 20% missing values. We also filtered out samples missing genetic ancestry or clinical endpoint data. The data matrix was standardized such that each feature has a zero mean and unit standard deviation. The ANOVA F value for each mRNA was calculated for the training samples to select 200 mRNAs as the input features for machine learning. The feature mask, ANOVA F value, and p values were calculated using the SelectKBest function (with the f_classif score function and k = 200) of the python sklearn package³¹. For the MMRF CoMMpass data, we selected 600 mRNA features with the highest mean absolute deviation as the input features for machine learning.

Deep neural network modeling

We used the Lasagne (https://lasagne.readthedocs.io/en/latest/) and Theano python packages (http://deeplearning.net/software/theano/) to train the DNN. We used a pyramid architecture³² with 6 layers: an input layer with 200 nodes for mRNA features or 189 nodes for protein features, 4 hidden layers including a fully connected layer with 128 nodes followed by a dropout layer³³, a fully connected layer with 64 nodes followed by a dropout layer, and a logistic regression output layer. To fit a DNN model, we used the stochastic gradient descent method with a learning rate of 0.01 (lr = 0.01) to find the weights that minimized a loss function consisting of a cross-entropy and two regularization terms: \(l\left( W \right) = - \mathop {\sum}\nolimits_{i = 1}^m {(y_i {\mathrm{log}}\left( {\hat y_i} \right) + \left( {1 - y_i} \right){\mathrm{log}}(1 - \hat y_i)) + \lambda _1\left| W \right| + \lambda _2\Vert W\Vert_2}\), where y_i is the observed label of patient i, \(\hat y_i\) is the predicted label for patient i, and W represents the weights in the DNN. Traditional activation functions such as the sigmoid and hyperbolic tangent functions have a gradient vanish problem in training a deep-learning model, which may lead to gradient decreasing quickly and training error propagating to forward layers. Here, we use the ReLU function f(x) = max(0, x), which is widely used in deep learning to avoid the gradient vanish problem. For each dropout layer, we set the dropout probability p = 0.5 to randomly omit half of the weights during the training to reduce the collinearity between feature detectors. To speed up the computation, we split the data into multiple mini-batches during training. We used a batch size of 20 (batch_size = 20) for two basic learning schemes (mixture learning and independent learning for the EA group) as there were relatively large numbers of cases available for training. For the independent learning for the AA group, we set the batch size to 4 because the number of cases available for training was limited. We set the maximum number of iterations at 100 (max_iter = 100) and applied the Nesterov momemtum³⁴ method (with momentum = 0.9 for each DNN model) to avoid premature stopping. We set the learning rate decay factor at 0.03 (lr_decay = 0.03) for the learning task BRCA-AA/EA-Protein-OS-4YR to avoid nonconvergence during training. For all other tasks, we set lr_decay = 0. The two regularization terms λ₁ and λ₂ were set at 0.001.

Transfer learning

For transfer learning^{11,12,13,35,36,37}, we set the EA group as the source domain and the AA or EAA group as the target domain. We applied three transfer learning methods to each learning task and selected the best AUROC as the performance index for the transfer learning scheme. The three transfer learning methods include two fine-tuning algorithms and a domain adaptation algorithm:

(1)
Fine-tuning algorithm 1

Recent studies have shown that fine-turning of DNN often leads to better performance and generalization in transfer learning³⁸. We first pretrained a DNN model using source domain data: \(M\sim f(Y_{Source}|X_{Source})\), which has the same architecture as described in the previous section. The training parameters were set as lr = 0.01, batch_size = 20, p = 0.5, max_iter = 100, and momentum = 0.9. After the initial training, the DNN model was then fine-tuned using backpropagation on the target domain data: M′ = fine_tuning (M|Y_Target, X_Target), where M′ was the final model. In the fine tuning, the learning rate was set at 0.002 and the batch size was set at 10 as the model had been partially fitted and the target dataset was small.
(2)
Fine-tuning algorithm 2

In the second fine-tuning algorithm, the source domain data were used as unlabeled data to pretrain a stacked denoising autoencoder^37,39,40. The stacked denoising autoencoder has 5 layers: the input layer, a coding layer with 128 nodes, a bottleneck layer with 64 nodes, a decoding layer with 128 nodes, and an output layer that has the same number of nodes with the input layer to reconstruct the input data. We used the source and target domain data to train the stacked autoencoder with the parameters: learning rate = 0.01, corruption level = 0.3, batch size = 32, and maximum iteration = 500. After pretraining the autoencoder, we removed the decoder and added a dropout layer (with p = 0.5) after each hidden layer, and then added a fine-tune (logistic regression) layer. The final DNN model had the same architecture as described in the previous section and was fine-tuned on target domain data with training parameters lr = 0.002, batch_size = 10 and max_iter = 100.
(3)
Domain adaptation

Domain adaptation is a class of transfer learning methods that improve machine learning performance on the target domain by adjusting the distribution discrepancy across domains^41,42. We adopted the Contrastive Classification Semantic Alignment (CCSA) method⁴³ for domain adaptation. The CCSA method is particularly suitable for our transfer learning tasks because: (1) this method can significantly improve target domain prediction accuracy by using very few labeled target samples for training; (2) this method includes semantic alignment in training and therefore can handle the domain discrepancy in both marginal and conditional distributions. To use the CCSA method which calculates the pairwise Euclidean distance between samples in the embedding space, we applied a L2 norm transformation to the features of each patient such that for patient i, \(\mathop {\sum}\nolimits_{j = 1}^n {x_{ij}^2} = 1\), where n is the number of features. The CCSA minimizes the loss function \(L_{CCSA}\left( f \right) = (1 - \gamma )L_C\left( {h\,o\,g} \right) + \gamma \left( {L_{SA}\left( h \right) + L_S\left( g \right)} \right)\), where f = h o g is the target function, g is an embedding function that maps the input X to an embedding space Z, and h is a function to predict the output labels from Z, L_C(f) denotes the classification loss (binary cross-entropy) of function f, L_SA(h) refers to the semantic alignment loss of function h, L_S(g) is the separation loss of function g, γ is the weight used to balance the classification loss versus the contrastive semantic alignment loss L_SA(h) + L_S(g), \(L_{SA}\left( h \right) = \frac{1}{n}\mathop {\sum }\limits_{y_i^s = y_j^t} \frac{1}{2}\Vert g(x_i^s),g(x_j^t)\Vert^2\) and \(L_S\left( g \right) = \frac{1}{n}\sum_{y_i^s \ne y_j^t} \frac{1}{2}{\mathrm{max}}(0,m - \Vert g(x_i^s),g(x_j^t)\Vert)^2\), \({\Vert}.{\Vert}\) is the Euclidean distance, while m is the margin that specifies the separability of the two domain features in the embedding space⁴³. During the training, we set the parameters m = 0.3, momentum = 0.9, batch_size = 20, learning_rate = 0.01, and max_iter = 100. We used one hidden layer with 100 nodes for semantic alignment and added a dropout layer (p = 0.5) after the hidden layer for classification.

Differential expression analysis

For each learning task, we performed a permutation-based t-test on the input features to select the proteins or mRNAs that were differentially expressed between the AA and EA groups. The mRNAs and proteins with a feature-wise p value < 0.05 were selected as differentially expressed features between the two ethnic groups.

Logistic regression

For each learning task, we fit two multivariate logistic regression models: \(Y^{AA} = 1/(1 + e^{ - {\boldsymbol\beta} ^{\boldsymbol{AA}} \cdot X^{AA}})\), \(Y^{EA} = 1/(1 + e^{ - {\boldsymbol\beta} ^{\boldsymbol{EA}} \cdot X^{EA}})\), for the AA group and the EA group, respectively, to calculate the regression parameters for each ethnic group.

Stratified cross-validation and training/testing data for machine learning experiments

For each learning task, we applied a threefold stratified cross-validation⁴⁴. For mixture learning, samples were stratified by the clinical outcome and genetic ancestry in the process of threefold data splitting. Samples of each fold had the same distribution over clinical outcome classes (positive and negative) and ethnic groups (EA and AA). Both AA and EA samples in the training set were used to train a deep-learning model and the performance of Mixture 0 was measured using the whole testing set, the performance of Mixture 1 was measured on the EA samples in the testing set, and the performance of Mixture 2 was measured on the AA samples in the testing set. For Independent learning, EA (Independent 1) and AA (Independent 2) samples were separated and then stratified by the clinical outcome in the threefold data splitting. The cross-validation was performed for the two ethnic groups separately. For transfer learning, EA and AA samples were separated and AA samples were stratified by the clinical outcome (same as Independent 2), and we used all the EA (source domain) samples for initial model training and then used AA training samples for fine-tuning or domain adaptation, and finally, the performance was evaluated on AA testing samples. The ethnic compositions for the training and testing data of the six types of machine learning experiments are shown in Table 1.

Machine learning performance evaluation

The main utility of performance metric in this work is to compare the relative performance of multiethnic machine learning schemes. We used the area under ROC curve⁴⁵ (AUROC) to evaluate performance of machine learning models. Another widely used machine learning performance metric is the area under precision–recall curve⁴⁶ (AUPR). It has been mathematically proven that the performance ranks of two models remain same in the ROC space and the PR space⁴⁷. However, linear interpolation in the precision–recall space is problematic, which may lead to inaccurate calculation of AUPR for datasets of small sample sizes⁴⁷. AUROC is a more robust metric for evaluating machine learning performance on the minority ethnic groups that have less cases.

Synthetic data generator

We developed a mathematical model to generate synthetic data for the multiethnic machine learning experiments. The simulated cohort consists of two ethnic groups. The degree of data inequality is controlled by the parameters: n₁ and n₂, which represent the numbers of individuals in the two ethnic groups. We used the ssizeRNA package⁴⁸ to generate the feature matrix x_ij. The number of differentially expressed features (n_de) is the parameter controlling marginal distribution (P(X)) discrepancy between the two ethnic groups. For individual i in ethnic group k, the label \(y_i^k\) was generated using the logistic regression function: \(y_i^k = \left\{ {\begin{array}{*{20}{c}} 1 & {if\,z_i^k \, > \, c^k} \\ { - 1} & \mathrm{{otherwise}} \end{array}} \right.\), where \(z_i^k = \frac{1}{{1 + e^{ - \mathop {\sum }\nolimits_{j = 1}^n \beta _j^kx_{ij}}}}\), x_ij is the j^th feature of individual i, and \(\beta _j^k{\it{\epsilon }}\{ - 1,1\}\) represents the effect of feature j on the label of ethnic group k, and c^k is the threshold for assigning a sample to the positive or negative category. A pair of \(\beta _j^1\) and \(\beta _j^2\) have four possible combinations representing the difference and similarity of the effect of feature j on the clinical outcome for the patients in the two ethnic groups. The number of features associated with each of the four combinations is denoted as n_−1,−1, n_−1,1, n_1,−1, and n_1,1 respectively. These parameters control the conditional distribution (P(Y|X)) discrepancy between the two ethnic groups. Using this model, we can generate synthetic datasets with or without data inequality and/or distribution discrepancy between two ethnic groups by setting the parameter values. These parameters can also be estimated from a real dataset. For example, we generated Synthetic Data 1 using the parameters estimated from the data for the learning task PanGyn-AA/EA-mRNA-DFI-5YR. We set n₁ and n₂ to be equal to the number of EA and AA patients in the real data, respectively. We estimated the parameters n_de using permutation-based t-tests (feature-wise p value < 0.05). The total number of features for the learning task PanGyn-AA/EA-mRNA-DFI-5YR was 200. We used multivariate logistic regression to calculate the regression parameters β^AA and β^EA. We let \(\beta _j^1 = \left\{ {\begin{array}{*{20}{l}} 1 \hfill & {if\,\beta _j^{\mathrm{EA}} \, > \, {\mathrm{median}}\left( {{\boldsymbol{\beta}} ^{\boldsymbol{EA}}} \right)} \hfill \\ { - 1} \hfill & {\mathrm{otherwise}} \hfill \end{array}} \right.\) and \(\beta _j^2 = \left\{ {\begin{array}{*{20}{l}} 1 \hfill & {if\,\beta _j^{{\mathrm{AA}}} \, > \, {\mathrm{median}}\left( {{\boldsymbol{\beta}} ^{\boldsymbol{AA}}} \right)} \hfill \\ { - 1} \hfill & \mathrm{{otherwise}} \hfill \end{array}} \right.\), and then calculated n_−1,−1, n_−1,1, n_1,−1, and n_1,1. The parameters used to generate Synthetic Data 1–4 are shown in Table 3.

Table 3 Parameters used to generate the synthetic data.

Full size table

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The TCGA and MMRF CoMMpass datasets are publicly available at the Genome Data Commons (https://gdc.cancer.gov/about-data/publications/pancanatlas and https://gdc.cancer.gov/about-gdc/contributed-genomic-data-cancer-research/foundation-medicine/multiple-myeloma-research-foundation-mmrf). The processed datasets that were used as the input files for the machine learning experiments are available at https://doi.org/10.6084/m9.figshare.12811574.

Code availability

Source code is available at https://github.com/ai4pm/TL4HDR.

Change history

18 December 2020
A Correction to this paper has been published: https://doi.org/10.1038/s41467-020-20480-x

References

Topol, E. J. High-performance medicine: the convergence of human and artificial intelligence. Nat. Med. 25, 44–56 (2019).
Article CAS Google Scholar
Azuaje, F. Artificial intelligence for precision oncology: beyond patient stratification. NPJ Precis. Oncol. 3, 6 (2019).
Article Google Scholar
Rajkomar, A., Dean, J. & Kohane, I. Machine learning in medicine. N. Engl. J. Med. 380, 1347–1358 (2019).
Article Google Scholar
The Cancer Genome Atlas Program. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga.
The Therapeutically Applicable Research to Generate Effective Treatments initiative. https://ocg.cancer.gov/programs/target.
Amos, C. I. et al. The OncoArray Consortium: a network for understanding the genetic architecture of common. Cancers 26, 126–135 (2017).
Google Scholar
Guerrero, S. et al. Analysis of racial/ethnic representation in select basic and applied cancer research studies. Sci. Rep. 8, 13978 (2018).
Article ADS Google Scholar
Genetics for all. Nature Genet. 51, 579–579 (2019).
Martin, A. R. et al. Clinical use of current polygenic risk scores may exacerbate health disparities. Nat. Genet. 51, 584–591 (2019).
Article CAS Google Scholar
Rajkomar, A., Hardt, M., Howell, M. D., Corrado, G. & Chin, M. H. Ensuring fairness in machine learning to advance health equity. Ann. Intern. Med. 169, 866–872 (2018).
Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 9 (2016).
Article Google Scholar
Tan, C. et al. A survey on deep transfer learning. In International Conference on Artificial Neural Networks. 270–279 (Springer, 2018).
Pan, S. J. & Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 1345–1359 (2010).
Article Google Scholar
Hutter, C. & Zenklusen, J. C. The Cancer Genome Atlas: creating lasting value beyond its data. Cell 173, 283–285 (2018).
Article CAS Google Scholar
Hoadley, K. A. et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 173, 291–304 (2018).
Article CAS Google Scholar
Uhlen, M. et al. A pathology atlas of the human cancer transcriptome. Science 357, eaan2507 (2017).
Article Google Scholar
Malta, T. M. et al. Machine learning identifies stemness features associated with oncogenic dedifferentiation. Cell 173, 338–354 (2018).
Article CAS Google Scholar
Way, G. P. et al. Machine learning detects pan-cancer ras pathway activation in the cancer genome atlas. Cell Rep. 23, 172–180 (2018).
Article CAS Google Scholar
Yousefi, S. et al. Predicting clinical outcomes from large scale cancer genomic profiles with deep survival models. Sci. Rep. 7, 11707 (2017).
Article ADS Google Scholar
Ching, T., Zhu, X. & Garmire, L. X. Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS Comput. Biol. 14, e1006076 (2018).
Article ADS Google Scholar
Capper, D. et al. DNA methylation-based classification of central nervous system tumours. Nature 555, 469–474 (2018).
Article ADS CAS Google Scholar
Mobadersany, P. et al. Predicting cancer outcomes from histology and genomics using convolutional networks. Proc. Natl. Acad. Sci. USA 115, E2970–E2979 (2018).
Article CAS Google Scholar
Kim, J. I. E. & Sarkar, I. N. Racial representation disparity of population-level genomic sequencing efforts. Stud. Health Technol. Inform. 264, 974–978 (2019).
Google Scholar
Lyles, C. R., Lunn, M. R., Obedin-Maliver, J. & Bibbins-Domingo, K. The new era of precision population health: insights for the All of Us Research Program and beyond. J. Transl. Med. 16, 211 (2018).
Article Google Scholar
Yuan, J. et al. Integrated analysis of genetic ancestry and genomic alterations across cancers. Cancer Cell 34, 549–560.e9 (2018).
Article CAS Google Scholar
TCGAA. The Cancer Genetic Ancestry Atlas. http://52.25.87.215/TCGAA.
The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile. https://themmrf.org/we-are-curing-multiple-myeloma/mmrf-commpass-study/.
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS CAS Google Scholar
Liu, J. et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell 173, 400–416 (2018).
Article CAS Google Scholar
Quionero-Candela, J., Sugiyama, M., Schwaighofer, A. & Lawrence, N. D. Dataset Shift in Machine Learning (The MIT Press, 2009).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Phung, S. L. & Bouzerdoum, A. A pyramidal neural network for visual pattern recognition. IEEE Trans. Neural Netw. 18, 329–343 (2007).
Article Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).
MathSciNet MATH Google Scholar
Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning. In International Conference on Machine Learning. 1139–1147 (2013).
Taroni, J. N. et al. MultiPLIER: a transfer learning framework for transcriptomics reveals systemic features of rare disease. Cell Syst. 8, 380–394 (2019).
Article CAS Google Scholar
Wang, J. et al. Data denoising with transfer learning in single-cell transcriptomics. Nat. Methods 16, 875–878 (2019).
Article CAS Google Scholar
Sevakula, R. K., Singh, V., Verma, N. K., Kumar, C. & Cui, Y. Transfer learning for molecular cancer classification using deep neural networks. IEEE/ACM Trans. Comput. Biol. Bioinform. 16, 2089–2100 (2019).
Article Google Scholar
Yosinski, J., Clune, J., Bengio, Y. & Lipson, H. How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems. 3320–3328 (2014).
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P.-A. Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010).
MathSciNet MATH Google Scholar
Singh, V., Baranwal, N., Sevakula, R. K., Verma, N. K. & Cui, Y. Layerwise feature selection in Stacked Sparse Auto-Encoder for tumor type prediction. In 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). 1542–1548 (2016).
Tzeng, E., Hoffman, J., Saenko, K. & Darrell, T. Adversarial discriminative domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7167–7176 (2017).
Daume, H. III & Marcu, D. Domain adaptation for statistical classifiers. J. Artif. Intell. Res. 26, 101–126 (2006).
Article MathSciNet Google Scholar
Motiian, S., Piccirilli, M., Adjeroh, D.A. & Doretto, G. Unified deep supervised domain adaptation and generalization. In Proceedings of the IEEE International Conference on Computer Vision. 5715–5725 (2017).
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. Classification and Regression Trees (CRC Press, 1984).
Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 27, 861–874 (2006).
Article Google Scholar
Raghavan, V., Bollmann, P. & Jung, G. S. A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst. 7, 205–229 (1989).
Article Google Scholar
Davis, J. & Goadrich, M. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning. 233–240 (2006).
Bi, R. & Liu, P. Sample size calculation for RNA-Seq experimental design—the ssizeRNA package. BMC Bioinform. 17, 146 (2016).
Article Google Scholar

Download references

Acknowledgements

This research was supported by the Center for Integrative and Translational Genomics at University of Tennessee Health Science Center.

Author information

Authors and Affiliations

Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
Yan Gao & Yan Cui
Center for Integrative and Translational Genomics, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
Yan Gao & Yan Cui
Center for Cancer Research, University of Tennessee Health Science Center, Memphis, TN, 38163, USA
Yan Cui

Authors

Yan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Yan Cui
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.C. conceived and designed the study. Y.G. performed the data processing and the machine learning experiments. Y.G. wrote the computer code and the documentation. All authors developed the algorithms, interpreted the results and wrote the paper. All authors reviewed and approved the final paper.

Corresponding author

Correspondence to Yan Cui.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Brandon Mahal and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Peer Review File

Supplementary Data 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Gao, Y., Cui, Y. Deep transfer learning for reducing health care disparities arising from biomedical data inequality. Nat Commun 11, 5131 (2020). https://doi.org/10.1038/s41467-020-18918-3

Download citation

Received: 22 April 2020
Accepted: 16 September 2020
Published: 12 October 2020
DOI: https://doi.org/10.1038/s41467-020-18918-3

This article is cited by

Optimizing clinico-genomic disease prediction across ancestries: a machine learning strategy with Pareto improvement
- Yan Gao
- Yan Cui
Genome Medicine (2024)
Enhancing the fairness of AI prediction models by Quasi-Pareto improvement among heterogeneous thyroid nodule population
- Siqiong Yao
- Fang Dai
- Hui Lu
Nature Communications (2024)
Differences in metabolomic profiles between Black and White women in the U.S.: Analyses from two prospective cohorts
- Emma E. McGee
- Oana A. Zeleznik
- A. Heather Eliassen
European Journal of Epidemiology (2024)
Deep Transfer Learning for Ethnically Distinct Populations: Prediction of Refractive Error Using Optical Coherence Tomography
- Rishabh Jain
- Tae Keun Yoo
- Ashiyana Nariani
Ophthalmology and Therapy (2024)
Multi-center study on predicting breast cancer lymph node status from core needle biopsy specimens using multi-modal and multi-instance deep learning
- Yan Ding
- Fan Yang
- Yueping Liu
npj Breast Cancer (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Clinical omics data inequalities among ethnic groups

Disparities in machine learning model performance

Transfer learning for improving machine learning model performance for data-disadvantaged ethnic groups

Key factors underlying ethnic disparities in machine learning model performance

Discussion

Methods

Data source and data preprocessing

Deep neural network modeling

Transfer learning

Differential expression analysis

Logistic regression

Stratified cross-validation and training/testing data for machine learning experiments

Machine learning performance evaluation

Synthetic data generator

Reporting summary

Data availability

Code availability

Change history

18 December 2020

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links