Abstract
Chronic kidney disease (CKD) is a complex disorder that causes a gradual loss of kidney function, affecting approximately 9.1% of the world's population. Here, we use a soft-clustering algorithm to deconstruct its genetic heterogeneity. First, we selected 322 CKD-associated independent genetic variants from published genome-wide association studies (GWAS) and added association results for 229 traits from the GWAS catalog. We then applied nonnegative matrix factorization (NMF) to discover overlapping clusters of related traits and variants. We computed cluster-specific polygenic scores and validated each cluster with a phenome-wide association study (PheWAS) on the BioMe biobank (n = 31,701). NMF identified nine clusters that reflect different aspects of CKD, with the top-weighted traits signifying areas such as kidney function, type 2 diabetes (T2D), and body weight. For most clusters, the top-weighted traits were confirmed in the PheWAS analysis. Results were found to be more significant in the cross-ancestry analysis, although significant ancestry-specific associations were also identified. While all alleles were associated with a decreased kidney function, associations with CKD-related diseases (e.g., T2D) were found only for a smaller subset of variants and differed across genetic ancestry groups. Our findings leverage genetics to gain insights into the underlying biology of CKD and investigate population-specific associations.
Similar content being viewed by others
Introduction
Chronic kidney disease (CKD) is a primarily asymptomatic disease characterized by a gradual loss of kidney function over a period extending from several months to years1. CKD affects approximately 9.1% of the global population, with a higher prevalence in high-income countries2. The leading risk factors for developing CKD are diabetes (40% of cases) and hypertension (29% of cases), followed by heart disease, family history of CKD, and obesity3. Other factors, such as exposure to HIV and contaminants, are additionally important in low-income countries4,5. Genetic ancestry also plays a crucial role, with increased risk rates of kidney failure in Black/African Americans and Hispanics/Latinos compared to individuals of European ancestry6. If left untreated, CKD increases the mortality risk for individuals with cardiovascular disease (CVD) and can result in the complete loss of kidney function7. Therefore, early detection is critical for improving quality of life and life expectancy. During the early stages of CKD, cost-effective treatment options are available and can be tailored to the cause of the disease8.
CKD is defined by a reduced functionality of the kidneys, which limits its filtering capability over a period of at least three months9. The main biomarkers for CKD detection include the urinary albumin/creatinine ratio (ACR) and the estimated glomerular filtration rate (eGFR)1. While ACR facilitated diagnosing albuminuria—an indicator of kidney damage characterized by an elevated excretion of urinary albumin—the eGFR estimates the filter volume of the glomerulus per unit of time using different biomarkers such as serum creatinine1. An abnormal kidney activity is indicated by high ACR values, reduced eGFR, or both.
Over the past few decades, many large-scale genomic studies, such as genome-wide association studies (GWAS), have successfully identified more than 500 independent genetic variants associated with reduced kidney function10,11,12. The association between genetic variants and various phenotypes has been studied, and the results are often shared in publicly available databases, like the GWAS Catalog13. The association of one genetic variant with multiple traits can be considered to identify secondary traits associated with a phenotype. This understanding can help elucidate potentially shared disease mechanisms, assuming that genetic variants affecting a shared pathway have a similar impact on the associated traits.
Soft-clustering methods provide a means to reduce the genetic complexity of a heterogeneous disease while also accounting for shared disease mechanisms. In contrast to hard-clustering approaches like K-means or hierarchical clustering, soft-clustering enables the factorization of high-dimensional data by identifying overlapping clusters14. Non-negative matrix factorization (NMF) is a family of algorithms within multivariate analysis that addresses the dimensionality challenge by extracting meaningful features from a given data set15,16. Using this approach, different genetically driven subtypes have been identified for Type 2 Diabetes in the past and were even associated with differences in the clinical outcomes17,18,19,20.
In this study, we aimed to deconstruct the heterogeneity of CKD by identifying its genetic subtypes. First, we collect all published variant-trait associations for variants associated with reduced kidney function and apply soft-clustering using NMF. We used the algorithm’s weights to calculate cluster-specific polygenic scores (cPGS) within the BioMe biobank. Finally, we use a phenome-wide association study (cPGS-PheWAS) to validate and interpret the clusters. By deconstructing the complexity of CKD, this methodology contributes new insights into the disease pathways of CKD and enhances our understanding of population-specific differences for CKD.
Results
NMF identified nine clusters of CKD-associated variants
We identified 508 independent genetic variants associated with decreased kidney function from the literature. Then, we retrieved all proxy SNPs in linkage disequilibrium with the identified variants and linked them to 805 associated traits using the GWAS Catalog database. After multiple filtering steps, including GWAS sample size thresholds and correction of GWAS p-values for multiple associations, we constructed a final matrix of trait-variant associations, which included 322 variants and 229 associated traits. We applied NMF to factorize the variants-traits association matrix X into a traits matrix W and variants matrix H with a shared number of clusters, k (Fig. 1). We identified nine clusters of CKD-associated variants using a hypothesis-free approach.
The most frequent CKD-associated secondary traits retrieved from the GWAS Catalog are related to kidney function (e.g., blood urea nitrogen, urea, uric acid, and cystatin C measurements), hemoglobin levels (e.g., hemoglobin measurements, hematocrit, and erythrocyte counts), T2D, body weight (e.g., body height, appendicular lean mass, BMI, BMI-adjusted waist-hip ratio), and pulse pressure (systolic and diastolic blood pressure measurements), among others (see Fig. S1). CKD-associated traits and their associated CKD variants were factorized into nine partly overlapping clusters by conducting NMF. To ensure the results were robust, we repeated the clustering with Bayesian NMF (bNMF) and got comparable results (Table S1). The top seven traits per cluster are summarised in Fig. 2. The ‘Reduced lipids’ cluster is associated with decreasing blood lipid levels (triglycerides, total cholesterol, use of lipid-lowering medications) and liver enzymes. The top traits of the cluster ‘Increased body mass’ show a positive association with body weight (appendicular lean mass, body height, and body weight). The clusters ‘Increased blood volume’ and ’Reduced blood volume’ are positively and negatively associated with volemic traits (e.g., mean corpuscular volume and mean corpuscular hemoglobin), respectively.
Similarly, clusters ‘Increased/Reduced hematocrit’ show opposite associations with hemoglobin content (e.g., hematocrit, hemoglobin measurements, red blood cell density, erythrocyte count), and clusters ‘Increased/Reduced inflammation’ convey opposite associations with markers of inflammation (C-reactive protein) and blood lipids. Lastly, cluster ‘Increased urate’ is positively associated with kidney function biomarkers like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C. The complete lists of the top features and variants per cluster, defined as traits and variants in the top decile of the cluster weights of the matrices H and W, are listed in Table S2. The matrices H and W are also available as supplementary material. Figure S2 summarises how the variants are distributed in each cluster, showing their overlaps.
Comparing three different pathway analysis approaches, we could identify significantly enriched pathways (q-value < 0.05) based on Ingenuity Pathway Analysis (IPA) for four of the nine identified clusters (Fig. S3). Overall, only two of the 22 identified enriched clusters were enriched in more than one of the clusters, namely, the myelination signaling pathway (enriched for the ‘Increased hematocrit’ and ‘Reduced urate’ clusters) and the estrogen receptor signaling pathway (enriched for the ‘Reduced hematocrit’ and ‘Reduced urate’ clusters). For some of the enriched pathways, we could identify an association with the top-weighted traits of the corresponding clusters, suggesting their biological plausibility. For instance, the IL-12 Signaling and Production in Macrophages is significantly enriched (q-value = 0.02) for the top-weighted genes of the increased inflammation cluster. IL-12 is a pro-inflammatory cytokine that, in the past, has been associated with multiple immune-mediated diseases21. For the ‘Increased hematocrit’ cluster, characterized by increased hematocrit, hemoglobin, red blood cell density measures, and erythrocyte count, BMP signaling was one of the significantly enriched pathways (q-value = 0.03). Bone Morphogenic Protein (BMP) has been implicated with hematopoiesis22. The top associated genes of the ‘Increased urate cluster’ suggest alterations of the metabolomic and renal functions. These suggestions can be validated by the Xenobiotic Metabolism AHR Signaling pathway, which can be activated in CKD patients due to the accumulation of uremic toxins and can promote renal fibrosis23.
PheWAS validated biological pathways of most clusters across ancestries
Each cluster was examined by conducting a cPGS-PheWAS with 988 quantitative traits and 832 binary traits on the four BioMe cohorts (ALL, AFR, AMR, EUR). Except for the ‘Increased body mass’ (abbreviated as ‘IBM’), ‘Reduced lipids’, and the ‘Reduced inflammation’ clusters, we validated at least 2 of the clusters’ top 5 traits (Fig. 3). The level of significance was reached more frequently across ancestries (ALL) than when validating on the individual ones (AMR, AFR, EUR) (Table S3). In addition to the validated traits, significant associations with decreasing eGFR were seen in clusters ‘Increased urate’ (β = − 0.04 [− 0.06 to − 0.03], p-value = 6.7e−09) and ‘Reduced hematocrit’ (β = − 0.05 [− 0.07 to − 0.04], p-value = 4.0e−12) (Table S3). ‘Reduced hematocrit’ was also nominally associated with an increased risk for chronic renal failure (OR = 1.11 [1.05–1.16], p-value = 1.2e−04) and with the curated phenotype ‘diabetic and hypertensive CKD’ (OR = 1.27 [1.11–1.46], p-value = 6.6e−04). Besides showing negative associations with disorders of lipoid metabolism, cluster ‘Increased inflammation’ shows strong negative associations with Alzheimer’s disease (OR = 0.60 [0.52–0.7], p-value = 1.5e−11) and dementias (OR = 0.77 [0.71–0.84], p-value = 1.0e−09) (Fig. S4). Regarding the individual ancestries, EUR showed the strongest associations when validating on binary traits, with an increased risk for “visual disturbances” (OR = 1.51 [1.27–1.79], p-value = 2.1e−06) in the cluster ‘Reduced inflammation,’ while AFR showed the strongest associations when validating on quantitative traits, with the strongest association being for the LDL-HDL ratio (β = − 0.14 [− 0.17 to − 0.11], p-value = 9.3e−21) in the ‘Increased inflammation’ cluster. Figure 3 summarizes which of the top traits of each cluster have been validated, while the complete list of cPGS-PheWAS results by ancestry is stored in Table S3.
cPGSs suggest apparent differences between genetic ancestries
We extracted the cluster weights of the W matrix and used them to calculate cluster-specific polygenic scores (cPGS) for participants of the BioMe cohort. Figure 5 shows the standardized polygenic score distributions for all NMF clusters across the BioMe cohort (ALL) and the individual continental populations EUR (n = 7447), AMR (n = 5336), and AFR (n = 5660). A normal distribution was observed for the cluster ‘Increased urate’ (EUR, AMR, and AFR; Anderson–Darling test, all p-values available in Table S5). Although polygenic scores are expected to have a normal distribution24, the other eight clusters present either a skewed tail (e.g., ‘Increased hematocrit’) or several peaks in their cPGS distributions (e.g., ‘Reduced inflammation’). As illustrated in Fig. 4, the peaks are caused by a few variants with relatively high cluster weights (the complete list of cluster weights for the top variants of each cluster is available in Table S2). For example, the top variant in cluster ‘Increased inflammation’ (rs429358, mapped gene: APOE) weighs 4.6, while the second one (rs17050272) weighs 0.2. In Fig. 5, we can also observe how this variant is more frequent in participants of inferred EUR ancestry. Similarly, the top variant of ‘Reduced inflammation’ (rs1260326, mapped gene: GCKR) weighs 5.7 and seems more frequent in the AFR population, while the second one (rs4418728) weighs 0.9. This unbalance in weight creates the three peaks of the distributions: the lower peak includes the scores of individuals without the top variant (0 copies), the middle one the heterozygous (1 copy), and the higher peak includes scores of participants with two copies of the top variant. Other ancestry-specific differences are visible in the distributions of four clusters and are significant when testing with the Mann–Whitney test (all p-values available in Table S4). This suggests that some variants appear with different frequencies in people that do not share similar ancestry: ‘Increased inflammation’ (all combinations), ‘Reduced inflammation’ (all combinations), ‘Reduced lipids’ (EUR vs. AFR), and ‘Increased body mass’ (EUR vs AFR and AFR vs AMR).
Distribution of participants across clusters
As the cPGS are calculated separately per cluster, each BioMe participant might have high polygenic scores in multiple clusters. Therefore, to understand the cluster overlap in terms of relative risk, we checked how many individuals belonged to the top decile of 1 or more clusters. 58% (18,431/31,701) of the whole BioMe cohort (ALL) were at high risk in at least 1 cluster. Of these, 60.2% were in the top decile for only 1 cluster, while 37% were at risk for 2–3 clusters (Fig. S5).
Discussion
CKD is typically defined as a progressive loss of kidney function over time. Although numerous genetic variants have been identified as associated with CKD, their relationship to disease pathways remains largely unclear. The work described here is the most comprehensive assessment of how variants associated with CKD can be grouped according to different CKD-related factors. Specifically, we included variant-trait associations of 322 CKD SNPs and 229 related metabolic traits from publicly available GWAS datasets. By analyzing these associations with NMF, a factorization approach that allows for minimal overlap between groups, we identified 9 clusters of CKD variants and associated traits.
CKD is commonly recognized as a heterogeneous condition with various underlying causes and risk factors, which are unlikely to represent a single disease process. This complexity is also reflected by the associated traits retrieved from published GWAS, which are related to kidney function, hemoglobin levels, T2D, body weight, and pulse pressure, among others. Attempting to deconvolute CKD's genetic heterogeneity and differentially grouping these traits, the nine clusters we identified represented different aspects of CKD. For example, the ‘Increased urate’ cluster, whose clustering weights represent abnormal levels of urinary metabolites like urate, blood/serum urea nitrogen, blood proteins, and Cystatin C, is related to decreasing kidney function. In normal conditions, such blood metabolites are excreted by the kidneys, but in CKD, they accumulate and exert a detrimental biological activity25,26. A second cluster, which we summarised as ‘Increased inflammation,’ was strongly clustered around rising serum C-reactive protein (CRP) concentrations. CRP is a common inflammatory biomarker in chronic diseases like CKD, diabetes, and cardiovascular diseases27,28,29. In line with that, patients with CKD commonly experience chronic inflammatory states30. These states tend to worsen as the disease progresses toward end-stage renal disease and are reflected, or even modulated31, by increasing CRP levels32,33,34. For these two pathways, we identified significantly enriched pathways using the top 25% weighted genes in IPA that are in line with the described top-weighted traits. Overall, the difference in significantly enriched pathways between the clusters suggests biological differences between them.
We then studied the genotype–phenotype correlation to demonstrate the utility of the clusters. We could validate most of the top-weighted features on quantitative traits (i.e., biomarkers), while the validation on binary traits (i.e., diagnoses) was less robust and required additional clinical interpretation. For the clusters of ‘Increased urate’ and ‘Increased inflammation,’ the top traits were confirmed by the PheWAS. CKD is also associated with dyslipidemia, which is comprised of high levels of triglycerides and LDL-cholesterol, and low levels of HDL-cholesterol and apolipoprotein A135. We could observe similar associations in clusters ‘Increased inflammation,’ ‘Reduced lipids,’ and ‘Reduced inflammation.’ Notably, we found multiple significant associations for cluster ‘Increased inflammation’ with reduced risk of dementias. Glycerophospholipids play an essential role in neural membranes36,37, and their levels are directly correlated with serum triglycerides and inversely correlated with total cholesterol and eGFR38.
A limitation of this study is the need for more genetic diversity in the GWAS Catalog, which mainly consists of studies performed on the European population. This European bias is well described in the literature and has important implications for disease risk prediction across global populations39. Despite this lack of genetic diversity, which affected the initial selection of variants used in the input matrix, we could still validate our results in BioMe, a biobank enriched for populations with non-European ancestries. We were most powered when jointly analyzing across ancestries (ALL), while signals validated in different ancestral groups with some group-specific differences. This result suggests that ancestry-specific studies are essential although most CKD risk factors converge across ancestral groups. Another two limitations are the filtering rules used to select traits and variants for the algorithm’s input matrix and the possible existence of non-additive interactions between risk factors that we did not consider in this study. Lastly, one of the input CKD studies, the PAGE study40, was also conducted using BioMe data. However, this should not impact the results since we are not looking at CKD case/control scenarios but at CKD subtypes.
Understanding the biological pathways that lead to CKD is essential to improve clinical management. For example, some clusters group similar traits but with opposite effect directions (e.g., ‘Increased hematocrit’ and ‘Reduced hematocrit’), while others suggest potentially protective effects (e.g., against dyslipidemia in cluster ‘Increased inflammation’). This behavior might indicate that CKD can affect the same metabolic pathways differently, confirming the genetic complexity of the disease. Additionally, the clusters have a limited degree of overlap, and as each represents a specific set of variants, participants might be high risk (i.e., in the top decile of the polygenic score) for more than one cluster. This additive disease model, similar to the mutational signatures in cancer, suggests a possible interplay of genetic susceptibility to multiple disease-causing mechanisms41.
In summary, by clustering genetic variants associated with CKD, we identified clusters with distinct trait associations, likely representing mechanistic pathways involved in CKD. We confirmed the validity of these clusters phenotypically. Further clinical investigations could explore whether individuals with a common disrupted pathway also share similar complications, a comparable rate of disease progression, or a different treatment response. In the future, classifying patients with CKD using their genotype may improve care by offering a more personalized and genetically informed clinical plan.
Methods
Trait-variants selection
We identified and aligned the alleles of 508 independent genetic variants associated either with decreased kidney function (defined as low eGFR levels for at least three months) or with CKD (using ICD-9/10 codes) from the most recent GWAS and GWAS meta-analyses10,11,12, 40, 42, 43 (Fig. 1a). We then used the R package LDlinkR (R version 4.2.1) to retrieve all proxy SNPs in linkage disequilibrium (r2 > = 0.6) with the lead variants, across all available 1000G human populations44,45. We used the GWAS Catalog database to link the proxy SNPs to 805 associated traits (as of July 30th, 2022)13. We excluded gender-specific GWAS and GWAS performed on less than 100 individuals. Additionally, as we are interested in secondary features associated with CKD, we excluded GWAS of traits directly related to eGFR or CKD (e.g., “Mild to moderate chronic kidney disease,” “Estimated Glomerular Filtration Rate”). We kept trait-variant associations with a significance threshold of less than 1 × 10−6 using a Bonferroni correction for all 2401 associations in our data set. To reduce sparsity in the data, we excluded traits associated with less than five variants; this threshold was empirically defined by comparing the clustering results of traits associated with up to 15 CKD variants. We standardized effect sizes across all GWAS by dividing the regression coefficient beta (B) by the standard error, using the GWAS summary statistic results. Traits and variants were then arranged as a matrix with the standardized effect sizes (β) as values. Table S5 contains, for each input CKD variant, the list of CKD-associated secondary traits extracted from the GWAS Catalog and the corresponding exclusion criteria for those excluded during the filtering steps.
NMF
NMF factorizes the input matrix of trait-variant associations (X, of dimensions 229 × 322) into a matrix of traits (H, 229xK) and one of variants (W, Kx322), so that HxW ≈ X15 (Fig. 1b). The factorization rank K corresponds to the number of clusters. We implemented NMF using the R package ButchR with 10,000 iterations, 30 random initiations, and the convolution threshold set to 8046. The number of expected clusters was set between 2 and 20. ButchR suggests the optimal K based on six cluster evaluation metrics, like the mean silhouette width and the Frobenius error. If two or more K were presented, we considered results with the highest mean silhouette width and the lowest Frobenius error, as suggested by Alexandrov et al.41. As additional validation, we also performed a Bayesian version of NMF16, using the code provided by Udler et al.17. bNMF was run 1000 times with up to 200,000 iterations in each run.
Pathway analyses
To identify cluster-specific enriched pathways, we conducted three different pathway analysis approaches using the closest genes to the variants (n = 262). For gene set enrichment analysis (GSEA), using the clusterProfiler R package47, we tested for enrichment of the Hallmark, Ractome and KEGG48,49,50,51 gene pathways from MSigDB. Ranks of the genes were defined based on the weights from the variant-cluster matrix H. If a gene was annotated to multiple variants, we only considered the highest weights. Using the top 25% weighted SNPs and their corresponding annotated genes, we further conducted Ingenuity Pathway Analysis52 and Overrepresentation Analysis using the WEB-based GEne SeT AnaLysis (WebGestalt) toolkit53, focusing on the KEGG, Reactome, Panther, and Wiki pathways.
Cluster-specific polygenic scores
The results of clustering provide cluster-specific weights for each variant and trait. We used PLINK and the variant cluster weights to calculate cluster-specific polygenic scores (cPGS) of the BioMe biobank participants54. cPGS were standardized within each cluster. The normality of each cPGS distribution was tested with the Anderson–Darling method. Differences between ancestry-specific distributions were tested with the Mann–Whitney test.
Validation cohort (BioMe)
We validated our results using the genetic and linked electronic health records (EHR) data of 31,701 BioMe biobank participants55 (Fig. 1c). As a fine-scale population structure can improve the risk prediction of complex diseases within genetic groups56, we inferred the genetic ancestry of the BioMe participants. We then performed a Principal Component Analysis (PCA) using PLINK, excluding relatives above 2nd-degree (kinship method, estimated using KING57) and variants with minor allele frequency below 0.0554,58. We trained a random forest classifier to infer the genetic ancestry of BioMe participants using the 1000 Genomes labels as reference59. The labeled ancestries are Admixed American (AMR, n = 5336), African (AFR, n = 5660), European (EUR, n = 7447), South Asian (SAS, n = 613), and East Asian (EAS, n = 728). For sub-population-specific analyses, we removed participants with mixed ancestry (defined as having a random forest probability ≤ 0.5) and outliers by only including the quantiles 0.25–0.9060 (n = 11,404).
Modeling disease outcomes as a function of cluster-specific polygenic scores
For each cluster, the cPGS were associated with the phenotypes available in the BioMe data set by performing a phenome-wide association study (cPGS-PheWAS). We fitted linear regression models to analyze 988 quantitative traits (e.g., laboratory results) and logistic regression models for 832 binary traits with cPGS as independent variables, adjusting for sex, age, and the first ten genetic principal components (stats R package61). Binary traits included Phecodes mapped to ICD-9 and ICD-10 codes (a Phecode is considered if at least two relevant diagnostic codes were present in a patient’s EHR)62 and curated phenotypes63. Controls were identified as the reference category. Traits were only considered if present or measured in at least 100 biobank participants. The model parameters were standardized using the effectsize R package (refit method)64. Standardized coefficient estimates (linear regression) and odd ratios (logistic regression, defined as change of 1 SD in the PGS) were reported with the corresponding 95% confidence intervals. The Bonferroni method was used to adjust for multiple testing, and the alpha threshold was defined as 3.1e−06 (0.05/[9 * (988 + 832)]). We then compared the PheWAS results with the traits in the top decile of NMF’s trait weights.
Data availability
All publicly available data (input variants, trait-variant associations) used to support the findings of this study are included in this published article (and its Supplementary Information files) and are also available from the cited publications and GWAS Catalog. Additional data generated for the analysis steps, including source code and intermediate results, are available from the corresponding author upon reasonable request. The data used to validate the findings of this study are available from BioMe biobank (https://icahn.mssm.edu/research/ipm/programs/biome-biobank), but restrictions apply to their availability. To access the data, please reach out to biomebiobank@mssm.edu.
References
Cockwell, P. & Fisher, L.-A. The global burden of chronic kidney disease. Lancet 395, 662–664 (2020).
Bikbov, B. et al. Global, regional, and national burden of chronic kidney disease, 1990–2017: A systematic analysis for the Global Burden of Disease Study 2017. Lancet 395, 709–733 (2020).
Couser, W. G., Remuzzi, G., Mendis, S. & Tonelli, M. The contribution of chronic kidney disease to the global burden of major noncommunicable diseases. Kidney Int. 80, 1258–1270 (2011).
Jha, V. et al. Chronic kidney disease: Global dimension and perspectives. Lancet 382, 260–272 (2013).
Ekrikpo, U. E. et al. Chronic kidney disease in the global adult HIV-infected population: A systematic review and meta-analysis. PLoS One 13, e0195443 (2018).
Saran, R. et al. US Renal Data System 2015 Annual Data Report: Epidemiology of kidney disease in the United States. Am. J. Kidney Dis. 67, A7–A8 (2016).
Levey, A. S. & Coresh, J. Chronic kidney disease. Lancet 379, 165–180 (2012).
Ruggenenti, P., Cravedi, P. & Remuzzi, G. Mechanisms and Treatment of CKD. J. Am. Soc. Nephrol. 23, 1917–1928 (2012).
Levey, A. S. et al. Definition and classification of chronic kidney disease: A position statement from Kidney Disease: Improving Global Outcomes (KDIGO). Kidney Int. 67, 2089–2100 (2005).
Gorski, M. et al. Meta-analysis uncovers genome-wide significant variants for rapid kidney function decline. Kidney Int. 99, 926–939 (2021).
Teumer, A. et al. Genome-wide association meta-analyses and fine-mapping elucidate pathways influencing albuminuria. Nat. Commun. 10, 4130 (2019).
Morris, A. P. et al. Trans-ethnic kidney function association study reveals putative causal genes and effects on kidney-specific disease aetiologies. Nat. Commun. 10, 29 (2019).
Sollis, E. et al. The NHGRI-EBI GWAS Catalog: Knowledgebase and deposition resource. Nucleic Acids Res. 51, D977–D985 (2023).
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Paatero, P. & Tapper, U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994).
Févotte, C. & Idier, J. Algorithms for nonnegative matrix factorization with the β-divergence. Neural Comput. 23, 2421–2456 (2011).
Udler, M. S. et al. Type 2 diabetes genetic loci informed by multi-trait associations point to disease mechanisms and subtypes: A soft clustering analysis. PLoS Med. 15, e1002654 (2018).
Kim, H. et al. High-throughput genetic clustering of type 2 diabetes loci reveals heterogeneous mechanistic pathways of metabolic disease. Diabetologia 66, 495–507 (2023).
DiCorpo, D. et al. Type 2 diabetes partitioned polygenic scores associate with disease outcomes in 454,193 individuals across 13 cohorts. Diabetes Care 45, 674–683 (2022).
Chun, S. et al. Non-parametric polygenic risk prediction via partitioned GWAS summary statistics. Am. J. Hum. Genet. 107, 46–59 (2020).
Tait Wojno, E. D., Hunter, C. A. & Stumhofer, J. S. The immunobiology of the interleukin-12 family: Room for discovery. Immunity 50, 851–870 (2019).
Bhatia, M. et al. Bone morphogenetic proteins regulate the developmental program of human hematopoietic stem cells. J. Exp. Med. 189, 1139–1148 (1999).
Liu, J.-R. et al. Gut microbiota-derived tryptophan metabolism mediates renal fibrosis by aryl hydrocarbon receptor signaling activation. Cell. Mol. Life Sci. 78, 909–922 (2021).
Choi, S. W., Mak, T.S.-H. & O’Reilly, P. F. Tutorial: A guide to performing polygenic risk score analyses. Nat. Protoc. 15, 2759–2772 (2020).
Gagnebin, Y. et al. Exploring blood alterations in chronic kidney disease and haemodialysis using metabolomics. Sci. Rep. 10, 19502 (2020).
Vanholder, R. et al. Review on uremic toxins: Classification, concentration, and interindividual variability. Kidney Int. 63, 1934–1943 (2003).
Stenvinkel, P. & Lindholm, B. C-reactive protein in end-stage renal disease: Are there reasons to measure it?. Blood Purif. 23, 72–78 (2005).
Kalantar-Zadeh, K., Stenvinkel, P., Pillon, L. & Kopple, J. D. Inflammation and nutrition in renal insufficiency. Adv. Ren. Replace. Ther. 10, 155–169 (2003).
Hansson, G. K. Inflammation, atherosclerosis, and coronary artery disease. N. Engl. J. Med. 352, 1685–1695 (2005).
Stenvinkel, P. et al. Strong association between malnutrition, inflammation, and atherosclerosis in chronic renal failure. Kidney Int. 55, 1899–1911 (1999).
Li, Z. et al. C-reactive protein promotes acute renal inflammation and fibrosis in unilateral ureteral obstructive nephropathy in mice. Lab. Investig. 91, 837–851 (2011).
Stuveling, E. M. et al. C-reactive protein is associated with renal function abnormalities in a non-diabetic population. Kidney Int. 63, 654–661 (2003).
Menon, V. et al. C-reactive protein and albumin as predictors of all-cause and cardiovascular mortality in chronic kidney disease. Kidney Int. 68, 766–772 (2005).
Muntner, P. et al. The prevalence of nontraditional risk factors for coronary heart disease in patients with chronic kidney disease. Ann. Intern. Med. 140, 9 (2004).
Theofilis, P., Vordoni, A., Koukoulaki, M., Vlachopanos, G. & Kalaitzidis, R. G. Dyslipidemia in chronic kidney disease: Contemporary concepts and future therapeutic perspectives. Am. J. Nephrol. 52, 693–701 (2021).
Farooqui, A. A., Horrocks, L. A. & Farooqui, T. Glycerophospholipids in brain: Their metabolism, incorporation into membranes, functions, and involvement in neurological disorders. Chem. Phys. Lipids 106, 1–29 (2000).
Frisardi, V., Panza, F., Seripa, D., Farooqui, T. & Farooqui, A. A. Glycerophospholipids and glycerophospholipid-derived lipid mediators: A complex meshwork in Alzheimer’s disease pathology. Prog. Lipid Res. 50, 313–330 (2011).
Chen, H. et al. Combined clinical phenotype and lipidomic analysis reveals the impact of chronic kidney disease on lipid metabolism. J. Proteome Res. 16, 1566–1578 (2017).
Sirugo, G., Williams, S. M. & Tishkoff, S. A. The missing diversity in human genetic studies. Cell 177, 26–31 (2019).
Lin, B. M. et al. Genetics of chronic kidney disease stages across ancestries: The PAGE study. Front. Genet. 10, 494 (2019).
Alexandrov, L. B., Nik-Zainal, S., Wedge, D. C., Campbell, P. J. & Stratton, M. R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep. 3, 246–259 (2013).
Stanzick, K. J. et al. Discovery and prioritization of variants and genes for kidney function in >1.2 million individuals. Nat. Commun. 12, 4350 (2021).
Lin, B. M. et al. Whole genome sequence analyses of eGFR in 23,732 people representing multiple ancestries in the NHLBI trans-omics for precision medicine (TOPMed) consortium. EBioMedicine 63, 103157 (2021).
Machiela, M. J. & Chanock, S. J. LDlink: A web-based application for exploring population-specific haplotype structure and linking correlated alleles of possible functional variants: Fig. 1. Bioinformatics 31, 3555–3557 (2015).
Myers, T. A., Chanock, S. J. & Machiela, M. J. LDlinkR: An R package for rapidly calculating linkage disequilibrium statistics in diverse populations. Front. Genet. 11, 157 (2020).
Quintero, A. et al. ShinyButchR: Interactive NMF-based decomposition workflow of genome-scale datasets. Biol. Methods Protoc. 5, bpaa022 (2020).
Wu, T. et al. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation 2, 100141 (2021).
Kanehisa, M. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).
Kanehisa, M., Furumichi, M., Sato, Y., Kawashima, M. & Ishiguro-Watanabe, M. KEGG for taxonomy-based analysis of pathways and genomes. Nucleic Acids Res. 51, D587–D592 (2023).
Milacic, M. et al. The reactome pathway knowledgebase 2024. Nucleic Acids Res. 52, D672–D678 (2024).
Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Krämer, A., Green, J., Pollard, J. & Tugendreich, S. Causal analysis approaches in ingenuity pathway analysis. Bioinformatics 30, 523–530 (2014).
Liao, Y., Wang, J., Jaehnig, E. J., Shi, Z. & Zhang, B. WebGestalt 2019: Gene set analysis toolkit with revamped UIs and APIs. Nucleic Acids Res. 47, W199–W205 (2019).
Purcell, S. et al. PLINK: A tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
BioMe BioBank Program | Icahn School of Medicine. Icahn School of Medicine at Mount Sinai. https://icahn.mssm.edu/research/ipm/programs/biome-biobank
Belbin, G. M. et al. Toward a fine-scale population health monitoring system. Cell 184, 2068-2083.e11 (2021).
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Quality Control (QC) | Pan UKBB. https://pan-dev.ukbb.broadinstitute.org/docs/qc
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013).
Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).
Nadkarni, G. N. et al. Development and validation of an electronic phenotyping algorithm for chronic kidney disease. AMIA Annu. Symp. Proc. AMIA Symp. 2014, 907–916 (2014).
Ben-Shachar, M., Lüdecke, D. & Makowski, D. effectsize: Estimation of effect size indices and standardized parameters. J. Open Source Softw. 5, 2815 (2020).
Acknowledgements
We would like to thank Miriam Udler, who kindly provided us with the script of the bNMF algorithm, which was used in this study to confirm the clusters identified by NMF. This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Additionally, this work was supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880, which allowed us to use Mount Sinai Data Warehouse (MSDW) data. Regarding HPI.MS resources, funding was provided by the Hasso Plattner Foundation (HPF). Additionally, the research leading to these results has received funding from the Horizon 2020 Programme of the European Commission under Grant Agreement No. 826117 (Smart4Health). The Mount Sinai BioMe Biobank has been supported by The Andrea and Charles Bronfman Philanthropies and in part by Federal funds from the NHLBI and NHGRI (U01HG00638001; U01HG007417; X01HL134588). We thank all participants in the Mount Sinai BioMe Biobank. We also thank all of our recruiters who have assisted in data collection and management, and we are grateful for the computational resources and staff expertise provided by Scientific Computing at the Icahn School of Medicine at Mount Sinai.
Funding
Open Access funding enabled and organized by Projekt DEAL.
Author information
Authors and Affiliations
Contributions
Conceptualization, C.S., E.B.; Methodology, C.S., S.I.; Data analysis, A.E., S.I.; Writing of the original draft, A.E.; Tables and Figures, A.E.; Supervision, S.I., H.O.H., E.B.; Project administration, E.B.; Funding acquisition, E.B., H.O.H. All authors reviewed and edited the manuscript. All authors have read, discussed, and approved the manuscript, its analyses, and interpretations.
Corresponding author
Ethics declarations
Competing interests
Claudia Schurmann is a paid employee of Bayer AG, Pharmaceuticals. All other authors do not have any competing interest.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Eoli, A., Ibing, S., Schurmann, C. et al. A clustering approach to improve our understanding of the genetic and phenotypic complexity of chronic kidney disease. Sci Rep 14, 9642 (2024). https://doi.org/10.1038/s41598-024-59747-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-024-59747-4
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.