Abstract
Advancements in high-throughput techniques have increased the availability of omics data, such as mRNA expression, DNA methylation, microRNA, and proteomics. Integrating these data types offers a more comprehensive analysis than using single data types. Traditional methods like sparse canonical correlation analysis (sCCA) and sparse partial least squares (sPLS) struggle with nonlinear, mixed-type data. Kernel-based methods can handle complexity but often overfit.
The random forest algorithm effectively handles complex omics data and reduces overfitting but is underused in data integration. We propose a novel framework using multivariate random forest (MRF) for variable selection and dimension reduction in multi-omics integration and cancer subtyping. We also introduce two MRF-based fusion methods for clustering to integrate multi-omics data.
Our framework demonstrates effectiveness in variable extraction and subtype clustering in simulation studies. Applied to TCGA-BRCA, TCGA-COAD, and TCGA-PAN cancer datasets, it shows promise in biomarker selection and provides new biological insights into cancer subtypes.