Abstract
The rapid growth in DNA and protein sequencing techniques over the last decade boosted the availability and scale of mutations data, and therefore the necessity of developing automated approaches to predict driver mutations arises. Identifying driver mutations is essential to better understand and measure cancer progression and thus enable proper diagnosis and targeted treatment of cancer. Here, we present a scalable machine learning based approach to identify driver missense mutations. The proposed approach does not require any domain knowledge and does not leverage any features based on previously annotated data. A group of independent parallel classifiers where each classifier handles a single set of features can be deployed. Then, a model fusion module combines the classifiers’ outputs to produce a final mutation label. Each classifier is trained and validated independently with its corresponding feature set. Feature sets undergo a feature selection process to filter out low significance features. Four protein sequence-level feature sets are leveraged, namely two amino acid indices (AAIndex1 and AAIndex2) feature sets, one pseudo amino acid composition (PseAAC) feature set, and one feature set generated using wavelet analysis are used in the current framework implementation. The proposed approach is extensible to consume new additional features with the minimal impact on the computational complexity due to the parallel design of its components. Experiments were performed to assess the performance of the proposed approach and compare it with another approach. In summary, the proposed framework achieved around 93% accuracy and 86% Matthew’s correlation coefficient when tested using driver mutations collected from CHASM data set and passenger mutations synthesized by CHASM algorithm and around 73% accuracy and 46% Matthew’s correlation coefficient when tested using driver and passenger mutations collected from CanProVar database. It was concluded that the proposed framework performance is comparable to the state-of-the-art algorithm CHASM software, and therefore, it is a promising approach for driver mutation prediction. A recommendation for further research is enclosed.