Cancer subtype classification remains a critical challenge in precision oncology, with traditional histopathological methods often inadequate for capturing molecular heterogeneity. This study evaluates machine learning approaches for accurate cancer subtype classification using high-dimensional gene expression profiles. We implemented and compared four machine learning algorithms: Support Vector Machines (SVM), Random Forest (RF), k-Nearest Neighbors (k-NN), and deep neural networks, using publicly available datasets from The Cancer Genome Atlas (TCGA) and Gene Expression Omnibus (GEO). Dimensionality reduction techniques including Principal Component Analysis (PCA) and feature selection methods such as Least Absolute Shrinkage and Selection Operator (LASSO) were employed to enhance model performance and interpretability. Performance evaluation utilized 10 fold cross-validation with metrics including accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC). Random Forest achieved the highest classification accuracy of 94.7% (95% CI: 92.1-96.8%), followed by SVM at 93.2% (95% CI: 90.4-95.6%). LASSO feature selection identified 183 discriminative genes, with PCA reducing dimensionality by 99.75% while retaining 95% of variance. The models successfully identified biologically relevant gene signatures associated with cancer pathogenesis and treatment response. These findings demonstrate that machine learning algorithms achieve superior performance in cancer subtype classification compared to conventional approaches. The integration of dimensionality reduction and feature selection techniques enhances both computational efficiency and biological interpretability, supporting the clinical implementation of MLbased diagnostic tools in precision oncology.
Machine learning, cancer classification, gene expression profiling, precision oncology, bioinformatics, molecular diagnostics, support vector machines, random forest, deep learning