A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning
Abstract
Accurate identification of gene-derived versus intergenic regions is a fundamental prerequisite for downstream genomic analyses, yet distinguishing these sequence types remains challenging when only short DNA windows are available. In this study, a scalable machine-learning framework was developed that integrates canonical k-mer representations with robust classifiers to discriminate 300 bp windows extracted from the Drosophila melanogaster genome. A balanced dataset of 1,000 gene-derived and 1,000 intergenic windows was encoded using canonical 3-mer and 4-mer frequencies combined with GC-content, yielding a 169-dimensional feature matrix. Logistic Regression, Random Forest, and Gradient Boosting models were evaluated using GroupKFold cross-validation to prevent gene-family leakage. All models achieved consistently high performance, with Gradient Boosting attaining the best overall results (Accuracy = 0.865, F1 = 0.868, MCC = 0.731, AUROC = 0.932, AUPRC = 0.918). SHAP-based feature attribution revealed that the GCC motif (mean |SHAP| = 0.50) and GC-content (0.48) were the most influential predictors, indicating that both specific short motifs and broader compositional patterns provide strong discriminative signals between genic and intergenic windows. Baseline comparisons demonstrated that alignment-based BLAST performed poorly on this task (Accuracy = 0.503), while a minimal 1D-CNN achieved performance comparable to classical machine-learning models, underscoring the efficiency and competitiveness of k-mer–based representations. Overall, the findings show that canonical k-mer features, when coupled with well-calibrated machine-learning models, offer an accurate, interpretable, and computationally efficient strategy for short-window genomic classification. This framework holds promise for improving large-scale genome annotation pipelines and may be extended to diverse taxa, metagenomic data, and real-time bioinformatics workflows.
Keywords
Supporting Institution
Project Number
Ethical Statement
Thanks
References
- Alam, M. N.U., & Chowdhury, U. F. (2020). Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses. PloS One, 15(9), e0239381.
- Athanasopoulou, K., Michalopoulou, V. I., Scorilas, A. & Adamopoulos, P. G. (2025). Integrating Artificial Intelligence in Next-Generation Sequencing: Advances, Challenges, and Future Directions. Current Issues in Molecular Biology, 47(6), 470.
- Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937-1967.
- Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.
- Deorowicz, S., Kokot, M., Grabowski, S., & Debudaj-Grabysz, A. (2015). KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10), 1569-1576.
- Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7), 389-403.
- Gonzalez-Dominguez, J. & Schmidt, B. (2016). ParDRe: faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics, 32(10), 1562-1564.
Details
Primary Language
English
Subjects
Agricultural Biotechnology (Other)
Journal Section
Research Article
Authors
Early Pub Date
March 6, 2026
Publication Date
March 6, 2026
Submission Date
August 16, 2025
Acceptance Date
December 30, 2025
Published in Issue
Year 2026 Number: Advanced Online Publication
