Research Article

A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning

Number: Advanced Online Publication March 6, 2026
EN TR

A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning

Abstract

Accurate identification of gene-derived versus intergenic regions is a fundamental prerequisite for downstream genomic analyses, yet distinguishing these sequence types remains challenging when only short DNA windows are available. In this study, a scalable machine-learning framework was developed that integrates canonical k-mer representations with robust classifiers to discriminate 300 bp windows extracted from the Drosophila melanogaster genome. A balanced dataset of 1,000 gene-derived and 1,000 intergenic windows was encoded using canonical 3-mer and 4-mer frequencies combined with GC-content, yielding a 169-dimensional feature matrix. Logistic Regression, Random Forest, and Gradient Boosting models were evaluated using GroupKFold cross-validation to prevent gene-family leakage. All models achieved consistently high performance, with Gradient Boosting attaining the best overall results (Accuracy = 0.865, F1 = 0.868, MCC = 0.731, AUROC = 0.932, AUPRC = 0.918). SHAP-based feature attribution revealed that the GCC motif (mean |SHAP| = 0.50) and GC-content (0.48) were the most influential predictors, indicating that both specific short motifs and broader compositional patterns provide strong discriminative signals between genic and intergenic windows. Baseline comparisons demonstrated that alignment-based BLAST performed poorly on this task (Accuracy = 0.503), while a minimal 1D-CNN achieved performance comparable to classical machine-learning models, underscoring the efficiency and competitiveness of k-mer–based representations. Overall, the findings show that canonical k-mer features, when coupled with well-calibrated machine-learning models, offer an accurate, interpretable, and computationally efficient strategy for short-window genomic classification. This framework holds promise for improving large-scale genome annotation pipelines and may be extended to diverse taxa, metagenomic data, and real-time bioinformatics workflows.

Keywords

Supporting Institution

This study was not supported by any institution or organization.

Project Number

No project number.

Ethical Statement

This study was conducted using publicly available data and therefore does not require ethical approval.

Thanks

The author has no acknowledgments to declare.

References

  1. Alam, M. N.U., & Chowdhury, U. F. (2020). Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses. PloS One, 15(9), e0239381.
  2. Athanasopoulou, K., Michalopoulou, V. I., Scorilas, A. & Adamopoulos, P. G. (2025). Integrating Artificial Intelligence in Next-Generation Sequencing: Advances, Challenges, and Future Directions. Current Issues in Molecular Biology, 47(6), 470.
  3. Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937-1967.
  4. Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
  5. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.
  6. Deorowicz, S., Kokot, M., Grabowski, S., & Debudaj-Grabysz, A. (2015). KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10), 1569-1576.
  7. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7), 389-403.
  8. Gonzalez-Dominguez, J. & Schmidt, B. (2016). ParDRe: faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics, 32(10), 1562-1564.

Details

Primary Language

English

Subjects

Agricultural Biotechnology (Other)

Journal Section

Research Article

Early Pub Date

March 6, 2026

Publication Date

March 6, 2026

Submission Date

August 16, 2025

Acceptance Date

December 30, 2025

Published in Issue

Year 2026 Number: Advanced Online Publication

APA
Yıldız, B. İ. (2026). A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım Ve Doğa Dergisi, Advanced Online Publication, 1127-1136. https://doi.org/10.18016/ksutarimdoga.vi.1766666


International Peer Reviewed Journal
Free submission and publication
Published 6 times a year



88x31.png


KSU Journal of Agriculture and Nature

e-ISSN: 2619-9149