A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning

Berkant İsmail Yıldız

doi:10.18016/ksutarimdoga.vi.1766666

EN TR

A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning

Abstract

Accurate identification of gene-derived versus intergenic regions is a fundamental prerequisite for downstream genomic analyses, yet distinguishing these sequence types remains challenging when only short DNA windows are available. In this study, a scalable machine-learning framework was developed that integrates canonical k-mer representations with robust classifiers to discriminate 300 bp windows extracted from the Drosophila melanogaster genome. A balanced dataset of 1,000 gene-derived and 1,000 intergenic windows was encoded using canonical 3-mer and 4-mer frequencies combined with GC-content, yielding a 169-dimensional feature matrix. Logistic Regression, Random Forest, and Gradient Boosting models were evaluated using GroupKFold cross-validation to prevent gene-family leakage. All models achieved consistently high performance, with Gradient Boosting attaining the best overall results (Accuracy = 0.865, F1 = 0.868, MCC = 0.731, AUROC = 0.932, AUPRC = 0.918). SHAP-based feature attribution revealed that the GCC motif (mean |SHAP| = 0.50) and GC-content (0.48) were the most influential predictors, indicating that both specific short motifs and broader compositional patterns provide strong discriminative signals between genic and intergenic windows. Baseline comparisons demonstrated that alignment-based BLAST performed poorly on this task (Accuracy = 0.503), while a minimal 1D-CNN achieved performance comparable to classical machine-learning models, underscoring the efficiency and competitiveness of k-mer–based representations. Overall, the findings show that canonical k-mer features, when coupled with well-calibrated machine-learning models, offer an accurate, interpretable, and computationally efficient strategy for short-window genomic classification. This framework holds promise for improving large-scale genome annotation pipelines and may be extended to diverse taxa, metagenomic data, and real-time bioinformatics workflows.

Keywords

Supporting Institution

This study was not supported by any institution or organization.

Project Number

No project number.

Ethical Statement

This study was conducted using publicly available data and therefore does not require ethical approval.

Thanks

The author has no acknowledgments to declare.

Kanonik k-mer’ler ve Makine Öğrenmesi Kullanılarak Genomik Bölgelerin Sınıflandırılmasına Yönelik Biyoinformatik Bir Çerçeve

Abstract

Gen kaynaklı (genic) ve gen dışı (intergenic) bölgelerin doğru şekilde ayrıştırılması, genomik analizlerin birçok aşaması için temel bir gerekliliktir; ancak bu bölgelerin yalnızca kısa DNA pencerelerine dayanarak ayırt edilmesi hâlen önemli bir zorluk oluşturmaktadır. Bu çalışmada, Drosophila melanogaster genomundan elde edilen 300 bp’lik dizileri sınıflandırmak amacıyla, kanonik k-mer temsillerini güçlü makine öğrenmesi sınıflandırıcılarıyla birleştiren ölçeklenebilir bir çerçeve geliştirilmiştir. Çalışmada, 1.000 gen kaynaklı ve 1.000 gen dışı pencereden oluşan dengeli bir veri seti kullanılmış; diziler kanonik 3-mer ve 4-mer frekansları ile GC içeriği temelinde 169 boyutlu bir özellik matrisine dönüştürülmüştür. Gen aileleri arasındaki benzerliğin eğitim–test sızıntısına yol açmaması için GroupKFold çapraz doğrulaması uygulanmıştır. Modeller genel olarak yüksek performans göstermiş ve en iyi sonuçlar Gradient Boosting sınıflandırıcısı tarafından elde edilmiştir (Doğruluk = 0.865, F1 = 0.868, MCC = 0.731, AUROC = 0.932, AUPRC = 0.918). SHAP tabanlı özellik önem analizi, en etkili motiflerin GCC (ortalama |SHAP| = 0.50) ve GC içeriği (0.48) olduğunu ortaya koymuş; bu durum hem belirli kısa motiflerin hem de daha geniş kompozisyonel örüntülerin genik ve gen dışı bölgelerin ayrımında güçlü sinyaller taşıdığını göstermiştir. Karşılaştırmalı analizler, hizalama temelli BLAST yönteminin bu problemde düşük performans sergilediğini (Doğruluk = 0.503), buna karşılık minimal bir 1D-CNN modelinin klasik makine öğrenimi yöntemleriyle benzer doğruluk düzeylerine ulaştığını göstermiştir. Bu sonuçlar, k-mer temelli temsilin verimliliğini ve rekabet gücünü desteklemektedir. Genel olarak bulgular, kanonik k-mer özelliklerinin iyi kalibre edilmiş makine öğrenmesi modelleriyle birleştirildiğinde kısa DNA pencerelerinin yüksek doğrulukla, yorumlanabilir ve hesaplamalı olarak verimli bir şekilde sınıflandırılmasını mümkün kıldığını ortaya koymaktadır. Bu çerçeve, büyük ölçekli genom anotasyon süreçlerinin iyileştirilmesi için önemli bir potansiyel taşımakta olup, farklı taksonlara, metagenomik veri setlerine ve gerçek zamanlı biyoinformatik uygulamalarına uyarlanabilir niteliktedir.

Keywords

Project Number

No project number.

References

Alam, M. N.U., & Chowdhury, U. F. (2020). Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses. PloS One, 15(9), e0239381.
Athanasopoulou, K., Michalopoulou, V. I., Scorilas, A. & Adamopoulos, P. G. (2025). Integrating Artificial Intelligence in Next-Generation Sequencing: Advances, Challenges, and Future Directions. Current Issues in Molecular Biology, 47(6), 470.
Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937-1967.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.
Deorowicz, S., Kokot, M., Grabowski, S., & Debudaj-Grabysz, A. (2015). KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10), 1569-1576.
Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7), 389-403.
Gonzalez-Dominguez, J. & Schmidt, B. (2016). ParDRe: faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics, 32(10), 1562-1564.

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., Del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P. … & Oliphant, T. E. (2020). Array programming with NumPy. Nature, 585(7825), 357–362.
Hastie, T., Tibshirani, R. & Friedman, J. (2017). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer.
He, L., Huang, M., Yiming, G., Zhu, Y., Liu, R., Chen, J. & Yau, S. S. (2025). A new alignment-free method: K-mer Subsequence Natural Vector (K-mer SNV) for classification of fungi. BMC Bioinformatics, 26(1), 170.
Isaic, A., Motofelea, N., Hoinoiu, T., Motofelea, A. C., Leancu, I. C., Stan, E., Gheorghe, S. R., Dutu, A. G. & Crintea, A. (2025). Next-generation sequencing: A review of its transformative impact on cancer diagnosis, treatment, and resistance management. Diagnostics, 15(19), 2425.
Jaillard, M., Palmieri, M., van Belkum, A. & Mahe, P. (2020). Interpreting k-mer–based signatures for antibiotic resistance prediction. Gigascience, 9(10), giaa110.
Kemena, C. & Notredame, C. (2009). Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics, 25(19), 2455-2465.
Kokot, M., Długosz, M. & Deorowicz, S. (2017). KMC 3: Counting and manipulating k-mer statistics. Bioinformatics, 33(17), 2759–2761.
Kuhn, R. M., Haussler, D. & Kent, W. J. (2013). The UCSC genome browser and associated tools. Briefings in Bioinformatics, 14(2), 144–161.
Libbrecht, M. W. & Noble, W. S. (2015). Machine learning applications in genetics and genomics. Nature Reviews Genetics, 16(6), 321-332.
Marçais, G. & Kingsford, C. (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6), 764–770.
McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, Texas, USA, 28 June-3 July 2010, pp. 56-61.
Meng, X. H., Huang, Y. X., Rao, D. P., Zhang, Q. & Liu, Q. (2013). Comparison of three data mining models for predicting diabetes or prediabetes by risk factors. The Kaohsiung Journal of Medical Sciences, 29(2), 93-99.
Moore, M. P., Laager, M., Ribeca, P. & Didelot, X. (2024). KmerAperture: retaining k-mer synteny for alignment-free extraction of core and accessory differences between bacterial genomes. PLoS Genetics, 20(4), e1011184.
Moradigaravand, D., Palm, M., Farewell, A., Mustonen, V., Warringer, J. & Parts, L. (2018). Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data. PLoS Computational Biology, 14(12), e1006258.
Nguyen, T. T. D., Trinh, V. N., Le, N. Q. K. & Ou, Y. Y. (2021). Using k-mer embeddings learned from a Skip-gram based neural network for building a cross-species DNA N6-methyladenine site prediction model. Plant Molecular Biology, 107(6), 533-542.
Ondov, B. D., Treangen, T. J., Melsted, P., Mallonee, A. B., Bergman, N. H., Koren, S. & Phillippy, A. M. (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome biology, 17(1), 132.
Orozco-Arias, S., Candamil-Cortés, M. S., Jaimes, P. A., Piña, J. S., Tabares-Soto, R., Guyot, R. & Isaza, G. (2021). K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. PeerJ, 9, e11456.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O. ... & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.
Python Software Foundation (2025). The Python language version 3.12.12.
Razali, M. N., Arbaiy, N., Lin, P. C. & Ismail, S. (2025). Optimizing Multiclass Classification Using Convolutional Neural Networks with Class Weights and Early Stopping for Imbalanced Datasets. Electronics, 14(4), 705.
Saito, T. & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot for imbalanced datasets. PLoS one, 10(3), e0118432.
Seaver, S. M., Gerdes, S., Frelin, O., Lerma-Ortiz, C., Bradbury, L. M., Zallot, R., Hasnain, G., Niehaus, T. D., El Yacoubi, B., Pasternak, S., Olson, R., Pusch, G., Overbeek, R., Stevens, R., de Crécy-Lagard, V., Ware, D., Hanson,
A. D. & Henry, C. S. (2014). High-throughput comparison, functional annotation, and metabolic modeling of plant genomes using the PlantSEED resource. Proceedings of the National Academy of Sciences of the United States of America, 111(26), 9645–9650.
Selberg, A., Clark, N. L., Sackton, T. B., Muse, S. V., Lucaci, A. G., Weaver, S., Nekrutenko, A., Chikina, M. & Pond, S. L. K. (2025). Minus the Error: Testing for Positive Selection in the Presence of Residual Alignment Errors. bioRxiv : the preprint server for biology, 2024.11.13.620707.
Temeltürk, B. M., Önder, S. & Ulubaş Serçe, Ç. (2025). Investigation of viral agents in walnut (Juglans spp.) trees by high throughput sequencing from Niğde province, Türkiye. KSÜ Tarım ve Doğa Dergisi, 28(3), 820-829.
Thurmond, J., Goodman, J. L., Strelets, V. B., Attrill, H., Gramates, L. S., Marygold, S. J., Matthews, B. B., Millburn, G., Antonazzo, G., Trovisco, V., Kaufman, T. C., Calvi, B. R. & FlyBase Consortium (2019). FlyBase 2.0: the next generation. Nucleic Acids Research, 47(1), 759–765.
Vanni, C., Schechter, M. S., Acinas, S. G., Barberán, A., Buttigieg, P. L., Casamayor, E. O., Delmont, T. O., Duarte, C. M., Eren, A. M., Finn, R. D., Kottmann, R., Mitchell, A., Sánchez, P., Siren, K., Steinegger, M., Gloeckner, F. O. & Fernàndez-Guerra, A. (2022). Unifying the known and unknown microbial coding sequence space. eLife, 11, e67667.
Wang, X., Wang, B. & Yuan, F. (2023). Deciphering the roles of unknown/uncharacterized genes in plant development and stress responses. Frontiers in Plant Science, 14, 1276559.
Zou, J., Huss, M., Abid, A., Mohammadi, P., Torkamani, A. & Telenti, A. (2019). A primer on deep learning in genomics. Nature Genetics, 51(1), 12-18.

Details

Primary Language

English

Subjects

Agricultural Biotechnology (Other)

Journal Section

Research Article

Authors

Berkant İsmail Yıldız ^*
0000-0001-8965-6361
Türkiye

Early Pub Date

March 6, 2026

Publication Date

March 6, 2026

Submission Date

August 16, 2025

Acceptance Date

December 30, 2025

Published in Issue

Year 2026 Number: Advanced Online Publication

DOI

https://doi.org/10.18016/ksutarimdoga.vi.1766666

IZ

https://izlik.org/JA96GK94LJ

Cite

RIS / Bibtex

APA

Yıldız, B. İ. (2026). A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım Ve Doğa Dergisi, Advanced Online Publication, 1127-1136. https://doi.org/10.18016/ksutarimdoga.vi.1766666

AMA

1.Yıldız Bİ. A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. KSU J. Agric Nat. 2026;(Advanced Online Publication):1127-1136. doi:10.18016/ksutarimdoga.vi.1766666

Chicago

Yıldız, Berkant İsmail. 2026. “A Bioinformatics Framework for Genomic Region Classification Using Canonical K-Mers and Machine Learning”. Kahramanmaraş Sütçü İmam Üniversitesi Tarım Ve Doğa Dergisi, no. Advanced Online Publication: 1127-36. https://doi.org/10.18016/ksutarimdoga.vi.1766666.

EndNote

Yıldız Bİ (March 1, 2026) A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi Advanced Online Publication 1127–1136.

IEEE

[1]B. İ. Yıldız, “A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning”, KSU J. Agric Nat., no. Advanced Online Publication, pp. 1127–1136, Mar. 2026, doi: 10.18016/ksutarimdoga.vi.1766666.

ISNAD

Yıldız, Berkant İsmail. “A Bioinformatics Framework for Genomic Region Classification Using Canonical K-Mers and Machine Learning”. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi. Advanced Online Publication (March 1, 2026): 1127-1136. https://doi.org/10.18016/ksutarimdoga.vi.1766666.

JAMA

1.Yıldız Bİ. A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. KSU J. Agric Nat. 2026;:1127–1136.

MLA

Yıldız, Berkant İsmail. “A Bioinformatics Framework for Genomic Region Classification Using Canonical K-Mers and Machine Learning”. Kahramanmaraş Sütçü İmam Üniversitesi Tarım Ve Doğa Dergisi, no. Advanced Online Publication, Mar. 2026, pp. 1127-36, doi:10.18016/ksutarimdoga.vi.1766666.

Vancouver

1.Berkant İsmail Yıldız. A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. KSU J. Agric Nat. 2026 Mar. 1;(Advanced Online Publication):1127-36. doi:10.18016/ksutarimdoga.vi.1766666