Araştırma Makalesi

A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning

Sayı: Advanced Online Publication 6 Mart 2026
PDF İndir
EN TR

A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning

Öz

Accurate identification of gene-derived versus intergenic regions is a fundamental prerequisite for downstream genomic analyses, yet distinguishing these sequence types remains challenging when only short DNA windows are available. In this study, a scalable machine-learning framework was developed that integrates canonical k-mer representations with robust classifiers to discriminate 300 bp windows extracted from the Drosophila melanogaster genome. A balanced dataset of 1,000 gene-derived and 1,000 intergenic windows was encoded using canonical 3-mer and 4-mer frequencies combined with GC-content, yielding a 169-dimensional feature matrix. Logistic Regression, Random Forest, and Gradient Boosting models were evaluated using GroupKFold cross-validation to prevent gene-family leakage. All models achieved consistently high performance, with Gradient Boosting attaining the best overall results (Accuracy = 0.865, F1 = 0.868, MCC = 0.731, AUROC = 0.932, AUPRC = 0.918). SHAP-based feature attribution revealed that the GCC motif (mean |SHAP| = 0.50) and GC-content (0.48) were the most influential predictors, indicating that both specific short motifs and broader compositional patterns provide strong discriminative signals between genic and intergenic windows. Baseline comparisons demonstrated that alignment-based BLAST performed poorly on this task (Accuracy = 0.503), while a minimal 1D-CNN achieved performance comparable to classical machine-learning models, underscoring the efficiency and competitiveness of k-mer–based representations. Overall, the findings show that canonical k-mer features, when coupled with well-calibrated machine-learning models, offer an accurate, interpretable, and computationally efficient strategy for short-window genomic classification. This framework holds promise for improving large-scale genome annotation pipelines and may be extended to diverse taxa, metagenomic data, and real-time bioinformatics workflows.

Anahtar Kelimeler

Destekleyen Kurum

This study was not supported by any institution or organization.

Proje Numarası

No project number.

Etik Beyan

This study was conducted using publicly available data and therefore does not require ethical approval.

Teşekkür

The author has no acknowledgments to declare.

Kaynakça

  1. Alam, M. N.U., & Chowdhury, U. F. (2020). Short k-mer abundance profiles yield robust machine learning features and accurate classifiers for RNA viruses. PloS One, 15(9), e0239381.
  2. Athanasopoulou, K., Michalopoulou, V. I., Scorilas, A. & Adamopoulos, P. G. (2025). Integrating Artificial Intelligence in Next-Generation Sequencing: Advances, Challenges, and Future Directions. Current Issues in Molecular Biology, 47(6), 470.
  3. Bentéjac, C., Csörgő, A. & Martínez-Muñoz, G. (2021). A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54(3), 1937-1967.
  4. Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32.
  5. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.
  6. Deorowicz, S., Kokot, M., Grabowski, S., & Debudaj-Grabysz, A. (2015). KMC 2: fast and resource-frugal k-mer counting. Bioinformatics, 31(10), 1569-1576.
  7. Eraslan, G., Avsec, Ž., Gagneur, J. & Theis, F. J. (2019). Deep learning: new computational modelling techniques for genomics. Nature Reviews Genetics, 20(7), 389-403.
  8. Gonzalez-Dominguez, J. & Schmidt, B. (2016). ParDRe: faster parallel duplicated reads removal tool for sequencing studies. Bioinformatics, 32(10), 1562-1564.

Ayrıntılar

Birincil Dil

İngilizce

Konular

Tarımsal Biyoteknoloji (Diğer)

Bölüm

Araştırma Makalesi

Erken Görünüm Tarihi

6 Mart 2026

Yayımlanma Tarihi

6 Mart 2026

Gönderilme Tarihi

16 Ağustos 2025

Kabul Tarihi

30 Aralık 2025

Yayımlandığı Sayı

Yıl 2026 Sayı: Advanced Online Publication

Kaynak Göster

APA
Yıldız, B. İ. (2026). A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi, Advanced Online Publication, 1127-1136. https://doi.org/10.18016/ksutarimdoga.vi.1766666
AMA
1.Yıldız Bİ. A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi. 2026;(Advanced Online Publication):1127-1136. doi:10.18016/ksutarimdoga.vi.1766666
Chicago
Yıldız, Berkant İsmail. 2026. “A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning”. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi, sy Advanced Online Publication: 1127-36. https://doi.org/10.18016/ksutarimdoga.vi.1766666.
EndNote
Yıldız Bİ (01 Mart 2026) A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi Advanced Online Publication 1127–1136.
IEEE
[1]B. İ. Yıldız, “A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning”, Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi, sy Advanced Online Publication, ss. 1127–1136, Mar. 2026, doi: 10.18016/ksutarimdoga.vi.1766666.
ISNAD
Yıldız, Berkant İsmail. “A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning”. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi. Advanced Online Publication (01 Mart 2026): 1127-1136. https://doi.org/10.18016/ksutarimdoga.vi.1766666.
JAMA
1.Yıldız Bİ. A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi. 2026;:1127–1136.
MLA
Yıldız, Berkant İsmail. “A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning”. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi, sy Advanced Online Publication, Mart 2026, ss. 1127-36, doi:10.18016/ksutarimdoga.vi.1766666.
Vancouver
1.Berkant İsmail Yıldız. A Bioinformatics Framework for Genomic Region Classification Using Canonical k-mers and Machine Learning. Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi. 01 Mart 2026;(Advanced Online Publication):1127-36. doi:10.18016/ksutarimdoga.vi.1766666

21082



2024-JIF = 0.500

2024-JCI = 0.14

Uluslararası Hakemli Dergi (International Peer Reviewed Journal)

       Dergimiz, herhangi bir başvuru veya yayımlama ücreti almamaktadır. (Free submission and publication)

      Yılda 6 sayı yayınlanır. (Published 6 times a year)


88x31.png 

Bu web sitesi Creative Commons Atıf 4.0 Uluslararası Lisansı ile lisanslanmıştır.

                 


Kahramanmaraş Sütçü İmam Üniversitesi Tarım ve Doğa Dergisi
e-ISSN: 2619-9149