Identification of OOV words in Turkish texts

Enis Arslan; Umut Orhan

Research Article

Türkçe metinlerde sözlük dışı kelime tespiti

Year 2019, Volume: 8 Issue: 2, 35 - 48, 31.10.2019

Enis Arslan Umut Orhan

Abstract

Bu çalışmada, Türkçe metinler için sözlük dışı
kelime (SDK) tespiti yapabilen anlamsal bir çizge ağı modeli sunulmuştur. Doğal
dil işleme (DDİ) alanında, biçimbirimsel çözümleyiciler, kelime analizi
esnasında bilinmeyen kelime (BK)’lerle karşılaşabilmektedirler. Bu durum daha
çok, bu tip araçların çözümleme esnasında aday bulabilmeleri için bir sözlüğe
bağımlı oldukları durumlarda oluşmaktadır.
Bazen, bir çözümleyici madde başı adaylarının sözlükte mevcut olmaması
sebebiyle hiçbir madde başı adayını bulamamaktadır. Bu durum çözümleme çıktı
değerini düşürebilmektedir. Sözlük dışı
kelime (SDK) tespiti için önerilen model, sözlükler için uygun olabilecek
sözlük dışı kelimeleri tespit edebilmektedir. Ayrıca çizge veri tabanında
birliktelik ilişkileri kullanılarak bir anlamsal alt-ağ oluşturulmuş ve yeni
eşdizimliliklerin madde başı olarak önerilecek şekilde keşfedilmesi amacıyla
kullanılmıştır.

Keywords

Bilinmeyen kelimeler, Eşdizimlilik, Birliktelik, Sözlük dışı kelimeler

References

Arısoy, E., Dutağacı, H., Arslan, L.M., 2006. A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Processing, 86(10), pp.2844-2862.
Arısoy, E., Can, D., Parlak, S., Sak, H. and Saraçlar, M., 2009. Turkish broadcast news transcription and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 17(5), pp.874-883.
Arslan, E, Orhan, U. 2017. Using Graphs in Construction of a Lemmatization Model for Turkish, International Mediteranean Science and Engineering Congress, IMSEC.Asahara, M., Matsumoto, Y., 2004, August. Japanese unknown word identification by character-based chunking. In Proceedings of the 20th international conference on Computational Linguistics (p. 459). Association for Computational Linguistics.
Bazzi, I., Glass, J., 2002. A multi-class approach for modelling out-of-vocabulary words. In Seventh International Conference on Spoken Language Processing.
Brill, E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational linguistics, 21(4), pp.543-565.
Çöltekin, Ç., 2014. A set of open source tools for Turkish natural language processing. In LREC (pp. 1079-1086).Daciuk, J., 1999, July. Treatment of unknown words. In International Workshop on Implementing Automata (pp. 71-80). Springer, Berlin, Heidelberg.
Erjavec, T., Džeroski, S., 2004. Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), pp.17-41.
Jongejan, B., Dalianis, H., 2009. August. Automatic training of lemmatization rules that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 145-153). Association for Computational Linguistics.
Korobov, M., 2015. April. Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts (pp. 320-332). Springer, Cham.
Lafferty, J., McCallum, A. and Pereira, F.C., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Loponen, A., Kalervo, J., 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Berlin, Heidelberg, 2010.
Nakagawa, T., 2004. August. Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th international conference on Computational Linguistics (p. 466). Association for Computational Linguistics.
Silfverberg, M., Ruokolainen, T., Lindén, K. and Kurimo, M., 2016. FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish. Language Resources and Evaluation, 50(4), pp.863-878.
Parlak, Siddika, and Murat Saraclar. "Spoken term detection for Turkish broadcast news." Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008.
Parlak, S., Saraclar, M., 2008. March. Spoken term detection for Turkish broadcast news. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 5244-5247). IEEE.
Tahiroglu, B.T., Akalın, S.H., Ozkan, B., 2014. Turkce Cevrim Ici Haber Metinlerinde Yeni Sozlerin (Neolojizm) Otomatik Çıkarımı. In Turkce Uzerine Derlembilim Uygulamaları, Karahan Kitabevi.

Identification of OOV words in Turkish texts

Year 2019, Volume: 8 Issue: 2, 35 - 48, 31.10.2019

Enis Arslan Umut Orhan

Abstract

In this study, we present a semantic graph network
model which is capable of detecting out-of-vocabulary (OOV) words in Turkish
texts. In natural language processing (NLP) field, morphological analyzers can
encounter unknown words (UW) during word processing. This mostly occurs when
these kind of tools depend on a dictionary to find the probable lemmas in order
to further process parsing.
Sometimes, an analyzer is unable to find any candidates because of the
non-existence of the lemma candidates in the dictionary. This results in
degraded parsing output. The proposed model for OOV detection is able to define
OOV words which are suitable for dictionaries. Also co-occurrence relations of
the lemmas in texts are modelled as a semantic sub-graph and it is used to
discover collocations to propose as new lemma candidates.

Keywords

Unknown words, Collocation, Co-occurrence, OOV words

References

Arısoy, E., Dutağacı, H., Arslan, L.M., 2006. A unified language model for large vocabulary continuous speech recognition of Turkish. Signal Processing, 86(10), pp.2844-2862.
Arısoy, E., Can, D., Parlak, S., Sak, H. and Saraçlar, M., 2009. Turkish broadcast news transcription and retrieval. IEEE Transactions on Audio, Speech, and Language Processing, 17(5), pp.874-883.
Arslan, E, Orhan, U. 2017. Using Graphs in Construction of a Lemmatization Model for Turkish, International Mediteranean Science and Engineering Congress, IMSEC.Asahara, M., Matsumoto, Y., 2004, August. Japanese unknown word identification by character-based chunking. In Proceedings of the 20th international conference on Computational Linguistics (p. 459). Association for Computational Linguistics.
Bazzi, I., Glass, J., 2002. A multi-class approach for modelling out-of-vocabulary words. In Seventh International Conference on Spoken Language Processing.
Brill, E., 1995. Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational linguistics, 21(4), pp.543-565.
Çöltekin, Ç., 2014. A set of open source tools for Turkish natural language processing. In LREC (pp. 1079-1086).Daciuk, J., 1999, July. Treatment of unknown words. In International Workshop on Implementing Automata (pp. 71-80). Springer, Berlin, Heidelberg.
Erjavec, T., Džeroski, S., 2004. Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words. Applied Artificial Intelligence, 18(1), pp.17-41.
Jongejan, B., Dalianis, H., 2009. August. Automatic training of lemmatization rules that handle morphological changes in pre-, in-and suffixes alike. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1 (pp. 145-153). Association for Computational Linguistics.
Korobov, M., 2015. April. Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts (pp. 320-332). Springer, Cham.
Lafferty, J., McCallum, A. and Pereira, F.C., 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
Loponen, A., Kalervo, J., 2010. A dictionary-and corpus-independent statistical lemmatizer for information retrieval in low resource languages. International Conference of the Cross-Language Evaluation Forum for European Languages. Springer, Berlin, Heidelberg, 2010.
Nakagawa, T., 2004. August. Chinese and Japanese word segmentation using word-level and character-level information. In Proceedings of the 20th international conference on Computational Linguistics (p. 466). Association for Computational Linguistics.
Silfverberg, M., Ruokolainen, T., Lindén, K. and Kurimo, M., 2016. FinnPos: an open-source morphological tagging and lemmatization toolkit for Finnish. Language Resources and Evaluation, 50(4), pp.863-878.
Parlak, Siddika, and Murat Saraclar. "Spoken term detection for Turkish broadcast news." Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on. IEEE, 2008.
Parlak, S., Saraclar, M., 2008. March. Spoken term detection for Turkish broadcast news. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on (pp. 5244-5247). IEEE.
Tahiroglu, B.T., Akalın, S.H., Ozkan, B., 2014. Turkce Cevrim Ici Haber Metinlerinde Yeni Sozlerin (Neolojizm) Otomatik Çıkarımı. In Turkce Uzerine Derlembilim Uygulamaları, Karahan Kitabevi.

There are 16 citations in total.

Details

Primary Language	English
Subjects	Engineering
Journal Section	Araştırma Makaleleri
Authors	Enis Arslan Umut Orhan
Publication Date	October 31, 2019
Published in Issue	Year 2019 Volume: 8 Issue: 2

Cite

APA	Arslan, E., & Orhan, U. (2019). Identification of OOV words in Turkish texts. Gaziosmanpaşa Bilimsel Araştırma Dergisi, 8(2), 35-48.
AMA	Arslan E, Orhan U. Identification of OOV words in Turkish texts. GBAD. October 2019;8(2):35-48.
Chicago	Arslan, Enis, and Umut Orhan. “Identification of OOV Words in Turkish Texts”. Gaziosmanpaşa Bilimsel Araştırma Dergisi 8, no. 2 (October 2019): 35-48.
EndNote	Arslan E, Orhan U (October 1, 2019) Identification of OOV words in Turkish texts. Gaziosmanpaşa Bilimsel Araştırma Dergisi 8 2 35–48.
IEEE	E. Arslan and U. Orhan, “Identification of OOV words in Turkish texts”, GBAD, vol. 8, no. 2, pp. 35–48, 2019.
ISNAD	Arslan, Enis - Orhan, Umut. “Identification of OOV Words in Turkish Texts”. Gaziosmanpaşa Bilimsel Araştırma Dergisi 8/2 (October 2019), 35-48.
JAMA	Arslan E, Orhan U. Identification of OOV words in Turkish texts. GBAD. 2019;8:35–48.
MLA	Arslan, Enis and Umut Orhan. “Identification of OOV Words in Turkish Texts”. Gaziosmanpaşa Bilimsel Araştırma Dergisi, vol. 8, no. 2, 2019, pp. 35-48.
Vancouver	Arslan E, Orhan U. Identification of OOV words in Turkish texts. GBAD. 2019;8(2):35-48.

Download Cover Image

Article Files

Full Text