Unsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach

Agun, Hayri Volkan

Unsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach

Tarih

2025

Yazarlar

Agun, Hayri Volkan

Yayıncı

Elsevier

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

In this study, active learning methods adapted for sentence selection of Turkish sentences are evaluated through language learning with neural models. Turkish is an agglutinative language with a complex morphology, where the linguistic properties of words are encoded in suffixes. The active learning methods based on regression, clustering, language models, distance metrics, and neural networks are applied to unlabeled sentence selection. In this respect, a sentence corpus is selected from a larger corpus, with the same number of samples for each target word in intrinsic and extrinsic evaluation tasks. The selected sentences are used for the training of SkipGram, CBOW, and self-attention LSTM language models and extracted embeddings are evaluated by the semantic analogy, POS and sentiment analysis tasks. The evaluation scores of the models trained on the samples selected by the active learning method are compared. The results of the selected sentences based on language models indicate an improvement over random selection based on a static vocabulary. These results also show that the selection affects the quality of unsupervised word embedding extraction even if the target vocabulary is kept the same. Along with the accuracy, the time efficiency of the language models is shown to be better than other methods especially methods based on neural network models, and distance metrics.

Anahtar Kelimeler

Unsupervised active learning, Language models, Natural language processing

Kaynak

Artificial Intelligence

WoS Q Değeri

Q2

Scopus Q Değeri

Q1

Cilt

348

Bağlantı

https://doi.org/10.1016/j.artint.2025.104422
https://hdl.handle.net/20.500.12885/5621

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Detaylı Öğe Kaydı

Unsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon