Unsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach

Küçük Resim Yok

Tarih

2025

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Elsevier

Erişim Hakkı

info:eu-repo/semantics/closedAccess

Özet

In this study, active learning methods adapted for sentence selection of Turkish sentences are evaluated through language learning with neural models. Turkish is an agglutinative language with a complex morphology, where the linguistic properties of words are encoded in suffixes. The active learning methods based on regression, clustering, language models, distance metrics, and neural networks are applied to unlabeled sentence selection. In this respect, a sentence corpus is selected from a larger corpus, with the same number of samples for each target word in intrinsic and extrinsic evaluation tasks. The selected sentences are used for the training of SkipGram, CBOW, and self-attention LSTM language models and extracted embeddings are evaluated by the semantic analogy, POS and sentiment analysis tasks. The evaluation scores of the models trained on the samples selected by the active learning method are compared. The results of the selected sentences based on language models indicate an improvement over random selection based on a static vocabulary. These results also show that the selection affects the quality of unsupervised word embedding extraction even if the target vocabulary is kept the same. Along with the accuracy, the time efficiency of the language models is shown to be better than other methods especially methods based on neural network models, and distance metrics.

Açıklama

Anahtar Kelimeler

Unsupervised active learning, Language models, Natural language processing

Kaynak

Artificial Intelligence

WoS Q Değeri

Q2

Scopus Q Değeri

Q1

Cilt

348

Sayı

Künye