Unsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach

dc.contributor.authorAgun, Hayri Volkan
dc.date.accessioned2026-02-08T15:15:08Z
dc.date.available2026-02-08T15:15:08Z
dc.date.issued2025
dc.departmentBursa Teknik Üniversitesi
dc.description.abstractIn this study, active learning methods adapted for sentence selection of Turkish sentences are evaluated through language learning with neural models. Turkish is an agglutinative language with a complex morphology, where the linguistic properties of words are encoded in suffixes. The active learning methods based on regression, clustering, language models, distance metrics, and neural networks are applied to unlabeled sentence selection. In this respect, a sentence corpus is selected from a larger corpus, with the same number of samples for each target word in intrinsic and extrinsic evaluation tasks. The selected sentences are used for the training of SkipGram, CBOW, and self-attention LSTM language models and extracted embeddings are evaluated by the semantic analogy, POS and sentiment analysis tasks. The evaluation scores of the models trained on the samples selected by the active learning method are compared. The results of the selected sentences based on language models indicate an improvement over random selection based on a static vocabulary. These results also show that the selection affects the quality of unsupervised word embedding extraction even if the target vocabulary is kept the same. Along with the accuracy, the time efficiency of the language models is shown to be better than other methods especially methods based on neural network models, and distance metrics.
dc.identifier.doi10.1016/j.artint.2025.104422
dc.identifier.issn0004-3702
dc.identifier.issn1872-7921
dc.identifier.scopus2-s2.0-105016585064
dc.identifier.scopusqualityQ1
dc.identifier.urihttps://doi.org/10.1016/j.artint.2025.104422
dc.identifier.urihttps://hdl.handle.net/20.500.12885/5621
dc.identifier.volume348
dc.identifier.wosWOS:001578014300002
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherElsevier
dc.relation.ispartofArtificial Intelligence
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/closedAccess
dc.snmzWOS_KA_20260207
dc.subjectUnsupervised active learning
dc.subjectLanguage models
dc.subjectNatural language processing
dc.titleUnsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach
dc.typeArticle

Dosyalar