DSpace Arşivi :: by Yazar "Agun, Hayri Volkan" değerine göre listeleniyor

Yazar "Agun, Hayri Volkan" seçeneğine göre listele

Listeleniyor 1 - 7 / 7

An efficient regular expression inference approach for relevant image extraction
(Elsevier, 2023) Agun, Hayri Volkan; Uzun, Erdinc
Traditional approaches for extracting relevant images automatically from web pages are error-prone and time-consuming. To improve this task, operations such as preparing a larger dataset and finding new features are used in the web data extraction approaches. However, these operations are difficult and laborious. In this study, we propose a fully-automated approach based on alignment of regular ex-pressions to automatically extract the relevant images from web pages. The automatically constructed regular expressions has been applied to a classification task for the first time. In this respect, a multi-stage inference approach is developed for generating regular expressions from the attribute values of relevant and irrelevant image elements in web pages. The proposed approach reduces the complexity of the alignment of two regular expressions by applying a constraint on a version of the Levenshtein distance algorithm. The classification accuracy of regular expression approaches is compared with the naive Bayes, logistic regression, J48, and multilayer perceptron classifiers on a balanced relevant image retrieval dataset consisting of 360 image element samples for 10 shopping websites. According to the cross-validation results, the regular expression inference-based classification achieved a 0.98 f-measure with only 5 frequent n-grams, and it outperformed other classifiers on the same set of features. The classification efficiency of the proposed approach is measured at 0.108 ms, which is very competitive with other classifiers.(c) 2023 Elsevier B.V. All rights reserved.
Automatically Discovering Relevant Images From Web Pages
(Ieee-Inst Electrical Electronics Engineers Inc, 2020) Uzun, Erdinc; Ozhan, Erkan; Agun, Hayri Volkan; Yerlikaya, Tarik; Bulus, Halil Nusret
Web pages contain irrelevant images along with relevant images. The classification of these images is an error-prone process due to the number of design variations of web pages. Using multiple web pages provides additional features that improve the performance of relevant image extraction. Traditional studies use the features extracted from a single web page. However, in this study, we enhance the performance of relevant image extraction by employing the features extracted from different web pages consisting of standard news, galleries, video pages, and link pages. The dataset obtained from these web pages contains 100 different web pages for each 200 online news websites from 58 different countries. For discovering relevant images, the most straightforward approach extracts the largest image on the web page. This approach achieves a 0.451 F-Measure score as a baseline. Then, we apply several machine learning methods using features in this dataset to find the most suitable machine learning method. The best f-Measure score is 0.822 using the AdaBoost classifier. Some of these features have been utilized in previous web data extraction studies. To the best of our knowledge, 15 new features are proposed for the first time in this study for discovering the relevant images. We compare the performance of the AdaBoost classifier on different feature sets. The proposed features improve the f-Measure by 35 percent. Besides, using only the cache feature, which is the most prominent feature, corresponds to 7 percent of this improvement.
Bireylerin Kovid-19 Riskinin Uzay-zamansal Olarak Belirlenmesi
(2023) Agun, Hayri Volkan
Mevcut çalışmalar örneğin şüpheli-bulaş-eksiltme modeli ve makine öğrenmesi modelleri her bir kişi ve alan için bulaş riskinin hesaplanmasına uygun değildir. Bu çalışmada mevcut yaklaşımların eksik yönlerinin giderilmesi için toplanan verilerin uzaysal ve zamansal tahminleme modeli olarak bir araya getirildiği bir dönüt işleme tasarımı önerilmektedir. Önerilen tasarım üç ana işleme aşaması içermektedir. Bunlar verinini üretilmesi, geri dönüş analizi ve gerçek zamanlı uzaysal ve zamansal değerlendirme süreçleridir. Verilerin üretilmesi aşamasında her bir bireyin Kovid-19 durumunun Markov olasılık işlemi kullanılarak üretildiği süreç yer alır. Bu aşamada hastalığın çoğalma parametreleri, semptonlu hastaların ve semptonsuz hastaların görülme sıklığı, toplam nüfus, hastalığı geçirmekte olan nüfus, ve hareket halinde olan nüfus sayıları kullanılarak her bir hasta için Kovid durumu ve hareket halinde olma durumu rastsal olarak güncellenir. Hareket verisi ise rastsal olarak belirlenen özel alanlar için oluşturulur. Bu veride kişilerin belirli bir alan içerisindeki etkileşimleri rastsal olarak hesaplanır. Geri dünüş analizi aşamasında toplanan istatistikler ve yerel olay verileri birleştirilerek doğrusal bir model yardımıyla her bir bireyin Kovid-19 riski tahmin edilir. Bu bağlamda yerel istatistilerin elde edilmesinde olasılıksal bir yakınsama yaklaşımı kullanılabilir. Değerlendirme aşamasında, geri dönüş analizinden elde edilen tüm etkileşimler kişilerin periodik olarak güncel Kovid-19 riskinin hesaplanmasında kullanılır. Daha sonra her bir kişinin üretilen verideki Kovid-19 bilgisi kullanılarak tamin başarısı o zaman aralağı için hesaplanır. Populasyon sayısı, yer/zaman ve hareketlilik oranınında bağımsız olarak her bir birey etkileşimi için hesaplanan Kappa önerilen tasarımın etkisinin önemli olduğunu göstermiştir.
Intrinsic evaluation of word embeddings for Turkish
(Association for Computing Machinery, 2020) Agun, Hayri Volkan; Yilmazel, O.
Word embeddings are evaluated through intrinsic and extrinsic tests. Similarity and analogy test are mainly preferred for intrinsic evaluation and natural language processing tasks such as named entity recognition and question answering are prefferred for extrinsic evaluation. Although there are various intrinsic evaluation datasets for English, the datasets for Turkish are very limited and measuring the degree of similarity and relatedness between words without specifying the type of semantic relation. In this paper, we propose an intrinsic evaluation dataset for evaluating different semantic relations other than a synonym, antonym, hypernym, and meronym as well as morphological relations of individual Turkish words. Moreover, we benchmark three publicly available word-embedding models on the proposed dataset and discuss agglutinative characteristics of the Turkish language for language modeling. © 2020 ACM.
Ranking Assisted Unsupervised Morphological Disambiguation of Turkish
(Ieee-Inst Electrical Electronics Engineers Inc, 2025) Agun, Hayri Volkan; Aslan, Ozkan
In comparison to English, Turkish is an agglutinative language with fewer resources. The agglutinative properties of words result in a significant number of morphological analyses, creating uncertainty in morphological disambiguation and syntactic parsing. Traditional approaches typically rely on supervised learning models based on the correct morphological analysis of a given phrase. In this study, we propose a ranking method to limit and filter out irrelevant morphological tags from all possible combinations of morphological analyses of a given sentence without supervision. The suggested method selects less ambiguous analyses for statistical aggregation and applies inference through the PageRank algorithm on a densely connected graph. Subsequently, this graph is utilized to develop a voting schema for each test word based on the connections in the test sentence. Experimental evaluations of the proposed methods on three independently and manually annotated test datasets indicate a token accuracy of approximately 80% and an accuracy of around 61% for ambiguous tokens. In all ranking evaluations, the best scores from the PageRank variations significantly outperform those of Self-Attention LSTM and ELMO deep learning models. The training process of PageRank is notably straightforward and efficient, requiring O(n(2)) parameter adjustments, which is considerably fewer than those required by the backpropagation method used in neural network training. Furthermore, to reduce ambiguity in sentences from different genres with scarce samples, the proposed method is easily adaptable.
Unsupervised sentence selection for creating a representative corpus in Turkish: An active learning approach
(Elsevier, 2025) Agun, Hayri Volkan
In this study, active learning methods adapted for sentence selection of Turkish sentences are evaluated through language learning with neural models. Turkish is an agglutinative language with a complex morphology, where the linguistic properties of words are encoded in suffixes. The active learning methods based on regression, clustering, language models, distance metrics, and neural networks are applied to unlabeled sentence selection. In this respect, a sentence corpus is selected from a larger corpus, with the same number of samples for each target word in intrinsic and extrinsic evaluation tasks. The selected sentences are used for the training of SkipGram, CBOW, and self-attention LSTM language models and extracted embeddings are evaluated by the semantic analogy, POS and sentiment analysis tasks. The evaluation scores of the models trained on the samples selected by the active learning method are compared. The results of the selected sentences based on language models indicate an improvement over random selection based on a static vocabulary. These results also show that the selection affects the quality of unsupervised word embedding extraction even if the target vocabulary is kept the same. Along with the accuracy, the time efficiency of the language models is shown to be better than other methods especially methods based on neural network models, and distance metrics.
WebCollectives: A light regular expression based web content extractor in Java
(Elsevier, 2023) Agun, Hayri Volkan
Conventional web crawling methods typically involve a sequence of distinct steps for downloading and extracting web content. A noteworthy limitation of these conventional crawling approaches is their lack of a focus-based crawling strategy. The software introduced in this paper, known as WebCollectives, introduces a straightforward crawling approach by integrating content extraction into a hierarchical regular expression definition model. Furthermore, it streamlines the crawling process through a pipeline-oriented framework, emphasizing focus-based link extraction. This crawler employs either a configurable Selenium mechanism or a direct HTTP GET method to fetch web pages. Subsequently, it undergoes an extraction process based on hierarchical regular expressions. Notably, Selenium allows for adaptable JavaScript functions to navigate web pages effectively. The content extraction generates XML structures from diverse types of content. Comparative analysis with the standard DOM (Document Object Model) reveals that the proposed approach yields significant improvements in extraction efficiency and requires fewer lines of code. Specifically, it outperforms non-recursive standard DOM hierarchy definitions in terms of both extraction speed and code complexity.

Yazar "Agun, Hayri Volkan" seçeneğine göre listele

Sayfa Başına Sonuç

Sıralama seçenekleri