WebCollectives: A light regular expression based web content extractor in Java

Agun, Hayri Volkan

WebCollectives: A light regular expression based web content extractor in Java

dc.contributor.author	Agun, Hayri Volkan
dc.date.accessioned	2026-02-12T21:05:41Z
dc.date.available	2026-02-12T21:05:41Z
dc.date.issued	2023
dc.department	Bursa Teknik Üniversitesi
dc.description.abstract	Conventional web crawling methods typically involve a sequence of distinct steps for downloading and extracting web content. A noteworthy limitation of these conventional crawling approaches is their lack of a focus-based crawling strategy. The software introduced in this paper, known as WebCollectives, introduces a straightforward crawling approach by integrating content extraction into a hierarchical regular expression definition model. Furthermore, it streamlines the crawling process through a pipeline-oriented framework, emphasizing focus-based link extraction. This crawler employs either a configurable Selenium mechanism or a direct HTTP GET method to fetch web pages. Subsequently, it undergoes an extraction process based on hierarchical regular expressions. Notably, Selenium allows for adaptable JavaScript functions to navigate web pages effectively. The content extraction generates XML structures from diverse types of content. Comparative analysis with the standard DOM (Document Object Model) reveals that the proposed approach yields significant improvements in extraction efficiency and requires fewer lines of code. Specifically, it outperforms non-recursive standard DOM hierarchy definitions in terms of both extraction speed and code complexity.
dc.identifier.doi	10.1016/j.softx.2023.101569
dc.identifier.issn	2352-7110
dc.identifier.scopus	2-s2.0-85174828192
dc.identifier.scopusquality	Q2
dc.identifier.uri	https://doi.org/10.1016/j.softx.2023.101569
dc.identifier.uri	https://hdl.handle.net/20.500.12885/7084
dc.identifier.volume	24
dc.identifier.wos	WOS:001102105200001
dc.identifier.wosquality	Q2
dc.indekslendigikaynak	Web of Science
dc.indekslendigikaynak	Scopus
dc.language.iso	en
dc.publisher	Elsevier
dc.relation.ispartof	Softwarex
dc.relation.publicationcategory	Makale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/openAccess
dc.snmz	KA_WoS_20260212
dc.subject	Focused crawler
dc.subject	Web content extractor
dc.subject	HTML parsing
dc.subject	Regular expressions
dc.title	WebCollectives: A light regular expression based web content extractor in Java
dc.type	Article

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

WebCollectives: A light regular expression based web content extractor in Java

Dosyalar

Koleksiyon