WebCollectives: A light regular expression based web content extractor in Java

dc.contributor.authorAgun, Hayri Volkan
dc.date.accessioned2026-02-12T21:05:41Z
dc.date.available2026-02-12T21:05:41Z
dc.date.issued2023
dc.departmentBursa Teknik Üniversitesi
dc.description.abstractConventional web crawling methods typically involve a sequence of distinct steps for downloading and extracting web content. A noteworthy limitation of these conventional crawling approaches is their lack of a focus-based crawling strategy. The software introduced in this paper, known as WebCollectives, introduces a straightforward crawling approach by integrating content extraction into a hierarchical regular expression definition model. Furthermore, it streamlines the crawling process through a pipeline-oriented framework, emphasizing focus-based link extraction. This crawler employs either a configurable Selenium mechanism or a direct HTTP GET method to fetch web pages. Subsequently, it undergoes an extraction process based on hierarchical regular expressions. Notably, Selenium allows for adaptable JavaScript functions to navigate web pages effectively. The content extraction generates XML structures from diverse types of content. Comparative analysis with the standard DOM (Document Object Model) reveals that the proposed approach yields significant improvements in extraction efficiency and requires fewer lines of code. Specifically, it outperforms non-recursive standard DOM hierarchy definitions in terms of both extraction speed and code complexity.
dc.identifier.doi10.1016/j.softx.2023.101569
dc.identifier.issn2352-7110
dc.identifier.scopus2-s2.0-85174828192
dc.identifier.scopusqualityQ2
dc.identifier.urihttps://doi.org/10.1016/j.softx.2023.101569
dc.identifier.urihttps://hdl.handle.net/20.500.12885/7084
dc.identifier.volume24
dc.identifier.wosWOS:001102105200001
dc.identifier.wosqualityQ2
dc.indekslendigikaynakWeb of Science
dc.indekslendigikaynakScopus
dc.language.isoen
dc.publisherElsevier
dc.relation.ispartofSoftwarex
dc.relation.publicationcategoryMakale - Uluslararası Hakemli Dergi - Kurum Öğretim Elemanı
dc.rightsinfo:eu-repo/semantics/openAccess
dc.snmzKA_WoS_20260212
dc.subjectFocused crawler
dc.subjectWeb content extractor
dc.subjectHTML parsing
dc.subjectRegular expressions
dc.titleWebCollectives: A light regular expression based web content extractor in Java
dc.typeArticle

Dosyalar