WebCollectives: A light regular expression based web content extractor in Java

Agun, Hayri Volkan

WebCollectives: A light regular expression based web content extractor in Java

Tarih

2023

Yazarlar

Agun, Hayri Volkan

Yayıncı

Elsevier

Erişim Hakkı

info:eu-repo/semantics/openAccess

Özet

Conventional web crawling methods typically involve a sequence of distinct steps for downloading and extracting web content. A noteworthy limitation of these conventional crawling approaches is their lack of a focus-based crawling strategy. The software introduced in this paper, known as WebCollectives, introduces a straightforward crawling approach by integrating content extraction into a hierarchical regular expression definition model. Furthermore, it streamlines the crawling process through a pipeline-oriented framework, emphasizing focus-based link extraction. This crawler employs either a configurable Selenium mechanism or a direct HTTP GET method to fetch web pages. Subsequently, it undergoes an extraction process based on hierarchical regular expressions. Notably, Selenium allows for adaptable JavaScript functions to navigate web pages effectively. The content extraction generates XML structures from diverse types of content. Comparative analysis with the standard DOM (Document Object Model) reveals that the proposed approach yields significant improvements in extraction efficiency and requires fewer lines of code. Specifically, it outperforms non-recursive standard DOM hierarchy definitions in terms of both extraction speed and code complexity.

Anahtar Kelimeler

Focused crawler, Web content extractor, HTML parsing, Regular expressions

Kaynak

Softwarex

WoS Q Değeri

Q2

Scopus Q Değeri

Q2

Cilt

24

Bağlantı

https://doi.org/10.1016/j.softx.2023.101569
https://hdl.handle.net/20.500.12885/7084

Koleksiyon

WoS İndeksli Yayınlar Koleksiyonu
Scopus İndeksli Yayınlar Koleksiyonu

Detaylı Öğe Kaydı

WebCollectives: A light regular expression based web content extractor in Java

Tarih

Yazarlar

Dergi Başlığı

Dergi ISSN

Cilt Başlığı

Yayıncı

Erişim Hakkı

Özet

Açıklama

Anahtar Kelimeler

Kaynak

WoS Q Değeri

Scopus Q Değeri

Cilt

Sayı

Künye

Bağlantı

Koleksiyon