Adapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation Task

Karakuş, Osman FurkanGülcü, AylaKaraca, Ali Can2026-02-082026-02-0820252602-4217https://doi.org/10.38088/jise.1471047https://hdl.handle.net/20.500.12885/4225This study introduces a novel approach for segmenting lines of text in handwritten documents using a vision transformer model. Specifically, we adapt DEtection TRansformer (DETR) model to detect line segments in images of handwritten documents. In order to adapt DETR for the line segmentation task, we applied a pre-processing step that involves dividing each line into fixed-size image patches followed by adding positional encoding. We benefit from DETR model with a ResNet-101 backbone pretrained on the Common Objects in Context (COCO) object detection training dataset, and re-train this model using our novel, complex line segmentation dataset consisting of 1,610 handwritten forms. To evaluate the performance, another line segmentation method named Bangla Document Recognition through Instance-level Segmentation of Handwritten Text Images (BN-DRISHTI) is implemented. This method utilizes the You Only Look Once (YOLO) object detection model. Both object detection-based methods involve a learning phase during which the model is trained or fine-tuned on the dataset. For a diverse set of baselines methods, we have also implemented two learning-free algorithms such as A* Search Algorithm and the Genetic Algorithm (GA). Experimental results based on the Intersection over Union (IoU) metric demonstrate that the proposed method outperforms all other methods in terms of the detection rate, recognition accuracy, and Text Line Detection Metric (TLDM). The quantitative results also indicate that two learning-free algorithms fail to segment highly skewed lines successfully in the dataset. The A* algorithm achieves a high recognition accuracy of 0.734, compared to GA and BN-DRISHTI, which achieve recognition accuracies of 0.498 and 0.689, respectively. Our proposed approach achieves the highest recognition accuracy of 0.872, outperforming all other methods. We show that the DETR model which requires only a single fine-tuning phase for adapting to line-segmentation task, not only simplifies the training and implementation process but also improves accuracy and efficiency in detecting and segmenting handwritten text lines. DETR’s use of a transformer’s global attention mechanism allows it to better understand the entire context of an image rather than relying solely on local features. This is particularly beneficial for managing the diverse and complex patterns found in handwritten text where traditional models might struggle with issues such as overlapping text lines or varied handwriting styles.eninfo:eu-repo/semantics/openAccessImage ProcessingGörüntü İşleme [EN] Pattern RecognitionÖrüntü TanımaAdapting Vision Transformer-Based Object Detection Model for Handwritten Text Line Segmentation TaskArticle10.38088/jise.1471047912838