A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Seven, Engin; Demirel, Eylem Yücel

A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

dc.contributor.author	Seven, Engin
dc.contributor.author	Demirel, Eylem Yücel
dc.date.accessioned	2026-02-08T15:04:49Z
dc.date.available	2026-02-08T15:04:49Z
dc.date.issued	2025
dc.department	Bursa Teknik Üniversitesi
dc.description.abstract	Video-Based Human Action Recognition (HAR) remains challenging due to inter-class similarity, background noise, and the need to capture long-term temporal dependencies. This study proposes a hybrid deep learning model that integrates 3D Convolutional Neural Networks (3D CNNs) with Transformer-based attention mechanisms to jointly capture spatio-temporal features and long-range motion context. The architecture was optimized for parameter efficiency and trained on the UCF101 and HMDB51 benchmark datasets using standardized preprocessing and training strategies. Experimental results indicate that the proposed model reaches 97% accuracy and 96.8% mean F1-score on UCF101, and 85% accuracy, and 83.8% F1-score on HMDB51, showing consistent improvements compared to the standalone 3D CNNs and Transformer variants under identical settings. Ablation studies confirm that the combination of convolutional and attention layers significantly improves recognition performance while maintaining competitive computational cost (3.78M parameters, 17.75 GFLOPs/video, ~7 ms GPU latency). These findings highlight the effectiveness of the hybrid design for accurate and efficient HAR. Future work will address class imbalance using focal loss or weighted training, explore multimodal data integration, and develop more lightweight Transformer modules for real-time deployment on resource-constrained devices.
dc.identifier.doi	10.38088/jise.1703936
dc.identifier.endpage	342
dc.identifier.issn	2602-4217
dc.identifier.issue	2
dc.identifier.startpage	327
dc.identifier.uri	https://doi.org/10.38088/jise.1703936
dc.identifier.uri	https://hdl.handle.net/20.500.12885/4224
dc.identifier.volume	9
dc.language.iso	en
dc.publisher	Bursa Teknik Üniversitesi
dc.relation.ispartof	Journal of Innovative Science and Engineering
dc.relation.publicationcategory	Makale - Ulusal Hakemli Dergi - Kurum Öğretim Elemanı
dc.rights	info:eu-repo/semantics/openAccess
dc.snmz	KA_DergiPark_20260207
dc.subject	Image Processing
dc.subject	Görüntü İşleme
dc.title	A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51
dc.type	Article

Koleksiyon

Journal of Innovative Science and Engineering Koleksiyonu

A Hybrid 3D CNNs Transformer Architecture for Video-Based Human Action Recognition with Improved Accuracy on UCF101 and HMDB51

Dosyalar

Koleksiyon