Search In this Thesis
   Search In this Thesis  
العنوان
Deep Learning Based Object Detection.
المؤلف
Maharek,Abdelrahman Alaa Sayed Mahmoud.
هيئة الاعداد
باحث / Abdelrahman Alaa Sayed Mahmoud Maharek
مشرف / Kamal ElDahshan
مشرف / Rasha Orban Mahmoud
مناقش / Hala zayed
مناقش / Roshdy mohamed farruk
الموضوع
Deep Learning . neural networks .
تاريخ النشر
2023.
عدد الصفحات
79 p;
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
Computer Science Applications
تاريخ الإجازة
20/12/2023
مكان الإجازة
جامعة بنها - كلية الحاسبات والمعلومات - علوم الحاسب
الفهرس
Only 14 pages are availabe for public view

from 80

from 80

Abstract

Object detection in video poses challenges due to factors such as fast movement, out-of-focus shots, and changes in posture. This has spurred research in video object detection (VID) to improve accuracy. VID has diverse applications in healthcare, including tumor detection in medical imaging, patient monitoring in healthcare facilities, and surgical video analysis for technique improvement. Additionally, it supports telemedicine for remote patient diagnosis and monitoring.
Existing Video Object Detection techniques rely on recurrent neural networks or optical flow for feature aggregation, either on the sequence or nearby frames level. Convolutional Neural Networks (CNNs) are commonly used as backbone networks for generating feature maps. However, Vision Transformers have exhibited superior performance in various vision tasks, such as object detection in still images and image classification.
This research proposes the use of Swin-Transformer, a state-of-the-art Vision Transformer, as an alternative to CNN-based backbone networks for object detection in videos. The proposed architecture enhances the accuracy of existing VID methods by leveraging the capabilities of Swin-Transformer to capture spatial and temporal information effectively. The Swin-Transformer provides a self-attention mechanism that enables efficient feature aggregation across video frames, addressing challenges such as motion blur, occlusion, and pose variations.
Evaluation is carried out using the ImageNet VID and EPIC KITCHENS datasets, which serve as extensively recognized benchmarks in the field of VID research. The results from our experiments reveal that our technique attains an 84.3% mean average precision (mAP) on the benchmark ImageNet VID dataset, outperforming other contemporary top-tier VID methods. Additionally, our proposed approach achieves this notable precision while consuming fewer computational resources, enhancing its computational efficiency.
The availability of the source code for our proposed method on the GitHub( ) repository allows other researchers and practitioners to replicate and build upon our work. This work contributes to advancing the field of object detection in videos by leveraging the capabilities of Vision Transformers and improving the accuracy and efficiency of existing VID techniques.
In conclusion, the proposed deep learning-based object detection approach, utilizing the Swin-Transformer architecture, demonstrates significant advancements in video object detection. The enhanced accuracy, efficient feature aggregation, and reduced memory requirements make our method a promising solution for various real-world applications in healthcare, surveillance, autonomous vehicles, and video analysis.