Author: Abdalla, Tharwat EL-Sayed Ismail./ Title: Hadoop Performance Enhancement for Big Data Including Small Files /

Search In this Thesis

العنوان

Hadoop Performance Enhancement for Big Data Including Small Files /

المؤلف

Abdalla, Tharwat EL-Sayed Ismail.

هيئة الاعداد

باحث / ثروت السيد إسماعيل عبدالله

مشرف / ايمن السيد أحمد السيد

مناقش / اشرف بهجات إبراهيم السيسي

مناقش / جمال محروس عطية

الموضوع

Big data. File organization (Computer science) Electronic data processing Distributed processing.

تاريخ النشر

2019.

عدد الصفحات

73 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

13/7/2019

مكان الإجازة

جامعة المنوفية - كلية الهندسة الإلكترونية - قسم هندسة وعلوم الحاسبات

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Hadoop is a distributed computing framework written in java that used to deal with big data. It is used widely because of its ease of programming, scalability, and availability. Hadoop distributed file system (HDFS) and Hadoop MapReduce are two important components of Hadoop. Hadoop MapReduce is used to process the data stored in HDFS. With the explosive development of cloud computing, increasing business and scientific application needs to take advantages of Hadoop. The sizes of files processed in Hadoop are not bound to very large files any more. Large amount of small files both in business and scientific areas are processed by MapReduce, such as document type files, bioinformatics files, geographic information files, and so on. In this situation, MapReduce performance of Hadoop is impacted severely. Although Hadoop and other frameworks provide some MapReduce strategies, they are not directly designed for small files. Handling the small files (in Bytes or Kbytes) leads to some problems in Hadoop performance. Several approaches used to overcome the housing and the access efficiency problem of the small files in Hadoop, these approaches are Hadoop Archive Files, Federated NameNodes, Change the ingestion process/interval, Batch file consolidation, Sequence files, HBase, and S3DistCp. All of these approaches have some advantages and also have some limitations for solving the problem of the small files in Hadoop. In this thesis, we propose an enhancement approach for the Sequence files approach called Small Files Search and Aggregation Node (SFSAN) approach. The results approved that the SFSAN approach can overcome some of the limitations of the Sequence Files approach and still keeping its advantages.