Author: Sakr, Mohamed Saber Abd El-Rahman./ Title: Parallel Outlier Detection on Big Data /

Search In this Thesis

العنوان

Parallel Outlier Detection on Big Data /

المؤلف

Sakr, Mohamed Saber Abd El-Rahman.

هيئة الاعداد

باحث / محمد صابر عبد الرحمن صقر

مشرف / عربي السيد كشك

مشرف / وليد سعيد عطوه

الموضوع

Parallel computers. Parallel computers - Programming.

تاريخ النشر

2019.

عدد الصفحات

117 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

21/7/2019

مكان الإجازة

جامعة المنوفية - كلية الحاسبات والمعلومات - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

117

from

117

Abstract

Anomaly detection or outlier detection has become a major research problem
in the era of big data. its considered an important research topic that has been
widely investigated. It is used in many Applications such as fraud detection in
credit card and bank transactions , network intrusion detection systems and noise
removal. its considered an essential source for protecting and securing private
and public properties. Although outlier detection has been under research over
many years and many algorithms was proposed , the explosive growth in data
volume and type lead to huge challenges to the outlier detection systems. This
leads to the need to introduce more intelligent outlier detection algorithms that
can deal with the new characteristics of the big data.
This thesis starts with presenting a review of the existing outlier detection
techniques. The categories , pros and cons of each outlier detection technique are
presented. Afterward, four outlier detection algorithms are proposed for the two
types of big data static data and streamed data with two types of processing single
machine processing and parallel distributed processing. The first algorithm
focuses on solving the problem of finding outliers in big static data in distributed
environment. It is based on grid algorithm which partition the data in a way that
minimize communication between processing nodes by grouping the points
located near each other in the same processing node. It uses Local Outlier Factor
(LOF) algorithm to detect outliers. the second algorithm focuses on the
distribution of the data in the first algorithm as it solves the problem of
unbalanced distribution of the data between processing nodes which maximize
the utilization of the processing nodes. The third algorithm focuses on detecting
outliers in streamed data in single machine in bounded memory size. It detects
outliers online in streamed data processed by a single machine and with bounded
memory size. It is based on summarizing the old data using genetic algorithm
that minimize the difference between the distribution of the old and the new
summarized data . The fourth algorithm focuses on detecting outliers in streamed
data in distributed environment parallelly . it uses sliding window technique to
split the data for online processing and then distribute the data between
processing nodes to allow each node to calculate the LOF parallelly.
Finally the performance for the four proposed algorithms was evaluated by
applying a series of simulation experiments over real data sets.