الفهرس | Only 14 pages are availabe for public view |
Abstract Anomaly detection or outlier detection has become a major research problem in the era of big data. its considered an important research topic that has been widely investigated. It is used in many Applications such as fraud detection in credit card and bank transactions , network intrusion detection systems and noise removal. its considered an essential source for protecting and securing private and public properties. Although outlier detection has been under research over many years and many algorithms was proposed , the explosive growth in data volume and type lead to huge challenges to the outlier detection systems. This leads to the need to introduce more intelligent outlier detection algorithms that can deal with the new characteristics of the big data. This thesis starts with presenting a review of the existing outlier detection techniques. The categories , pros and cons of each outlier detection technique are presented. Afterward, four outlier detection algorithms are proposed for the two types of big data static data and streamed data with two types of processing single machine processing and parallel distributed processing. The first algorithm focuses on solving the problem of finding outliers in big static data in distributed environment. It is based on grid algorithm which partition the data in a way that minimize communication between processing nodes by grouping the points located near each other in the same processing node. It uses Local Outlier Factor (LOF) algorithm to detect outliers. the second algorithm focuses on the distribution of the data in the first algorithm as it solves the problem of unbalanced distribution of the data between processing nodes which maximize the utilization of the processing nodes. The third algorithm focuses on detecting outliers in streamed data in single machine in bounded memory size. It detects outliers online in streamed data processed by a single machine and with bounded memory size. It is based on summarizing the old data using genetic algorithm that minimize the difference between the distribution of the old and the new summarized data . The fourth algorithm focuses on detecting outliers in streamed data in distributed environment parallelly . it uses sliding window technique to split the data for online processing and then distribute the data between processing nodes to allow each node to calculate the LOF parallelly. Finally the performance for the four proposed algorithms was evaluated by applying a series of simulation experiments over real data sets. |