Search In this Thesis
   Search In this Thesis  
العنوان
A novel hybrid bitmap-based scalable parallel classifier/
الناشر
Ghada Hamed Hesain Drahim,
المؤلف
Drahim,Ghada Hamed Hesain.
الموضوع
Computer Science & Automatic Control.
تاريخ النشر
2008 .
عدد الصفحات
86 P.:
الفهرس
يوجد فقط 14 صفحة متاحة للعرض العام

from 102

from 102

المستخلص

Classification is an important data mining task with a broad range of applications. Classification has been studied extensively. Algorithms for classification can bc categorized as decision tree based methods and non-decision tree based methods. Many algorithms have been used in constructing decision trees [1]. Non-decision tree based methods include Neural Networks (NN), Genetic Programming (GP) and classification by association rules. Although classification is a well studied task, there are some major problems affecting the performance of most existing classification algorithms. For example, most of the current decision tree algorithms need an in-memory data structure to achieve efficiency. This limits their suitability for mining large databases. Also, they need a pruning phase to reduce large tree size and to handle data overfitting problems. The pruning phase may cause problems to the classification accuracy due to its ignorance to the effect of small data sets. On the other hand, each type of the non decision tree algorithms has its own problems. For example, algorithms that are based on GP use huge search space. and if the training data is large, an enormous amount of memory is required to process large populations of trees that contain many nodes. Also, the fitness evaluation is a very expensive operation with respect to the time dimension. Other algorithms, like algorithms that are based on NN suffer from the inability to backtrack the decision making process a decision is dependent on the initialization of the network and the training set. Another drawback is the extremely long training time even for a small database. Also, for a largc database, a more sophisticated neural network should be used in order to improve th accuracy.
In this thesis, a novel representation approach of the structured data under study is proposed. The approach scans a structured training data set once and creates an encoded list of records by using a bitmap representation of each attribute. Thus, it does not require the residence of large data structures in memory. This will reduce strongly the time complexity when mining large data.A bitmap-based scalable parallel classifier (BSPC), employing the above approach. is next developed. It overcomes most of the above mentioned classification problems. Performance analysis demonstrates that the BSPC outperforms other state-of-the-art algorithms. The superiority of the novel algorithm is demonstrated through the classification of Wisconsin breast cancer dataset where this data is clean and doesn’t need the pruning phase.
To enhance the performance of BSPC and complete the elimination of the above mentioned classification problems, this work presents another classifier which is a scalable and parallel hybrid decision tree! (GP) algorithm. This algorithm executes in two phases. The first phase uses BSPC to find rules that cover the majority of samples in the training data set. The second phase uses a GP algorithm to find a tree that covers the remaining (hard) samples. This second phase benefits from the global search performed by the GP and copes better with attributes interaction than conventional greedy decision treeinduction algorithms. It has been implemented on an Intel Core 2 Quad processors 2.4 Gll/ desk top computer. where the scalability and speedup features have been demonstrated. and shown to be consistent with the theoretical study.