Author: Amasha, Nahed Ali Mohamed Ali./ Title: An intelligent agent system for data cleaning as a phase to support knowledge acquisition /

Search In this Thesis

العنوان

An intelligent agent system for data cleaning as a phase to support knowledge acquisition /

المؤلف

Amasha, Nahed Ali Mohamed Ali.

هيئة الاعداد

باحث / ناهد على محمد على عماشه

مشرف / امانى فوزى محمد الجمل

مشرف / نبيل عبد المحسن احمد موسى

مناقش / عطا ابراهيم امام الألفى

الموضوع

Electronic data processing - Data preparation. Electronic data processing - Quality control.

تاريخ النشر

2014.

عدد الصفحات

125 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

1/1/2014

مكان الإجازة

جامعة المنصورة - كلية التربية النوعية - اعداد معلم الحاسب الالى

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Data cleaning is a complex process which makes use of several technology specializations to solve the contradictions in different data sources. In fact, it represents a real challenge for most organizations which need to improve the quality of their data. Data quality needs to be improved in data stores when there is an error in input data, abbreviations or differences in the archives derived from several databases in one source. Therefore, data cleaning is one of the most challenging stages to clean archives, because it deals with the detection and removal of errors, filling in missing values, smoothing noisy data, identifying or removing outliers, resolving inconsistencies, and detecting and eliminating duplicated data to improve the quality of the data gathered from distributed sources. Duplicate elimination is very important to detect and clean data which refer to the same real-world entity in a single database to speed up the data cleaning process, reduce the complexity and to improve the quality of data in data warehouses. It is particularly crucial to extract a correct conclusion from data in decision support systems (DSS).This research presents an application of a general framework to improve the data cleaning process, which consists of six steps, namely selection of attributes, formation of tokens, selection of the clustering algorithm, similarity computation, selection of the elimination function, and finally merge. First, an attribute selection is used to select the best and most suitable attributes depending on the attribute selection criteria. Second, formation of tokens technique achieves a better result by using short tokens in record comparisons using token formation algorithm. In the next step, the clustering algorithm is used to group the records based on Sorted Neighborhood Method (SNM); and similarity computing is calculated based on Edit Distance(ED) among records. Then, duplicate elimination is done by using the rule-based approach to detect duplicates and eliminate low quality duplicates. Finally, cleaned data is merged as a cluster.The framework for data cleaning needs some functional elements to execute the cleaning process; and these are represented as an architecture model, which is turned into knowledge processing elements called agents. Agents are intelligent software that achieves rationality through communication, co-operation, learning, autonomy, flexibility and merges algorithms to execute the cleaning process. Thus, this research proposes an intelligent agent (IA) for data cleaning. The proposed intelligent agent system(IAS) for data cleaning is carried out on PC of the specifications: Intel Core i5 of 2.35 GHz and 4GB RAM(2.92 GB usable), running MS Windows 7 (32-bit) operating system, and using the technologies of SQL Server Management Studio and Microsoft visual c#.
The experimental work includes a dataset contains 11 attributes: ID, Name, Address, Birth Date, phone, Company_ name, Postal code, Phone 2, Email and Web. It consists of 2500 records, and after system implementation they become 1214, so the accuracy of the proposed system is 98.6.