Search In this Thesis
   Search In this Thesis  
العنوان
Filtering Web Site Schema Generated by Web Data Extraction Systems /
المؤلف
Shehata, Marwa Hashem.
هيئة الاعداد
باحث / مروة هشام شحاته
.
مشرف / احمد عبد القادر رمضان
.
الموضوع
Data mining. Web databases.
تاريخ النشر
2013.
عدد الصفحات
64 p. :
اللغة
الإنجليزية
الدرجة
ماجستير
التخصص
Information Systems
تاريخ الإجازة
1/1/2013
مكان الإجازة
جامعة بني سويف - كلية العلوم - الرياضيات
الفهرس
Only 14 pages are availabe for public view

from 79

from 79

Abstract

Web data extraction approaches aim not only to extract data embedded in pages of a web site but also to detect the schema of this site. Detecting the schema of a web site has been a key step for value-added services on the web such as comparative shopping and information integration systems.
Many web data extraction systems have been developed to detect this schema. For a real web site, due to the complexity of the site schema, the schema detected by such data extraction systems is imperfect. This work is aimed to filter out the schema detected by one of the web data extraction systems called FiVaTech.
The thesis has three main contributions. First, it discusses two common problems that may appear in the schema detected by data extraction systems: incorrect and incomplete schema types. Second, it compares among different schema types by constructing a classifier for each schema type. The classifier of a schema type can also be used to decide whether a data value in some web page is an instance of this type; i.e., the classifier can be used as an extractor for web data extraction systems. Given some instances of a schema type, we exploit HTML tags contents, DOM trees structural information, and visual information of these instances for the classifier construction. Third, we use the constructed classifier to filter out the detected schema to get, as much as possible, a perfect schema.
The experiments show an encourage result with the schemas of the test web sites.
Note: The results in chapters four and five are published in International Journal of Information Processing and Management (IJIPM).