Author: Badr, Mohamed Gamal Mohamed./ Title: Fast Similarity Search for Protein and DNA Sequences \

Search In this Thesis

العنوان

Fast Similarity Search for Protein and DNA Sequences \

المؤلف

Badr, Mohamed Gamal Mohamed.

هيئة الاعداد

باحث / محمد جمال محمد بدار

مشرف / سهير أحمد فؤاد بسيونى

saf@alex.edu.eg

مناقش / محمد عبد الحميد اسماعيل

مناقش / صالح عبد الشكور الشهابى

الموضوع

Computer Science.

تاريخ النشر

2014.

عدد الصفحات

70 p. :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

الهندسة (متفرقات)

تاريخ الإجازة

1/2/2013

مكان الإجازة

جامعة الاسكندريه - كلية الهندسة - هندسة الحاسبات و النظم

الفهرس

Only 14 pages are availabe for public view

from

102

from

102

Abstract

Protein function prediction is a fundamental task in computational biology and has many practical applications. This task plays a critical role in the process of drug design. This process includes detecting a target protein based on its function, and then this protein’s function is to be moderated or blocked. Advances in genome sequencing technology resulted in a large growth of the size of proteins’ sequences databases. A significant portion of these databases’ protein sequences still haven’t their functions explored. A number of methods have been developed for protein function prediction. Manual analysis techniques usually provide high accuracy for predicting protein function. However the huge amount of sequence data has made manual analysis tedious and cumbersome. Hence, a number of computational methods have been developed for predicting protein function. These computational methods usually depend on different sources of information. These sources of information include protein homology, protein interaction network analysis, gene expression analysis and literature’s text mining. The most prevalent methods used for protein homology detection are those based on protein homology. The idea behind these methods is that given a newly sequenced protein (a query), we search a database of well characterized proteins (proteins with their function and other information recorded) and retrieve database proteins homologous to this newly sequenced protein. Homology is usually inferred via protein sequence similarity. Hence homologous proteins are detected by scoring similarity of query with database sequences. After detecting homologous proteins, functional information is transferred from database to query sequence based on level of homology. An important challenge is detecting homologies in cases of low pairwise similarity; this problem is called remote homology detection. Many methods have been developed for solving this problem. Profile based method are usually used for remote homology detection. In this type of methods a profile is created for the query and this profile is scored against database sequences. An extension to profile based methods is profile-profile methods in which a profile is createdfor the query and clusters of closely related sequences in the database, then these profiles are compared. HHsearch: a remote protein homology detection based on
comparing two profile hidden Markov models (HMMs) achieves relatively higher sensitivity than other remote homology detection in the literature. However, Hlisearch used dynamic programming algorithm for comparing two HMMs, hence HHsearch is a computationally intensive method. To solve this problem, we have developed SHsearch as a faster alternative for HHsearch that significantly reduces computational time with a minimal sensitivity loss. SHsearch focuses on comparing the most important sub-models instead of comparing the complete two models as in HHsearch. The results show a speedup of 88X for SHsearch relative to HHsearch with 8.2 sensitivity loss at error rate of 10, which deemed to be acceptable.