Probabilistic context-free grammar for pattern detection in protein sequences

Dyrka, Witold (2007) Probabilistic context-free grammar for pattern detection in protein sequences. (MSc(R) thesis), Kingston University.

Full text not available from this archive.

Abstract

Analysis of protein sequences to predict their functions is a very challenging problem where pattern recognition techniques based on Hidden Markov models (HMMs) have proved to be the most efficient. However HMMs have limitations. According to formal language theory, their expressive power is similar to Probabilistic Regular Grammars (PRG). Here, we propose a pattern recognition method based on a more powerful grammar. We developed a Probabilistic Context-Free Grammar (PCFG) based system to detect protein regions that are involved in binding sites. In order to deal with the size of the protein alphabet, we use quantitative properties of amino acids to reduce the number of rules. The grammars based on different properties are then combined to retain as much inforination as possible. To increase the number of symbols while keeping the rule set on a maintainable level, we imposed some structural constraints on grammars. Moreover, to deal with motifs of a variable length, we implemented a window-independent scoring scheme for parsing. Then the PCFGs can be generated by an evolutionary process. It was customised to PCFG induction by implementing a diversity measure based on the Weighted Hamming distance. Our PCFGs proved their ability to detect binding sites with high accuracy. They achieved very good results for protein sequence annotation and binding site localisation. We also showed that some features of protein patterns could be better represented by PCFG than PRG. This confirms our initial assumption that binding site detection benefits from the expressive power provided by a context-free language. Finally, results suggest that, unlike current state-of-the-art methods, our system would be particularly suited to deal with patterns shared by non-homologous proteins.

Item Type: Thesis (MSc(R))
Physical Location: This item is held in stock at Kingston University Library.
Research Area: Biological sciences
Faculty, School or Research Centre: Faculty of Computing, Information Systems and Mathematics (until 2011) > Digital Imaging Research Centre (DIRC)
Depositing User: Katrina Clifford
Date Deposited: 17 Apr 2012 15:33
Last Modified: 16 Sep 2013 15:43
URI: http://eprints.kingston.ac.uk/id/eprint/21544

Actions (Repository Editors)

Item Control Page Item Control Page