Probabilistic context-free grammar for pattern detection in protein sequences

Dyrka, Witold (2007) Probabilistic context-free grammar for pattern detection in protein sequences. (MSc(R) thesis), Kingston University, .

Abstract

Analysis of protein sequences to predict their functions is a very challenging problem where pattern recognition techniques based on Hidden Markov models (HMMs) have proved to be the most efficient. However HMMs have limitations. According to formal language theory, their expressive power is similar to Probabilistic Regular Grammars (PRG). Here, we propose a pattern recognition method based on a more powerful grammar. We developed a Probabilistic Context-Free Grammar (PCFG) based system to detect protein regions that are involved in binding sites. In order to deal with the size of the protein alphabet, we use quantitative properties of amino acids to reduce the number of rules. The grammars based on different properties are then combined to retain as much inforination as possible. To increase the number of symbols while keeping the rule set on a maintainable level, we imposed some structural constraints on grammars. Moreover, to deal with motifs of a variable length, we implemented a window-independent scoring scheme for parsing. Then the PCFGs can be generated by an evolutionary process. It was customised to PCFG induction by implementing a diversity measure based on the Weighted Hamming distance. Our PCFGs proved their ability to detect binding sites with high accuracy. They achieved very good results for protein sequence annotation and binding site localisation. We also showed that some features of protein patterns could be better represented by PCFG than PRG. This confirms our initial assumption that binding site detection benefits from the expressive power provided by a context-free language. Finally, results suggest that, unlike current state-of-the-art methods, our system would be particularly suited to deal with patterns shared by non-homologous proteins.

Actions (Repository Editors)

Item Control Page Item Control Page