PatternHunter

Summary

PatternHunter is a commercially available homology search instrument software that uses sequence alignment techniques. It was initially developed in the year 2002 by three scientists: Bin Ma, John Tramp and Ming Li.[1]: 440  These scientists were driven by the desire to solve the problem that many investigators face during studies that involve genomics and proteomics. These scientists realized that such studies greatly relied on homology studies that established short seed matches that were subsequently lengthened. Describing homologous genes was an essential part of most evolutionary studies and was crucial to the understanding of the evolution of gene families, the relationship between domains and families.[2]: 7  Homologous genes could only be studied effectively using search tools that established like portions or local placement between two proteins or nucleic acid sequences.[3]: 15  Homology was quantified by scores obtained from matching sequences, “mismatch and gap scores”.[4]: 164 

Development edit

In comparative genomics, for example, it is necessary to compare huge chromosomes such as those found in the human genome. However, the immense expansion of genomic data introduces a predicament in the available methods of carrying out homology searches. For instance, enlarging the seed size lowers sensitivity while reducing seed size reduces the speed of calculations. Several sequence alignment programs have been developed to determine homology between genes. These include FASTA, the BLAST family, QUASAR, MUMmer, SENSEI, SIM, and REPuter.[1]: 440  They mostly use Smith-Waterman alignment technique, which compares bases against other bases, but is too slow. BLAST makes an improvement to this technique by establishing brief, precise seed matches that it later joins up to form longer alignments.[5]: 737  However, when dealing with lengthy sequences, the above-mentioned techniques are extremely sluggish and required considerable memory sizes. SENSEI, however, is more efficient than the other methods, but is incompetent in other forms of alignment as its strength lies in handling ungapped alignments. The quality of the production from Megablast, on the other hand, is of poor quality and does not adapt well to large sequences. Techniques such as MUMmer and QUASAR employ suffix trees, which are supposed to handle exact matches. However, these methods can only apply to the comparison of sequences that display elevated similarities. All the above-mentioned problems necessitate the development of a fast reliable tool that can handle all types of sequences efficiently without consuming too many resources in a computer.

Approach edit

PatternHunter utilizes numerous seeds (tiny search strings) with optimal intervals between them. Searches that employ seeds are extremely fast because they only determine homology in places where hits are established. The sensitivity of a search string is greatly influenced by the amount of space between adjacent strings. Large seeds are unable to find isolated homologies, whereas small ones generate numerous arbitrary hits that delay computation. PatternHunter strikes a delicate balance in this area by providing optimal spacing between search strings. It uses alternate k (k = 11) letters as seeds in contrast with BLAST, which utilizes successive k letters as seeds. The first stage in PatternHunter analysis entails a filtering phase where the program hunts for matches in k alternating points as denoted by the most advantageous pattern.[6]: 11  The second stage is the alignment phase, which is identical to BLAST. In addition, it is possible to use more than one seed at a go with PatternHunter. This elevates the sensitivity of the tool without interfering with its speed.

Speed edit

PatternHunter takes a short time to analyze all types of sequences. On a modern computer, it can take a few seconds to handle prokaryotic genomes, minutes to process Arabidopsis thaliana sequences and several hours to process a human chromosome.[1]: 440  When compared to other tools, PatternHunter exhibits speeds that are approximately a hundred times faster than BLAST and Mega BLAST.[7] These speeds are 3000-fold those attained from a Smith-Waterman algorithm. In addition, the program has a user-friendly interface that allows one to customize the search parameters.

Sensitivity edit

In terms of sensitivity, it is possible to attain the optimum sensitivity with PatternHunter while still retaining the same speed as a conventional BLAST search.

Specifications edit

The designing of PatternHunter uses Java technology. Consequently, the program runs smoothly when installed in any Java 1.4 environments.[7]

Future advances edit

Homology search is a very lengthy procedure that requires a lot of time. Challenges still remain in handling DNA-DNA searches as well as translated DNA-protein searches because of the vast sizes of databases and the tiny query that is used. PatternHunter has been improved to an upgraded PatternHunter II version, which hastens DNA-protein searches a hundredfold without altering the sensitivity. However, there are plans to improve PatternHunter to attain the high sensitivity of the Smith - Waterman tool while obtaining BLAST pace. A novel translated PatternHunter that intends to hasten tBLASTx.[4]: 174  is also in the developmental stages.

References edit

  1. ^ a b c Ma, Bin; Tromp, John; Li, Ming (2002). "PatternHunter: Faster and More Sensitive Homology Search". Bioinformatics. 18 (2): 440–445. doi:10.1093/bioinformatics/18.3.440. PMID 11934743.
  2. ^ Joseph, Jacob M. (2012). On the identification and investigation of homologous gene families, with particular emphasis on the accuracy of multidomain families (PDF) (PhD). Carnegie Mellon University.
  3. ^ Pevsner, Jonathan (2009). Bioinformatics and Functional Genomics (2nd ed.). New Jersey: Wiley Blackwell. ISBN 9780470451489.
  4. ^ a b Li, M.; Ma, B.; Kisman, D.; Tromp, J. (2003). "PatternHunter II: Highly sensitive and fast homology search". Genome Informatics. International Conference on Genome Informatics. 14: 164–175. PMID 15706531.
  5. ^ Pearson, W. R. (1991). "Searching protein sequence libraries: Comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms". Genomics. 11 (3): 635–650. doi:10.1016/0888-7543(91)90071-L. PMID 1774068.
  6. ^ Zhang, Louxin. "Sequence Database Search Techniques I: Blast and PatternHunter tools" (PDF). Retrieved 6 December 2013.
  7. ^ a b "PatternHunter Brochure" (PDF). Archived from the original (PDF) on 11 December 2013. Retrieved 30 November 2013.