Clustering Protein Sequences by Parallel Mining Non-positional k-tuple Matches


Szu-Hsien Lu and Ming-Jing Hwang

Institute of Biomedical Sciences, Academia Sinica, 128 Yen-chiou Yuan Rd., Sec. 2, Taipei 11529, Taiwan


    Sequence comparison is fundamental to bioinformatics research.  The need for still faster comparison methods is pressing as an increasing number of genome sequencing projects are being pursued and completed.  To that end, the use of k-tuples obtained from frame-shift scanning of gene sequences as the basis of comparison has been popular. k-tuple matches resulting from pairwise comparisons could enable highly efficient, though greedy, alignment strategies to build a guiding tree that precedes the eventual construction of phylogenetic clusters. We propose here that if clustering protein sequences and not their actual alignments is of primary interest, it is possible to inspect k-tuple matches of multiple sequences in an even more parallel way by disregarding positional information. Our preliminary test showed that of the 118 members of the AAA (ATPases Associated with various cellular Activities) superfamily (, only 7 missed the subfamily designation as determined by a conventional pairwise comparison method.