Motif discovery among set biological sequences is important and active research area in computational biology. For analysis of sequence data, motif search problem incorporate various important problems, where biologically important patterns are known as a motif. For example, analyzing large-scale genomic and proteomic data is one of the challenges in order to discover motif. A motif is conserved amino acid sequence pattern which is present in most of the proteins of that protein-family and is thought to be biologically significant for those proteins in exhibiting their structure or function. These conserved regions often serve either structural support to the protein, or to serve as functionally important parts of the protein. Hence to better understand the tertiary structure and to predict the function of that protein, it is essential to discover such motif. To discover motif there is various algorithm exist such as AlignACE 1, Weeder 2 (which are used to discover DNA motifs), Gibbs 3 and MEME 4 (which are used to discover motif in both protein and DNA dataset).
Initially, motif discovery is a more complex process. In this approach, X-ray structural study of protein is carried out with similar function, which is a good indicator of the binding site and, hence, the amino acid residues forming the binding site are considered as the motif which is responsible for function. A list of such known pattern has been compiled into PROSITE 5 database. PROSITE also have a program which matches these patterns against sequences, so we can directly use the primary sequence to extract the pattern. If a new sequence consist a known pattern it is a good indicator of possible function. The pattern in PROSITE is not automated but by inspection. However, the rates at which new sequences are being determined there is a need for an automatic method to extract the pattern from primary sequence information.
The traditional approach of motif discovery is based on multiple sequence alignment. In this approach to construct the consensus pattern, a region is discovered which is greater than average similarity from the aligned sequence. However, the multiple sequence alignment is best for limited sets of related protein because they are sensitive to gap penalty parameters and similarity scoring matrix.
Another approach to the problem is to use statistical technique to discover biologically meaningful patterns and relationships.