Recipes for all the proteins that are needed by an organism are described in its genes which are encoded in DNA. In order to create a new protein, a copy of the DNA recipe must first be transcribed into RNA and this transcript is subsequently translated into protein. Since different proteins are used in different cell-types and at different times, the process of creating new proteins must be tightly regulated. The first step in this process is mainly regulated by transcription factors which bind to specific sequence patterns in the DNA and help recruit the transcriptional apparatus to the start of the gene and initiate transcription. An important step in elucidating the gene regulatory networks of an organism is thus to determine which sequence pattern each transcription factor binds to (the binding motif) and also the sites where they bind.
Although such motifs and binding sites are best determined experimentally, computational tools for motif discovery seem to offer a convenient, fast and costeffective alternative to experimental methods. Hundreds of software programs have therefore been developed for this purpose. These tools can broadly be divided into two classes: motif scanning tools rely on predefined models of binding motifs and search sequences for matches to these motifs in order to identify potential binding sites. De novo motif discovery methods, on the other hand, aim to find new motifs and binding sites without such prior knowledge by looking for overrepresented patterns in sequences believed to be regulated by common factors.
However, independent assessment studies of computational motif discovery tools have shown that the performance of these methods is limited, especially with respect to predicting functionally active binding sites in real genomic sequences. One reason for this is that most of these tools only base their predictions on information in the DNA sequence itself, but many other aspects besides the presence of a binding motif can influence whether a transcription factor will actually be able to bind and exert its regulatory function, including for instance the local chromatin conformation around the binding site or the presence of cooperative factors binding nearby.
More recent approaches have demonstrated that binding site predictions can be improved by also considering additional information related to e.g. phylogenetic conservation, nucleosome occupancy, DNase hypersensitive sites, epigenetic features, gene expression and transcription factor interactions. To this end we have developed a new software workbench which is able to integrate additional information from a variety of sources into the motif discovery process in a coherent and flexible way.