Download ExPlain 3.0 manual
Transcript
CHAPTER 5. TRANSCRIPTION FACTOR SITE 5.4. SEARCH SITES SEARCH THEORETICAL BACKGROUND center of the window is multiplied with 1 and the factor diminishes to 0 according to a cosine function the greater the distance to the center. In the figure, 3 evidence points within the window are indicated by green and yellow bars. The cosine is exemplified by the orange curve. Finally, a sum-of-evidence-scores histogram peaking at the current window position is shown by blue bars. Figure 5.27 TSS definition in TRANSPro 5.4.3 Cut-off values In the MATCH, cut-off values are separately defined for core and matrix similarity values [5]. The matrix similarity is a score that describes the quality of a match between a matrix and an arbitrary part of the input sequences. Analogously, the core similarity denotes the quality of a match between the core sequence of a matrix (i.e. the five most conserved positions within a matrix) and a part of the input sequence. A match has to contain the ”core sequence” of a matrix, i.e. the core sequence has to match with a score higher than or equal to the core similarity cut-off. In addition, only the matches that score higher than or equal to the matrix similarity threshold appear in the output. For the minFP, minFN, and minSUM cut-offs, first the core similarity score is calculated, and then the matrix similarity score is calculated for the selected positions according to the following equation. where is the frequency of nucleotide b at position i of the matrix with width L, the fre- quency of the rarest occurring nucleotide in position i, and the frequency of the most frequent occurring nucleotide in position i. The information vector I(i) describes the conservation of nucleotide B in position i of the matrix: , i = 1, 2, ..., L Cut-off to minimize false negative matches (minFN): The false negative rate MatchTM obtains with a matrix was measured on known genomic binding sites for the transcription factors associated with that matrix, as far as such sites are available. In case a sufficient number of genomic binding sites (less than 10) were not available, SELEX sites or sets of generated oligonucleotides were used for estimating the cut-offs to minimize the false negative rate, using actual weight matrices to calculate the probability of a nucleotide occurring at a certain position of a binding site. For each matrix we applied the MatchTM algorithm to these test sequence sets without using any matrix similarity cut-offs. Then we set the cutoff to a value that provides recognition of at least 90% of oligonucleotides. We decided to tolerate an error rate of ten percent. We call this set of cut-offs minFN cut-offs. Applying the minFN cut-offs, the user will find most genomic binding sites, but in this case a high rate of false positives should be taken into account as well. The minFN cut-offs are useful for the detailed analysis of relatively short DNA fragments. Cut-off to minimize false positive matches (minFP) : In order to estimate this cut-off, which will reduce the number of random sites found by MatchTM, we applied the MatchTM algorithm to promoter 81