Download ExPlain 3.0 manual

Transcript
CHAPTER 5. TRANSCRIPTION FACTOR SITE 5.4.
SEARCH
SITES SEARCH THEORETICAL BACKGROUND
center of the window is multiplied with 1 and the factor diminishes to 0 according to a cosine function
the greater the distance to the center. In the figure, 3 evidence points within the window are indicated by
green and yellow bars. The cosine is exemplified by the orange curve. Finally, a sum-of-evidence-scores
histogram peaking at the current window position is shown by blue bars.
Figure 5.27 TSS definition in TRANSPro
5.4.3
Cut-off values
In the MATCH, cut-off values are separately defined for core and matrix similarity values [5]. The
matrix similarity is a score that describes the quality of a match between a matrix and an arbitrary part
of the input sequences. Analogously, the core similarity denotes the quality of a match between the core
sequence of a matrix (i.e. the five most conserved positions within a matrix) and a part of the input
sequence. A match has to contain the ”core sequence” of a matrix, i.e. the core sequence has to match
with a score higher than or equal to the core similarity cut-off. In addition, only the matches that score
higher than or equal to the matrix similarity threshold appear in the output. For the minFP, minFN,
and minSUM cut-offs, first the core similarity score is calculated, and then the matrix similarity score is
calculated for the selected positions according to the following equation.
where
is the frequency of nucleotide b at position i of the matrix with width L,
the fre-
quency of the rarest occurring nucleotide in position i, and
the frequency of the most frequent
occurring nucleotide in position i. The information vector I(i) describes the conservation of nucleotide
B in position i of the matrix:
, i = 1, 2, ..., L
Cut-off to minimize false negative matches (minFN): The false negative rate MatchTM obtains with a
matrix was measured on known genomic binding sites for the transcription factors associated with that
matrix, as far as such sites are available. In case a sufficient number of genomic binding sites (less than
10) were not available, SELEX sites or sets of generated oligonucleotides were used for estimating the
cut-offs to minimize the false negative rate, using actual weight matrices to calculate the probability of
a nucleotide occurring at a certain position of a binding site. For each matrix we applied the MatchTM
algorithm to these test sequence sets without using any matrix similarity cut-offs. Then we set the cutoff to a value that provides recognition of at least 90% of oligonucleotides. We decided to tolerate an
error rate of ten percent. We call this set of cut-offs minFN cut-offs. Applying the minFN cut-offs, the
user will find most genomic binding sites, but in this case a high rate of false positives should be taken
into account as well. The minFN cut-offs are useful for the detailed analysis of relatively short DNA
fragments.
Cut-off to minimize false positive matches (minFP) : In order to estimate this cut-off, which will
reduce the number of random sites found by MatchTM, we applied the MatchTM algorithm to promoter
81