Download ContigScape User Manual 1.0

Transcript
ContigScape User Manual 1.0
Introduction ................................................................................................................................................................. 2
Highlights ..................................................................................................................................................................... 2
Installation ................................................................................................................................................................... 3
Overview....................................................................................................................................................................... 3
Open files ...................................................................................................................................................................... 4
Open CRS file........................................................................................................................................................ 4
Open 454Contig.ace file ........................................................................................................................................ 5
Open AGP file ....................................................................................................................................................... 7
Open mate-paired reads ......................................................................................................................................... 7
Alignment Options: ........................................................................................................................................... 8
Quality threshold: ............................................................................................................................................ 8
Minimum read length: ..................................................................................................................................... 8
Minimum quality score to keep: ...................................................................................................................... 8
Minimum percent of bases: ............................................................................................................................. 8
Number of CPU for BWA: .............................................................................................................................. 8
Visualization Options .................................................................................................................................................. 9
Show contig width by ratio .................................................................................................................................... 9
Show contig sequences: ......................................................................................................................................... 9
Hide contig link < {value} .................................................................................................................................... 9
Hide contig < {value} bp....................................................................................................................................... 9
Repeat contig threshold ......................................................................................................................................... 9
Single contig threshold .......................................................................................................................................... 9
Visualization appearance .......................................................................................................................................... 11
Repeat contig color for >2x: ................................................................................................................................ 11
Probable contig color for >1.5x: .......................................................................................................................... 11
Single contig color for <1.5x: .............................................................................................................................. 12
Contig shape for <2kb: ........................................................................................................................................ 12
Link label size: .................................................................................................................................................... 12
Link color: ........................................................................................................................................................... 12
Link label color: .................................................................................................................................................. 12
Link color for gap: ............................................................................................................................................... 12
Link label for gap: ............................................................................................................................................... 12
Set shape for plasmid:.......................................................................................................................................... 12
Maximum link width: .......................................................................................................................................... 12
Minimum link width: ........................................................................................................................................... 12
Show/Hide result ........................................................................................................................................................ 13
Show intermediate results .................................................................................................................................... 13
Estimate Plasmid/TIRs ........................................................................................................................................ 14
Contig sequences........................................................................................................................................................ 15
Algorithm principle ................................................................................................................................................... 16
Finding connections between 454 reads .............................................................................................................. 16
Finding connections between mate-paired reads ................................................................................................. 16
Genomic features prediction ................................................................................................................................ 17
Copyright and Contact.............................................................................................................................................. 18
Frequently Asked Questions (FAQs) ........................................................................................................................ 19
1
Introduction
Thank you for choosing ContigScape, which is a Cytoscape plugin that offers construction and visualization of de novo
genome assemblies. Its intuitive manner makes it very suitable for discovering contig relationships, and therefore helpful in
gap-closing strategy. The program is freely available from http://sourceforge.net/projects/contigscape/files/.
Highlights
Automatically construct possible connections between scaffolds or contigs using assembly of reads (ACE format),
mate-paired reads, scaffolding information (AGP format) or user defined CRS files. ContigScape lets you manipulate the
appearance of the graph, as well as filter the contigs by their length and coverage. Meanwhile, the sequence contigs of
plasmids and special repeats (IS elements, ribosomal RNAs, terminal repeats etc) can be displayed as well. The following
figure illustrates network appearance upon loading ACE file, mate-paired reads and AGP file, respectively.
2
Installation
Before using ContigScape, you need to make sure you have Java and Cytoscape2.8.3 installed. After that, you just need to
put ContigScape.jar under installation directory cytoscape/plugins folder or browse to plugin using “Install plugin from
file” which is under Plugins menu.
Overview
ContigScape is loaded automatically along with Cytoscape, which can be found in the left panel. The graphic interface is
divided into four categories: “Open files”, “Visualization Options”, “Visualization appearance”, “Show/Hide result”,
“Contig sequences”. There are several options under each category,. You can click the categories to make options hide or
show up.
I
II
III
V
IV
3
I Open files allows user to choose different file types to import into ContigScape. Currently ACE format, mate-paired
reads, AGP format, and user defined tab-delimited files are supported.
II Visualization Options allows user to check which type of graphic object you wish to show on the graph, as well as see
statistical information of the graph.
In able to obtain the best visual effects, please use “Layout”—“Force-Directed
Layout”
button
and “Visual style” —“sample1”
III Visualization appearance allows user to control the appearance of the graph, such as color and size for different types
of nodes/edges.
IV Show/Hide result allows user to export intermediate result and estimate plasmid/TIRs.
V Contig sequences allows user to check selected contig sequences. It is also possible for users to send the sequences to
NCBI Blast upon selection.
I. Open files:
1 Open CRS file:
CRS format includes two files, and each contains three columns. ’tabbed.txt’ represents the number of connections among
contigs, and ‘tabbedCov.txt’ describes all the contigs’ length and coverage. User can input ’tabbed.txt’ guiding how
contigs are connected to each other. This is usefully in creating connections using diverse data from other excellent software,
or adds additional connections to the existed network. The following example shows how to prepare ’tabbed.txt’ files
using the beginning and end of the contig to create connections. Please notice do not include connections between the
beginning and end of the contigs unless they are self-connected (the contig maybe a plasmid, etc), as contig140 in the
following example. The format of ’tabbed.txt’ is as follows:
Begin
3S
3E
140S
End
119E
58E
140E
Connections-ReadsNum (optional)
113
322
29
NOTICE: “S” represents “start position”, “E” represents “end position”
‘tabbedCov.txt’ file can be imported as additional information by checking “input tab-delimited contig coverage file” and
specify the location to the file. The file should also be tab-delimited and follow the format in the table below.
Contig ID Length Coverage (or copy number)
119
169
40.72
(1.94)
3
3295
24.73
(1.18)
58
31775 21.92
(1.04)
4
The CRS test files of 10 454Contigs.ace can be downloaded from
http://sourceforge.net/projects/contigscape/files/datasets/454Contigs_CRS_test_files.zip
2 Open 454Contig.ace file:
ACE files are generated by various assembly programs, including Phrap, CAP3, Newbler, Arachne, AMOS, etc. ACE file
not only includes contigs and reads of the assemblies, but also contains information of the connections between the contigs.
Take 454Contig.ace for example, which was generated using 454 reads and Newbler program. Upon pressing “Open
454Contig.ace”, a dialog promotes the user to input the locations to the sequence file. The graph below indicates the
connections/copy number of the contigs after import 454Contig.ace.
5
6
3 Open AGP file:
AGP format describes the assembly of a larger sequence object from smaller objects, such as joining contigs into scaffolds.
Here it is useful in forming larger assemblies using the reference sequences. You can apply BLAST alignment of the contigs
onto close-related reference sequences using the following command.
blastall -p blastn -i contig.fas -d ref.fas -m 8 -e 1e-30 –F F -o output.txt
Then you need to remove both the duplicated and the ambiguous matches (score < 1000 in the following example):
perl -ne '@t=split(/\t/);next if $t[11]<1000;print unless defined $h{$t[0]};
$h{$t[0]}=1;' output.txt|sort -k9,9n > output2.txt
After that you can use blast2agp.pl to convert the BLAST output into AGP file, and then press button “open AGP file”
and browse to its location:
4 Open mate-paired reads:
Mate-paired reads can be used to construct the relationship between the scaffolds. ContigScape needs several tools to do this
(SAMtools, bwa, FASTX-Toolkit, BEDTools). In the Open Dialog, user needs to specify the sequences and the path to the
executables. Press the OK button and ContigScape will precede the input sequences using the programs. After processing,
the files generated will be saved in the same folder as the contig sequences you input.
7
Alignment Options
The options below are for setting the parameters used while mapping mate-paired reads onto the contigs. For the program
used, please also refer to http://hannonlab.cshl.edu/fastx_toolkit/commandline.html#fastx_trimmer_usage and
http://bio-bwa.sourceforge.net/bwa.shtml.
Quality threshold:
Set quality to be trimmed (from the end of the sequence) by fastq_quality_trimmer
Minimum read length:
Sequences shorter than this will be discarded by fastq_quality_trimmer.
Minimum quality score to keep:
Minimum quality score to keep by fastq_quality_filter.
Minimum percent of bases:
Minimum percent of bases filtered by fastq_quality_filter.
Number of CPU for BWA:
Number of threads used by bwa while doing alignment, see also “bwa aln”.
8
II. Visualization Options:
Show contig width by ratio:
Check this option to apply modification to contigs width according to their length. Otherwise contigs are shown in the same
width.
Show contig sequences:
Store contig sequences in memory while loading 454Contigs.ace files. The contig sequences are shown in the data panel
below on click the nodes/edges representing each contig.
Hide contig link < {value}
Hide the connection for those contig connections having a reads count lower the value specified by user. Check this option
if user want to leave high confident connections only.
Repeat contig threshold
Define repeat threshold, default value is 2
Single contig threshold
Define specific contig, default value is 1.5
9
10
III. Visualization appearances:
In able to obtain the best visual
“Layout”—“Force-Directed Layout” button
effects, please use
and
“Visual
style” —“sample1”
The options below control the appearance of the graph, such as color,
shape, size of the graph objects and labels according to constrains set
by user.
Repeat contig color:
Set color for contigs which have coverage greater than two-folds of the overall coverage.
Probable repeat contig color:
Set color for the contigs whose coverage is greater than 1.5 and less than 2.
11
Single contig color for:
Set color for contigs which have coverage lower than 1.5-folds of the overall coverage.
Contig shape for <2kb:
Set shape for contigs which have length greater than 2k base pairs.
Link label size:
Set size for the labels of the contig connections.
Link color:
Set color for contig connections.
Link label color:
Set color for the labels of contig connections.
Link color for gap:
Set color for connections of gaps.
Link label for gap:
Set color for labels of connections of gaps.
Set shape for plasmid:
Set shape for the contigs predicted as plasmid.
Maximum link width:
Set maximum width for the connections between contigs. Set to 25 by default.
Minimum link width:
Set minimum width for the connections between contigs. Set to 0.12 by default.
12
IV. Show/Hide result:
Show intermediate results:
Intermediate results contain information generated while loading ACE files, AGP files and mate-paired reads. The following
example shows contig and contig connections information for ACE file loaded. User can choose either “Export” to save
current table, or “Export All” to save all the tables.
13
Estimate Plasmid/TIRs:
Plasmid results include predicted plasmids with coverage and type information.
14
V. Contig sequences
Show sequence in the data panel field below upon clicking corresponding contigs and edges.
15
When viewing the graph, the 1,000 base pairs of both 5’-end and 3’-end can be loaded, with 20 “N” linking them
representing the middle sequences. Clicking the edge of two contigs, the sequence containing corresponding contigs’ ends
can also be displayed. The displayed sequence can be used to design primers in ContigScape and perform blast against
NCBI database.
Algorithm principle
Finding connections between 454 reads:
To our knowledge, Roche 454 shotgun sequencing is the most suitable strategy for de novo genome sequencing. The reads
length of Roche 454 reached 700bp now and could resolve small repeats caused gaps. But reads of longer repeats would be
assembled into only one contig and thus produced gaps in other repeat regions. The‘Newbler Assembler’ could produce a
‘454Contigs.ace’ file, which contained all the assembly information and could be opened by ‘Consed’ (Gordon, D. et
al. 1998). As we can see from the following, when a read was separated into two contigs, the coordinate of the read in each
contig will be shown after the read name, followed by the contig number this read linked, using‘fmX’representing 5’ end
of the read located in contigX and ‘toY’ representing 3’end of the read located in contigY. This label showed a unique
advantage of‘Newbler Assembler’ endowed by 454 long reads. We can extract all the ‘fm’ and ‘to’information from
‘454Contigs.ace’ file, and arrange them into a relationship table, such as ‘5’-end-Contig1’ linked ‘3’-end -Contig2’.
Then the table could be displayed by ContigScape.
Finding connections between mate-paired reads:
material:matePairedReads1.fq, matePairedReads2.fq, matePairedContigs.fas, scaffold.pl
The most common way to assemble contigs into scaffolds is through mate-pair information. Depending on the fragment size
16
of mate-pair reads, scaffolding program can identify how far two contigs should be apart from each other. For example, if
two contigs were separately mapped by a pair of 3kb apart mate-pair reads, the two contigs could be joined into a scaffold
and the gap size should be 3kb minus distance from mapping loci to the end of contigs. The repeat region less than 3kb
could be spanned using this method. But if the repeat region was longer than the fragment size of mate-pair library,
ambiguous linkage information would occur. Same as ‘Newbler’ result of 454 reads, the ContigScape could display a
network relationship within scaffolds by counting mate-pair reads number linking LargeContigs (>500bp). Detailed method
was described in scaffold.pl (http://sourceforge.net/projects/contigscape/files/datasets/).
Genomic features prediction:
Genomic features can be categorized into several groups: plasmids, phage elements, and terminal repeats, which can be
accessed through “show Plasmid/TIRs” in the control panel. See the following figure for details. In Figure A. The interface
of ContigScape. The left part was the control panel, the window on the right shows a sample genome. Contigs were
indicated by red (repeated contig) , dark blue (unique contig) and orange( probable repeats). B. Zoomed image of some
contigs (light blue frame in panel A). B1. A linear plasmid formed by three contigs. B2. Repeats (Contig28) in the end of
chromosome. B3. A circular plasmid with high copy number formed by three contigs (Contig141,142,143) . B4. Two
high-copy number circular plasmids each formed by a single contig. B5. A linear plasmid with high copy number formed by
one contig. B6. A circular plasmid with single copy number composed of one contig.
Judging whether a repeat contig was from chromosome or plasmid mainly depended on the linkage information of two
ends of this contig. Four different types were shown in Figure, 1). Repeat contigs connected in a circular fashion (Panel 3),
2).Individual contig connected itself without anyone else (Panel 4 and 6), 3). One end of repeat contig having no linkage to
any other contigs, usually representing linear chromosome telomere or linear plasmid end (Panel 1 and 2), 4).A linear
plasmid composed of only one repeat contig without connections to any contigs (Panel 5). While if a plasmid is linear and
single copy, ContigScape cannot distinguish it. We can estimate whether or not a contig was a plasmid effectively based on
above described situation in our experience. Of course researcher must confirm whether it is a plasmid or not by PCR,
sequencing and annotation.
In Figure B, 143E has connections with 142E and 144E (Panel 3). But the number of connections (800) between 143E
and 142E is more than that (10) between 143E and 144E. In this case, the latter might be a nonspecific connection caused
by little overlap among the reads. Additionally, Figure 6B shows that contig78 in the linear plasmid
80E-80S-78E-78S-54E-54S also has another copy in the chromosome (Panel 1).
17
Copyright
ContigScape is a free software; you can redistribute it and/or modify it under the terms of the Lesser GNU General Public
License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied
warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Lesser GNU General Public
License for more details.
You should have received a copy of the Lesser GNU General Public License along with this program; if not, write to the
Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
Contact
You may email any questions/suggestions to: [email protected], [email protected]
You're welcome to report bugs at ContigScape Discussion Forums.
18
Frequently Asked Questions(FAQs)
1. Why the network looks messed up after I opened 454Contigs.ace?
A: You have to do layout manually after loading the network, one way is to use "Force directed layout" in Cytoscape
menu. In able to obtain the best visual effects, please use “Layout”—“Force-Directed Layout”
button
and
VizMapper-“Visual style” —“sample1”
2. How do I make an AGP file by comparing the contigs to the reference sequences?
A: You may do a local BLAST between the contig and the reference sequences with “-m 8” parameter, then use a
script called "blast2agp.pl" included in this software package to convert the BLAST output to an AGP file.
3. Have GRASS, SSPACE and OPERA been integrated into the package of ContigScape?
A: Sorry, we have not integrated them. User can prepare scaffold result handled with these tools to CRS format file
mentioned in the above, and then ContigScape will provide a Global visualization.
4. I had some problem to install the java class. It doesn’t work on my Mac, and gives me an error. On a Linux system I got
it to run. I loaded the 454 contigs as ace file into ContigScape. Further, it is extremely “slow”. My java job is always on
99%..
A: It looks that you have not install java correctly. Please download Java 1.6 or higher version from
http://java.sun.com and install it properly.
5. You mentioned contigs can be reordered, broken and the results can be saved. I haven’t found those functionalities in
ContigScape. Can the results be saved?
A: Those functions can be found in Editor panel of Cytoscape. Cytoscape is a software package with perfect
visualization network and high operability. That is why we select it.
6. Would it be possible to import bam files? Those contain most of the relevant information, and it is a widely used format.
A: You can use bamToBed in BEDTools package to convert a sorted bam file into a bed file, then specify both the
contig sequence and the bed file from "open mate-paired reads" dialog to load.
7. I saw the scaffold.pl script, but why can’t it be run within ContigScape?
A: Scaffold.pl is a perl script used in processing ‘fastq’ file and corresponding contig file with the purpose of gaining
a scaffold.bed and tabbed.txt including contigs’ relationships. Scaffold.pl needs Fastx_toolkit, BEDTools, BWA,
Samtools which were developed only based on Linux platforms. In fact, user can run scaffold.pl in “open
mate-paired reads ” of ContigScape on Linux platforms.
8. In the abstract you said that you can find repeats like IS elements, Ribosomal RNA etc. This is detection purely done by
coverage of reads? If due to long reads like PacBio, those are assembled into their single copies, will you still detect them?
A: It’s true. The contig number of bacterial genome will greatly reduced using PacBio technology. ContigScape
won't be able to detect small repeat like IS elements, Ribosomal RNA. But, large repeat contig still exist in several
bacterial genome, fungi genome and all plant and animal genome. Large repeats more than 10kb usually have
important biological significance.
9. The scope is supposed to be microbial genomes gap closing, but would the tools be able to be applied to scope with plant
and animal genome with ten thousands of contigs?
A: User may convert the result to CRS format file mentioned in the above, and then ContigScape provides a Global
visualization. Though thousands of contigs may look complicated in the image, it is the real relationship among these
contigs.
19
10. How do I get "454Contigs.ace" for my assembly?
A: You have to ensure you checked "Single ACE file" or "Single ACE file for small genomes" in GS De novo
Assembler Output Sub-Tab, see
http://454.com/my454/documentation/gs-flx-system/emanuals/Part_C/wwhelp/wwhimpl/common/html/wwhelp.htm#
href=PartC.1.016.html&single=true
11. Why I cannot obtain the result after input scaffold.bed and corresponding contig sequence ?
A: Please check the integrity of the scaffold.bed and make sure that the sequence title in scaffold.bed is same as
corresponding contig sequence.
12. Can I install it in Cytoscape 3.0?
A: Sorry, ContigScape 1.0 don’t support Cytoscape 3.0 now, we will update it in version 1.1.
13. I have not clearly understood whether the program was specific to Newbler assembler or not.
A: The “open 454Contigs.ace” option can be only used to input 454Contigs.ace, which had been modified in
ContigScape v1.0. Ace format is a very large file with various contents. We cannot process so many formats
sand we provided CRS format described in the manuscript to fit different program. CRS format contains
two
files,
and
users
can
refer
to
http://sourceforge.net/projects/contigscape/files/datasets/454Contigs_CRS_test_files.zip/download.
14. How it provides help in finishing after viewing a graph?
A: Contigs displayed in ContigScape were better operational, as the repeat contigs, gaps and even plasmids can be
highlighted, filtered, and customized. So, graph itself will facilitate a faster and more precise determination of the
linkages among contigs and greatly improve the efficiency of gap closing.
After view a graph, we can get the sequence of contigs’ ends which may be used to blast and designing primer.
Click the edge of two contigs, the sequence containing corresponding contigs’ ends can be used to blast on NCBI to
judge the edge is true or not. Other, if user needs to edit the connections of the network, the user can open “edit
panel” to edit. All the gaps needs to be filled by PCR and sequencing using ABI3730 sequencer. We have added this
point in part “Display functionality of ContigScape” of our revision.
In our finishing strategy, all contigs together with ABI3730 data must be assembled using phred, phrap, consed at
last. Our plugin cannot replace “consed” program. It looks like a canvas which used to edit and judge the order
among contigs and can evaluate the complexity of assembly visually.
20