Download Get PDF - Wiley Online Library
Transcript
Analyzing Gene Expression Data from Microarray and Next-Generation DNA Sequencing Transcriptome Profiling Assays Using GeneSifter Analysis Edition UNIT 7.14 Sandra Porter,1, 2 N. Eric Olson,2 and Todd Smith2 1 2 Digital World Biology, Seattle, Washington Geospiza, Inc., Seattle, Washington ABSTRACT Transcription profiling with microarrays has become a standard procedure for comparing the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation DNA sequencing methods, are also starting to be used for transcriptome analysis. These technologies, with their low background, large capacity for data collection, and dynamic range, provide a powerful and complementary tool to the assays that formerly relied on microarrays. In this chapter, we describe two protocols for working with microarray data from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing data from two different instrument platforms (Illumina GA and Applied Biosystems C 2009 by John Wiley & Sons, SOLiD). Curr. Protoc. Bioinform. 27:7.14.1-7.14.35. Inc. Keywords: gene expression r microarray r RNA-Seq r transcriptome r GeneSifter Analysis Edition r next-generation DNA sequencing INTRODUCTION Transcriptome profiling is a widely used technique that allows researchers to view the response of an organism or cell to a new situation or treatment. Insights into the transcriptome have uncovered new genes, helped clarify mechanisms of gene regulation, and implicated new pathways in the response to different drugs or environmental conditions. Often, these kinds of analyses are carried out using microarrays. Microarray assays quantify gene expression indirectly by measuring the intensity of fluorescent signals from tagged RNA after it has been allowed to hybridize to thousands of probes on a single chip. Recently, next-generation DNA sequencing technologies (also known as NGS or Next Gen) have emerged as an alternative method for sampling the transcriptome. Unlike microarrays, which identify transcripts by hybridization and quantify transcripts by fluorescence intensity, NGS technologies identify transcripts by sequencing DNA and quantify transcription by counting the number of sequences that align to a given transcript. Although the final output from an NGS experiment is a digital measure of gene expression, with the units expressed as the numbers of aligned reads instead of intensity, the data and goals are similar enough that we can apply many of the statistical methods developed for working with microarrays to the analysis of NGS data. There are many benefits to using microarray assays, the greatest being low cost and long experience. Over the years, the laboratory methods for sample preparation and the statistical methods for analyzing data have become more standardized. As NGS becomes more commonplace, these new methods are increasingly likely to serve as a complement Current Protocols in Bioinformatics 7.14.1-7.14.35, September 2009 Published online September 2009 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471250953.bi0714s27 C 2009 John Wiley & Sons, Inc. Copyright Analyzing Expression Patterns 7.14.1 Supplement 27 or alternative to microarrays. Since these assays are based on DNA sequencing rather than hybridization, the background is low, the results are digital, the dynamic range is greater, and transcripts can be detected even in the absence of a pre-existing probe (Marioni et al., 2008; Wang et al., 2009). Furthermore, once the sequence data are available, they can be aligned to new reference data sets, making NGS data valuable for future experiments. Still, until NGS assays are better characterized and understood, it is likely that microarrays and NGS will serve as complementary technologies for some years to come. In this chapter, we describe using a common platform, GeneSifter Analysis Edition (GSAE; a registered trademark of Geospiza, Inc.), for analyzing data from both microarray and NGS experiments. GSAE is a versatile Web-based system that can already be used to analyze data from a wide variety of microarray platforms. We have added features for uploading large data sets, aligning data to reference sequences, and presenting results, which make GSAE useful for NGS as well. Both kinds of data analyses share several similar features. Data must be entered into the system and normalized. Statistical methods must be applied to identify significant differences in gene expression. Once significantly different expression patterns have been identified, there must be a way to uncover the biological meaning for those results. GSAE provides methods for working with ontologies and KEGG pathways, clustering options to help identify genes that share similar patterns of expression, and links to access information in public databases. Data-management capabilities and quality control measures are also part of the GSAE system. In both of the two basic protocols, we will present general methods for analyzing microarray data, follow those procedures with alternative procedures that can be used to analyze NGS data, and discuss the differences between the microarray protocol and the NGS alternative. Basic Protocol 1 presents a pairwise analysis of microarray data from mice that were fed different kinds of food (Kozul et al., 2008). The protocol uses data from the public Gene Expression Omnibus (GEO) database at the NCBI (Barrett et al., 2009), and demonstrates normalizing the data and the analyses. Alternate Protocol 1, for a pairwise comparison, also uses data from GEO; however, these are NGS data from the Applied Biosystems SOLiD instrument. In Alternate Protocol 1, we use a pairwise analysis to compare gene expression from single wild-type mouse oocytes with gene expression in mouse oocytes containing a knockout mutation for DICER, a gene involved in processing microRNAs (Tang et al., 2009). Basic Protocol 2 presents a general method for analyzing microarray data from samples that were obtained after multiple conditions were applied. In this study, mice were fed two kinds of food and exposed to increasing concentrations of arsenic in their water (Kozul et al., 2008). This protocol includes ANOVA and demonstrates options for Principal Component Analysis, clustering data by samples or genes, and identifying expression patterns from specific gene families. Alternate Protocol 2, a variation on Basic Protocol 2, describes an analysis of NGS data from the Illumina GA analyzer, comparing samples from three different tissues (Mortazavi et al., 2008). Cluster analysis is included in this procedure as a means of identifying genes that are expressed in a tissue-specific manner. As with Basic Protocol 1, these studies use data from public repositories, in this case, GEO and the NCBI Short Read Archive (SRA; Wheeler et al., 2008). It should be noted for both protocols that GSAE contains alternatives to the statistical tools used in these procedures and that other tools may be more appropriate, depending on the individual study. Analyzing Gene Expression Data Using GeneSifter 7.14.2 Supplement 27 Current Protocols in Bioinformatics COMPARING GENE EXPRESSION FROM PAIRED SAMPLE DATA OBTAINED FROM MICROARRAY EXPERIMENTS BASIC PROTOCOL 1 One of the most common types of transcriptome profiling experiments involves comparing gene expression from two different kinds of samples. These conditions might be an untreated and treated control, or a wild-type strain and a mutant. Since there are two conditions, we call this process a pairwise analysis. Often, the two conditions involve replicates as well. For example, we might have four mice as untreated controls and four mice that were subjected to some kind of experimental treatment. Comparing these two sets of samples requires normalizing the data so that we can compare expression within and between arrays. Next, the normalized results are compared and subjected to statistical tests to determine if any differences are likely to be significant. Procedures can also be applied at this stage to correct for multiple testing. Last, we use z scores, ontologies, and pathway information to explore the biology and determine if some pathways are significantly over-represented, and elucidate what this information is telling us about our samples. Figure 7.14.1 provides an overview of this process. In this analysis, we compare the expression profiles from the livers of five mice that were fed for 5 weeks with a purified diet, AIN-76A, with the expression profiles from the livers of five mice that were fed for the same period of time with LRD-5001, a standard laboratory mouse food. pairwise comparison mouse 1 mouse 2 mouse 3 mouse 4 mouse 5 mouse 6 treatment no treatment isolate RNA microarrays upload data, normalize identify differential expression - fold change - quality - statistics-e.g. t -test, others - multiple testing correction - Bonferroni - Benjamini and Hochberg - others explore biology-ontology, KEGG, scatter plot Figure 7.14.1 Overview of the process for a pairwise comparison. Analyzing Expression Patterns 7.14.3 Current Protocols in Bioinformatics Supplement 27 Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a trial account or license for GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) GSAE is accessed through the Web; therefore Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari. Files Data files from a variety of microarray platforms may be uploaded and analyzed in GSAE, including Affymetrix, Illumina, Codelink, or Agilent arrays, and custom chips. The example data used in this procedure were CEL files from an Affymetrix array and were obtained from the GEO database at the NCBI (Accession code GSE 9630). CEL files are the best file type for use in GSAE. To obtain CEL files, go to the GEO database at the NCBI (www.ncbi.nih.gov/geo/). Enter the accession number (in this case GSE 9630) in the section labeled Query and click the Go button. In this example, all the files in the data set are downloaded as a single tar file by selecting (ftp) from the Download column at the bottom of the page. After downloading to a local computer, the files are extracted, unzipped, then uploaded to GSAE as described in the instructions. Files used for the AIN-76 group: GSM243398, GSM243405, GSM243391, GSM243358, and GSM243376. Files used for the LRD-5001 group: GSM243394, GSM243397, GSM243378, GSM243382, and GSM243355. A demonstration site with the steps performed below and the same data files can be accessed from the data center at http://www.geospiza.com. Uploading data 1. Create a zip archive from your microarray data files. a. If using a computer with a Microsoft Windows operating system, a commonly used program is WinZip. b. If using Mac OS X, select your data files, click the right mouse button, and choose Compress # Items to create a zip archive. 2. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). A username and password are provided when a trial account is established. 3. Locate the Import Data heading in the Control Panel on the left-hand side of the screen and click Upload Tools. Several types of microarray data can be uploaded and analyzed in GSAE. Since different microarray platforms produce data in a variety of formats, each type of microarray data has its own upload wizard. In this protocol, we will be working with Affymetrix CEL data from the NCBI GEO database, and so we choose the option for “Advanced upload methods.” This option also allows you to normalize data during the upload process using standard techniques for Affymetrix data such as RMA, GC-RMA, or MAS5. Instructions for using other GSAE upload wizards are straightforward and are available in the GSAE user manual. Analyzing Gene Expression Data Using GeneSifter RMA and GC-RMA are commonly used normalization procedures (Millenaar et al., 2006). Both of these processes involve three distinct operations: global background normalization, across-array normalization, and log2 transformations of the intensity values. One 7.14.4 Supplement 27 Current Protocols in Bioinformatics point to note here is that if you plan to use RMA or GC-RMA, the across-array normalization step requires that all the data be uploaded at the same time. If you wish to compare data to another experiment at a later time, you will need to upload the data again, together with those data from the new experiment. 4. Click the Run Advanced Upload Methods button. 5. Next, select the normalization method and the array type from pull-down menus. Click the Next button (at the bottom of the screen). Choose GC-RMA normalization and the 430 2.0 Mouse array for in this example. 6. In the screen which now appears, browse to locate the data file created in step 1. 7. Choose an option (radio button): Create Groups, Create New Targets, or Same as File Name. Since a pairwise analysis involves comparing two groups of samples, choose Create Groups and set 2 as the value. If the experiment were to involve comparing more than two conditions, other options would be chosen. These are described in Basic Protocol 2. 8. Click the Next button. The screen for the next step will appear after the data are uploaded. 9. On the screen displayed in “step 3 of 4,” you will be asked to enter a title for your data set, assign a condition to each group, add labels to your samples if desired, and identify which sample(s) belong to which group. In this case, decide that the AIN-76A mice should be condition 1 and the LRD-5001 mice should be condition 2. Then, use the buttons to assign all the AIN-76 samples to group 1 and the LRD-5001 samples to group 2. Comparing paired groups of samples and finding differentially expressed genes 10. Begin by selecting Pairwise from the Analysis section of the control panel (Fig. 7.14.2). 11. Find the array or gene set that corresponds to your experiment. In this case, our array is named “Mouse food and arsenic.” 12. Select the spyglass to set up the analysis. A new page will appear with a list of all the samples in the array as well as the analysis options. 13. Use the checkboxes in the group 1 column to select the samples for group 1, and the checkbox in the group 2 column to select the samples for group 2. Usually, the control, wild-type, or untreated samples are assigned to group 1. Here, assign the AIN-76A sample to group 1 and the LRD-5001 samples to group 2. 14. Choose the advanced analysis settings. Since the data were normalized during the uploading process by the GC-RMA algorithm, we can use some of the default settings for the analysis. If you choose a setting that is not valid for RMA or GC-RMA normalized data, warnings will appear to let you know that the data are already normalized or already log transformed. a. Normalization: Use None with RMA or GC-RMA normalized data. This step has already been performed, since RMA and GC-RMA both perform quantile normalization during the upload process. b. Statistics: The statistical tests available from the pull-down menu are used to determine the probability that the differences between the mean values for intensity measurements, for each gene (or probe), from a set of replicate samples, are Analyzing Expression Patterns 7.14.5 Current Protocols in Bioinformatics Supplement 27 1 2 3 1 select Pairwise 2 select the gene set 3 assign samples to the two groups 4 choose analysis settings 5 click Analyze 4 5 Figure 7.14.2 Analyzing Gene Expression Data Using GeneSifter Setting up a pairwise comparison. significant. The significance level for each gene is reported as a p value. When multiple replicates of a sample are used, GSAE users can choose between the t test, Welch’s t test, a Wilcoxon test, and no statistical tests. The t test is commonly used for this step when samples from a controlled experiment are being compared. The t test assumes a normal distribution with equal variance. Other options that may be used are the Welch’s t test, which does not assume equal variance, and the Wilcoxon test, a nonparametric rank-sum test. Since all of these tests look at the variation between replicates, you must have at least two replicates for each group to apply these tests. For the Wilcoxon test, you must have at least four replicates. Use the t test for this example. c. Quality (Calls): The quality options in this menu are N/A, A (absent), M (marginal), or P (present). However, neither RMA nor GC-RMA produce quality values, so N/A is the appropriate choice when these normalization methods are used. d. Exclude Control Probes: Selecting this check box excludes positive and negative control probes from the analysis. This step can be helpful because it cuts down on the number of tests and minimizes the penalty from the multiple testing correction. For our example, check this box. 7.14.6 Supplement 27 Current Protocols in Bioinformatics e. Show genes that are up-regulated or down-regulated: Use the checkboxes to choose both sets of genes or one set. Check both boxes for this example. f. Threshold: The Lower threshold menu allows you to filter the results by the change in expression levels. For example, picking 1.5 as the lower threshold means that genes will only appear in the list if there is at least a 1.5-fold difference in expression between the two groups of samples. Use a setting of 1.5 as the Lower limit and None as the Upper limit. g. Correction: Every time gene expression is measured, in a microarray or Next Gen experiment, there is a certain probability that the results will be identified as significantly different, even though they are not. These kinds of results can be described as false positives or as type I errors. As we increase the number of the genes tested, we also increase the probability of seeing false positives. For example, if we have a p value of 0.05, we have a 5% chance that the gene expression difference between the two groups resulted from chance. When a large data set such as one generated by a microarray experiment is analyzed, with a list of 10,000 genes (an average-sized microarray), about 500 of those genes could be incorrectly identified as significant. The correction methods in this menu are designed to compensate for this kind of result. Four different options are available in GSAE to adjust the p values for multiple testing and minimize the false-discovery rate. Since these methods are used to correct the p values obtained from statistical tests, these corrections are only be applied if a statistical test, such as a t test, has been applied. GSAE offers the following correction methods: Bonferroni, Holm, Benjamini and Hochberg, and Westfall and Young maxT corrections. The Bonferroni and Westfall and Young corrections calculate a family-wise error rate. This is a very conservative requirement, with a 5% chance that you will have at least one error. The Benjamini and Hochberg correction calculates a False Discovery Rate. With this method, when the error rate equals 0.05%, 5% of the genes considered statistically significant will be false positives. Benjamin and Hochberg is the least stringent of the four choices, allowing for a greater number of false positives, and fewer false negatives. When it comes to choosing a correction method, we choose correction methods depending on our experimental goal. If our goal is discovery, we can tolerate a greater number of false-positive results in order to minimize the number of false negatives. If we choose a method to minimize false positives, we have to realize that some of the real positives may be missed. Genes with real differences in expression may appear to show an insignificant change after multiple testing corrections are applied. One of the most important reasons for using these tests is that they allow the user to rank genes in order of the significance of change, and make choices about which changes to investigate. For this example, choose Benjamini and Hochberg. h. Data Transformation: Our data are already log2 transformed, since RMA and GCRMA both carry out this step during the upload process. Choose Data Already Log Transformed for this example. i. Click the Analyze button. A page with results appears when the processing step is complete. Investigating the biology Figure 7.14.3 shows the results from our pairwise analysis of the microarray data—the differentially expressed genes. Pull-down menus in the middle of the page contain options Analyzing Expression Patterns 7.14.7 Current Protocols in Bioinformatics Supplement 27 7 6 view changes by ontology or the z score view changes in KEGG pathways 5 see the scatter plot 8 2 change to adjusted p 1 change to adjusted p 3 save your results click search to apply changes 539 genes show a significant change 4 this gene is expressed about 7-fold more highly in mice fed with LRD-5001 Figure 7.14.3 select a name to view the gene summary Analyzing the results from a pairwise comparison. for sorting and changing the views. You may increase the number of genes in the list, sort by the ratio, p value, or adjusted p value, choose a p-value cutoff so that genes are only shown if the p values are below a certain number, and change the presentation from the raw p value to the adjusted p value. After choosing selections from the menus, click the Search button to show the results. When this page first appears, our results show a list of 764 genes that are differentially expressed. Arrows on the left side of each gene ratio point up if a gene shows an increase in expression relative to the first group or down if a gene shows decreased expression. The ratio shows the extent of up- or down-regulation. When this page first appears, the list is filtered by the raw p value. 15. Filter based on the corrections for multiple testing by selecting “adjusted p” from the raw p value menu and clicking the Search button. By choosing “adjusted p” from the left pull-down menu to correct for the false discovery rate, calculated by the Benjamini and Hochberg correction, and clicking the Search button to show the p values for the differences between each gene, the number of genes is changed to 539. Analyzing Gene Expression Data Using GeneSifter 16. Next, it can be helpful to sort the data. Initially, the data are shown sorted by ratio so that genes with a larger-fold change appear earlier in the list. It can also be helpful to sort the data by the p value or the adjusted p value to see which genes show the most 7.14.8 Supplement 27 Current Protocols in Bioinformatics significant change. Choose “Adj. p” from the Sort By menu to sort by the adjusted p value. Sorting by the adjusted p value shows that the genes with the most significant changes are cytochrome p450, family 2, subfamily a, polypeptide 4, and glutathione S-transferase, alpha 2 gene. 17. We can learn more about any gene in the list by clicking its name. Clicking the top gene in the list brings us to a page where we can view summarized information for this gene and obtain links to more information in public databases. 18. Click Scatter Plot to view the differences in gene expression another way. A new window will open with the data presented as a scatter plot (Fig. 7.14.4). 2 click the zoom button cytochrome P450, family 2, subfamily a, LRD-5001, water 0 100000 1 drag the box over the genes you wish to view in detail up-regulated genes appear on this side of the diagonal line 10000 3 click a spot to see more details 1000 100 down-regulated genes appear on this side 10 >> Gene Info Cytochrome P450, family 2, subfamily a, polypeptide 4 1 1 10 100 1000 10000 100000 AIN-76A, water 0 Open static scatter plot Group N Mean SEM/Mean Quality AIN-76A, water 0 5 10.0036 0.1127 SEM 1.13% 0.0000 LRD-5001, water 0 5 12.7881 0.1339 1.05% 0.0000 gene summary information appears in this lower corner Each spot in the graph represents the expression measurements for one gene. The expression level for group 2 is plotted on the y axis and the value for group 1 is plotted on the x axis. Intensity 15 10 5 0 AIN-76A, water 0 LRD-5001, water 0 LRD-5001, water 0 up-regulated 6.9 fold compared to AIN-76A, water 0 Figure 7.14.4 Scatter plot. Analyzing Expression Patterns 7.14.9 Current Protocols in Bioinformatics Supplement 27 a. The scatter plot. The scatter plot gives us a visual picture of gene expression in the different samples. The levels of gene expression in group 1 (mice fed with AIN-76A) are plotted on the x axis and group 2 (mice fed with LRD-5001) on the y axis. Genes that are equally expressed in both samples fall on the diagonal line. Genes that are expressed more in one group or in the other appear either above the line (group 2) or below the line (group 1) depending on the group that shows the highest level of expression. If we used a method to correct for the false-discovery rate, then the points for genes showing nonsignificant changes would be colored gray, up-regulated genes showing a significant change would be colored red, and down-regulated genes showing a significant change would be colored blue or green. b. The zoom window and gene summary. To learn more about any gene in the graph, we drag the box on top of a spot and click the “zoom” button. After a short time (up to 30 sec), the highlighted spot and surrounding spots will appear in the top right window. If spots overlap, you may separate them by dragging them with the mouse. The name of each gene will appear when the mouse is moved over a spot, and clicking a spot will produce the gene summary information in the lower right corner. In our experimental example, clicking some of the spots will find genes that were seen earlier in the list, such as genes for members of the cytochrome p450 family and glutathioneS-transferase. 19. Return to the results window and click the KEGG link. a. The KEGG report. The KEGG report, as shown in Figure 7.14.5, presents a list of biochemical and regulatory pathways that contain members from the list of differentially expressed genes on the results page. Each row shows the name of the pathway, a link to a list of gene-list members that belong to that pathway, with arrowheads to show if a member is up- or down-regulated, a link to the KEGG pathway database, the number of genes from the list that belong to that pathway, the number of genes that are up-regulated, the number down-regulated, the total number from that pathway that were present in the array (or reference data set for Next Gen data), and the z scores for up- and down-regulated genes. b. z scores. z scores are used to evaluate whether genes from a specific pathway are enriched in your list of differentially expressed genes. If genes from a specific pathway are represented in your gene list more often than they would be expected to be seen by chance, the z-scores reflect that occurrence. A z score greater than 2 indicates that a pathway is significantly enriched in the list of differentially expressed genes, while a z-score below −2 indicates that a pathway is significantly under-represented in the list. The direction and color of the arrowheads show whether those genes are up- or down-regulated in the second group relative to the first group of samples. Clicking the arrows above a z score column will allow you to sort by z scores for up-regulated or down-regulated genes. Click the arrowhead that is pointed up in the z score column to sort by up-regulated genes. We can see at least 20 pathways are up-regulated when mice are fed LRD-5001. c. Genes. Pick one of the top listed pathways and click the corresponding icon in the Genes column. A new section will appear underneath the name of the pathway. Before proceeding, look at the values in the List, totals, and Array column. We can see in our analysis that the cytochrome P450 pathway for metabolizing xenobiotics is significantly up-regulated and contains 19 members from our 539-member gene list. We also see that those members are all up-regulated and that there are 53 genes on the array that belong to this pathway. Analyzing Gene Expression Data Using GeneSifter Now, look at the list of genes in the newly opened section. Where we had 19 genes shown as the value in the list column, there are 26 genes listed below the name of the pathway. 7.14.10 Supplement 27 Current Protocols in Bioinformatics click the KEGG icon to see a diagram of the pathway click the gene icon to see a list of the genes that belong to this pathway genes from the list are in green boxes genes that are up-regulated are in boxes with red numbers and a red border 1 2 3 4 5 6 1 The number of genes from the list on the analysis page, that belong to this pathway. 2 The number of genes in this pathway that are up-regulated in group 2 relative to group 1. 3 The number of genes in this pathway that are down-regulated in group 2 relative to group 1. 4 The number of genes on the array that belong to this pathway. 5 The z score for the number of genes that belong to this pathway and are up-regulated. Clicking the red arrow, will sort the list by z scores. 6 The z score for the number of genes that belong to this pathway and are down-regulated. Clicking the green arrow, will sort the list by z scores. Figure 7.14.5 KEGG pathway results. Most of the genes have different names, but some appear to be identical. For example, there are three listings for glutathione-S-transferase, mu 1. Are they really the same gene? Clicking the gene names shows us that two entries have the same accession number. One possible explanation for their duplication in the list could be that they are represented multiple times on the array. It could also be that the probes were originally thought to belong to different genes and now, with a better map, are placed in the same gene. We also see that one of the three genes has a different accession number. This entry might represent a different isoform that is transcribed from the same gene. Many arrays do not distinguish between alternative transcripts and count them all together. Affymetrix arrays can also have multiple probe sets for a single gene; in these cases, the gene will appear multiple times since intensity measurements will be obtained from each probe. It should also be noted that some genes may belong to multiple KEGG pathways (see below). d. KEGG pathways. Click the KEGG icon to access the KEGG database and view more details for a KEGG pathway. Once we have identified KEGG pathways with significant changes, we can investigate further by selecting the links to the individual genes in that pathway or we Analyzing Expression Patterns 7.14.11 Current Protocols in Bioinformatics Supplement 27 can select the KEGG icon to view the encoded enzymes in the context of a biochemical pathway. Clicking the boxes in the KEGG database takes us to additional information about each enzyme. In our experiment, we find that 19 of the 53 genes in the array are up-regulated and belong to the cytochrome P450 pathway for metabolizing xenobiotics. The KEGG pathway shows some of the possible substrates for these enzymes. It would be interesting to look more closely at LRD-5001 and see if it contains naphthalene or benzopyrene, or one of the other compounds shown in the KEGG pathway. Other pathways that are up-regulated, when mice are fed LRD-5001 instead of AIN-76A, are pathways for biosynthesis of steroids, fatty acid metabolism, arachidonic acid metabolism, etc. Down-regulated pathways include those for pyruvate metabolism and glycolysis. 20. Return to the results window and click Ontology (options described below). a. Ontology reports. An overview of the ontology reports and their features is shown in Figure 7.14.6. Three kinds of ontology reports are available from Ontology: a set organized by biological process, another by cellular component, and a third by molecular function. Each report shows a list of ontologies that contain up or down-regulated genes from the list of 539 genes. i. Ontology. Selecting the name of an ontology, allows you to drill down and view subontologies. ii. Genes. Clicking the icon in the genes column shows the genes from the gene list that belong to that ontology. iii. GO. Clicking the GO icon opens the record for the ontology in the AmiGO database. iv. List. The list column shows the total number of genes, from the gene list, both up- and down-regulated, that have that ontology as part of their annotation. v. Totals (up or down). One column contains the values for number of up-regulated genes in the list that belong to an ontology. The other column shows the number of down-regulated genes that belong to that ontology. vi. Array. This value shows the number of probes on a microarray chip that could correspond to genes in an individual ontology. vii. z-score. As with the KEGG report, the z-score provides a way to determine whether a specific ontology is over- or under-represented in the list of differentially expressed genes. Significant z scores are above 2 or below negative 2. We cannot sort by z scores on the ontology report pages, but we can sort by z scores from the z score report. viii. Pie graph. The pie graph depicts the ontologies in the list and the numbers of members. b. z-score reports. Each ontology report page contains a link to a z-score report. Where the ontology reports show ontologies through a hierarchical organization, the z-score report shows all the ontologies with significant z-scores, without the need to drill down into the hierarchy. This is helpful both because significant z scores can be hidden inside of a hierarchy, and because this report allows you to sort by z scores. It should also be noted that some genes may belong to multiple ontologies. Analyzing Gene Expression Data Using GeneSifter When we look at the ontology information for our experiment, we can see that the most significant ontologies in biological processes are metabolism, cellular processes, and regulation; for cellular components, we see that cells and cell parts are significant, and for the molecular function ontology, catalytic activity and electron carrier activity are significant. When we look at the z score report for molecular function and sort our results by up-regulated genes, we see that many genes show oxidoreductase and glutathione-Stransferase activity, which is consistent with our findings from the KEGG report. Selecting the Genes icon shows us that those genes are cytochrome P450s. Taking all of our data together, we can conclude that genes for breaking down substances like xenobiotics are expressed more highly when mice are fed LRD-5001 than when they are fed AIN-76A. 7.14.12 Supplement 27 Current Protocols in Bioinformatics Figure 7.14.6 Gene ontology reports. COMPARE GENE EXPRESSION FROM PAIRED SAMPLES OBTAINED FROM TRANSCRIPTOME PROFILING ASSAYS BY NEXT-GENERATION DNA SEQUENCING Several experiments have been published recently where NGS or “Next Gen” technologies were used for transcriptome profiling. NGS experiments have three phases for data analysis. First, there is a data-collection phase where the instrument captures information, performs base-calling, and creates the short DNA sequences that we refer to as “reads.” Next, there is an alignment phase, where reads are aligned to a reference data set and ALTERNATE PROTOCOL 1 Analyzing Expression Patterns 7.14.13 Current Protocols in Bioinformatics Supplement 27 counted. Last, there is a comparison phase where the numbers of read counts can be used to gain insights into gene expression. Many of the steps in the last phase are similar to those used in the analysis of microarray data. In this protocol, we will describe analyzing data from two NGS data sets and their replicates. These data were obtained from an experiment to assess the transcriptome from single cells (mouse oocytes) with different genotypes (Tang et al., 2009). In one case, wild-type mouse oocytes were used. In the other case, the mouse oocytes had a knock-out mutation for DICER, a gene required for processing microRNAs. We will discuss uploading data and aligning the data, view the types of information obtained from the alignment, and compare the two samples to each other, mentioning where the NGS data analysis process differs from a pairwise comparison of samples from microarrays. Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a license for the GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) GSAE is accessed over the Web; therefore, Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari Files Data files may be uploaded from a variety of sequencing instruments. For the Illumina GA analyzer, the data are text files, containing FASTA-formatted sequences. Data from the ABI SOLiD instrument are uploaded as csfasta files. The example NGS data used in this procedure were generated by the ABI SOLiD instrument and obtained as csfasta files from the GEO database at the NCBI (Accession number GSE14605). The csfasta files are obtained as follows. The accession number GSE14605 is entered in the data set search box at the NCBI GEO database (http://www.ncbi.nlm.nih.gov/geo/) and the Go button is clicked. The csfasta files are downloaded for both wild-type mouse oocytes and DICER knockout mouse oocytes by clicking the links to the file names and clicking (ftp) for the gzipped csfasta files: GSM365013.filtered.csfasta.txt.gz, GSM365014.filtered.csfasta.txt.gz, GSM365015.filtered.csfasta.txt.gz, GSM365016.filtered.csfasta.txt.gz. 1. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). Uploading data 2. Locate the Import Data heading in the Control Panel and click Upload Tools. The uploading and processing steps described for Next Gen data sets require a license from Geospiza. However, you may access data that have already been uploaded and processed from a demonstration site. The demonstration site can be accessed from the data center at http://www.geospiza.com. 3. Click the Next Gen File Upload button to begin uploading Next Gen data. 4. Enter a name for a folder. Analyzing Gene Expression Data Using GeneSifter Folders are used to organize Next Gen data sets. 7.14.14 Supplement 27 Current Protocols in Bioinformatics 5. Click the Next button. 6. Two windows will appear for managing the upload process. Use the controls in the left window to locate your data files. Once you have found your data files, select them with your mouse and click the blue arrowhead to move those files into the Transfer Queue. 7. Once the files you wish to transfer are in the Transfer Queue, highlight those files and click the blue arrow beneath the Transfer Queue window to begin transferring data. Transferring data will take a variable amount of time depending on your network, the volume of network traffic, and the amount of data you are transferring. A 2-GB Next Gen data set will take at least 40 min to upload. Aligning Next Gen data to reference data Once the data have been uploaded to GSAE, the reads in each data set are aligned to a reference data source. During this process, the number of reads mapping to each transcript are counted and normalized to the number of reads per million reads (RPM) so that data may be compared between experiments. 8. Access uploaded Next Gen data sets by clicking Next Gen in the Inventories section of the control panel. 9. Use the checkboxes to select data sets for analysis, then click the Analyze button on the bottom right side of the table. A new page will appear. 10. Choose the Analysis Type, Reference species, and a Reference Type from the corresponding pull-down menus. a. Analysis Type: The Analysis Type is determined by the kind of data that were uploaded and the kind of experiment that was performed. For example, if you uploaded SOLiD data, analysis options specific to that data type would appear as choices in the menu. For SOLiD data, the alignment algorithm is specific for data in a csfasta format. Choose RNA-Seq (SOLiD, 3 passes). b. Reference Species: The Reference Species is determined by the source of your data. If your data came from human tissues, for example, you would select “Homo sapiens” as the reference species. Since our data came from mouse, choose “Mus musculus.” c. Reference Type: The choices for Reference Type are made available in the Reference Type menu after you have selected the analysis type and reference species. The Reference Type refers to the kind of reference data that will be used in the alignment. Since we are measuring gene expression, choose “mRNA” as the reference type. This reference data set contains the RNA sequences from the mouse RefSeq database at the NCBI. 11. Click the checkbox for “Create Experiment(s) upon completion.” This selection organizes your data as an experiment, allowing you to compare expression between samples after the analysis step is complete. In order to set up experiments, GeneSifter must already contain an appropriate Gene Set. A Gene Set is derived from the annotations that accompany the reference data source. 12. Click the Analyze button to queue the Next Gen data set for analysis. 13. The analysis step may take a few hours depending on the size of your data file and the number of samples that need to be processed. Analyzing Expression Patterns 7.14.15 Current Protocols in Bioinformatics Supplement 27 Read Mapping Statistics–RNA-Seq Analysis Gene List Total number of reads: Number of unmapped reads: 12537930 1993626 10544304 Number of mapped reads: Non-uniquely mapped reads: Uniquely mapped reads: 1759919 8784385 6163136 Uniquely mapped reads with 0 mismatches: Uniquely mapped reads with 1 mismatches: Uniquely mapped reads with 2 mismatches: Uniquely mapped to ribosomal, >> Analysis Job Details (15.90% of all reads) (84.10% of all reads) (14.04% of all reads) 2 mismatches: (70.06% of all reads) (70.16% of uniquely mapped reads) (19.88% of uniquely mapped reads) 1746203 875046 0 (9.96% of uniquely mapped reads) (0.00% of uniquely mapped reads) Uniquely Mapped Reads by Chromosome Job Info: Job ID: 51 Analysis Type: RNA-Seq (SOLID, 3 passes) Input File: GSM365013.filtered.csfasta.txt State: Complete Initiated: 2009-05-15 13:33:10 Completed: 2009-05-15 14:26:43 Comment: – Remote ID: 2032 Experiment: GSM365013 -1- 18 1617 15 14 19 X 1 2 3 13 4 12 5 11 6 10 9 8 7 RNA-Seq (SOLID, 3 passes) Info: Reference Species: Mus musculus Reference Type: mRNA Analysis Results: Read alignment statistics (HTML) Gene list (Text) Gene list (HTML) Job log file Standard error Standard output Figure 7.14.7 Gene List for genes.txt Summary Statistics Reads RPKM RPM Entrez Overview Download Gene List RefSeq ID Title Gene ID Chrom. Type 137275 12756.87 15627.16 12049 NM_013479 Bcl2-like 10 Bcl2l10 9 mRNA 115536 11806.49 13152.43 171506 NM_138311 H1 histone family, member O, oocyte-specific H1foo 6 mRNA 103361 8452.91 11766.45 21432 NM_009337 87655 2412.60 9978.50 20729 NM_011462, NM_146043 spidlin 1 80718 4851.53 9188.80 72114 NM_028106 T-cell lymphoma breakpoint Zinc finger, BED domain containing 3 Tcl 1 12 mRNA Spin1 13 mRNA Zbed3 13 mRNA Analysis results from NGS data, obtained from an ABI SOLiD instrument. Viewing the Next Gen alignment results 14. When the alignment step is complete, you will be able to view different types of information about your samples. Click the file name to get to the analysis details page for your file, then click the Job ID to get the information from the analysis. 15. The exact kinds of information will depend on the data type and the algorithms that were used to align the reads to the reference data source (Fig. 7.14.7). The types of information seen from Illumina data will be described in the next protocol. For SOLiD data, you will see information that includes: Analyzing Gene Expression Data Using GeneSifter a. Read alignment statistics: These include the total number of reads and the numbers that were mapped, unmapped, or mapped to multiple positions. Sets of reads can also be downloaded from the links on this page. b. Gene list (text): A gene list can be downloaded as a text file after the alignment is complete. 7.14.16 Supplement 27 Current Protocols in Bioinformatics c. Gene list (html): The gene list (html) shows a table with information for all the transcripts identified in this experiment. i. Reads: A read is a DNA sequence obtained, together with several other reads, from a single sample. Typical reads from Next Gen instruments such as the ABI SOLiD and the Illumina GA are between 25 and 50 bases long. The number of reads in the first column equals the number of reads from a single sample that were aligned to the reference data set, in this case, RefSeq RNAs. ii. RPKM: Reads per thousand bases, per million reads. This column shows the number of reads for a given transcript divided by the length of the transcript and normalized to 1 million reads. Dividing the number of reads by the transcript length corrects for the greater number of reads that would be expected to align to a longer molecule of RNA. iii. RPM: Reads per million reads. iv. Entrez: This column contains links to the corresponding entries in the Entrez Gene database. v. Image maps: Image maps are used to show where reads align to each transcript. The transcripts in these images are all different lengths. vi. RefSeq ID: The RefSeq accession number for a given transcript. vii. Title: The name of the gene from RefSeq. viii. Gene ID: The symbol for that gene. ix. Chrom: The chromosomal location for a gene. x. Type: The type of RNA molecule. Comparing paired samples and finding differentially expressed genes In the next step, the numbers of reads mapping to each transcript are compared in order to quantify differential gene expression between the samples. This process is similar to the process that we used in Basic Protocol 1; we will set up our analysis, apply statistics to correct for multiple testing, then view the results from the scatter plot, KEGG pathways, and ontology results to explore the biology. 16. Locate the Analysis section in the GSAE Control Panel and select Pairwise. A list of potential array/gene sets will appear. The gene sets correspond to the results from analyzing Next Gen data. Clicking the name of a gene set will allow you to view the samples that belong to that set. 17. To set up the analysis, either click the spyglass on the left of a gene set, or click the name of the gene set and choose “Analyze experiments from this array” from the middle of the window. A page will appear where you can assign samples to a group and pick the analysis settings. 18. Use the checkboxes to assign one sample (or set of samples) to group 1 (these are often the control samples) and the other sample (or set of samples) to group 2. Assign the two sets from wild-type mouse oocytes to group 1 and the two sets from the DICER knock-out oocytes to group 2. 19. Use the pull-down menus to select the advanced analysis settings. a. Normalization. This step involves normalizing data for differences in signal intensity within and between arrays. This type of normalization process does not apply to Next Gen data since Next Gen measurements are derived from the number of reads that map to a transcript instead of the intensity of light. Next Gen sequence data are normalized by GSAE but this happens during the alignment phase. During the alignment process, the number of reads from each experiment is Analyzing Expression Patterns 7.14.17 Current Protocols in Bioinformatics Supplement 27 normalized to the number of mapped reads per million reads (RPM). This allows data from different experiments to be compared. For this example, choose “None” from the menu. b. Statistics. The statistical tests available from this menu are used to determine if the differences between the mean numbers of read counts (or intensity measurements, in the case of microarrays), from a set of replicate samples, are significant. The significance levels are reported as p values, i.e., the probability of seeing a result by chance. For this example, choose “t test” for the statistics. c. Quality. For Next Gen data, the quality values correspond to the number of reads per million transcripts and range from 0.5 to 100. For this example, set the quality at “1”, meaning that we will only look at transcripts where there the RPM value is at least 1 in one of the samples being compared. d. Show genes that are up-regulated and/or down-regulated. Selecting the checkboxes allows you to choose whether to limit the view to up-regulated or down-regulated genes, or to show both types. For this example, check both boxes. e. Threshold, Lower. The threshold corresponds to the fold-change. For this example, choose 1.5 as the lower threshold. f. Threshold, Upper. This option is usually set to “none,” however, if you wish to filter out highly expressed genes, you might wish to set an upper threshold. For this example, leave the upper threshold at “none.” g. Correction. For this example, choose the Benjamini and Hochberg correction. h. Data Transformation. Use these buttons to choose whether the data will be log transformed or not. Log transformations are often used with microarray data to make the data more normally distributed. For this example, apply a log transformation to the data. 20. Click the Analyze button. When the analysis is complete, the results page will appear. Viewing the results The results page shows the two groups of samples that were compared and the conditions that were used for the comparison. All the genes that varied in expression by at least 1.5 fold are listed in a table on this page. 21. Choose “adjusted p” from the last menu and click the Search button. Adjusted p values are the p values obtained after the multi-test correction (in our case, Benjamini and Hochberg) has been applied. Analyzing Gene Expression Data Using GeneSifter In this analysis, choosing the adjusted p value decreases the number of differentially expressed genes from 1449 to 28. As noted earlier, although the multiple testing correction provides a way to sort genes by the significance, genes that truly change may be missed when these corrections are applied. To view additional genes that may be candidates for study, you can raise the cut-off limit for the adjusted p values, using the pull-down menu, or skip the multiple test correction altogether. 7.14.18 Supplement 27 Current Protocols in Bioinformatics Interpreting the results After adjusting the p value, only 28 genes in our set show significant changes. It is helpful at this point to save our results before proceeding on to further analyses. Since the reports that we would use next (the scatter plot, KEGG pathway information, and ontology reports) are the same as in Basic Protocol 1, we will leave it to the reader to refer to the earlier protocol for instruction. The one point we would like to discuss here is interpreting the gene summary and the differences between the gene summaries for microarray and Next Gen data. Each gene in the list is accompanied by a summary that can be accessed by clicking the gene name. The summary page presents information about expression levels at the top and links to external databases in the bottom half. Summaries from both microarray data and Next Gen data (Fig. 7.14.8) show the number of samples (N), along with the values for each sample and the standard error of the mean. Where the two kinds of summaries differ is in intensity and quality values. For microarray data, the columns labeled “intensity values” do show the intensity data. If the data were log transformed during the upload process or the analysis, then the log-transformed values are reported. For Next Gen data, however, the values in the “intensity values” column are not intensity values. When Next Gen data are used, these values refer to the normalized number of Figure 7.14.8 Gene summaries for microarray and NGS data. A gene summary from a microarray sample is shown in the top half of the image and a summary for a sample analyzed by NGS is shown in the bottom half. Note the difference between the intensity and quality values. Analyzing Expression Patterns 7.14.19 Current Protocols in Bioinformatics Supplement 27 reads that were mapped to a gene (RPM). If the data were log transformed during the analysis, then these values are the log-transformed values. The other difference between these data for the two systems is in the quality column. For Next Gen data, the quality column shows the RPM value for that gene. In the quality column for the Next Gen data, two of the samples show quality values of zero. This means that zero transcripts were detected. The other two samples show values around 6, indicating that approximately 6 transcripts were detected, per million reads, for the Drebrin-like gene. BASIC PROTOCOL 2 COMPARING GENE EXPRESSION FROM MICROARRAY EXPERIMENTS WITH MULTIPLE CONDITIONS GSAE has two modes for analyzing data, depending on the number of factors that are tested. If two factors are compared, such as treated and untreated, or wild-type and mutant samples, then a pairwise analysis, as described in Basic Protocol 1, is used to compare the results. If an experiment involves multiple conditions, such as a time course, different drug dosing regimes, and perhaps even different genotypes, then the analysis is considered a project. GSAE projects have additional capabilities for analyzing these projects as well as different statistical procedures for identifying significant changes in expression. Some of the tests that can be performed with GSAE are a one-way ANOVA, a two-way balanced ANOVA, and a non-parametric Kruskal-Wallis test. Corrections for multiple testing such as those from Bonferroni, Holm, and Benjamini and Hochberg can also be applied. Additional analyses are clustering, or using the Pearson coefficient to look for patterns of expression. Specific searches for genes by name, characteristic, or function can also be performed. In Basic Protocol 2, we describe a general procedure (shown in Fig. 7.14.9) for analyzing microarray data from specimens that were obtained from different treatments. An alternative procedure (Alternate Protocol 2) follows in which we will demonstrate a multiple-condition analysis with Next Gen data from the Illumina GA analyzer. The samples used in Basic Protocol 2 were obtained from the GEO database. These samples came from the same study described in Basic Protocol 1. RNA was isolated from mouse livers where two factors were examined: diet and arsenic in the drinking water. Over a 5-week period, the mice were fed two kinds of food, AIN-76A, a purified diet, or LRD-5001, a standard laboratory mouse food, and given arsenic in their water at three different concentrations (0, 10 ppb, or 100 ppb). There were four to five biological replicates (mice) for each treatment. We will demonstrate setting up the analysis, applying statistics and multiple testing corrections, and using some of the clustering tools. Some of the clustering methods, PAM and CLARA, will be discussed in Alternate Protocol 2 rather than Basic Protocol 2. Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a license for the GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) GSAE is accessed over the Web; therefore, Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari. Files Analyzing Gene Expression Data Using GeneSifter 7.14.20 Supplement 27 Data files from a variety of microarray platforms may be uploaded and analyzed in GSAE, including Affymetrix, Illumina, Codelink, or Agilent arrays, and custom chips Current Protocols in Bioinformatics multiple sample comparison mouse 1 mouse 2 mouse 3 mouse 4 mouse 5 treatment, 10 ppb As no treatment condition 1 mouse 7 mouse 8 condition 2 mouse 9 mouse 10 mouse 11 mouse 12 treatment, 1000 ppb As treatment, 100 ppb As condition 3 upload data, normalize identify differential expression - fold change - quality - ANOVA - multiple testing correction - Bonferroni - Benjamini and Hochberg - others Figure 7.14.9 mouse 6 condition 4 isolate RNA microarrays visualize results hierarchical clustering PCA explore biology ontology KEGG scatter plot partitioning PAM silhouettes Overview of an experiment comparing multiple conditions. The example data used in this procedure were CEL files from an Affymetrix 430 2.0 Mouse array and were obtained from the GEO database at the NCBI (Accession code GSE 9630). CEL files are the best file type for use in GSAE. To obtain CEL files, go to the GEO database at the NCBI (www.ncbi.nih.gov/geo/). Enter the accession number (in this case GSE 9630) in the section labeled “Query” and click the Go button. In this example, all the files in the data set are downloaded as a single tar file by selecting (ftp) from the Download column at the bottom of the page. After downloading to a local computer, the files are extracted, unzipped, and uploaded to GSAE as described in the instructions. Files used for the AIN-76 group with 0 ppb arsenic: GSM243398, GSM243405, GSM243391, GSM243358, and GSM243376; for the AIN-76 group with 10 ppb arsenic: GSM243359, GSM243400, GSM243403, GSM243406, GSM243410; for the AIN-76 group with 100 ppb arsenic: GSM243353, GSM243365, GSM243369, GSM243377; for the LRD-5001 group with 0 ppb arsenic: GSM243394, GSM243397, GSM243378, GSM243382, and GSM243355; for the LRD-5001 group with 10 ppb arsenic: GSM243374, GSM243380, GSM243381, GSM243385, GSM243387; and for the LRD-5001 group with 100 ppb arsenic: GSM243354, GSM243356, GSM243383, GSM243390, GSM243392. A demonstration site with the same files and analysis procedures can be viewed from the data center at http://www.geospiza.com. Analyzing Expression Patterns 7.14.21 Current Protocols in Bioinformatics Supplement 27 Uploading data 1. Create a zip archive from your microarray data files. a. If using a computer with a Microsoft Windows operating system, a commonly used program is WinZip. b. If using Mac OS X, select your data files, click the right mouse button, and choose “Compress # Items” to create a zip archive. The resulting archive file will be called Archive.zip. 2. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). 3. Locate the Import Data heading in the Control Panel on the left-hand side of the screen and click Upload Tools. Several types of microarray data can be uploaded and analyzed in GSAE. See Basic Protocol 1 for detailed descriptions. Our data were uploaded using the option for “Advanced upload methods” and normalized with GC-RMA. 4. On the page that appears, click the Run Advanced Upload Methods button. 5. Select the normalization method and the array type from pull-down menus. For this example, use GC-RMA and the 430 2.0 Mouse array. Click the Next button. 6. In the screen which now appears, browse to locate the data file on your computer. 7. Choose the option (radio button) for Create Groups, Create New Targets, or Same as File Name. Our data came from mice that were given two different kinds of food and drinking water with three concentrations of arsenic, so six groups were created. Therefore, set 6 as the value next to Create Groups. 8. Click the Next button. The screen for the next step will appear after the data are uploaded. 9. On the screen displayed in “step 3 of 4,”, you will be asked to enter a title for your data set, assign a condition to each group, add labels to your samples if desired, and identify which sample(s) belong to which group. In our case, we have six conditions (see Table 7.14.1), with four to five biological replicates (targets) for each condition. We kept the original file names as the target or sample names. Setting up a project for analysis 10. We begin the analysis process by creating a project. Select New Project from the Create New section, add a title for the project, click the checkbox next to the array that contains the samples, and click the Continue button. Table 7.14.1 Conditions Used for the Example in Basic Protocol 2 Analyzing Gene Expression Data Using GeneSifter Condition Mouse food Arsenic in water (ppb) 1 AIN-76A 0 2 AIN-76A 10 3 AIN-76A 100 4 LRD-5001 0 5 LRD-5001 10 6 LRD-5001 100 7.14.22 Supplement 27 Current Protocols in Bioinformatics 11. Enter the name for the control group as the group name and any descriptive information in the Description field. 12. Choose a Normalization option. Leave the setting at None because our data were log transformed and normalized when we used GC-RMA during the upload process. 13. Choose a Data Transformation option. Leave the setting at “Data already log transformed” because our data were log transformed and normalized when we used GC-RMA during the upload process. 14. Select a group for a control sample and use the arrow button to move that group to the box on the right side. Choose AIN-76A, with 0 as the control sample. 15. Select the other groups that will be part of the analysis and move them to the right side by clicking the arrow button. 16. Click the Create Group button. 17. Next, select the samples for each condition. Select all the experiments and click the Create Group button. A new page will appear with a list of all the conditions and all the samples for each condition. 18. Choose the samples that will be used in the analysis. You may choose the samples one by one, or if all the samples will be used, click Select All Experiments. Click Select All Experiments. 19. Click the Create Group button. A small window will appear while data are processing. When the processing step is complete, a new page will appear stating that your project has been created. From this point, you can continue the analysis by selecting Analyze This Project or you can analyze the project at a later time. Identifying differential gene expression 20. Select Projects from the Analysis section of the Control Panel. 21. Choose the project name to review the box plots for the samples and replicates in the project. When we analyze multiple samples, GSAE creates box plots that allow us to evaluate the variation between experimental groups and the replicate samples within each group. The box plot, also known as a “box and whiskers plot,” shows the averaged data either from a group of replicate samples or from the intensity values for a single sample. The line within the box represents the median value for the data set. The ends of the whiskers show the highest and lowest values. If a box and whiskers graph is made from data with a normal distribution, the graph would look like the box plot in Figure 7.14.10. Box plots are helpful for quality control. If we find a box plot with a different median value from the other samples, it could indicate a problem with that sample or array. a. Locate the Project Info section in the Project Details page and click Boxplot. A box plot will appear, as shown in Figure 7.14.11A, with plots representing all six of the different conditions. Notice that all six of the plots have similar shapes and similar values. b. Return to the Project Details page. Locate the bottom section, entitled Group Info. Each of the conditions in this section has between four and five replicates and a box plot (Fig. 7.14.11B). The box plot link for this section opens a window for Analyzing Expression Patterns 7.14.23 Current Protocols in Bioinformatics Supplement 27 a box-and-whiskers plot showing a normal distribution of data highest value 3rd quartile range median 1st quartile lowest value Figure 7.14.10 Box plot. a box for each replicate. Click the box plot link for some of the replicates to see if the replicates are similar or if any of the replicates appear to be different from other members of the group. 22. Click Analyze This Project to begin the analysis. The Pattern Navigation page appears. From the Pattern Navigation page, you may view all the genes, or limit the genes to those that meet certain criteria for fold change, statistics, or a certain pattern of expression. Additional options from the Gene Navigation link allow genes to be located by name, chromosome, or accession number, and options from the Gene Function link allow them to be located by ontology or KEGG pathway. Statistics can also be applied to limit the results. 23. Locate the Search by Threshold section and set the threshold choices. Choose 1.5 for the Threshold, ANOVA for the statistics, and Benjamini and Hochberg to correct for the false discovery rate. Click the Exclude Control Probes checkbox, then click the Search button. Clicking Show All Genes gives 45,101 results. Returning to this page and making the choices listed here cuts the number of genes to 921. The results page appears after the threshold filtering is complete. At this point, you can either save these results and return or continue the analysis. There are a variety of paths we can follow from this point, as shown in Figure 7.14.12. We can view the ontology or KEGG reports, as discussed in Basic Protocol 1. We can also use clustering to group related genes, or we can change the p cutoff value to limit the number of genes even further. Using clustering to identify patterns of differential gene expression 24. Choose PCA from the Cluster options. The PCA option performs a type of clustering known as Principal Component Analysis (PCA). PCA allows you to evaluate the similarities between samples by identifying the directions where variation is maximal. The idea behind PCA is that much of the variation in a data set can be explained by a small number of variables. Analyzing Gene Expression Data Using GeneSifter In Figure 7.14.12, we can see that principal component analysis breaks our conditions up into three groups. One group contains all of the LRD-5001 samples, one group contains the AIN-76A samples, and another group contains the AIN-76A sample that was treated with 100 ppb arsenic. These results tell us that the greatest difference between the groups resulted from the food. 7.14.24 Supplement 27 Current Protocols in Bioinformatics A 18 16 14 12 10 8 6 4 2 b 0 10 10 er er w at w at 00 D -5 LR LR D -5 00 1, 1, 00 LR D -5 pp b pp 0 w at 1, er 6A ,w at AI N -7 10 er 6A ,w at er 10 b pp 0 er AI N -7 A, w at AI N -7 6 B 0 0 18 16 14 12 10 8 6 4 2 0 AIN-76A -1- AIN-76A -5- AIN-76A -7- AIN-76A -8- AIN-76A -11- Figure 7.14.11 Box plots from a multiple-condition experiment. (A) Box plots from the six conditions that were compared in Basic Protocol 2. Each plot represents the averaged data from the four to five replicates from each treatment. (B) Box plots from biological replicates. Replicates from the AIN-76, 0 lead samples are shown. 25. Return to the results page and choose Samples from the Cluster options. These results also show us that the groups of samples are divided by the kinds of food they received. The mice that ate the LRD-5001 show patterns of gene expression more similar to each other than to the patterns from the mice that ate AIN-76A. We also see that the AIN-76A samples that had 100 ppb of arsenic were more different from the AIN-76A samples without arsenic than the LRD-5001 samples were from each other. Analyzing Expression Patterns 7.14.25 Current Protocols in Bioinformatics Supplement 27 clustered by sample clustered by genes PCA summary ontology KEGG Figure 7.14.12 Analyzing the results from comparing multiple samples. 26. Return to the results page and select Genes from the Cluster options. Clustering by genes produces an image consistent with our earlier results. On the right half, where the mice were fed LRD-5001, the three conditions show similar patterns of expression. The samples in the left half are also similar to each other, although it appears that some genes have changed in the sample with 100 ppb arsenic. If we look more closely at the genes that appear to be up-regulated in the LRD-5001 mice, we can see that many of the genes belong to the cytochrome P450 family. Examining differential gene expression in a specific gene family The user may decide to look further at the cytochrome P450s that were induced by LRD-5001 to see if patterns of expression can be discerned. 27. Click Pattern Navigation, located on the right top corner of the page. Analyzing Gene Expression Data Using GeneSifter 28. Locate the Project Analysis section in the bottom half of the page and click Gene Navigation. 7.14.26 Supplement 27 Current Protocols in Bioinformatics enter part of a gene name cluster by gene Figure 7.14.13 a. b. c. d. Gene-specific navigation. Enter the gene symbol in the Name textbox as shown in Figure 7.14.13. Choose Match Any Word from the Option pull-down menu. Choose ANOVA from the Statistics menu. Click the Search button. A page will appear when the filtering is complete. It will indicate that 20 genes matched this query. At this point, clustering with the Gene option lets us see which of the cytochrome P450 genes are up-regulated in the presence of LRD-5001. To understand this phenomenon further, we could use the ontology reports and KEGG pathways to learn about the specific roles that these cytochrome P450s play in metabolism and why they might be up-regulated when mice are fed LRD-5001. We could also use a 2-way ANOVA. COMPARE GENE EXPRESSION FROM NEXT-GENERATION DNA SEQUENCING DATA OBTAINED FROM MULTIPLE CONDITIONS ALTERNATE PROTOCOL 2 This protocol discusses a general method for analyzing samples from Next Generation DNA sequencing experiments that represent different conditions. In this example, we will compare replicate samples (n = 3) from three different tissues: brain, liver, and muscle. We will also discuss using partitioning to cluster data by the pattern of gene expression. Necessary Resources Software GeneSifter Analysis Edition (GSAE): a trial account must be established in order to upload data files to GSAE; a license for the GeneSifter Analysis Edition may be obtained from Geospiza, Inc. (http://www.geospiza.com) Analyzing Expression Patterns 7.14.27 Current Protocols in Bioinformatics Supplement 27 GSAE is accessed over the Web, therefore, Internet access is required along with an up-to-date Web browser, such as Mozilla Firefox, MS Internet Explorer, or Apple Safari Files Data files may be uploaded from a variety of sequencing instruments. Illumina GA analyzer data are text files, containing FASTA-formatted sequences. Data from the ABI SOLiD instrument are uploaded as csfasta files. The example data used in this procedure were generated by the Illumina GA Analyzer and obtained from the SRA database at the NCBI (Accession code SRA001030). The data files are obtained as follows. The accession number SRA001030 is entered in the data set search box at the NCBI Short Read Archive (http://www.ncbi.nih.gov/sra), and the Go button is clicked. The files are downloaded for each tissue type by clicking “Download data” for this experiment link. After downloading the data files, the text files containing the fasta sequences are uploaded to GSAE and processed as described in the instructions. 1. Log in to GeneSifter Analysis Edition (GSAE; http://login.genesifter.net). Uploading data 2. Locate the Import Data heading in the Control Panel and click Upload Tools. 3. Click the Next Gen File Upload button to begin uploading Next Gen data. 4. Enter a name for a folder. Folders are used to organize Next Gen data sets. 5. Click the Next button. 6. Two windows will appear for managing the upload process. Use the controls in the left window to locate your data files. Once you have found your data files, select them with your mouse and click the blue arrowhead to move those files into the Transfer Queue. 7. Once the files you wish to transfer are in the Transfer Queue, highlight those files and click the blue arrow beneath the Transfer Queue window to begin transferring data. Transferring data will take a variable amount of time depending on your network, the volume of network traffic, and the amount of data you are transferring. Illumina GA data sets are approximately 250 MB and take at least 10 min to transfer. Aligning Next Gen data to reference data Once the data have been uploaded to GSAE, the expression levels for each gene are measured by aligning the read sequences from the data set to a reference data source and counting the number of reads that map to each transcript. 8. Access uploaded Next Gen data sets by clicking Next Gen in the Inventories section of the control panel. 9. Use the checkboxes to select the data sets then click the Analyze button on the bottom right side of the table. Analyzing Gene Expression Data Using GeneSifter 10. A new page will appear where you can choose analysis settings from pull-down menus. These settings include the File Type, Analysis Type, Reference Species, and Reference Type. Choose the appropriate Analysis Type, Reference Species, and Reference Type. 7.14.28 Supplement 27 Current Protocols in Bioinformatics a. File Type: The file type is determined by the instrument that was used to collect the data. Since our read data were generated by an Illumina Genome Analyzer, choose Genome Analyzer. b. Analysis Type: The Analysis Type is determined by the kind of data that were uploaded and the kind of experiment that was performed. This setting also allows you to choose which algorithm to use for the alignment. Choose RNA-Seq (BWA, 2 MM). This setting uses the Burroughs Wheeler algorithm (Li and Durbin, 2009) to align the reads with a tolerance setting of 2 mismatches. c. Reference Species: The Reference Species is determined by the source of your data. Since our data came from mouse, choose “Mus musculus.” d. Reference Type: The choices for Reference Type are made available in the Reference Type menu after you have selected the analysis type and reference species. The Reference Type refers to the reference data that will be used in the alignment. Since we are measuring gene expression, pick “mRNA” as the reference type. This reference data set corresponds to the current build for mouse RefSeq RNA. 11. Click the checkbox for “Create Experiment(s) upon completion.” This selection organizes your data as an experiment, allowing you to compare expression between samples after the analysis step is complete. 12. Click the Analyze button to queue the Next Gen data set for analysis. The analysis step may take a few hours depending on the size of your data file and the number of samples waiting to be processed. When the analysis has finished, the information on the right side of the table, in the Analysis State column, will change to Complete. When the analysis step is complete, you will be able to view different types of information about your samples. Viewing the Next Gen alignment results 13. Click the file name to get to the analysis details page for your file, then click the Job ID to get the information from the analysis. The kinds of analysis results obtained depend on the alignment algorithms. The results from processing data from the AB SOLiD instrument are described in Alternate Protocol 1. For Illumina data, processed with the BWA, we obtain the following kinds of results: gene lists (text and html), a base composition plot, a list of genes formatted for GSAE, a transcript coverage plot, and an analysis log (Fig. 7.14.14). a. Gene lists (text and html): The gene lists show the number of reads that map to each transcript, and the number mapping per transcript, normalized per million. The html version of the gene list includes a graph showing where the reads map, which is linked to a more detailed map with each base position. Links are provided to the NCBI RefSeq record. b. Base composition plot: This graph shows the numbers of each base at each position and can be helpful for quality control. If sequencing DNA, we would expect the ratios to be fairly similar. If sequencing single-stranded RNA, we would expect to see more differences. c. Transcript coverage plot: The transcript coverage plot shows the number of reads that map to different numbers of transcripts. For example, in each case, you can see there are a large number of transcripts that only have one mapping read. Analyzing Expression Patterns 7.14.29 Current Protocols in Bioinformatics Supplement 27 Figure 7.14.14 Illumina data. Setting up a project 14. To compare multiple samples, begin by setting up a project. Find the Create New section in the Control Panel and click Project. 15. Give the project a title and add a description. Use “Mouse tissues” for this project. 16. Use the checkboxes to select the arrays that contain your data. These names correspond to the Array/Gene Set names that you assigned to the data sets during the upload process. If you checked the correct box, you will see the sample names appear in the Common Conditions box. The conditions that appear should match your experimental treatments. 17. Click the Continue button. 18. Assign a name to this group. 19. Select a normalization method. Choose None for this example. Analyzing Gene Expression Data Using GeneSifter 20. Use the “Data transformation” menu to select a method for data transformation. Data transformation options are “no transformation,” “log transformation,” or “already transformed.” Log transformations smooth out the data and produce a more Gaussian distribution. For this example, choose “Log transform data” from the menu. 7.14.30 Supplement 27 Current Protocols in Bioinformatics 21. In this next step, you will set the condition order. The first group selected acts as a reference or control group. Changes in gene expression in the other groups are all measured relative to first group that is chosen. a. Decide which group is group 1 and enter the name of that group in the Group Name box. To do this, select that group in the Conditions box on the left side, and use the arrow key to move it to the right-hand box. b. Select the other conditions that you wish to analyze and use the arrow key to move those to the left condition box as well. c. Click the Create Group button. A new page will appear with a list of all the groups and samples. 22. Select the samples for each condition. We will use all the samples, so click Select All Experiments, then click the Create Group button. The processing window will appear while the data are being processed. 23. Once a project has been created, you may analyze the project or create a new project or new group. These steps can also be completed at a later time. Comparing samples 24. Locate the Analysis section in the Control Panel, select Projects, and find your project in the list. Once you have found your project in the list, you may wish to select the project name to view some of the project details. You may also wish to view the box plots for these data as discussed in Basic Protocol 2. Identifying differential expression 25. To begin the analysis, select Projects from the Analysis section, locate your project in the list, and either select the spyglass or click the name or your project and then click on Analyze this Project. The Project summary page appears. From this page, we can choose to view all the genes or apply filters to locate specific genes by name, chromosome, function, or other distinguishing features. 26. Choose Show All Genes. It will take a few moments for the results to appear, especially with large data sets. The Project results appear. At this point, we see there are 40,009 results. We will need to apply a threshold and some statistics to select genes that are differentially expressed. The threshold filter allows us to choose the genes that show at least a minimum change in expression. Use a threshold of 1.5 for this project. 27. GSAE offers three types of statistical tests (described below) that can be applied at this point. At least three replicates per group are recommended. A balanced ANOVA can also be carried out when only one factor is varied (such as time or dose) and there are equal numbers of replicates for each sample. a. A standard 1-way ANOVA: This method is used when there is a normal distribution, the samples show equal variance, and the samples are independent. b. A 1-way ANOVA for samples with unequal variance: Like the standard 1-way ANOVA, this method assumes a normal distribution and independent, random samples. c. The Kruskal-Wallis test (nonparametric): This method assumes independent random samples but does not make assumptions about the distribution or variance. Choose the standard 1-way ANOVA for this analysis. Analyzing Expression Patterns 7.14.31 Current Protocols in Bioinformatics Supplement 27 28. After choosing a statistical method, click the Search button. At this point, there are still over 17,000 results. The advanced analysis methods in GSAE work best with gene numbers under 5000; consequently, we will use some additional filters to reduce the number of genes in the list. a. Apply a correction to limit false discoveries. The options are the Bonferroni, Holm, and Benjamini and Hochberg. Bonferroni is the most stringent, followed by Holm, with Benjamini and Hochberg allowing more false positives in order to minimize false negatives. Multiple testing corrections are discussed in detail in Basic Protocol 1. Used Benajmini and Hochberg in this example. b. Apply a p Cutoff. This sets a threshold for the minimum p value. Set the p-value cutoff at 0.01. c. Set the quality. For NGS data, the quality corresponds to the number of reads per million reads. Set the quality level at 100 to view highly expressed genes that differ between these three tissues. A quality value of 100 for NGS data corresponds to 100 reads per million sampled. 29. Click the Search button. 30. Now, we have limited the number of genes to 3293. At this point, it is helpful to save the results so that we can easily return to this point. To do this, click Save and enter a name and description for this subset of our project. When saving your project, it is helpful to enter information about the data transformations or statistical tests that were used during the analysis. For example, if your data were log transformed, or statistical tests or corrections were applied, it helps to enter this information in the description field. 31. A page will appear asking if you wish to continue the analysis or analyze the newly created project. Select “Analyze newly created project” and select Show All Genes from the Project Summary page. Visualizing the results Now, we can begin use some of the other analysis features in GSAE. The ontology and KEGG reports were discussed earlier in Basic Protocol 1, and some of the clustering options such as PCA and clustering by samples or genes were described earlier in this protocol. We will use clustering by genes here as well, in order to gain insights into the possible numbers of genes with related expression patterns. In this case, clustering by genes suggests that there may be three to four different expression patterns. Partition clustering Two of the advanced clustering methods provided in GeneSifter are PAM (Partitioning Around Medoids) and CLARA (Clustering Around Large Applications). Both of these options are variations of K-means clustering. K-means clustering is used to break a set of objects in this case, genes, into set of k groups. The clusters are formed by locating samples at the medoids (median values) to act as the seeds and clustering the other genes around the medoids. Analyzing Gene Expression Data Using GeneSifter In order to use the advanced clustering methods such as PAM or CLARA, filters must be applied in order to limit the number of the genes to below 5000. Two ways to limit the gene number are to set a lower p value as a cutoff and to raise the threshold. These filters can be used separately or in combination. 7.14.32 Supplement 27 Current Protocols in Bioinformatics To use the advanced clustering methods 32. Choose Pattern Navigation from the analysis path. 33. Choose Cluster. 34. Choose a method for clustering and set the options (as described below). The two options for advanced cluster analysis are PAM (Partitioning Around Medoids) and CLARA (Clustering Around Large Applications). The difference between the two methods is that PAM will try to group the samples into the number of clusters that you assign, while CLARA will try to find the optimum number of groups. PAM is recommended for data sets smaller than 3500 genes, while CLARA is more suited to larger data sets. PAM is also more robust; it tries all possible combinations of genes for k and picks the best clusters. CLARA does a sampling (100) and picks the best from that sample. a. Clusters: The number chosen here determines the number of gene groups. Often people try different values to see which gives the best results. b. Row Center: The values in this set, Row Mean, None, or Control are used to determine the centers of each row. c. Distance: The Distance choices are Euclidean, which corresponds to a straight line distance, Manhattan, which is a sum of linear distances, and Correlation. As a starting point for this example, choose PAM with 4 clusters based on our Gene cluster pattern, a Euclidean distance, and the Row Center at the Row Mean. 35. Click the Search button to begin. Silhouettes When the clustering process is complete, a page appears with multiple graphs, one for each cluster group. At the top of the page and under each graph are values called “silhouettes.” Silhouette widths are scores that indicate how well the expression of the genes within a cluster matches that graph. Values between 0.26 and 0.50 indicate a weak structure, between 0.50 and 0.70 a reasonable structure, and above 0.70 a strong structure. The mean silhouette value for all the silhouettes appears at the top of the page, with the individual values appearing below each graph along with the number of genes that show that pattern (Kaufman and Rousseeuw, 1990). The graphs showing the average expression pattern within each cluster and the silhouette values for our clusters are shown in Figure 7.14.15. When a graph in GSAE is clicked, the heat map containing the genes represented by the graph will appear. The first graph shows a pattern that seems a bit different from the results we might expect. Instead of showing the brain samples with a higher level of expression and liver and muscle lower, our first 20 liver and muscle samples appear instead to up-regulated. This result is puzzling until we look more closely at the results and see that the first silhouette contains 1920 genes, and that the variations in expression levels are small. It is likely that looking at more genes would show us that they do follow the pattern of expression seen in the graph. The other three graphs, with 397, 717, and 277 genes, respectively, match the results that we see in their respective heat maps. These groups also make biological sense. If we look at the genes and read about their function in the ontology and KEGG reports, we can see that, as expected, brain genes are expressed in brain, liver genes in liver, muscle genes in muscle, and some genes in two or more of tissues examined. It should be noted that clustering is not a definitive analytical tool. Clustering is used to try and group genes by the expression patterns that we see, and we will often try multiple values for k and different ways of making the clusters. Although the silhouette scores are helpful for evaluating the strength of the group, ultimately, we want to see if the cluster makes biological sense, with genes in a common pathway showing a pattern of coordinate control. Current Protocols in Bioinformatics Analyzing Expression Patterns 7.14.33 Supplement 27 Figure 7.14.15 Partitioning and silhouette data from a Next Gen experiment. LITERATURE CITED Barrett, T., Troup, D.B., Wilhite, S.E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I.F., Soboleva, A., Tomashevsky, M., Marshall, K.A., Phillippy, K.H., Sherman, P.M., Muertter, R.N., and Edgar, R. 2009. NCBI GEO: Archive for high-throughput functional genomic data. Nucleic Acids Res. 37:D885-D890. Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley Series in Probability and Statistics. John Wiley & Sons, Inc., New York. Kozul, C.D., Nomikos, A.P., Hampton, T.H., Warnke, L.A., Gosse, J.A., Davey, J.C., Thorpe, J.E., Jackson, B.P., Ihnat, M.A., and Hamilton, J.W. 2008. Laboratory diet profoundly alters gene expression and confounds genomic analysis in mouse liver and lung. Chem. Biol. Interact. 173:129-140. Li, H. and Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics E-pub May 18. Marioni, J.C., Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. 2008. RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18:1509-1517. Analyzing Gene Expression Data Using GeneSifter Millenaar, F.F., Okyere, J., May, S.T., van Zanten, M., Voesenek, L.A., and Peeters, A.J. 2006. How to decide? Different methods of calculating gene expression from short oligonucleotide array data will give different results. BMC Bioinformatics 7:137. 7.14.34 Supplement 27 Current Protocols in Bioinformatics Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. 2008. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods 5:621-628. Tang, F., Barbacioru, C., Wang, Y., Nordman, E., Lee, C., Xu, N., Wang, X., Bodeau, J., Tuch, B.B., Siddiqui, A., Lao, K., and Surani, M.A. 2009. mRNA-Seq whole-transcriptome analysis of a single cell. Nat. Methods 5:377-382. Wang, Z., Gerstein, M., and Snyder, M. 2009. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10:57-63. Wheeler, D.L., Barrett, T., Benson, D.A., Bryant, S.H., Canese, K., Chetvernin, V., Church, D.M., Dicuccio, M., Edgar, R., Federhen, S., Feolo, M., Geer, L.Y., Helmberg, W., Kapustin, Y., Khovayko, O., Landsman, D., Lipman, D.J., Madden, T.L., Maglott, D.R., Miller, V., Ostell, J., Pruitt, K.D., Schuler, G.D., Shumway, M., Sequeira, E., Sherry, S.T., Sirotkin, K., Souvorov, A., Starchenko, G., Tatusov, R.L., Tatusova, T.A., Wagner, L., and Yaschenko, E. 2008. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36:D13-D21. INTERNET RESOURCES http://www.geospiza.com/Support/datacenter.shtml The microarray data center at Geospiza, Inc. A diverse set of microarray data sets and tutorials on using GSAE are available from this page. http://www.ncbi.nlm.nih.gov/geo/ The NCBI GEO (Gene Expression Omnibus) database. GEO is a convenient place to find both microarray and Next Gen transcriptome datasets. http://www.ebi.ac.uk/microarray/ The ArrayExpress database from the European Bioinformatics Institute. Both microarray and Next Gen transcriptome data can be obtained here. http://www.ncbi.nlm.nih.gov/sra/ The NCBI SRA (Short Read Archive) database. Some Next Gen transcriptome data can be obtained here. Analyzing Expression Patterns 7.14.35 Current Protocols in Bioinformatics Supplement 27