Download CRAN GMD: User's Guide (0.3.3)
Transcript
CRAN GMD: User’s Guide (0.3.3) Generic histogram construction, generic distance measure, cluster analysis with evaluation and visualization Xiaobei Zhao∗1 and Albin Sandelin†1 1 Bioinformatics Centre, University of Copenhagen Modified: 2014-08-26 Compiled: 2014-8-27 You may find the latest version of GMD and this documentation at, http://CRAN.R-project.org/package=GMD Keywords: histogram, distance, metric, non-parametric, cluster analysis, hierarchical clustering, sum-ofsquares, heatmap.3 Abstract The purpose of this GMD vignette is to show how to get started with the R package GMD. GMD denotes Generalized Minimum Distance between distributions, which is a true metric that measures the distance between the shapes of any two discrete numerical distributions (e.g. histograms). The vignette includes a brief introduction, an example to illustrate the concepts and the implementation of GMD and case studies that were carried out using classical data sets (e.g. iris) and high-throughput sequencing data (e.g. ChIPseq) from biology experiments. The appendix on page 14 contains an overview of package functionality, and examples using primary functions in histogram construction (the ghist function), how to measure distance between distributions (the gdist function), cluster analysis with evaluation (the “elbow” method in the css function) and visualization (the heatmap.3 function). Contents 1 Introduction 2 2 Minimal Example: “Hello, GMD!” 3 ∗ Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, 450 West Dr, Chapel Hill, NC 27599, USA. [email protected] † Bioinformatics Centre, University of Copenhagen, Department of Biology and Biotech Research and Innovation Centre, Ole Maaløes Vej 5, DK-2200, Copenhagen, Denmark. [email protected] 1 CRAN GMD Zhao et al 3 An example to understand GMD 3.1 3.2 4 Histogram: construction and visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1.1 Load library and simulate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1.2 Construct histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3.1.3 Save histograms as multiple-histogram (‘mhist’) object . . . . . . . . . . . . . . . . . . . 4 3.1.4 Visualize an ‘mhist’ object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Histogram: distance measure and alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.2.1 Measure the pairwise distance between two histograms by GMD . . . . . . . . . . . . . 7 3.2.2 Show alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Case study 9 4.1 CAGE: measuring the dissimilarities among TSSDs . . . . . . . . . . . . . . . . . . . . . . . . . 9 4.2 ChIP-seq: measuring the similarities among histone modification patterns . . . . . . . . . . . . 12 A Functionality 14 A.1 An overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 ghist: Generic construction and visualization of histograms . . . . . . . . . . . . . . . . . . . . 15 A.2.1 Examples using simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2.2 Examples using iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 A.2.3 Examples using nottem data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 A.3 gdist: Generic construction and visualization of distances . . . . . . . . . . . . . . . . . . . . . 21 A.4 css: Clustering Sum-of-Squares and the “elbow” plot . . . . . . . . . . . . . . . . . . . . . . . 23 A.5 heatmap.3: Visualization in cluster analysis, with evaluation . . . . . . . . . . . . . . . . . . . 25 A.5.1 Examples using mtcars data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 A.5.2 Examples using ruspini data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 B Data 1 30 B.1 GMD dataset overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.2 CAGE data: cage and cagel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 B.3 ChIP-seq data: chipseq_mES and chipseq_hCD4T . . . . . . . . . . . . . . . . . . . . . . . . . 32 Introduction Similar to the Earth Mover’s distance, Generalized Minimum distance of Distributions (GMD) (based on MDPA - Minimum Difference of Pair Assignment [3]) is a true metric of the similarity between the shapes of two histograms1 . Considering two normalized histograms A and B, GMD measures their similarity by counting the necessary “shifts” of elements between the bins that have to be performed to transform distribution A into distribution B. 1 In statistics (and many other fields), histogram refers to a graphical representation of category frequencies in the data. Here, we use this term in a more mathematical sense, defined as a function that counts categorical data or a result returned by such a function. 2 CRAN GMD Zhao et al The GMD package provides classes and methods for computing GMD in R [5]. The algorithm has been implemented in C to interface with R for efficient computation. The package also includes downstream cluster analysis in function css (A.4 on page 23) that use a pairwise distance matrix to make partitions given variant criteria, including the “elbow” rule as discussed in [7] or desired number of clusters. In addition, the function heatmap.3 (A.5 on page 25) integrates the visualization of the hierarchical clustering in dendrogram, the distance measure in heatmap and graphical representations of summary statistics of the resulting clusters or the overall partition. For more flexibility, the function heatmap.3 can also accept plug-in functions defined by end-users for custom summary statistics. The motivation to write this package was born with the project [7] on characterizing Transcription Start Site (TSS) landscapes using high-throughput sequencing data, where a non-parametric distance measure was developed to assess the similarity among distributions of high-throughput sequencing reads from biological experiments. However, it is possible to use the method for any empirical distributions of categorical data. The package is available on CRAN. The source code is available at http://CRAN.R-project.org/package=GMD under GPL license. 2 Minimal Example: “Hello, GMD!” hello-GMD.R 1 2 ##' Check GMD's sanity and start up ##' @name hello-GMD 3 4 5 ## GMD at CRAN, for source code download and installation ## http://cran.r-project.org/web/packages/GMD/index.html 6 7 8 ## load GMD library(GMD) 9 10 11 12 ## version of GMD and description packageVersion("GMD") packageDescription("GMD") 13 14 15 ## view GMD vignette vignette("GMD-vignette",package="GMD") 16 17 18 ## list the available data sets in GMD data(package="GMD") 19 20 21 ## list all the objects in the GMD ls("package:GMD") 22 23 24 ## help info on GMD help(package="GMD") 25 26 27 ## run a demo demo("GMD-demo") 28 29 30 ## cite GMD in publications citation(package="GMD") 31 hello-GMD.R (Source Code 1) is a minimal example to load and check of that your GMD installation works. It also includes code for viewing the package information and this “vignette”, checking data sets provided by GMD, starting a demo and listing the citation of GMD. 3 CRAN GMD 3 Zhao et al An example to understand GMD This example, based on simulated data, is designed to illustrate the concepts and the implementation of GMD by stepping through the computations in detail. 3.1 Histogram: construction and visualization 3.1.1 > > > > > > require("GMD") # load library ## create two normally-distributed samples ## with unequal means and unequal variances set.seed(2012) x1 <- rnorm(1000,mean=-5, sd=10) x2 <- rnorm(1000,mean=10, sd=5) 3.1.2 > > > > > > > Load library and simulate data Construct histograms ## create common bins n <- 20 # desired number of bins breaks <- gbreaks(c(x1,x2),n) # bin boundaries ## make two histograms v1 <- ghist(x1,breaks=breaks,digits=0) v2 <- ghist(x2,breaks=breaks,digits=0) 3.1.3 Save histograms as multiple-histogram (‘mhist’) object > x <- list(v1,v2) > mhist.obj <- as.mhist(x) 4 CRAN GMD 3.1.4 Zhao et al Visualize an ‘mhist’ object > ## plot histograms side-by-side > plot(mhist.obj,mar=c(1.5,1,1,0),main="Histograms of simulated normal distributions") Histograms of simulated normal distributions 1 2 250 Count 200 150 100 50 0 −21 −5 11 Bin 5 27 CRAN GMD Zhao et al > ## plot histograms as subplots, with corresponding bins aligned > plot(mhist.obj,beside=FALSE,mar=c(1.5,1,1,0), + main="Histograms of simulated normal distributions") Histograms of simulated normal distributions 1 250 200 150 100 50 Count 0 −21 −5 11 27 2 250 200 150 100 50 0 −21 −5 11 Bin 6 27 CRAN GMD 3.2 Zhao et al Histogram: distance measure and alignment Here we measure the GMD distance between shapes of two histograms with option sliding on. 3.2.1 Measure the pairwise distance between two histograms by GMD > gmdp.obj <- gmdp(v1,v2,sliding=TRUE) > print(gmdp.obj) # print a brief version by default [1] 1.334 > print(gmdp.obj,mode="detailed") # print a detailed version GM-Distance: 1.334 Sliding: TRUE Number of hits: 1 Gap: v1 Hit1 v2 5 0 Resolution: 1 > print(gmdp.obj,mode="full") # print a full version Distribution of v1: 1 2 10 19 28 46 59 101 109 119 133 108 90 77 42 29 17 6 3 1 (After normalization) 0.001 0.002 0.01 0.019 0.028 0.046 0.059 0.101 0.109 0.119 0.133 0.108 0.09 0.077 0.042 0.029 0.017 0.006 0.003 Distribution of v2: 0 0 0 0 0 0 0 0 0 1 5 30 94 206 258 199 139 58 8 2 (After normalization) 0 0 0 0 0 0 0 0 0 0.001 0.005 0.03 0.094 0.206 0.258 0.199 0.139 0.058 0.008 0.002 GM-Distance: 1.334 Sliding: TRUE Number of hits: 1 Gap: v1 Hit1 v2 5 0 Resolution: 1 7 CRAN GMD 3.2.2 Zhao et al Show alignment Now, let’s have a look at the alignment by GMD, with a distance 1.334 and a “shift” of 5 in the 1st distribution. It is important to note that the specific features (the values in this case) of the original bins in the histograms are ignored with sliding on. To keep original bin-to-bin correspondence, please set sliding to FALSE (see examples in section 4.2 on page 12). > plot(gmdp.obj,beside=FALSE) Optimal alignment between distributions (with sliding) GMD=1.334, gap=c(5,0) v1 0.25 0.20 0.15 0.10 0.05 0.00 Fraction 5 10 15 20 25 v2 0.25 0.20 0.15 0.10 0.05 0.00 5 10 15 20 Position 8 25 CRAN GMD 4 4.1 Zhao et al Case study CAGE: measuring the dissimilarities among TSSDs Studies have demonstrated that the spatial distributions of read-based sequencing data from different platforms often indicate functional properties and expression profiles (reviewed in [6] and [8]). Analyzing the distributions of DNA reads is therefore often meaningful. To do this systematically, a measure of similarity between distributions is necessary. Such measures should ideally be true metrics, have few parameters as possible, be computationally efficient and also make biological sense to end-users. Case studies were made in section 4.1 and 4.2 to demonstrate the applications of GMD using distributions of CAGE and ChIP-seq reads. In this section we demonstrate how GMD is applied to measure the dissimilarities among TSSDs, histograms of transcription start site (TSS) that are made of CAGE tags, with option sliding on. The spatial properties of TSSDs vary widely between promoters and have biological implications in both regulation and function. The raw data were produced by CAGE and downloaded from FANTOM3 ([2]) and CAGE sequence reads were preprocessed as did in [7]. The following codes case-cage.R (Source Code 2) are sufficient to perform both pairwise GMD calculation by function gmdp and to construct a GMD distance matrix by function gmdm. A handful of options are available for control and flexibility, particularly, the option sliding is enabled by default to allow partial alignment. case-cage.R 1 2 require("GMD") # load library data(cage) # load data 3 4 5 6 7 ## measure pairwise distance x <- gmdp(cage[["Pfkfb3 (T02R00AEC2D8)"]],cage[["Csf1 (T03R0672174D)"]]) print(x) # print a brief version by default print(x, mode="full") # print a full version by default 8 9 10 ## show alignment plot(x,labels=c("Pfkfb3","Csf1"),beside=FALSE) 11 12 13 14 15 ## show another alignment plot(gmdp(cage[["Hig1 (T09R0743763C)"]],cage[["Cd72 (T04R028B8BC9)"]]), labels=c("Hig1 (T09R0743763C)","Cd72 (T04R028B8BC9)"), beside=FALSE) 16 17 18 19 20 ## construct a distance matrix and visualize it short.labels <- gsub("(.+) \\(.+","\\1",names(cage)) # get short labels x <- gmdm(cage[1:6],labels=short.labels[1:6]) plot(x) 9 CRAN GMD Zhao et al Optimal alignment between distributions (with sliding) GMD=7.881, gap=c(0,100) 0.8 Pfkfb3 (T02R00AEC2D8) 0.6 0.4 0.2 0.0 Fraction 50 100 150 200 250 0.8 Csf1 (T03R0672174D) 0.6 0.4 0.2 0.0 50 100 150 200 250 Position Figure 1: Graphical output of “case-cage.R”. Optimal alignment between distributions (with sliding) GMD=3.992, gap=c(0,24) Hig1 (T09R0743763C) 0.3 0.2 0.1 0.0 Fraction 20 40 60 80 100 120 Cd72 (T04R028B8BC9) 0.3 0.2 0.1 0.0 20 40 60 80 Position Figure 2: Graphical output of “case-cage.R”. 10 100 120 CRAN GMD Zhao et al Optimal alignments among distributions (with sliding) NA 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5 5 5 0.8 GMD=1.310 Gap=(4,0) Glul 5 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 5 Cyp4a14 GMD=2.284 Gap=(0,3) 0.0 2 GMD=0.833 Gap=(0,0) 0.0 5 0.8 GMD=0.589 Gap=(0,3) 0.0 50 0.8 200 0.8 GMD=3.802 Gap=(90,0) 0.0 50 Fraction GMD=2.754 Gap=(71,0) 0.0 50 0.8 0.8 0.0 50 0.0 20 0.8 0.0 50 0.8 200 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 5 GMD=2.732 Gap=(3,0) 0.0 5 GMD=2.551 Gap=(0,0) 0.0 50 200 GMD=5.698 Gap=(93,0) 0.0 50 GMD=4.821 Gap=(74,0) 0.0 50 GMD=5.415 Gap=(59,0) 0.0 50 0.0 50 0.0 20 0.0 50 200 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 5 D230039L 06Rik GMD=0.658 Gap=(0,3) 0.0 50 200 GMD=4.339 Gap=(90,0) 0.0 50 GMD=3.354 Gap=(71,0) 0.0 50 GMD=4.432 Gap=(54,0) 0.0 50 0.0 50 0.0 20 0.0 50 200 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 0.0 50 Gas5 200 GMD=3.769 Gap=(93,0) 0.0 50 GMD=2.760 Gap=(74,0) 0.0 50 GMD=4.447 Gap=(59,0) 0.0 50 0.0 50 0.0 20 0.0 50 200 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.0 200 Tpt1 0.0 50 GMD=1.480 Gap=(0,19) 0.0 50 GMD=2.315 Gap=(0,34) 0.0 50 0.0 50 0.0 20 0.0 50 200 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.0 Pcna GMD=2.325 Gap=(0,15) 0.2 0.0 50 0.2 0.0 50 0.2 0.0 50 200 0.2 0.0 50 200 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 D0H4S114 0.2 0.0 20 0.2 0.0 50 0.2 0.0 50 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 GMD=5.936 Gap=(39,0) GMD=4.677 Gap=(42,0) GMD=6.155 Gap=(37,0) GMD=6.161 Gap=(42,0) GMD=8.677 Gap=(0,51) GMD=7.739 Gap=(0,32) GMD=6.507 Gap=(0,17) GMD=8.409 Gap=(0,26) 0.2 0.0 50 0.2 0.0 50 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.0 Hsd11b1 0.2 0.0 50 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.2 0.0 50 Csf1 0.0 50 200 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 0.0 50 0.0 50 200 0.8 GMD=3.747 Gap=(34,0) B2m 0.0 50 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.0 50 200 0.0 20 GMD=3.538 Gap=(97,0) GMD=5.794 Gap=(100,0) GMD=4.162 Gap=(97,0) GMD=3.802 Gap=(100,0) GMD=1.179 Gap=(7,0) GMD=1.402 Gap=(26,0) GMD=2.462 Gap=(41,0) GMD=1.968 Gap=(32,0) GMD=8.670 Gap=(58,0) GMD=7.881 Gap=(100,0) GMD=10.728 Gap=(66,0) Pfkfb3 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.2 0.0 50 GMD=18.387 Gap=(31,0) GMD=17.706 Gap=(32,0) GMD=15.601 Gap=(35,0) GMD=18.253 Gap=(32,0) GMD=18.130 Gap=(35,0) GMD=18.366 Gap=(0,50) GMD=17.692 Gap=(0,35) GMD=16.389 Gap=(0,16) GMD=16.574 Gap=(0,25) GMD=12.228 Gap=(0,5) GMD=14.311 Gap=(35,0) GMD=11.099 Gap=(1,0) GMD=18.183 Gap=(0,57) 200 Hig1 300 100 300 100 300 100 300 100 300 100 300 100 300 0.8 0.6 0.4 0.0 GMD=17.388 Gap=(35,0) 100 0.0 50 0.8 GMD=4.024 Gap=(96,0) 300 0.8 0.6 0.4 0.0 GMD=4.607 Gap=(100,0) 100 0.0 20 0.8 300 0.0 50 0.8 0.0 GMD=2.682 Gap=(0,6) 0.0 20 100 0.0 50 0.8 0.6 0.0 300 0.2 0.0 50 0.8 0.6 50 GMD=10.539 Gap=(0,32) 0.2 0.0 50 0.8 0.0 GMD=2.453 Gap=(0,40) 0.2 0.0 50 100 0.0 50 0.8 0.6 50 300 0.2 0.0 20 0.8 0.6 0.2 GMD=8.790 Gap=(0,23) 0.2 0.0 50 0.8 0.0 GMD=8.173 Gap=(0,66) 0.2 0.0 50 100 0.0 50 0.8 0.6 0.0 300 0.2 0.0 50 0.8 0.6 GMD=6.336 Gap=(37,0) GMD=9.839 Gap=(0,39) 0.2 0.0 50 0.8 0.6 GMD=5.300 Gap=(40,0) GMD=6.236 Gap=(0,57) 0.2 0.0 50 0.8 Gsto1 GMD=10.799 Gap=(0,57) 200 0.8 0.6 0.0 100 0.0 50 0.8 0.6 0.2 300 0.2 0.0 50 0.8 0.6 GMD=2.147 Gap=(9,0) GMD=8.016 Gap=(34,0) 0.2 0.0 50 0.8 GMD=2.277 Gap=(0,6) GMD=7.086 Gap=(0,74) 0.2 0.0 50 100 0.0 10 0.8 0.6 GMD=2.037 Gap=(0,25) GMD=7.863 Gap=(31,0) 0.0 50 0.8 GMD=4.784 Gap=(67,0) GMD=8.213 Gap=(0,93) 0.0 50 300 0.0 10 0.8 0.6 GMD=4.872 Gap=(63,0) GMD=6.336 Gap=(36,0) 0.0 20 140 0.8 GMD=5.915 Gap=(68,0) GMD=4.734 Gap=(0,0) 0.0 50 100 0.0 10 0.8 0.6 0.4 GMD=4.426 Gap=(63,0) GMD=7.811 Gap=(31,0) 0.0 50 0.8 0.6 GMD=5.015 Gap=(63,0) GMD=4.247 Gap=(0,3) 0.0 50 300 0.0 10 0.8 0.6 0.4 GMD=5.015 Gap=(66,0) GMD=8.086 Gap=(30,0) 0.0 20 140 0.8 0.6 20 GMD=4.532 Gap=(1,0) 0.0 50 100 0.8 0.6 0.4 Stxbp4 300 0.0 10 0.8 0.6 0.4 0.2 GMD=7.010 Gap=(34,0) 0.0 20 0.8 0.6 0.4 0.0 GMD=4.628 Gap=(0,3) 0.0 50 0.8 0.6 0.4 20 GMD=4.344 Gap=(54,0) 0.0 50 0.8 100 0.8 0.6 0.4 0.0 GMD=4.736 Gap=(0,4) 0.0 0.6 0.4 0.2 GMD=3.674 Gap=(0,0) 0.4 0.2 0.8 0.6 50 GMD=4.587 Gap=(54,0) 0.8 0.4 0.0 GMD=4.435 Gap=(57,0) 0.6 10 0.6 0.2 GMD=2.948 Gap=(70,0) 0.8 50 0.4 50 GMD=3.875 Gap=(74,0) 0.8 200 0.6 0.0 GMD=3.931 Gap=(89,0) 0.8 50 0.4 2 GMD=4.914 Gap=(93,0) 0.8 50 0.6 0.0 GMD=0.351 Gap=(0,4) 0.8 20 0.4 5 GMD=1.251 Gap=(0,0) 0.8 50 0.6 0.0 GMD=0.562 Gap=(0,1) 0.8 50 0.4 5 GMD=0.907 Gap=(3,0) 50 0.6 0.0 GMD=2.833 Gap=(0,4) 0.8 50 0.4 0.8 GMD=1.835 Gap=(0,0) 0.8 200 0.6 5 GMD=0.821 Gap=(0,1) 50 0.4 0.0 GMD=1.128 Gap=(3,0) 2 0.8 0.0 50 200 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 20 0.8 0.6 GMD=17.387 Gap=(7,0) GMD=18.386 Gap=(3,0) GMD=17.705 Gap=(4,0) GMD=15.579 Gap=(7,0) GMD=18.252 Gap=(4,0) GMD=18.130 Gap=(7,0) GMD=18.835 Gap=(0,74) GMD=18.126 Gap=(0,58) GMD=17.125 Gap=(0,42) GMD=17.059 Gap=(0,51) GMD=12.993 Gap=(0,33) GMD=15.084 Gap=(5,0) GMD=12.237 Gap=(0,29) GMD=18.652 Gap=(0,81) GMD=3.992 Gap=(0,24) Cd72 0.4 0.2 0.0 GMD=48.292 Gap=(137,0) GMD=49.290 Gap=(133,0) GMD=48.610 Gap=(134,0) GMD=46.521 Gap=(137,0) GMD=49.157 Gap=(134,0) GMD=49.034 Gap=(137,0) GMD=45.360 Gap=(44,0) GMD=46.391 Gap=(63,0) GMD=44.807 Gap=(80,0) GMD=44.467 Gap=(71,0) Position Figure 3: Graphical output of “case-cage.R”. 11 GMD=43.082 Gap=(97,0) GMD=44.996 Gap=(137,0) GMD=41.294 Gap=(103,0) GMD=45.380 Gap=(37,0) GMD=32.443 Gap=(98,0) GMD=34.026 Gap=(121,0) Egln1 CRAN GMD 4.2 Zhao et al ChIP-seq: measuring the similarities among histone modification patterns In this section we demonstrate how GMD is applied to measure the dissimilarities between histone modifications represented by ChIP-seq reads. Distinctive patterns of chromatin modifications around the TSS are associated with transcription regulation and expression variation of genes. Comparing the chromatin modification profiles (originally produced by [1] and [4], and preprocessed by [7]), the sliding option is disabled for fixed alignments at the TSSs and the flanking regions. The GMD measure indicates how well profiles are co-related to each other. In addition, the downstream cluster analysis is visualized with function heatmap.3 that use GMD distance matrix to generate clustering dendrograms and make partitions given variant criteria, including the “elbow” rule (discussed in [7]) or desired number of clusters. case-chipseq.R 1 2 3 require("GMD") # load library data(chipseq_mES) # load data data(chipseq_hCD4T) # load data 4 5 6 ## pairwise distance and alignment based on GMD metric plot(gmdm(chipseq_mES,sliding=FALSE)) 7 8 9 10 ## clustering on spatial distributions of histone modifications x <- gmdm(chipseq_hCD4T,sliding=FALSE,resolution=10) heatmap.3(x,revC=TRUE) 11 12 13 14 15 16 17 ## Determine the number of clusters by "Elbow" criterion main <- "Heatmap of ChIP-seq data (human CD4+ T cells)" dist.obj <- gmdm2dist(x) css.multi.obj <- css.hclust(dist.obj,hclust(dist.obj)) elbow.obj <- elbow.batch(css.multi.obj,ev.thres=0.90,inc.thres=0.05) heatmap.3(dist.obj, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k) 18 19 20 21 ## more strict threshold elbow.obj <- elbow.batch(css.multi.obj,ev.thres=0.75,inc.thres=0.1) heatmap.3(dist.obj, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k) 22 23 24 25 ## side plots normalizeVector <- function(v){v/sum(v)} # a function to normalize a vector dev.new(width=12,height=8) 26 27 28 29 30 31 ## summary of row clusters expr1 <- list(quote(op <- par(mar = par("mar")*2)), quote(plot(mhist.summary(as.mhist(i.x)),if.plot.new=FALSE)), quote(par(op)) ) 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 ## summary of row clustering expr2 <- list(quote(tmp.clusters <- cutree(hclust(dist.row),k=kr)), quote(tmp.css <- css(dist.row,tmp.clusters)), quote(print(tmp.css)), quote(tmp.wev <- tmp.css$wss/tmp.css$tss), quote(names(tmp.wev) <- as.character(unique(tmp.clusters))), quote(tmp.wev <- tmp.wev[order(unique(tmp.clusters))]), quote(barplot(tmp.wev,main="Cluster Explained Variance", xlab="Cluster", ylab="EV",col="white",border="black", ylim=c(0,max(tmp.wev)*1.1),cex.main=1))) expr3 <- list(quote(op <- par(mar = par("mar")*2)), quote(plot.elbow(css.multi.obj,elbow.obj,if.plot.new=FALSE)), quote(par(op)) ) 47 48 49 50 51 heatmap.3(dist.obj, main=main, cex.main=1.25, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k, keysize=1,mapsize=4.5,row.data=lapply(chipseq_hCD4T,normalizeVector), plot.row.clusters=TRUE,plot.row.clusters.list=list(expr1), plot.row.clustering=TRUE,plot.row.clustering.list=list(expr2,expr3)) 52 12 CRAN GMD Zhao et al Optimal alignments among distributions (without sliding) H3K27me3 0.007 0.007 0.007 0.007 0.007 0.006 0.006 0.006 0.006 0.006 0.005 0.005 0.005 0.005 0.005 0.004 0.004 0.004 0.004 0.004 0.003 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.002 0.001 0.001 0.001 0.001 0.000 0.000 0.000 0.000 200 GMD=118.840 Gap=(0,0) 400 600 H3K36me3 800 1000 200 600 800 1000 200 600 800 1000 0.001 0.000 200 400 600 800 1000 0.007 0.007 0.007 0.006 0.006 0.006 0.006 0.005 0.005 0.005 0.005 0.004 0.004 0.004 0.004 0.003 0.003 0.003 0.003 0.002 0.002 0.002 0.002 0.001 0.001 0.001 0.000 0.000 0.000 GMD=82.910 Gap=(0,0) 400 600 800 1000 Fraction H3K4me1 200 GMD=140.037 Gap=(0,0) 400 600 800 1000 600 800 1000 0.007 0.006 0.006 0.006 0.005 0.005 0.005 0.004 0.004 0.004 0.003 0.003 0.003 0.002 0.002 0.002 0.001 0.001 0.000 0.000 GMD=77.365 Gap=(0,0) 600 800 1000 H3K4me2 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000 0.000 400 0.007 400 200 0.001 200 0.007 200 GMD=56.380 Gap=(0,0) 400 0.007 200 GMD=37.253 Gap=(0,0) 400 0.001 0.000 200 400 600 800 1000 0.007 0.007 0.006 0.006 0.005 0.005 0.004 0.004 0.003 0.003 0.002 0.002 0.001 0.001 0.000 0.000 200 400 600 800 1000 0.007 0.006 0.005 GMD=110.047 Gap=(0,0) GMD=190.394 Gap=(0,0) GMD=135.755 Gap=(0,0) GMD=58.471 Gap=(0,0) 0.004 H3K4me3 0.003 0.002 0.001 0.000 GMD=46.266 Gap=(0,0) GMD=108.302 Gap=(0,0) GMD=33.494 Gap=(0,0) GMD=99.835 Gap=(0,0) GMD=156.171 Gap=(0,0) H3K9me3 Position Figure 4: Graphical output of “case-chipseq.R”. 150 Heatmap of ChIP−seq data (human CD4+ T cells) Row Clusters 0 50 Count Color Key 0 5 Row Clustering 15 Value 1 0.77 1 116 264 412 560 708 856 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 0.0015 10 k 15 20 0.12 Cluster Explained Variance EV 0.08 0.004 0.000 Figure 5: Graphical output of “case-chipseq.R”. 13 1 116 264 412 560 708 856 ● ● 1 116 264 412 560 708 856 Row group 1 (n=16) ● ● ● Row group 3 (n=7) 0.0000 Elbow plot k=3 Explained Variance 0.0 0.4 0.8 0.0000 0.0015 Row group 2 (n=17) H2AK9ac CTCF H4K5ac H4K8ac H2BK20ac H2BK12ac H2AZ H3K4ac H3K36ac H4K91ac H2BK120ac H3K4me3 H3K18ac H3K9ac H3K27ac H2BK5ac H4K20me1 H3K79me2 H3K79me3 H3K36me3 H3K27me1 H3K79me1 H2BK5me1 H3K4me2 H3K9me1 H3K4me1 H3K23ac H4K12ac H4K16ac H4K20me3 H3K14ac H3K36me1 H4R3me2 H3R2me1 H3R2me2 H2AK5ac H3K27me2 H3K9me2 H3K9me3 H3K27me3 1 3 2 H2AK9ac CTCF H4K5ac H4K8ac H2BK20ac H2BK12ac H2AZ H3K4ac H3K36ac H4K91ac H2BK120ac H3K4me3 H3K18ac H3K9ac H3K27ac H2BK5ac H4K20me1 H3K79me2 H3K79me3 H3K36me3 H3K27me1 H3K79me1 H2BK5me1 H3K4me2 H3K9me1 H3K4me1 H3K23ac H4K12ac H4K16ac H4K20me3 H3K14ac H3K36me1 H4R3me2 H3R2me1 H3R2me2 H2AK5ac H3K27me2 H3K9me2 H3K9me3 H3K27me3 0.04 3 0.00 2 1 2 Cluster 3 CRAN GMD A A.1 Zhao et al Functionality An overview Table 1. Functions of the GMD R package Function Description ghist gdist css Generalized Histogram Computation and Visualization Generalized Distance Matrix Computation Computing Clustering Sum-of-Squares and evaluating the clustering by the “elbow” method Enhanced Heatmap Representation with Dendrogram and Partition Computation of GMD on a pair of histograms Computation of GMD Matrix on a set of histograms heatmap.3 gmdp gmdm 14 CRAN GMD A.2 A.2.1 Zhao et al ghist: Generic construction and visualization of histograms Examples using simulated data example-ghist.R (Source Code 4) is an example on how to construct a histogram object from raw data and make a visualization based on this. example-ghist.R 1 2 ## load library require("GMD") 3 4 5 6 7 8 ## create two normally-distributed samples ## with unequal means and unequal variances set.seed(2012) v1 <- rnorm(1000,mean=-5, sd=10) v2 <- rnorm(1000,mean=10, sd=5) 9 10 11 12 13 14 15 16 ## create common bins n <- 20 # desired number of bins breaks <- gbreaks(c(v1,v2),n) # bin boundaries x <list(ghist(v1,breaks=breaks,digits=0), ghist(v2,breaks=breaks,digits=0)) mhist.obj <- as.mhist(x) 17 18 19 ## plot histograms side-by-side plot(mhist.obj,mar=c(1.5,1,1,0),main="Histograms of simulated normal distributions") 20 22 23 24 ## plot histograms as subplots, ## with corresponding bins aligned plot(mhist.obj,beside=FALSE,mar=c(1.5,1,1,0), main="Histograms of simulated normal distributions") Histograms of simulated normal distributions 1 2 250 200 Count 21 150 100 50 0 −21 −5 Bin Figure 6: Graphical output of “example-ghist.R”. 15 11 27 CRAN GMD Zhao et al Histograms of simulated normal distributions 1 250 200 150 100 50 Count 0 −21 −5 11 27 2 250 200 150 100 50 0 −21 −5 Bin Figure 7: Graphical output of “example-ghist.R”. 16 11 27 CRAN GMD A.2.2 Zhao et al Examples using iris data case-iris.R (Source Code 5) is a study on how to obtain and visualize histograms, using Fisher’s iris data set. case-iris.R 1 2 ## load library require("GMD") 3 4 5 ## load data data(iris) 6 7 8 9 ## create common bins n <- 30 # the number of bins breaks <- gbreaks(iris[,"Sepal.Length"],n) # the boundary of bins 10 11 12 13 14 15 16 ## create a list of histograms Sepal.Length <list(setosa=ghist(iris[iris$Species=="setosa","Sepal.Length"],breaks=breaks), versicolor=ghist(iris[iris$Species=="versicolor","Sepal.Length"],breaks=breaks), virginica=ghist(iris[iris$Species=="virginica","Sepal.Length"],breaks=breaks) ) 17 18 19 ## convert to a `hist' object x <- as.mhist(Sepal.Length) 20 21 22 ## get bin-wise summary statistics summary(x) 23 24 25 26 ## visualize the histograms plot(x,beside=FALSE, main="Histogram of Sepal.Length of iris",xlab="Sepal.Length (cm)") 17 CRAN GMD Zhao et al Histogram of Sepal.Length of iris setosa 8 6 4 2 0 4.9 5.5 6.1 6.7 7.3 7.9 versicolor 8 Count 6 4 2 0 4.9 5.5 6.1 6.7 7.3 7.9 virginica 8 6 4 2 0 4.9 5.5 6.1 Sepal.Length (cm) Figure 8: Graphical output of “case-iris.R”. 18 6.7 7.3 7.9 CRAN GMD A.2.3 Zhao et al Examples using nottem data case-nottem.R (Source Code 6) is a study on how to draw histograms side-by-side and to compute and visualize a bin-wise summary plot, using air temperature data at Nottingham Castle. case-nottem.R 1 2 ## load library require("GMD") 3 4 5 ## load data data(nottem) 6 7 8 9 class(nottem) # a time-series (ts) object x <- ts2df(nottem) # convert ts to data.frame mhist1 <- as.mhist(x[1:3,]) 10 11 12 13 ## plot multiple discrete distributions side-by-side plot(mhist1,xlab="Month",ylab="Degree Fahrenheit", main="Air temperatures at Nottingham Castle") 14 15 16 17 18 ## make summary statistics for each bin mhist2 <- as.mhist(x) ms <- mhist.summary(mhist2) print(ms) 19 21 22 23 ## plot bin-wise summary statistics with ## confidence intervals over the bars plot(ms, main="Mean air temperatures at Nottingham Castle (1920-1939)", xlab="Month", ylab="Degree Fahrenheit") Air temperatures at Nottingham Castle 70 1920 1921 1922 65 Degree Fahrenheit 20 60 55 50 45 40 35 Jan Feb Mar Apr May Jun Jul Month Figure 9: Graphical output of “case-nottem.R”. 19 Aug Sep Oct Nov Dec CRAN GMD Zhao et al 40 30 20 0 10 Degree Fahrenheit 50 60 Mean air temperatures at Nottingham Castle (1920−1939) Jan Feb Mar Apr May Jun Jul Month Figure 10: Graphical output of “case-nottem.R”. 20 Aug Sep Oct Nov Dec CRAN GMD A.3 Zhao et al gdist: Generic construction and visualization of distances example-gdist.R (Source Code 7) is an example on how to measure distances using a user-defined metric, such as correlation distance and GMD. example-gdist.R 1 2 3 ## load library require("GMD") require(cluster) 4 5 6 7 ## compute distance using Euclidean metric (default) data(ruspini) x <- gdist(ruspini) 8 9 10 11 12 13 ## see a dendrogram result by hierarchical clustering dev.new(width=12, height=6) plot(hclust(x), main="Cluster Dendrogram of Ruspini data", xlab="Observations") 14 15 16 ## convert to a distance matrix m <- as.matrix(x) 17 18 19 20 ## convert from a distance matrix d <- as.dist(m) stopifnot(d == x) 21 Cluster Dendrogram of Ruspini data 7 Observations hclust (*, "complete") Figure 11: Graphical output of “example-gdist.R”. 21 16 18 19 17 14 15 20 11 12 13 1 5 6 4 8 9 10 2 3 44 45 41 42 43 46 47 48 54 50 52 53 49 51 58 59 60 57 55 56 61 64 68 65 67 69 70 71 63 62 66 73 74 72 75 28 27 28 29 30 23 24 21 22 31 25 26 40 36 39 37 38 33 34 32 35 27 150 26 100 25 Height 24 ## Use correlations between variables "as distance" data(USJudgeRatings) dd <- gdist(x=USJudgeRatings,method="correlation.of.variables") dev.new(width=12, height=6) plot(hclust(dd), main="Cluster Dendrogram of USJudgeRatings data", xlab="Variables") 50 23 0 22 CRAN GMD Zhao et al CONT 0.4 Variables hclust (*, "complete") Figure 12: Graphical output of “example-gdist.R”. 22 FAMI PREP WRIT ORAL RTEN DECI CFMG DILG PHYS DMNR INTG 0.0 0.2 Height 0.6 0.8 1.0 1.2 Cluster Dendrogram of USJudgeRatings data CRAN GMD A.4 Zhao et al css: Clustering Sum-of-Squares and the “elbow” plot: determining the number of clusters in a data set A good clustering yields clusters where the total within-cluster sum-of-squares (WSSs) is small (i.e. cluster cohesion, measuring how closely related are objects in a cluster) and the total between-cluster sum-of-squares (BSSs) is high (i.e. cluster separation, measuring how distinct or well-separated one cluster is from the other). example-css.R (Source Code 8) is an example on how to make correct choice of k using “elbow criterion”. A good k is selected according a) how much of the total variance in the whole data that the clusters can explain, and b) how large gain in explained variance we obtain by using these many clusters compared to one less or one more, the so-called “elbow” criterion. The optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. More important, an ideal k should also be relevant in terms of what it reveals about the data, which typically cannot be measured by a metric but by a human expert. Here we present a way to measure such performance of a clustering model, using squared Euclidean distances. The evaluation is based on pairwise distance matrix and therefore more generic in a way that doesn’t involve computating the “centers” of the clusters in the raw data, which are often not available or hard to obtain. example-css.R 1 2 ## load library require("GMD") 3 4 5 6 7 8 9 10 11 12 ## simulate data around 12 points in Euclidean space pointv <- data.frame(x=c(1,2,2,4,4,5,5,6,7,8,9,9),y=c(1,2,8,2,4,4,5,9,9,8,1,9)) set.seed(2012) mydata <- c() for (i in 1:nrow(pointv)){ mydata <- rbind(mydata,cbind(rnorm(10,pointv[i,1],0.1),rnorm(10,pointv[i,2],0.1))) } mydata <- data.frame(mydata); colnames(mydata) <- c("x","y") plot(mydata,type="p",pch=21, main="Simulated data") 13 14 15 16 17 18 19 ## determine a "good" k using elbow dist.obj <- dist(mydata[,1:2]) hclust.obj <- hclust(dist.obj) css.obj <- css.hclust(dist.obj,hclust.obj) elbow.obj <- elbow.batch(css.obj) print(elbow.obj) 20 21 22 23 ## make partition given the "good" k k <- elbow.obj$k; cutree.obj <- cutree(hclust.obj,k=k) mydata$cluster <- cutree.obj 24 25 26 27 28 29 30 ## draw a elbow plot and label the data dev.new(width=12, height=6) par(mfcol=c(1,2),mar=c(4,5,3,3),omi=c(0.75,0,0,0)) plot(mydata$x,mydata$y,pch=as.character(mydata$cluster),col=mydata$cluster,cex=0.75, main="Clusters of simulated data") plot.elbow(css.obj,elbow.obj,if.plot.new=FALSE) 31 32 33 34 ## clustering with more relaxed thresholds (, resulting a smaller "good" k) elbow.obj2 <- elbow.batch(css.obj,ev.thres=0.90,inc.thres=0.05) mydata$cluster2 <- cutree(hclust.obj,k=elbow.obj2$k) 35 36 37 38 39 40 dev.new(width=12, height=6) par(mfcol=c(1,2), mar=c(4,5,3,3),omi=c(0.75,0,0,0)) plot(mydata$x,mydata$y,pch=as.character(mydata$cluster2),col=mydata$cluster2,cex=0.75, main="Clusters of simulated data") plot.elbow(css.obj,elbow.obj2,if.plot.new=FALSE) 23 CRAN GMD Zhao et al Clusters of simulated data Elbow plot ● ● ● ● ● ● ● ● ● ● ● 0.98 0.8 0.6 Explained Variance 3 3333 33 1 11111 111 1 7 7777777 7 2 ● ● ● 0.2 4 4 4444 444 4 6 0.0 mydata$y 4 2 1 11 1 11111 ● ● 4444 4 4444 4444 4444 ● ● ● 6 8 6 6 66 6666 66 666 6 66 666 22 2 222 22 22 0.4 55 555 555 55 1.0 k=7 5 55 55555 5 ● 8 5 10 mydata$x 15 20 k A "good" k=7 (EV=0.98) is detected when the EV is no less than 0.95 and the increment of EV is no more than 0.01 for a bigger k. Figure 13: Graphical output of “example-css.R”. Clusters of simulated data Elbow plot 4 4 44 4444 44 ● ● ● ● ● ● ● ● ● ● ● 0.94 0.8 0.6 ● ● 0.2 3 3 3333 333 1 1111 11 1 11111 111 1 4 6 0.0 5 5555555 5 2 ● ● 0.4 Explained Variance 6 mydata$y 4 2 1 11 1 11111 ● ● 3333 3 3333 3333 3333 ● ● 444 4 44 444 22 2 222 22 22 8 44 444 444 44 1.0 k=5 4 44 44444 4 8 ● 5 mydata$x 10 k A "good" k=5 (EV=0.94) is detected when the EV is no less than 0.9 and the increment of EV is no more than 0.05 for a bigger k. Figure 14: Graphical output of “example-css.R”. 24 15 20 CRAN GMD A.5 A.5.1 Zhao et al heatmap.3: Visualization in cluster analysis, with evaluation Examples using mtcars data example-heatmap3a.R (Source Code 9) is an example on how to make a heatmap with summary visualization of observations. example-heatmap3a.R 1 2 3 require("GMD") # load library data(mtcars) # load data x <- as.matrix(mtcars) # data as a matrix 4 5 6 7 8 9 10 11 12 13 14 15 16 17 dev.new(width=10,height=8) heatmap.3(x) # heatmap.3(x, Rowv=FALSE, Colv=FALSE) # heatmap.3(x, dendrogram="none") # heatmap.3(x, dendrogram="row") # heatmap.3(x, dendrogram="row", Colv=FALSE) # heatmap.3(x, dendrogram="col") # heatmap.3(x, dendrogram="col", Rowv=FALSE) # heatmapOut <heatmap.3(x, scale="column") # names(heatmapOut) # heatmap.3(x, scale="column", x.center=0) # heatmap.3(x, scale="column",trace="column")# default, with reordering and dendrogram no reordering and no dendrogram reordering without dendrogram row dendrogram with row (and col) reordering row dendrogram with only row reordering col dendrogram col dendrogram with only col reordering sacled by column view the list that is returned colors centered around 0 trun "trace" on 18 19 20 21 22 23 24 ## coloring cars (row observations) by brand brands <- sapply(rownames(x), function(e) strsplit(e," ")[[1]][1]) names(brands) <- c() brands.index <- as.numeric(as.factor(brands)) RowIndividualColors <- rainbow(max(brands.index))[brands.index] heatmap.3(x, scale="column", RowIndividualColors=RowIndividualColors) 25 26 27 ## coloring attributes (column features) randomly (just for a test :) heatmap.3(x, scale="column", ColIndividualColors=rainbow(ncol(x))) 28 29 30 31 32 33 34 35 ## add a single plot for all row individuals dev.new(width=12,height=8) expr1 <- list(quote(plot(row.data[rowInd,"hp"],1:nrow(row.data),xlab="hp",ylab="", main="Gross horsepower",yaxt="n")), quote(axis(2,1:nrow(row.data),rownames(row.data)[rowInd],las=2))) heatmap.3(x, scale="column", plot.row.individuals=TRUE, row.data=x, plot.row.individuals.list=list(expr1)) 36 37 38 39 40 41 42 43 ## add a single plot for all col individuals dev.new(width=12,height=8) expr2 <- list(quote(plot(colMeans(col.data)[colInd], xlab="", ylab="Mean",xaxt="n", main="Mean features",cex=1,pch=19)), quote(axis(1,1:ncol(col.data),colnames(col.data)[colInd],las=2))) heatmap.3(x, scale="column", plot.col.individuals=TRUE, col.data=x, plot.col.individuals.list=list(expr2)) 44 45 46 47 48 49 50 51 52 53 ## add another single plot for all col individuals dev.new(width=12,height=8) expr3 <- list(quote(op <- par(mar = par("mar")*1.5)), quote(mytmp.data <- apply(col.data,2,function(e) e/sum(e))), quote(boxplot(mytmp.data[,colInd], xlab="", ylab="Value", main="Boxplot of normalized column features")), quote(par(op))) heatmap.3(x, scale="column", plot.col.individuals=TRUE, col.data=x, plot.col.individuals.list=list(expr3)) 25 CRAN GMD Zhao et al 30 Heatmap 0 Count 60 Color Key −2 0 1 2 3 4 Value Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro Z28 Ford Pantera L Maserati Bora Merc 450SL Merc 450SE Merc 450SLC Dodge Challenger AMC Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914−2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 240D Merc 280 Merc 280C Mazda RX4 Wag Mazda RX4 Ferrari Dino Fiat 128 Fiat X1−9 Toyota Corolla disp hp mpg qsec gear drat wt carb cyl am vs Honda Civic Figure 15: Graphical output of “example-heatmap3a.R”. 30 Heatmap Row Individuals 0 Count 60 Color Key −2 0 2 4 Value Gross horsepower Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro Z28 Ford Pantera L Maserati Bora Merc 450SL Merc 450SE Merc 450SLC Dodge Challenger AMC Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914−2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 240D Merc 280 Merc 280C Mazda RX4 Wag Mazda RX4 Ferrari Dino Fiat 128 Fiat X1−9 Toyota Corolla Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro Z28 Ford Pantera L Maserati Bora Merc 450SL Merc 450SE Merc 450SLC Dodge Challenger AMC Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914−2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 240D Merc 280 Merc 280C Mazda RX4 Wag Mazda RX4 Ferrari Dino Fiat 128 Fiat X1−9 Toyota Corolla Honda Civic ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Figure 16: Graphical output of “example-heatmap3a.R”. 26 disp hp mpg qsec gear drat wt carb cyl vs am Honda Civic 50 100 150 200 hp 250 300 Zhao et al Color Key 0 40 Count CRAN GMD −2 Heatmap 0 1 2 3 4 Value Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro Z28 Ford Pantera L Maserati Bora Merc 450SL Merc 450SE Merc 450SLC Dodge Challenger AMC Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914−2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 240D Merc 280 Merc 280C Mazda RX4 Wag Mazda RX4 Ferrari Dino Fiat 128 Fiat X1−9 Toyota Corolla disp hp mpg qsec gear drat wt carb cyl vs am Honda Civic Mean features Mean 150 Column Individuals ● ● ● disp carb ● hp cyl ● mpg vs ● qsec ● gear ● drat ● wt ● am 0 50 ● Color Key 0 40 Count Figure 17: Graphical output of “example-heatmap3a.R”. −2 Heatmap 0 1 2 3 4 Value Cadillac Fleetwood Lincoln Continental Chrysler Imperial Pontiac Firebird Hornet Sportabout Duster 360 Camaro Z28 Ford Pantera L Maserati Bora Merc 450SL Merc 450SE Merc 450SLC Dodge Challenger AMC Javelin Hornet 4 Drive Valiant Toyota Corona Porsche 914−2 Datsun 710 Volvo 142E Merc 230 Lotus Europa Merc 240D Merc 280 Merc 280C Mazda RX4 Wag Mazda RX4 Ferrari Dino Fiat 128 Fiat X1−9 Toyota Corolla disp hp mpg qsec gear drat wt carb cyl vs am Honda Civic Value 0.00 0.04 0.08 Column Individuals Boxplot of normalized column features ● ● ● ● am vs cyl carb wt Figure 18: Graphical output of “example-heatmap3a.R”. 27 drat gear qsec mpg hp disp CRAN GMD A.5.2 Zhao et al Examples using ruspini data example-heatmap3b.R (Source Code 10) is an example on how to make a heatmap with summary visualization of clusters. example-heatmap3b.R 1 2 3 ## load library require("GMD") require(cluster) 4 5 6 ## load data data(ruspini) 7 8 9 10 11 12 13 14 ## heatmap on a `dist' object x <- gdist(ruspini) main <- "Heatmap of Ruspini data" dev.new(width=10,height=10) heatmap.3(x, main=main, mapratio=1) # default with a title and a map in square! heatmap.3(x, main=main, revC=TRUE) # reverse column for a symmetric look heatmap.3(x, main=main, kr=2, kc=2) # show partition by predefined number of clusters 15 16 17 18 19 20 21 ## show partition by elbow css.multi.obj <- css.hclust(x,hclust(x)) elbow.obj <- elbow.batch(css.multi.obj,ev.thres=0.90,inc.thres=0.05) heatmap.3(x, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k) heatmap.3(x, main=main, sub=sub("\n"," ",attr(elbow.obj,"description")), cex.sub=1.25, revC=TRUE,kr=elbow.obj$k, kc=elbow.obj$k) # show elbow info as subtitle 22 23 24 25 26 27 28 29 ## side plot for every row clusters dev.new(width=10,height=10) expr1 <- list(quote(plot(do.call(rbind,i.x),xlab="x",ylab="y", xlim=range(ruspini$x),ylim=range(ruspini$y),))) heatmap.3(x, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k, trace="none", row.data=as.list(data.frame(t(ruspini))), plot.row.clusters=TRUE,plot.row.clusters.list=list(expr1)) 30 31 32 33 34 35 36 37 ## side plot for every col clusters dev.new(width=10,height=10) expr2 <- list(quote(plot(do.call(rbind,i.x),xlab="x",ylab="y", xlim=range(ruspini$x),ylim=range(ruspini$y),))) heatmap.3(x, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k, trace="none", col.data=as.list(data.frame(t(ruspini))), plot.col.clusters=TRUE,plot.col.clusters.list=list(expr2)) 38 28 CRAN GMD Zhao et al Heatmap of Ruspini data 400 0 Count Color Key 0 Row Clusters A "good" k=4 (EV=0.93) is detected when the EV is no less than 0.9 and the increment of EV is no more than 0.05 for a bigger k. 100 Value 4 150 Row group 2 (n=23) 0 50 y ●●●●● ●● ● ●● ●● ●● ●● ●● ●● ● ● 150 0 20 40 60 80 120 x Row group 3 (n=17) ● ● ● ● ● ● ●● ●● ●● 0 50 y ● ● ●● ● 0 50 y 150 0 20 40 60 80 120 x Row group 1 (n=20) ● ●● ●● ● ● ●● ● ●● ● ● ●● ● ●● ● 150 0 20 40 60 80 120 x Row group 4 (n=15) 0 ●●● ●●● ● ● ● ●●●●● ● 20 40 60 80 x 63 66 62 61 64 68 74 73 75 72 65 71 70 69 67 20 11 13 12 7 6 8 4 16 19 18 17 14 15 10 9 3 2 5 1 46 48 47 57 55 56 58 59 60 45 44 54 50 52 53 49 51 28 27 30 29 22 21 23 24 41 42 43 37 38 34 33 40 36 39 35 32 31 26 25 4 1 3 2 63 66 62 61 64 68 74 73 75 72 65 71 70 69 67 20 11 13 12 7 6 8 4 16 19 18 17 14 15 10 9 3 2 5 1 46 48 47 57 55 56 58 59 60 45 44 54 50 52 53 49 51 28 27 30 29 22 21 23 24 41 42 43 37 38 34 33 40 36 39 35 32 31 26 25 y 1 0 50 3 2 0 40 x 80 120 40 x 80 120 0 150 Col group 4 (n=15) made by Function: heatmap.3 Package: GMD in R 100 150 y 50 50 y 0 ● ●●●● ●● ● ● ●● ● ● ●●● ● ● ●● 0 50 y ●●● Col group 1 (n=20) ●● ●●●●● ● ● ● ●● ● ● ● 0 100 ● ● ● ● ● ● ● ● ● ●● ● ●● 100 150 Col group 3 (n=17) 0 50 y 100 ●●●●●●● ●●● ●● ●● ●●● ● ●●● ● ● 0 Col Clusters 150 Col group 2 (n=23) 40 Figure 19: Graphical output of “example-heatmap3b.R”. 29 x 80 120 0 40 x 80 120 120 CRAN GMD B Zhao et al Data B.1 GMD dataset overview > data(package="GMD") Data sets in package 'GMD': cage cagel chipseq_hCD4T chipseq_mES B.2 CAGE Data CAGE Data ChIP-seq Data ChIP-seq Data CAGE data: cage and cagel > help(cage) 30 CRAN GMD Zhao et al > require(GMD) > data(cage) > class(cage) [1] "list" > length(cage) [1] 20 > names(cage) [1] [3] [5] [7] [9] [11] [13] [15] [17] [19] "NA (T01F029805F8)" "Cyp4a14 (T04R06C91673)" "D230039L06Rik (T01F0AA465EB)" "Rab5c (T11R05FBC6C4)" "Tpt1 (T14F04079189)" "D0H4S114 (T18R020553F0)" "Hsd11b1 (T01R0B8305BD)" "B2m (T02F0743FF05)" "Pfkfb3 (T02R00AEC2D8)" "Cd72 (T04R028B8BC9)" "Glul (T01F092C2995)" "Stxbp4 (T11R05607FD4)" "Gas5 (T01F09995479)" "BC003940 (T11R072A6CB0)" "Pcna (T02R07DE319B)" "Gsto1 (T19F02D03566)" "Csf1 (T03R0672174D)" "Alox5ap (T05F08BCF2C4)" "Hig1 (T09R0743763C)" "Egln1 (T08R0769239F)" > data(cagel) > names(cagel) [1] [3] [5] [7] [9] [11] [13] [15] [17] [19] [21] [23] [25] [27] [29] [31] [33] [35] [37] [39] [41] [43] [45] [47] [49] "NA (T17F05912B83)" "B2m (T02F0743FF05)" "2600001A11Rik (T12R043A2595)" "Rpl41 (T10R07AB7138)" "Dscr1l1 (T17F02802885)" "Rab5c (T11R05FBC6C4)" "Csf1 (T03R0672174D)" "Crim1 (T17F04928998)" "Rai17 (T14F014BF473)" "Apbb2 (T05R03E329C8)" "Tmeff2 (T01F030EC757)" "Hsd11b1 (T01R0B8305BD)" "4930430J02Rik (T09F036E80C6)" "Trpv2 (T11F03B4EBD8)" "Scd1 (T19R029B5186)" "5730596K20Rik (T19F006DFC1A)" "Klhl5 (T05F03CCA673)" "NA (T02R07EF5EDA)" "Nudt7 (T08F06C3B651)" "Egln1 (T08R0769239F)" "BC003940 (T11R072A6CB0)" "Alox5ap (T05F08BCF2C4)" "Gch1 (T14R02602138)" "Vrk1 (T12F06010C9B)" "Wdtc1 (T04R07DAFEDC)" "Tpt1 (T14F04079189)" "Grn (T11F0615F289)" "Rbbp7 (T0XF091A7ACA)" "H2afy (T13R034ACF47)" "Ckap4 (T10R0504CE97)" "Pfkfb3 (T02R00AEC2D8)" "Ctsb (T14F0348EDBA)" "3930401E15Rik (T18R02CDD141)" "Hig1 (T09R0743763C)" "Ptn (T06R0230806E)" "Mrps6 (T16F0583C906)" "D0H4S114 (T18R020553F0)" "Phtf2 (T05R0125E896)" "Slco3a1 (T07R03AC06B9)" "Ctxn (T08R0040864A)" "Arbp (T05F06BBE13B)" "Gsto1 (T19F02D03566)" "Srpk1 (T17R019F4A41)" "Tnfrsf10b (T14F03AB1306)" "9630050M13Rik (T02F002EC972)" "Ppia (T11F00604AFF)" "Pcna (T02R07DE319B)" "Yap1 (T09R0079F3FF)" "Cd72 (T04R028B8BC9)" "Centg2 (T01F055392D1)" 31 CRAN GMD B.3 Zhao et al ChIP-seq data: chipseq_mES and chipseq_hCD4T > help(chipseq) > data(chipseq_mES) > class(chipseq_mES) [1] "list" > length(chipseq_mES) [1] 6 > names(chipseq_mES) [1] "H3K27me3" "H3K36me3" "H3K4me1" "H3K4me2" "H3K4me3" "H3K9me3" > data(chipseq_hCD4T) > names(chipseq_hCD4T) [1] [7] [13] [19] [25] [31] [37] "CTCF" "H2BK20ac" "H3K27ac" "H3K36me3" "H3K79me2" "H3R2me1" "H4K5ac" "H2AK5ac" "H2BK5ac" "H3K27me1" "H3K4ac" "H3K79me3" "H3R2me2" "H4K8ac" "H2AK9ac" "H2BK5me1" "H3K27me2" "H3K4me1" "H3K9ac" "H4K12ac" "H4K91ac" "H2AZ" "H3K14ac" "H3K27me3" "H3K4me2" "H3K9me1" "H4K16ac" "H4R3me2" 32 "H2BK120ac" "H3K18ac" "H3K36ac" "H3K4me3" "H3K9me2" "H4K20me1" "H2BK12ac" "H3K23ac" "H3K36me1" "H3K79me1" "H3K9me3" "H4K20me3" CRAN GMD Zhao et al References [1] Artem Barski, Suresh Cuddapah, Kairong Cui, Tae-Young Roh, Dustin E Schones, Zhibin Wang, Gang Wei, Iouri Chepelev, and Keji Zhao. High-resolution profiling of histone methylations in the human genome. Cell, 129(4):823–837, May 2007. [2] Piero Carninci, Albin Sandelin, Boris Lenhard, Shintaro Katayama, Kazuro Shimokawa, Jasmina Ponjavic, Colin A M Semple, Martin S Taylor, PÂŁr G EngstrÂŽm, Martin C Frith, Alistair R R Forrest, Wynand B Alkema, Sin Lam Tan, Charles Plessy, Rimantas Kodzius, Timothy Ravasi, Takeya Kasukawa, Shiro Fukuda, Mutsumi Kanamori-Katayama, Yayoi Kitazume, Hideya Kawaji, Chikatoshi Kai, Mari Nakamura, Hideaki Konno, Kenji Nakano, Salim Mottagui-Tabar, Peter Arner, Alessandra Chesi, Stefano Gustincich, Francesca Persichetti, Harukazu Suzuki, Sean M Grimmond, Christine A Wells, Valerio Orlando, Claes Wahlestedt, Edison T Liu, Matthias Harbers, Jun Kawai, Vladimir B Bajic, David A Hume, and Yoshihide Hayashizaki. Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet, 38(6):626–635, Jun 2006. [3] Sung-Hyuk Cha and Sargur N. Srihari. On measuring the distance between histograms. Pattern Recognition, 35(6):1355–1370, 2002. [4] Tarjei S Mikkelsen, Manching Ku, David B Jaffe, Biju Issac, Erez Lieberman, Georgia Giannoukos, Pablo Alvarez, William Brockman, Tae-Kyung Kim, Richard P Koche, William Lee, Eric Mendenhall, Aisling O’Donovan, Aviva Presser, Carsten Russ, Xiaohui Xie, Alexander Meissner, Marius Wernig, Rudolf Jaenisch, Chad Nusbaum, Eric S Lander, and Bradley E Bernstein. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature, 448(7153):553–560, Aug 2007. [5] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0. [6] Albin Sandelin, Piero Carninci, Boris Lenhard, Jasmina Ponjavic, Yoshihide Hayashizaki, and David A Hume. Mammalian rna polymerase ii core promoters: insights from genome-wide studies. Nat Rev Genet, 8(6):424–436, Jun 2007. [7] Xiaobei Zhao, Eivind Valen, Brian J Parker, and Albin Sandelin. Systematic clustering of transcription start site landscapes. PLoS ONE, 6(8):e23409, August 2011. [8] Vicky W Zhou, Alon Goren, and Bradley E Bernstein. Charting histone modifications and the functional organization of mammalian genomes. Nat Rev Genet, 12(1):7–18, Jan 2011. 33