Download CRAN GMD: User's Guide (0.3.3)

Transcript
CRAN GMD: User’s Guide (0.3.3)
Generic histogram construction, generic distance measure,
cluster analysis with evaluation and visualization
Xiaobei Zhao∗1 and Albin Sandelin†1
1
Bioinformatics Centre, University of Copenhagen
Modified: 2014-08-26
Compiled: 2014-8-27
You may find the latest version of GMD and this documentation at,
http://CRAN.R-project.org/package=GMD
Keywords: histogram, distance, metric, non-parametric, cluster analysis, hierarchical clustering, sum-ofsquares, heatmap.3
Abstract
The purpose of this GMD vignette is to show how to get started with the R package GMD. GMD denotes Generalized
Minimum Distance between distributions, which is a true metric that measures the distance between the shapes
of any two discrete numerical distributions (e.g. histograms).
The vignette includes a brief introduction, an example to illustrate the concepts and the implementation of GMD and
case studies that were carried out using classical data sets (e.g. iris) and high-throughput sequencing data (e.g. ChIPseq) from biology experiments. The appendix on page 14 contains an overview of package functionality, and examples
using primary functions in histogram construction (the ghist function), how to measure distance between distributions
(the gdist function), cluster analysis with evaluation (the “elbow” method in the css function) and visualization (the
heatmap.3 function).
Contents
1 Introduction
2
2 Minimal Example: “Hello, GMD!”
3
∗ Lineberger
Comprehensive Cancer Center, University of North Carolina at Chapel Hill, 450 West Dr, Chapel Hill, NC 27599,
USA. [email protected]
† Bioinformatics Centre, University of Copenhagen, Department of Biology and Biotech Research and Innovation Centre, Ole
Maaløes Vej 5, DK-2200, Copenhagen, Denmark. [email protected]
1
CRAN GMD
Zhao et al
3 An example to understand GMD
3.1
3.2
4
Histogram: construction and visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.1.1
Load library and simulate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.1.2
Construct histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
3.1.3
Save histograms as multiple-histogram (‘mhist’) object . . . . . . . . . . . . . . . . . . .
4
3.1.4
Visualize an ‘mhist’ object
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
Histogram: distance measure and alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
3.2.1
Measure the pairwise distance between two histograms by GMD . . . . . . . . . . . . .
7
3.2.2
Show alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
4 Case study
9
4.1
CAGE: measuring the dissimilarities among TSSDs . . . . . . . . . . . . . . . . . . . . . . . . .
9
4.2
ChIP-seq: measuring the similarities among histone modification patterns . . . . . . . . . . . .
12
A Functionality
14
A.1 An overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
14
A.2 ghist: Generic construction and visualization of histograms . . . . . . . . . . . . . . . . . . . .
15
A.2.1 Examples using simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
15
A.2.2 Examples using iris data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17
A.2.3 Examples using nottem data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
A.3 gdist: Generic construction and visualization of distances . . . . . . . . . . . . . . . . . . . . .
21
A.4 css: Clustering Sum-of-Squares and the “elbow” plot . . . . . . . . . . . . . . . . . . . . . . .
23
A.5 heatmap.3: Visualization in cluster analysis, with evaluation . . . . . . . . . . . . . . . . . . .
25
A.5.1 Examples using mtcars data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
A.5.2 Examples using ruspini data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
B Data
1
30
B.1 GMD dataset overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
B.2 CAGE data: cage and cagel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
30
B.3 ChIP-seq data: chipseq_mES and chipseq_hCD4T . . . . . . . . . . . . . . . . . . . . . . . . .
32
Introduction
Similar to the Earth Mover’s distance, Generalized Minimum distance of Distributions (GMD) (based on
MDPA - Minimum Difference of Pair Assignment [3]) is a true metric of the similarity between the shapes of
two histograms1 . Considering two normalized histograms A and B, GMD measures their similarity by counting
the necessary “shifts” of elements between the bins that have to be performed to transform distribution A into
distribution B.
1 In statistics (and many other fields), histogram refers to a graphical representation of category frequencies in the data. Here,
we use this term in a more mathematical sense, defined as a function that counts categorical data or a result returned by such a
function.
2
CRAN GMD
Zhao et al
The GMD package provides classes and methods for computing GMD in R [5]. The algorithm has been implemented in C to interface with R for efficient computation. The package also includes downstream cluster
analysis in function css (A.4 on page 23) that use a pairwise distance matrix to make partitions given variant
criteria, including the “elbow” rule as discussed in [7] or desired number of clusters. In addition, the function
heatmap.3 (A.5 on page 25) integrates the visualization of the hierarchical clustering in dendrogram, the
distance measure in heatmap and graphical representations of summary statistics of the resulting clusters or
the overall partition. For more flexibility, the function heatmap.3 can also accept plug-in functions defined by
end-users for custom summary statistics.
The motivation to write this package was born with the project [7] on characterizing Transcription Start
Site (TSS) landscapes using high-throughput sequencing data, where a non-parametric distance measure was
developed to assess the similarity among distributions of high-throughput sequencing reads from biological
experiments. However, it is possible to use the method for any empirical distributions of categorical data.
The package is available on CRAN. The source code is available at http://CRAN.R-project.org/package=GMD
under GPL license.
2
Minimal Example: “Hello, GMD!”
hello-GMD.R
1
2
##' Check GMD's sanity and start up
##' @name hello-GMD
3
4
5
## GMD at CRAN, for source code download and installation
## http://cran.r-project.org/web/packages/GMD/index.html
6
7
8
## load GMD
library(GMD)
9
10
11
12
## version of GMD and description
packageVersion("GMD")
packageDescription("GMD")
13
14
15
## view GMD vignette
vignette("GMD-vignette",package="GMD")
16
17
18
## list the available data sets in GMD
data(package="GMD")
19
20
21
## list all the objects in the GMD
ls("package:GMD")
22
23
24
## help info on GMD
help(package="GMD")
25
26
27
## run a demo
demo("GMD-demo")
28
29
30
## cite GMD in publications
citation(package="GMD")
31
hello-GMD.R (Source Code 1) is a minimal example to load and check of that your GMD installation works.
It also includes code for viewing the package information and this “vignette”, checking data sets provided by
GMD, starting a demo and listing the citation of GMD.
3
CRAN GMD
3
Zhao et al
An example to understand GMD
This example, based on simulated data, is designed to illustrate the concepts and the implementation of GMD
by stepping through the computations in detail.
3.1
Histogram: construction and visualization
3.1.1
>
>
>
>
>
>
require("GMD") # load library
## create two normally-distributed samples
## with unequal means and unequal variances
set.seed(2012)
x1 <- rnorm(1000,mean=-5, sd=10)
x2 <- rnorm(1000,mean=10, sd=5)
3.1.2
>
>
>
>
>
>
>
Load library and simulate data
Construct histograms
## create common bins
n <- 20 # desired number of bins
breaks <- gbreaks(c(x1,x2),n) # bin boundaries
## make two histograms
v1 <- ghist(x1,breaks=breaks,digits=0)
v2 <- ghist(x2,breaks=breaks,digits=0)
3.1.3
Save histograms as multiple-histogram (‘mhist’) object
> x <- list(v1,v2)
> mhist.obj <- as.mhist(x)
4
CRAN GMD
3.1.4
Zhao et al
Visualize an ‘mhist’ object
> ## plot histograms side-by-side
> plot(mhist.obj,mar=c(1.5,1,1,0),main="Histograms of simulated normal distributions")
Histograms of simulated normal distributions
1
2
250
Count
200
150
100
50
0
−21
−5
11
Bin
5
27
CRAN GMD
Zhao et al
> ## plot histograms as subplots, with corresponding bins aligned
> plot(mhist.obj,beside=FALSE,mar=c(1.5,1,1,0),
+ main="Histograms of simulated normal distributions")
Histograms of simulated normal distributions
1
250
200
150
100
50
Count
0
−21
−5
11
27
2
250
200
150
100
50
0
−21
−5
11
Bin
6
27
CRAN GMD
3.2
Zhao et al
Histogram: distance measure and alignment
Here we measure the GMD distance between shapes of two histograms with option sliding on.
3.2.1
Measure the pairwise distance between two histograms by GMD
> gmdp.obj <- gmdp(v1,v2,sliding=TRUE)
> print(gmdp.obj)
# print a brief version by default
[1] 1.334
> print(gmdp.obj,mode="detailed") # print a detailed version
GM-Distance: 1.334
Sliding: TRUE
Number of hits: 1
Gap:
v1
Hit1
v2
5
0
Resolution: 1
> print(gmdp.obj,mode="full")
# print a full version
Distribution of v1:
1 2 10 19 28 46 59 101 109 119 133 108 90 77 42 29 17 6 3 1
(After normalization)
0.001 0.002 0.01 0.019 0.028 0.046 0.059 0.101 0.109 0.119 0.133 0.108 0.09 0.077 0.042 0.029 0.017 0.006 0.003
Distribution of v2:
0 0 0 0 0 0 0 0 0 1 5 30 94 206 258 199 139 58 8 2
(After normalization)
0 0 0 0 0 0 0 0 0 0.001 0.005 0.03 0.094 0.206 0.258 0.199 0.139 0.058 0.008 0.002
GM-Distance: 1.334
Sliding: TRUE
Number of hits: 1
Gap:
v1
Hit1
v2
5
0
Resolution: 1
7
CRAN GMD
3.2.2
Zhao et al
Show alignment
Now, let’s have a look at the alignment by GMD, with a distance 1.334 and a “shift” of 5 in the 1st distribution. It
is important to note that the specific features (the values in this case) of the original bins in the histograms are
ignored with sliding on. To keep original bin-to-bin correspondence, please set sliding to FALSE (see examples
in section 4.2 on page 12).
> plot(gmdp.obj,beside=FALSE)
Optimal alignment between distributions (with sliding)
GMD=1.334, gap=c(5,0)
v1
0.25
0.20
0.15
0.10
0.05
0.00
Fraction
5
10
15
20
25
v2
0.25
0.20
0.15
0.10
0.05
0.00
5
10
15
20
Position
8
25
CRAN GMD
4
4.1
Zhao et al
Case study
CAGE: measuring the dissimilarities among TSSDs
Studies have demonstrated that the spatial distributions of read-based sequencing data from different platforms
often indicate functional properties and expression profiles (reviewed in [6] and [8]). Analyzing the distributions
of DNA reads is therefore often meaningful. To do this systematically, a measure of similarity between
distributions is necessary. Such measures should ideally be true metrics, have few parameters as possible, be
computationally efficient and also make biological sense to end-users. Case studies were made in section 4.1
and 4.2 to demonstrate the applications of GMD using distributions of CAGE and ChIP-seq reads.
In this section we demonstrate how GMD is applied to measure the dissimilarities among TSSDs, histograms
of transcription start site (TSS) that are made of CAGE tags, with option sliding on. The spatial properties
of TSSDs vary widely between promoters and have biological implications in both regulation and function.
The raw data were produced by CAGE and downloaded from FANTOM3 ([2]) and CAGE sequence reads
were preprocessed as did in [7]. The following codes case-cage.R (Source Code 2) are sufficient to perform
both pairwise GMD calculation by function gmdp and to construct a GMD distance matrix by function gmdm.
A handful of options are available for control and flexibility, particularly, the option sliding is enabled by
default to allow partial alignment.
case-cage.R
1
2
require("GMD") # load library
data(cage)
# load data
3
4
5
6
7
## measure pairwise distance
x <- gmdp(cage[["Pfkfb3 (T02R00AEC2D8)"]],cage[["Csf1 (T03R0672174D)"]])
print(x)
# print a brief version by default
print(x, mode="full") # print a full version by default
8
9
10
## show alignment
plot(x,labels=c("Pfkfb3","Csf1"),beside=FALSE)
11
12
13
14
15
## show another alignment
plot(gmdp(cage[["Hig1 (T09R0743763C)"]],cage[["Cd72 (T04R028B8BC9)"]]),
labels=c("Hig1 (T09R0743763C)","Cd72 (T04R028B8BC9)"),
beside=FALSE)
16
17
18
19
20
## construct a distance matrix and visualize it
short.labels <- gsub("(.+) \\(.+","\\1",names(cage)) # get short labels
x <- gmdm(cage[1:6],labels=short.labels[1:6])
plot(x)
9
CRAN GMD
Zhao et al
Optimal alignment between distributions (with sliding)
GMD=7.881, gap=c(0,100)
0.8
Pfkfb3 (T02R00AEC2D8)
0.6
0.4
0.2
0.0
Fraction
50
100
150
200
250
0.8
Csf1 (T03R0672174D)
0.6
0.4
0.2
0.0
50
100
150
200
250
Position
Figure 1: Graphical output of “case-cage.R”.
Optimal alignment between distributions (with sliding)
GMD=3.992, gap=c(0,24)
Hig1 (T09R0743763C)
0.3
0.2
0.1
0.0
Fraction
20
40
60
80
100
120
Cd72 (T04R028B8BC9)
0.3
0.2
0.1
0.0
20
40
60
80
Position
Figure 2: Graphical output of “case-cage.R”.
10
100
120
CRAN GMD
Zhao et al
Optimal alignments among distributions (with sliding)
NA
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
5
5
5
0.8
GMD=1.310
Gap=(4,0)
Glul
5
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.0
0.0
5
Cyp4a14
GMD=2.284
Gap=(0,3)
0.0
2
GMD=0.833
Gap=(0,0)
0.0
5
0.8
GMD=0.589
Gap=(0,3)
0.0
50
0.8
200
0.8
GMD=3.802
Gap=(90,0)
0.0
50
Fraction
GMD=2.754
Gap=(71,0)
0.0
50
0.8
0.8
0.0
50
0.0
20
0.8
0.0
50
0.8
200
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.0
0.0
5
GMD=2.732
Gap=(3,0)
0.0
5
GMD=2.551
Gap=(0,0)
0.0
50
200
GMD=5.698
Gap=(93,0)
0.0
50
GMD=4.821
Gap=(74,0)
0.0
50
GMD=5.415
Gap=(59,0)
0.0
50
0.0
50
0.0
20
0.0
50
200
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.0
0.0
5
D230039L
06Rik
GMD=0.658
Gap=(0,3)
0.0
50
200
GMD=4.339
Gap=(90,0)
0.0
50
GMD=3.354
Gap=(71,0)
0.0
50
GMD=4.432
Gap=(54,0)
0.0
50
0.0
50
0.0
20
0.0
50
200
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.0
0.0
50
Gas5
200
GMD=3.769
Gap=(93,0)
0.0
50
GMD=2.760
Gap=(74,0)
0.0
50
GMD=4.447
Gap=(59,0)
0.0
50
0.0
50
0.0
20
0.0
50
200
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.2
0.0
200
Tpt1
0.0
50
GMD=1.480
Gap=(0,19)
0.0
50
GMD=2.315
Gap=(0,34)
0.0
50
0.0
50
0.0
20
0.0
50
200
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.0
Pcna
GMD=2.325
Gap=(0,15)
0.2
0.0
50
0.2
0.0
50
0.2
0.0
50
200
0.2
0.0
50
200
0.8
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
D0H4S114
0.2
0.0
20
0.2
0.0
50
0.2
0.0
50
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
GMD=5.936
Gap=(39,0)
GMD=4.677
Gap=(42,0)
GMD=6.155
Gap=(37,0)
GMD=6.161
Gap=(42,0)
GMD=8.677
Gap=(0,51)
GMD=7.739
Gap=(0,32)
GMD=6.507
Gap=(0,17)
GMD=8.409
Gap=(0,26)
0.2
0.0
50
0.2
0.0
50
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.0
Hsd11b1
0.2
0.0
50
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.2
0.0
50
Csf1
0.0
50
200
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.2
0.0
50
0.0
50
200
0.8
GMD=3.747
Gap=(34,0)
B2m
0.0
50
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0.0
50
200
0.0
20
GMD=3.538
Gap=(97,0)
GMD=5.794
Gap=(100,0)
GMD=4.162
Gap=(97,0)
GMD=3.802
Gap=(100,0)
GMD=1.179
Gap=(7,0)
GMD=1.402
Gap=(26,0)
GMD=2.462
Gap=(41,0)
GMD=1.968
Gap=(32,0)
GMD=8.670
Gap=(58,0)
GMD=7.881
Gap=(100,0)
GMD=10.728
Gap=(66,0)
Pfkfb3
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.2
0.0
50
GMD=18.387
Gap=(31,0)
GMD=17.706
Gap=(32,0)
GMD=15.601
Gap=(35,0)
GMD=18.253
Gap=(32,0)
GMD=18.130
Gap=(35,0)
GMD=18.366
Gap=(0,50)
GMD=17.692
Gap=(0,35)
GMD=16.389
Gap=(0,16)
GMD=16.574
Gap=(0,25)
GMD=12.228
Gap=(0,5)
GMD=14.311
Gap=(35,0)
GMD=11.099
Gap=(1,0)
GMD=18.183
Gap=(0,57)
200
Hig1
300
100
300
100
300
100
300
100
300
100
300
100
300
0.8
0.6
0.4
0.0
GMD=17.388
Gap=(35,0)
100
0.0
50
0.8
GMD=4.024
Gap=(96,0)
300
0.8
0.6
0.4
0.0
GMD=4.607
Gap=(100,0)
100
0.0
20
0.8
300
0.0
50
0.8
0.0
GMD=2.682
Gap=(0,6)
0.0
20
100
0.0
50
0.8
0.6
0.0
300
0.2
0.0
50
0.8
0.6
50
GMD=10.539
Gap=(0,32)
0.2
0.0
50
0.8
0.0
GMD=2.453
Gap=(0,40)
0.2
0.0
50
100
0.0
50
0.8
0.6
50
300
0.2
0.0
20
0.8
0.6
0.2
GMD=8.790
Gap=(0,23)
0.2
0.0
50
0.8
0.0
GMD=8.173
Gap=(0,66)
0.2
0.0
50
100
0.0
50
0.8
0.6
0.0
300
0.2
0.0
50
0.8
0.6
GMD=6.336
Gap=(37,0)
GMD=9.839
Gap=(0,39)
0.2
0.0
50
0.8
0.6
GMD=5.300
Gap=(40,0)
GMD=6.236
Gap=(0,57)
0.2
0.0
50
0.8
Gsto1
GMD=10.799
Gap=(0,57)
200
0.8
0.6
0.0
100
0.0
50
0.8
0.6
0.2
300
0.2
0.0
50
0.8
0.6
GMD=2.147
Gap=(9,0)
GMD=8.016
Gap=(34,0)
0.2
0.0
50
0.8
GMD=2.277
Gap=(0,6)
GMD=7.086
Gap=(0,74)
0.2
0.0
50
100
0.0
10
0.8
0.6
GMD=2.037
Gap=(0,25)
GMD=7.863
Gap=(31,0)
0.0
50
0.8
GMD=4.784
Gap=(67,0)
GMD=8.213
Gap=(0,93)
0.0
50
300
0.0
10
0.8
0.6
GMD=4.872
Gap=(63,0)
GMD=6.336
Gap=(36,0)
0.0
20 140
0.8
GMD=5.915
Gap=(68,0)
GMD=4.734
Gap=(0,0)
0.0
50
100
0.0
10
0.8
0.6
0.4
GMD=4.426
Gap=(63,0)
GMD=7.811
Gap=(31,0)
0.0
50
0.8
0.6
GMD=5.015
Gap=(63,0)
GMD=4.247
Gap=(0,3)
0.0
50
300
0.0
10
0.8
0.6
0.4
GMD=5.015
Gap=(66,0)
GMD=8.086
Gap=(30,0)
0.0
20 140
0.8
0.6
20
GMD=4.532
Gap=(1,0)
0.0
50
100
0.8
0.6
0.4
Stxbp4
300
0.0
10
0.8
0.6
0.4
0.2
GMD=7.010
Gap=(34,0)
0.0
20
0.8
0.6
0.4
0.0
GMD=4.628
Gap=(0,3)
0.0
50
0.8
0.6
0.4
20
GMD=4.344
Gap=(54,0)
0.0
50
0.8
100
0.8
0.6
0.4
0.0
GMD=4.736
Gap=(0,4)
0.0
0.6
0.4
0.2
GMD=3.674
Gap=(0,0)
0.4
0.2
0.8
0.6
50
GMD=4.587
Gap=(54,0)
0.8
0.4
0.0
GMD=4.435
Gap=(57,0)
0.6
10
0.6
0.2
GMD=2.948
Gap=(70,0)
0.8
50
0.4
50
GMD=3.875
Gap=(74,0)
0.8
200
0.6
0.0
GMD=3.931
Gap=(89,0)
0.8
50
0.4
2
GMD=4.914
Gap=(93,0)
0.8
50
0.6
0.0
GMD=0.351
Gap=(0,4)
0.8
20
0.4
5
GMD=1.251
Gap=(0,0)
0.8
50
0.6
0.0
GMD=0.562
Gap=(0,1)
0.8
50
0.4
5
GMD=0.907
Gap=(3,0)
50
0.6
0.0
GMD=2.833
Gap=(0,4)
0.8
50
0.4
0.8
GMD=1.835
Gap=(0,0)
0.8
200
0.6
5
GMD=0.821
Gap=(0,1)
50
0.4
0.0
GMD=1.128
Gap=(3,0)
2
0.8
0.0
50
200
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
20
0.8
0.6
GMD=17.387
Gap=(7,0)
GMD=18.386
Gap=(3,0)
GMD=17.705
Gap=(4,0)
GMD=15.579
Gap=(7,0)
GMD=18.252
Gap=(4,0)
GMD=18.130
Gap=(7,0)
GMD=18.835
Gap=(0,74)
GMD=18.126
Gap=(0,58)
GMD=17.125
Gap=(0,42)
GMD=17.059
Gap=(0,51)
GMD=12.993
Gap=(0,33)
GMD=15.084
Gap=(5,0)
GMD=12.237
Gap=(0,29)
GMD=18.652
Gap=(0,81)
GMD=3.992
Gap=(0,24)
Cd72
0.4
0.2
0.0
GMD=48.292
Gap=(137,0)
GMD=49.290
Gap=(133,0)
GMD=48.610
Gap=(134,0)
GMD=46.521
Gap=(137,0)
GMD=49.157
Gap=(134,0)
GMD=49.034
Gap=(137,0)
GMD=45.360
Gap=(44,0)
GMD=46.391
Gap=(63,0)
GMD=44.807
Gap=(80,0)
GMD=44.467
Gap=(71,0)
Position
Figure 3: Graphical output of “case-cage.R”.
11
GMD=43.082
Gap=(97,0)
GMD=44.996
Gap=(137,0)
GMD=41.294
Gap=(103,0)
GMD=45.380
Gap=(37,0)
GMD=32.443
Gap=(98,0)
GMD=34.026
Gap=(121,0)
Egln1
CRAN GMD
4.2
Zhao et al
ChIP-seq: measuring the similarities among histone modification patterns
In this section we demonstrate how GMD is applied to measure the dissimilarities between histone modifications
represented by ChIP-seq reads. Distinctive patterns of chromatin modifications around the TSS are associated
with transcription regulation and expression variation of genes. Comparing the chromatin modification profiles
(originally produced by [1] and [4], and preprocessed by [7]), the sliding option is disabled for fixed alignments
at the TSSs and the flanking regions. The GMD measure indicates how well profiles are co-related to each
other. In addition, the downstream cluster analysis is visualized with function heatmap.3 that use GMD
distance matrix to generate clustering dendrograms and make partitions given variant criteria, including the
“elbow” rule (discussed in [7]) or desired number of clusters.
case-chipseq.R
1
2
3
require("GMD")
# load library
data(chipseq_mES)
# load data
data(chipseq_hCD4T) # load data
4
5
6
## pairwise distance and alignment based on GMD metric
plot(gmdm(chipseq_mES,sliding=FALSE))
7
8
9
10
## clustering on spatial distributions of histone modifications
x <- gmdm(chipseq_hCD4T,sliding=FALSE,resolution=10)
heatmap.3(x,revC=TRUE)
11
12
13
14
15
16
17
## Determine the number of clusters by "Elbow" criterion
main <- "Heatmap of ChIP-seq data (human CD4+ T cells)"
dist.obj <- gmdm2dist(x)
css.multi.obj <- css.hclust(dist.obj,hclust(dist.obj))
elbow.obj <- elbow.batch(css.multi.obj,ev.thres=0.90,inc.thres=0.05)
heatmap.3(dist.obj, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k)
18
19
20
21
## more strict threshold
elbow.obj <- elbow.batch(css.multi.obj,ev.thres=0.75,inc.thres=0.1)
heatmap.3(dist.obj, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k)
22
23
24
25
## side plots
normalizeVector <- function(v){v/sum(v)} # a function to normalize a vector
dev.new(width=12,height=8)
26
27
28
29
30
31
## summary of row clusters
expr1 <- list(quote(op <- par(mar = par("mar")*2)),
quote(plot(mhist.summary(as.mhist(i.x)),if.plot.new=FALSE)),
quote(par(op))
)
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
## summary of row clustering
expr2 <- list(quote(tmp.clusters <- cutree(hclust(dist.row),k=kr)),
quote(tmp.css <- css(dist.row,tmp.clusters)),
quote(print(tmp.css)),
quote(tmp.wev <- tmp.css$wss/tmp.css$tss),
quote(names(tmp.wev) <- as.character(unique(tmp.clusters))),
quote(tmp.wev <- tmp.wev[order(unique(tmp.clusters))]),
quote(barplot(tmp.wev,main="Cluster Explained Variance", xlab="Cluster",
ylab="EV",col="white",border="black",
ylim=c(0,max(tmp.wev)*1.1),cex.main=1)))
expr3 <- list(quote(op <- par(mar = par("mar")*2)),
quote(plot.elbow(css.multi.obj,elbow.obj,if.plot.new=FALSE)),
quote(par(op))
)
47
48
49
50
51
heatmap.3(dist.obj, main=main, cex.main=1.25, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k,
keysize=1,mapsize=4.5,row.data=lapply(chipseq_hCD4T,normalizeVector),
plot.row.clusters=TRUE,plot.row.clusters.list=list(expr1),
plot.row.clustering=TRUE,plot.row.clustering.list=list(expr2,expr3))
52
12
CRAN GMD
Zhao et al
Optimal alignments among distributions (without sliding)
H3K27me3
0.007
0.007
0.007
0.007
0.007
0.006
0.006
0.006
0.006
0.006
0.005
0.005
0.005
0.005
0.005
0.004
0.004
0.004
0.004
0.004
0.003
0.003
0.003
0.003
0.003
0.002
0.002
0.002
0.002
0.002
0.001
0.001
0.001
0.001
0.000
0.000
0.000
0.000
200
GMD=118.840
Gap=(0,0)
400
600
H3K36me3
800 1000
200
600
800 1000
200
600
800 1000
0.001
0.000
200
400
600
800 1000
0.007
0.007
0.007
0.006
0.006
0.006
0.006
0.005
0.005
0.005
0.005
0.004
0.004
0.004
0.004
0.003
0.003
0.003
0.003
0.002
0.002
0.002
0.002
0.001
0.001
0.001
0.000
0.000
0.000
GMD=82.910
Gap=(0,0)
400
600
800 1000
Fraction
H3K4me1
200
GMD=140.037
Gap=(0,0)
400
600
800 1000
600
800 1000
0.007
0.006
0.006
0.006
0.005
0.005
0.005
0.004
0.004
0.004
0.003
0.003
0.003
0.002
0.002
0.002
0.001
0.001
0.000
0.000
GMD=77.365
Gap=(0,0)
600
800 1000
H3K4me2
400
600
800 1000
200
400
600
800 1000
200
400
600
800 1000
200
400
600
800 1000
200
400
600
800 1000
0.000
400
0.007
400
200
0.001
200
0.007
200
GMD=56.380
Gap=(0,0)
400
0.007
200
GMD=37.253
Gap=(0,0)
400
0.001
0.000
200
400
600
800 1000
0.007
0.007
0.006
0.006
0.005
0.005
0.004
0.004
0.003
0.003
0.002
0.002
0.001
0.001
0.000
0.000
200
400
600
800 1000
0.007
0.006
0.005
GMD=110.047
Gap=(0,0)
GMD=190.394
Gap=(0,0)
GMD=135.755
Gap=(0,0)
GMD=58.471
Gap=(0,0)
0.004
H3K4me3
0.003
0.002
0.001
0.000
GMD=46.266
Gap=(0,0)
GMD=108.302
Gap=(0,0)
GMD=33.494
Gap=(0,0)
GMD=99.835
Gap=(0,0)
GMD=156.171
Gap=(0,0)
H3K9me3
Position
Figure 4: Graphical output of “case-chipseq.R”.
150
Heatmap of ChIP−seq data (human CD4+ T cells)
Row
Clusters
0 50
Count
Color Key
0
5
Row
Clustering
15
Value
1
0.77
1 116 264 412 560 708 856
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
5
0.0015
10
k
15
20
0.12
Cluster Explained Variance
EV
0.08
0.004
0.000
Figure 5: Graphical output of “case-chipseq.R”.
13
1 116 264 412 560 708 856
●
●
1 116 264 412 560 708 856
Row group 1 (n=16)
●
●
●
Row group 3 (n=7)
0.0000
Elbow plot
k=3
Explained Variance
0.0
0.4
0.8
0.0000
0.0015
Row group 2 (n=17)
H2AK9ac
CTCF
H4K5ac
H4K8ac
H2BK20ac
H2BK12ac
H2AZ
H3K4ac
H3K36ac
H4K91ac
H2BK120ac
H3K4me3
H3K18ac
H3K9ac
H3K27ac
H2BK5ac
H4K20me1
H3K79me2
H3K79me3
H3K36me3
H3K27me1
H3K79me1
H2BK5me1
H3K4me2
H3K9me1
H3K4me1
H3K23ac
H4K12ac
H4K16ac
H4K20me3
H3K14ac
H3K36me1
H4R3me2
H3R2me1
H3R2me2
H2AK5ac
H3K27me2
H3K9me2
H3K9me3
H3K27me3
1
3
2
H2AK9ac
CTCF
H4K5ac
H4K8ac
H2BK20ac
H2BK12ac
H2AZ
H3K4ac
H3K36ac
H4K91ac
H2BK120ac
H3K4me3
H3K18ac
H3K9ac
H3K27ac
H2BK5ac
H4K20me1
H3K79me2
H3K79me3
H3K36me3
H3K27me1
H3K79me1
H2BK5me1
H3K4me2
H3K9me1
H3K4me1
H3K23ac
H4K12ac
H4K16ac
H4K20me3
H3K14ac
H3K36me1
H4R3me2
H3R2me1
H3R2me2
H2AK5ac
H3K27me2
H3K9me2
H3K9me3
H3K27me3
0.04
3
0.00
2
1
2
Cluster
3
CRAN GMD
A
A.1
Zhao et al
Functionality
An overview
Table 1. Functions of the GMD R package
Function
Description
ghist
gdist
css
Generalized Histogram Computation and Visualization
Generalized Distance Matrix Computation
Computing Clustering Sum-of-Squares and
evaluating the clustering by the “elbow” method
Enhanced Heatmap Representation with Dendrogram and Partition
Computation of GMD on a pair of histograms
Computation of GMD Matrix on a set of histograms
heatmap.3
gmdp
gmdm
14
CRAN GMD
A.2
A.2.1
Zhao et al
ghist: Generic construction and visualization of histograms
Examples using simulated data
example-ghist.R (Source Code 4) is an example on how to construct a histogram object from raw data and
make a visualization based on this.
example-ghist.R
1
2
## load library
require("GMD")
3
4
5
6
7
8
## create two normally-distributed samples
## with unequal means and unequal variances
set.seed(2012)
v1 <- rnorm(1000,mean=-5, sd=10)
v2 <- rnorm(1000,mean=10, sd=5)
9
10
11
12
13
14
15
16
## create common bins
n <- 20 # desired number of bins
breaks <- gbreaks(c(v1,v2),n) # bin boundaries
x <list(ghist(v1,breaks=breaks,digits=0),
ghist(v2,breaks=breaks,digits=0))
mhist.obj <- as.mhist(x)
17
18
19
## plot histograms side-by-side
plot(mhist.obj,mar=c(1.5,1,1,0),main="Histograms of simulated normal distributions")
20
22
23
24
## plot histograms as subplots,
## with corresponding bins aligned
plot(mhist.obj,beside=FALSE,mar=c(1.5,1,1,0),
main="Histograms of simulated normal distributions")
Histograms of simulated normal distributions
1
2
250
200
Count
21
150
100
50
0
−21
−5
Bin
Figure 6: Graphical output of “example-ghist.R”.
15
11
27
CRAN GMD
Zhao et al
Histograms of simulated normal distributions
1
250
200
150
100
50
Count
0
−21
−5
11
27
2
250
200
150
100
50
0
−21
−5
Bin
Figure 7: Graphical output of “example-ghist.R”.
16
11
27
CRAN GMD
A.2.2
Zhao et al
Examples using iris data
case-iris.R (Source Code 5) is a study on how to obtain and visualize histograms, using Fisher’s iris data
set.
case-iris.R
1
2
## load library
require("GMD")
3
4
5
## load data
data(iris)
6
7
8
9
## create common bins
n <- 30
# the number of bins
breaks <- gbreaks(iris[,"Sepal.Length"],n) # the boundary of bins
10
11
12
13
14
15
16
## create a list of histograms
Sepal.Length <list(setosa=ghist(iris[iris$Species=="setosa","Sepal.Length"],breaks=breaks),
versicolor=ghist(iris[iris$Species=="versicolor","Sepal.Length"],breaks=breaks),
virginica=ghist(iris[iris$Species=="virginica","Sepal.Length"],breaks=breaks)
)
17
18
19
## convert to a `hist' object
x <- as.mhist(Sepal.Length)
20
21
22
## get bin-wise summary statistics
summary(x)
23
24
25
26
## visualize the histograms
plot(x,beside=FALSE,
main="Histogram of Sepal.Length of iris",xlab="Sepal.Length (cm)")
17
CRAN GMD
Zhao et al
Histogram of Sepal.Length of iris
setosa
8
6
4
2
0
4.9
5.5
6.1
6.7
7.3
7.9
versicolor
8
Count
6
4
2
0
4.9
5.5
6.1
6.7
7.3
7.9
virginica
8
6
4
2
0
4.9
5.5
6.1
Sepal.Length (cm)
Figure 8: Graphical output of “case-iris.R”.
18
6.7
7.3
7.9
CRAN GMD
A.2.3
Zhao et al
Examples using nottem data
case-nottem.R (Source Code 6) is a study on how to draw histograms side-by-side and to compute and
visualize a bin-wise summary plot, using air temperature data at Nottingham Castle.
case-nottem.R
1
2
## load library
require("GMD")
3
4
5
## load data
data(nottem)
6
7
8
9
class(nottem)
# a time-series (ts) object
x <- ts2df(nottem) # convert ts to data.frame
mhist1 <- as.mhist(x[1:3,])
10
11
12
13
## plot multiple discrete distributions side-by-side
plot(mhist1,xlab="Month",ylab="Degree Fahrenheit",
main="Air temperatures at Nottingham Castle")
14
15
16
17
18
## make summary statistics for each bin
mhist2 <- as.mhist(x)
ms <- mhist.summary(mhist2)
print(ms)
19
21
22
23
## plot bin-wise summary statistics with
## confidence intervals over the bars
plot(ms, main="Mean air temperatures at Nottingham Castle (1920-1939)",
xlab="Month", ylab="Degree Fahrenheit")
Air temperatures at Nottingham Castle
70
1920
1921
1922
65
Degree Fahrenheit
20
60
55
50
45
40
35
Jan
Feb
Mar
Apr
May
Jun
Jul
Month
Figure 9: Graphical output of “case-nottem.R”.
19
Aug
Sep
Oct
Nov
Dec
CRAN GMD
Zhao et al
40
30
20
0
10
Degree Fahrenheit
50
60
Mean air temperatures at Nottingham Castle (1920−1939)
Jan
Feb
Mar
Apr
May
Jun
Jul
Month
Figure 10: Graphical output of “case-nottem.R”.
20
Aug
Sep
Oct
Nov
Dec
CRAN GMD
A.3
Zhao et al
gdist: Generic construction and visualization of distances
example-gdist.R (Source Code 7) is an example on how to measure distances using a user-defined metric,
such as correlation distance and GMD.
example-gdist.R
1
2
3
## load library
require("GMD")
require(cluster)
4
5
6
7
## compute distance using Euclidean metric (default)
data(ruspini)
x <- gdist(ruspini)
8
9
10
11
12
13
## see a dendrogram result by hierarchical clustering
dev.new(width=12, height=6)
plot(hclust(x),
main="Cluster Dendrogram of Ruspini data",
xlab="Observations")
14
15
16
## convert to a distance matrix
m <- as.matrix(x)
17
18
19
20
## convert from a distance matrix
d <- as.dist(m)
stopifnot(d == x)
21
Cluster Dendrogram of Ruspini data
7
Observations
hclust (*, "complete")
Figure 11: Graphical output of “example-gdist.R”.
21
16
18
19
17
14
15
20
11
12
13
1
5
6
4
8
9
10
2
3
44
45
41
42
43
46
47
48
54
50
52
53
49
51
58
59
60
57
55
56
61
64
68
65
67
69
70
71
63
62
66
73
74
72
75
28
27
28
29
30
23
24
21
22
31
25
26
40
36
39
37
38
33
34
32
35
27
150
26
100
25
Height
24
## Use correlations between variables "as distance"
data(USJudgeRatings)
dd <- gdist(x=USJudgeRatings,method="correlation.of.variables")
dev.new(width=12, height=6)
plot(hclust(dd),
main="Cluster Dendrogram of USJudgeRatings data",
xlab="Variables")
50
23
0
22
CRAN GMD
Zhao et al
CONT
0.4
Variables
hclust (*, "complete")
Figure 12: Graphical output of “example-gdist.R”.
22
FAMI
PREP
WRIT
ORAL
RTEN
DECI
CFMG
DILG
PHYS
DMNR
INTG
0.0
0.2
Height
0.6
0.8
1.0
1.2
Cluster Dendrogram of USJudgeRatings data
CRAN GMD
A.4
Zhao et al
css: Clustering Sum-of-Squares and the “elbow” plot: determining the number of clusters in a data set
A good clustering yields clusters where the total within-cluster sum-of-squares (WSSs) is small (i.e. cluster
cohesion, measuring how closely related are objects in a cluster) and the total between-cluster sum-of-squares
(BSSs) is high (i.e. cluster separation, measuring how distinct or well-separated one cluster is from the other).
example-css.R (Source Code 8) is an example on how to make correct choice of k using “elbow criterion”.
A good k is selected according a) how much of the total variance in the whole data that the clusters can
explain, and b) how large gain in explained variance we obtain by using these many clusters compared to one
less or one more, the so-called “elbow” criterion.
The optimal choice of k will strike a balance between maximum compression of the data using a single cluster,
and maximum accuracy by assigning each data point to its own cluster. More important, an ideal k should
also be relevant in terms of what it reveals about the data, which typically cannot be measured by a metric
but by a human expert. Here we present a way to measure such performance of a clustering model, using
squared Euclidean distances. The evaluation is based on pairwise distance matrix and therefore more generic
in a way that doesn’t involve computating the “centers” of the clusters in the raw data, which are often not
available or hard to obtain.
example-css.R
1
2
## load library
require("GMD")
3
4
5
6
7
8
9
10
11
12
## simulate data around 12 points in Euclidean space
pointv <- data.frame(x=c(1,2,2,4,4,5,5,6,7,8,9,9),y=c(1,2,8,2,4,4,5,9,9,8,1,9))
set.seed(2012)
mydata <- c()
for (i in 1:nrow(pointv)){
mydata <- rbind(mydata,cbind(rnorm(10,pointv[i,1],0.1),rnorm(10,pointv[i,2],0.1)))
}
mydata <- data.frame(mydata); colnames(mydata) <- c("x","y")
plot(mydata,type="p",pch=21, main="Simulated data")
13
14
15
16
17
18
19
## determine a "good" k using elbow
dist.obj <- dist(mydata[,1:2])
hclust.obj <- hclust(dist.obj)
css.obj <- css.hclust(dist.obj,hclust.obj)
elbow.obj <- elbow.batch(css.obj)
print(elbow.obj)
20
21
22
23
## make partition given the "good" k
k <- elbow.obj$k; cutree.obj <- cutree(hclust.obj,k=k)
mydata$cluster <- cutree.obj
24
25
26
27
28
29
30
## draw a elbow plot and label the data
dev.new(width=12, height=6)
par(mfcol=c(1,2),mar=c(4,5,3,3),omi=c(0.75,0,0,0))
plot(mydata$x,mydata$y,pch=as.character(mydata$cluster),col=mydata$cluster,cex=0.75,
main="Clusters of simulated data")
plot.elbow(css.obj,elbow.obj,if.plot.new=FALSE)
31
32
33
34
## clustering with more relaxed thresholds (, resulting a smaller "good" k)
elbow.obj2 <- elbow.batch(css.obj,ev.thres=0.90,inc.thres=0.05)
mydata$cluster2 <- cutree(hclust.obj,k=elbow.obj2$k)
35
36
37
38
39
40
dev.new(width=12, height=6)
par(mfcol=c(1,2), mar=c(4,5,3,3),omi=c(0.75,0,0,0))
plot(mydata$x,mydata$y,pch=as.character(mydata$cluster2),col=mydata$cluster2,cex=0.75,
main="Clusters of simulated data")
plot.elbow(css.obj,elbow.obj2,if.plot.new=FALSE)
23
CRAN GMD
Zhao et al
Clusters of simulated data
Elbow plot
●
●
●
●
●
●
●
●
●
●
●
0.98
0.8
0.6
Explained Variance
3
3333
33
1
11111
111
1
7
7777777
7
2
●
●
●
0.2
4
4
4444
444
4
6
0.0
mydata$y
4
2
1
11
1 11111
●
●
4444
4
4444
4444
4444
●
●
●
6
8
6
6
66
6666
66
666
6
66
666
22 2
222
22
22
0.4
55
555
555
55
1.0
k=7
5 55
55555
5
●
8
5
10
mydata$x
15
20
k
A "good" k=7 (EV=0.98) is detected when the EV is no less than 0.95
and the increment of EV is no more than 0.01 for a bigger k.
Figure 13: Graphical output of “example-css.R”.
Clusters of simulated data
Elbow plot
4
4
44
4444
44
●
●
●
●
●
●
●
●
●
●
●
0.94
0.8
0.6
●
●
0.2
3
3
3333
333
1
1111
11
1
11111
111
1
4
6
0.0
5
5555555
5
2
●
●
0.4
Explained Variance
6
mydata$y
4
2
1
11
1 11111
●
●
3333
3
3333
3333
3333
●
●
444
4
44
444
22 2
222
22
22
8
44
444
444
44
1.0
k=5
4 44
44444
4
8
●
5
mydata$x
10
k
A "good" k=5 (EV=0.94) is detected when the EV is no less than 0.9
and the increment of EV is no more than 0.05 for a bigger k.
Figure 14: Graphical output of “example-css.R”.
24
15
20
CRAN GMD
A.5
A.5.1
Zhao et al
heatmap.3: Visualization in cluster analysis, with evaluation
Examples using mtcars data
example-heatmap3a.R (Source Code 9) is an example on how to make a heatmap with summary visualization
of observations.
example-heatmap3a.R
1
2
3
require("GMD")
# load library
data(mtcars)
# load data
x <- as.matrix(mtcars) # data as a matrix
4
5
6
7
8
9
10
11
12
13
14
15
16
17
dev.new(width=10,height=8)
heatmap.3(x)
#
heatmap.3(x, Rowv=FALSE, Colv=FALSE)
#
heatmap.3(x, dendrogram="none")
#
heatmap.3(x, dendrogram="row")
#
heatmap.3(x, dendrogram="row", Colv=FALSE) #
heatmap.3(x, dendrogram="col")
#
heatmap.3(x, dendrogram="col", Rowv=FALSE) #
heatmapOut <heatmap.3(x, scale="column")
#
names(heatmapOut)
#
heatmap.3(x, scale="column", x.center=0)
#
heatmap.3(x, scale="column",trace="column")#
default, with reordering and dendrogram
no reordering and no dendrogram
reordering without dendrogram
row dendrogram with row (and col) reordering
row dendrogram with only row reordering
col dendrogram
col dendrogram with only col reordering
sacled by column
view the list that is returned
colors centered around 0
trun "trace" on
18
19
20
21
22
23
24
## coloring cars (row observations) by brand
brands <- sapply(rownames(x), function(e) strsplit(e," ")[[1]][1])
names(brands) <- c()
brands.index <- as.numeric(as.factor(brands))
RowIndividualColors <- rainbow(max(brands.index))[brands.index]
heatmap.3(x, scale="column", RowIndividualColors=RowIndividualColors)
25
26
27
## coloring attributes (column features) randomly (just for a test :)
heatmap.3(x, scale="column", ColIndividualColors=rainbow(ncol(x)))
28
29
30
31
32
33
34
35
## add a single plot for all row individuals
dev.new(width=12,height=8)
expr1 <- list(quote(plot(row.data[rowInd,"hp"],1:nrow(row.data),xlab="hp",ylab="",
main="Gross horsepower",yaxt="n")),
quote(axis(2,1:nrow(row.data),rownames(row.data)[rowInd],las=2)))
heatmap.3(x, scale="column", plot.row.individuals=TRUE, row.data=x,
plot.row.individuals.list=list(expr1))
36
37
38
39
40
41
42
43
## add a single plot for all col individuals
dev.new(width=12,height=8)
expr2 <- list(quote(plot(colMeans(col.data)[colInd], xlab="", ylab="Mean",xaxt="n",
main="Mean features",cex=1,pch=19)),
quote(axis(1,1:ncol(col.data),colnames(col.data)[colInd],las=2)))
heatmap.3(x, scale="column", plot.col.individuals=TRUE, col.data=x,
plot.col.individuals.list=list(expr2))
44
45
46
47
48
49
50
51
52
53
## add another single plot for all col individuals
dev.new(width=12,height=8)
expr3 <- list(quote(op <- par(mar = par("mar")*1.5)),
quote(mytmp.data <- apply(col.data,2,function(e) e/sum(e))),
quote(boxplot(mytmp.data[,colInd], xlab="", ylab="Value",
main="Boxplot of normalized column features")),
quote(par(op)))
heatmap.3(x, scale="column", plot.col.individuals=TRUE, col.data=x,
plot.col.individuals.list=list(expr3))
25
CRAN GMD
Zhao et al
30
Heatmap
0
Count
60
Color Key
−2
0
1
2
3
4
Value
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Pontiac Firebird
Hornet Sportabout
Duster 360
Camaro Z28
Ford Pantera L
Maserati Bora
Merc 450SL
Merc 450SE
Merc 450SLC
Dodge Challenger
AMC Javelin
Hornet 4 Drive
Valiant
Toyota Corona
Porsche 914−2
Datsun 710
Volvo 142E
Merc 230
Lotus Europa
Merc 240D
Merc 280
Merc 280C
Mazda RX4 Wag
Mazda RX4
Ferrari Dino
Fiat 128
Fiat X1−9
Toyota Corolla
disp
hp
mpg
qsec
gear
drat
wt
carb
cyl
am
vs
Honda Civic
Figure 15: Graphical output of “example-heatmap3a.R”.
30
Heatmap
Row
Individuals
0
Count
60
Color Key
−2
0
2
4
Value
Gross horsepower
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Pontiac Firebird
Hornet Sportabout
Duster 360
Camaro Z28
Ford Pantera L
Maserati Bora
Merc 450SL
Merc 450SE
Merc 450SLC
Dodge Challenger
AMC Javelin
Hornet 4 Drive
Valiant
Toyota Corona
Porsche 914−2
Datsun 710
Volvo 142E
Merc 230
Lotus Europa
Merc 240D
Merc 280
Merc 280C
Mazda RX4 Wag
Mazda RX4
Ferrari Dino
Fiat 128
Fiat X1−9
Toyota Corolla
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Pontiac Firebird
Hornet Sportabout
Duster 360
Camaro Z28
Ford Pantera L
Maserati Bora
Merc 450SL
Merc 450SE
Merc 450SLC
Dodge Challenger
AMC Javelin
Hornet 4 Drive
Valiant
Toyota Corona
Porsche 914−2
Datsun 710
Volvo 142E
Merc 230
Lotus Europa
Merc 240D
Merc 280
Merc 280C
Mazda RX4 Wag
Mazda RX4
Ferrari Dino
Fiat 128
Fiat X1−9
Toyota Corolla
Honda Civic
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Figure 16: Graphical output of “example-heatmap3a.R”.
26
disp
hp
mpg
qsec
gear
drat
wt
carb
cyl
vs
am
Honda Civic
50
100
150
200
hp
250
300
Zhao et al
Color Key
0 40
Count
CRAN GMD
−2
Heatmap
0
1
2
3
4
Value
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Pontiac Firebird
Hornet Sportabout
Duster 360
Camaro Z28
Ford Pantera L
Maserati Bora
Merc 450SL
Merc 450SE
Merc 450SLC
Dodge Challenger
AMC Javelin
Hornet 4 Drive
Valiant
Toyota Corona
Porsche 914−2
Datsun 710
Volvo 142E
Merc 230
Lotus Europa
Merc 240D
Merc 280
Merc 280C
Mazda RX4 Wag
Mazda RX4
Ferrari Dino
Fiat 128
Fiat X1−9
Toyota Corolla
disp
hp
mpg
qsec
gear
drat
wt
carb
cyl
vs
am
Honda Civic
Mean features
Mean
150
Column
Individuals
●
●
●
disp
carb
●
hp
cyl
●
mpg
vs
●
qsec
●
gear
●
drat
●
wt
●
am
0
50
●
Color Key
0 40
Count
Figure 17: Graphical output of “example-heatmap3a.R”.
−2
Heatmap
0
1
2
3
4
Value
Cadillac Fleetwood
Lincoln Continental
Chrysler Imperial
Pontiac Firebird
Hornet Sportabout
Duster 360
Camaro Z28
Ford Pantera L
Maserati Bora
Merc 450SL
Merc 450SE
Merc 450SLC
Dodge Challenger
AMC Javelin
Hornet 4 Drive
Valiant
Toyota Corona
Porsche 914−2
Datsun 710
Volvo 142E
Merc 230
Lotus Europa
Merc 240D
Merc 280
Merc 280C
Mazda RX4 Wag
Mazda RX4
Ferrari Dino
Fiat 128
Fiat X1−9
Toyota Corolla
disp
hp
mpg
qsec
gear
drat
wt
carb
cyl
vs
am
Honda Civic
Value
0.00 0.04 0.08
Column
Individuals
Boxplot of normalized column features
●
●
●
●
am
vs
cyl
carb
wt
Figure 18: Graphical output of “example-heatmap3a.R”.
27
drat
gear
qsec
mpg
hp
disp
CRAN GMD
A.5.2
Zhao et al
Examples using ruspini data
example-heatmap3b.R (Source Code 10) is an example on how to make a heatmap with summary visualization
of clusters.
example-heatmap3b.R
1
2
3
## load library
require("GMD")
require(cluster)
4
5
6
## load data
data(ruspini)
7
8
9
10
11
12
13
14
## heatmap on a `dist' object
x <- gdist(ruspini)
main <- "Heatmap of Ruspini data"
dev.new(width=10,height=10)
heatmap.3(x, main=main, mapratio=1) # default with a title and a map in square!
heatmap.3(x, main=main, revC=TRUE) # reverse column for a symmetric look
heatmap.3(x, main=main, kr=2, kc=2) # show partition by predefined number of clusters
15
16
17
18
19
20
21
## show partition by elbow
css.multi.obj <- css.hclust(x,hclust(x))
elbow.obj <- elbow.batch(css.multi.obj,ev.thres=0.90,inc.thres=0.05)
heatmap.3(x, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k)
heatmap.3(x, main=main, sub=sub("\n"," ",attr(elbow.obj,"description")), cex.sub=1.25,
revC=TRUE,kr=elbow.obj$k, kc=elbow.obj$k) # show elbow info as subtitle
22
23
24
25
26
27
28
29
## side plot for every row clusters
dev.new(width=10,height=10)
expr1 <- list(quote(plot(do.call(rbind,i.x),xlab="x",ylab="y",
xlim=range(ruspini$x),ylim=range(ruspini$y),)))
heatmap.3(x, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k, trace="none",
row.data=as.list(data.frame(t(ruspini))),
plot.row.clusters=TRUE,plot.row.clusters.list=list(expr1))
30
31
32
33
34
35
36
37
## side plot for every col clusters
dev.new(width=10,height=10)
expr2 <- list(quote(plot(do.call(rbind,i.x),xlab="x",ylab="y",
xlim=range(ruspini$x),ylim=range(ruspini$y),)))
heatmap.3(x, main=main, revC=TRUE, kr=elbow.obj$k, kc=elbow.obj$k, trace="none",
col.data=as.list(data.frame(t(ruspini))),
plot.col.clusters=TRUE,plot.col.clusters.list=list(expr2))
38
28
CRAN GMD
Zhao et al
Heatmap of Ruspini data
400
0
Count
Color Key
0
Row
Clusters
A "good" k=4 (EV=0.93) is detected when the EV is no less than 0.9 and the increment of EV is no more than 0.05 for a bigger k.
100
Value
4
150
Row group 2 (n=23)
0 50
y
●●●●●
●●
●
●●
●●
●●
●● ●●
●● ●
●
150
0
20 40 60 80
120
x
Row group 3 (n=17)
● ●
●
●
●
●
●●
●● ●●
0 50
y
●
●
●● ●
0 50
y
150
0
20 40 60 80
120
x
Row group 1 (n=20)
●
●● ●● ●
● ●● ●
●●
●
● ●● ● ●●
●
150
0
20 40 60 80
120
x
Row group 4 (n=15)
0
●●● ●●● ●
● ●
●●●●●
●
20 40 60 80
x
63
66
62
61
64
68
74
73
75
72
65
71
70
69
67
20
11
13
12
7
6
8
4
16
19
18
17
14
15
10
9
3
2
5
1
46
48
47
57
55
56
58
59
60
45
44
54
50
52
53
49
51
28
27
30
29
22
21
23
24
41
42
43
37
38
34
33
40
36
39
35
32
31
26
25
4
1
3
2
63
66
62
61
64
68
74
73
75
72
65
71
70
69
67
20
11
13
12
7
6
8
4
16
19
18
17
14
15
10
9
3
2
5
1
46
48
47
57
55
56
58
59
60
45
44
54
50
52
53
49
51
28
27
30
29
22
21
23
24
41
42
43
37
38
34
33
40
36
39
35
32
31
26
25
y
1
0 50
3
2
0
40
x
80
120
40
x
80
120
0
150
Col group 4 (n=15)
made by
Function: heatmap.3
Package: GMD
in R
100
150
y
50
50
y
0
●
●●●●
●● ● ●
●● ●
●
●●●
● ● ●●
0
50
y
●●●
Col group 1 (n=20)
●●
●●●●●
●
● ●
●● ● ●
●
0
100
● ●
● ●
●
●
●
● ●
●●
●
●●
100
150
Col group 3 (n=17)
0
50
y
100
●●●●●●●
●●●
●●
●●
●●● ●
●●●
●
●
0
Col
Clusters
150
Col group 2 (n=23)
40
Figure 19: Graphical output of “example-heatmap3b.R”.
29
x
80
120
0
40
x
80
120
120
CRAN GMD
B
Zhao et al
Data
B.1
GMD dataset overview
> data(package="GMD")
Data sets in package 'GMD':
cage
cagel
chipseq_hCD4T
chipseq_mES
B.2
CAGE Data
CAGE Data
ChIP-seq Data
ChIP-seq Data
CAGE data: cage and cagel
> help(cage)
30
CRAN GMD
Zhao et al
> require(GMD)
> data(cage)
> class(cage)
[1] "list"
> length(cage)
[1] 20
> names(cage)
[1]
[3]
[5]
[7]
[9]
[11]
[13]
[15]
[17]
[19]
"NA (T01F029805F8)"
"Cyp4a14 (T04R06C91673)"
"D230039L06Rik (T01F0AA465EB)"
"Rab5c (T11R05FBC6C4)"
"Tpt1 (T14F04079189)"
"D0H4S114 (T18R020553F0)"
"Hsd11b1 (T01R0B8305BD)"
"B2m (T02F0743FF05)"
"Pfkfb3 (T02R00AEC2D8)"
"Cd72 (T04R028B8BC9)"
"Glul (T01F092C2995)"
"Stxbp4 (T11R05607FD4)"
"Gas5 (T01F09995479)"
"BC003940 (T11R072A6CB0)"
"Pcna (T02R07DE319B)"
"Gsto1 (T19F02D03566)"
"Csf1 (T03R0672174D)"
"Alox5ap (T05F08BCF2C4)"
"Hig1 (T09R0743763C)"
"Egln1 (T08R0769239F)"
> data(cagel)
> names(cagel)
[1]
[3]
[5]
[7]
[9]
[11]
[13]
[15]
[17]
[19]
[21]
[23]
[25]
[27]
[29]
[31]
[33]
[35]
[37]
[39]
[41]
[43]
[45]
[47]
[49]
"NA (T17F05912B83)"
"B2m (T02F0743FF05)"
"2600001A11Rik (T12R043A2595)"
"Rpl41 (T10R07AB7138)"
"Dscr1l1 (T17F02802885)"
"Rab5c (T11R05FBC6C4)"
"Csf1 (T03R0672174D)"
"Crim1 (T17F04928998)"
"Rai17 (T14F014BF473)"
"Apbb2 (T05R03E329C8)"
"Tmeff2 (T01F030EC757)"
"Hsd11b1 (T01R0B8305BD)"
"4930430J02Rik (T09F036E80C6)"
"Trpv2 (T11F03B4EBD8)"
"Scd1 (T19R029B5186)"
"5730596K20Rik (T19F006DFC1A)"
"Klhl5 (T05F03CCA673)"
"NA (T02R07EF5EDA)"
"Nudt7 (T08F06C3B651)"
"Egln1 (T08R0769239F)"
"BC003940 (T11R072A6CB0)"
"Alox5ap (T05F08BCF2C4)"
"Gch1 (T14R02602138)"
"Vrk1 (T12F06010C9B)"
"Wdtc1 (T04R07DAFEDC)"
"Tpt1 (T14F04079189)"
"Grn (T11F0615F289)"
"Rbbp7 (T0XF091A7ACA)"
"H2afy (T13R034ACF47)"
"Ckap4 (T10R0504CE97)"
"Pfkfb3 (T02R00AEC2D8)"
"Ctsb (T14F0348EDBA)"
"3930401E15Rik (T18R02CDD141)"
"Hig1 (T09R0743763C)"
"Ptn (T06R0230806E)"
"Mrps6 (T16F0583C906)"
"D0H4S114 (T18R020553F0)"
"Phtf2 (T05R0125E896)"
"Slco3a1 (T07R03AC06B9)"
"Ctxn (T08R0040864A)"
"Arbp (T05F06BBE13B)"
"Gsto1 (T19F02D03566)"
"Srpk1 (T17R019F4A41)"
"Tnfrsf10b (T14F03AB1306)"
"9630050M13Rik (T02F002EC972)"
"Ppia (T11F00604AFF)"
"Pcna (T02R07DE319B)"
"Yap1 (T09R0079F3FF)"
"Cd72 (T04R028B8BC9)"
"Centg2 (T01F055392D1)"
31
CRAN GMD
B.3
Zhao et al
ChIP-seq data: chipseq_mES and chipseq_hCD4T
> help(chipseq)
> data(chipseq_mES)
> class(chipseq_mES)
[1] "list"
> length(chipseq_mES)
[1] 6
> names(chipseq_mES)
[1] "H3K27me3" "H3K36me3" "H3K4me1"
"H3K4me2"
"H3K4me3"
"H3K9me3"
> data(chipseq_hCD4T)
> names(chipseq_hCD4T)
[1]
[7]
[13]
[19]
[25]
[31]
[37]
"CTCF"
"H2BK20ac"
"H3K27ac"
"H3K36me3"
"H3K79me2"
"H3R2me1"
"H4K5ac"
"H2AK5ac"
"H2BK5ac"
"H3K27me1"
"H3K4ac"
"H3K79me3"
"H3R2me2"
"H4K8ac"
"H2AK9ac"
"H2BK5me1"
"H3K27me2"
"H3K4me1"
"H3K9ac"
"H4K12ac"
"H4K91ac"
"H2AZ"
"H3K14ac"
"H3K27me3"
"H3K4me2"
"H3K9me1"
"H4K16ac"
"H4R3me2"
32
"H2BK120ac"
"H3K18ac"
"H3K36ac"
"H3K4me3"
"H3K9me2"
"H4K20me1"
"H2BK12ac"
"H3K23ac"
"H3K36me1"
"H3K79me1"
"H3K9me3"
"H4K20me3"
CRAN GMD
Zhao et al
References
[1] Artem Barski, Suresh Cuddapah, Kairong Cui, Tae-Young Roh, Dustin E Schones, Zhibin Wang, Gang
Wei, Iouri Chepelev, and Keji Zhao. High-resolution profiling of histone methylations in the human genome.
Cell, 129(4):823–837, May 2007.
[2] Piero Carninci, Albin Sandelin, Boris Lenhard, Shintaro Katayama, Kazuro Shimokawa, Jasmina Ponjavic, Colin A M Semple, Martin S Taylor, PÂŁr G EngstrÂŽm, Martin C Frith, Alistair R R Forrest,
Wynand B Alkema, Sin Lam Tan, Charles Plessy, Rimantas Kodzius, Timothy Ravasi, Takeya Kasukawa,
Shiro Fukuda, Mutsumi Kanamori-Katayama, Yayoi Kitazume, Hideya Kawaji, Chikatoshi Kai, Mari
Nakamura, Hideaki Konno, Kenji Nakano, Salim Mottagui-Tabar, Peter Arner, Alessandra Chesi, Stefano Gustincich, Francesca Persichetti, Harukazu Suzuki, Sean M Grimmond, Christine A Wells, Valerio
Orlando, Claes Wahlestedt, Edison T Liu, Matthias Harbers, Jun Kawai, Vladimir B Bajic, David A Hume,
and Yoshihide Hayashizaki. Genome-wide analysis of mammalian promoter architecture and evolution. Nat
Genet, 38(6):626–635, Jun 2006.
[3] Sung-Hyuk Cha and Sargur N. Srihari. On measuring the distance between histograms. Pattern Recognition,
35(6):1355–1370, 2002.
[4] Tarjei S Mikkelsen, Manching Ku, David B Jaffe, Biju Issac, Erez Lieberman, Georgia Giannoukos, Pablo
Alvarez, William Brockman, Tae-Kyung Kim, Richard P Koche, William Lee, Eric Mendenhall, Aisling O’Donovan, Aviva Presser, Carsten Russ, Xiaohui Xie, Alexander Meissner, Marius Wernig, Rudolf
Jaenisch, Chad Nusbaum, Eric S Lander, and Bradley E Bernstein. Genome-wide maps of chromatin state
in pluripotent and lineage-committed cells. Nature, 448(7153):553–560, Aug 2007.
[5] R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria, 2011. ISBN 3-900051-07-0.
[6] Albin Sandelin, Piero Carninci, Boris Lenhard, Jasmina Ponjavic, Yoshihide Hayashizaki, and David A
Hume. Mammalian rna polymerase ii core promoters: insights from genome-wide studies. Nat Rev Genet,
8(6):424–436, Jun 2007.
[7] Xiaobei Zhao, Eivind Valen, Brian J Parker, and Albin Sandelin. Systematic clustering of transcription
start site landscapes. PLoS ONE, 6(8):e23409, August 2011.
[8] Vicky W Zhou, Alon Goren, and Bradley E Bernstein. Charting histone modifications and the functional
organization of mammalian genomes. Nat Rev Genet, 12(1):7–18, Jan 2011.
33