Strategies for Identifying Statistically Significant Dense
Regions in Microarray Data

Supplementary Materials

Andy M. Yip, Michael K. Ng, Edmond H. Wu and Tony F. Chan


Frequency Distribution of the Size of Dense Regions

We use data matrices of size 20 ´ 200. Four kinds of random %binary matrices with sparsity (percentage of number of 1's) 10%, 20%, 30% and 40% respectively are generated. Then the DRIFT algorithm (with minimum size threshold 1 ´ 1) is applied to each of these four kind of matrices and the size of the identified dense regions are recorded. These procedures are repeated 10 times and the counts are averaged. The resulting (average) frequency distributions of the sizes (depicted as heat maps) are shown in Figure 1 below.

(a) (b)
(c) (d)

Figure 1. F
requency distributions of the sizes of dense regions in random matrices with size 200 ´ 20.
(a) Sparsity = 10%. (b) Sparsity = 20%. (c) Sparsity = 30%. (d) Sparsity = 40%.

¡@

Summary of Findings:

We found that the size of the dense regions generally increases with the sparsity. More importantly, more than dense regions have very small sizes, for instance, when sparsity is 30% more than 99% of the dense regions have size smaller than 5 ´ 5.


An Additional Example: Primate Brain Samples

We employ the dataset consisting of gene expression measurements of 23 primate brain samples (7 human, 8 chimpanzees, 8 Rhesus macaques) studied by Cáceres et al. (2003). Oligonucleotide microarrays were used to measure expression levels of ~10000 genes simultaneously. The purpose of the study was to explain phenotypic differences between human and chimpanzees at level of gene regulation using macaques as an outgroup, despite the fact that the two species have ~99% of their DNA sequences in common.

For illustration purposes, a subset of 376 genes is selected based on the coefficient of variation and percentage of present calls generated by the dChip 1.3 software. Next, model-based expression indices are calculated and all replicates are pooled resulting in a dataset with 13 samples and 376 genes. Each gene is then normalized to have mean 0 and standard deviation 1 across the samples. Finally, the values are rounded off to integers.

We apply our algorithms to find DRs in the dataset. The results are shown in Figure 2. The heat map of the dataset is shown in (a) where the genes and the samples are ordered according to the results of average linkage hierarchical clustering. Three sample DRs identified by £gDRIFT are illustrated in (b)-(d). Moreover, we apply the annotation tool DAVID 2.0 to find the functional categories of genes in each DR.
¡@

Summary of Findings:

The region in (b) suggests that most of the genes under the study are down-regulated in the macaque brain samples (Mm1-Mm4). Among the 326 genes in this region, 130 (39.9%) of them are involved metabolism and 96 (29.4%) in cellular physiological process.

The region in (c) mostly consists of genes that are down-regulated in the two human samples (Hs1-Hs2) but not other human samples (Hs3-Hs5). The samples Hs1 and Hs2 differ from the other three human samples by (i) they had longer postmortem intervals (~13 hrs) so that the degradation of the RNA samples may have been more pronounced; (ii) they were collected from a different region (the frontal pole) which may show a different pattern of gene expression than samples collected from other regions. If case (i) is the major reason that causes the deviation of Hs1 and Hs2 from Hs3-Hs5, then one may want to discard Hs1 and Hs2 before any further analysis. Thus, it will be helpful to look at the functional categories of the genes in this region. Indeed, 15 genes (31.9%) are involved metabolism while 14 genes (29.8%) in cellular physiological process.

The region in (d) consists of genes that are consistently up-regulated in the human samples (Hs3-Hs5) but not in other samples. This gives a list of candidate genes to analyze the difference between human and chimpanzees while reducing the effects (i) and (ii) in Hs1 and Hs2 mentioned above. In this region, out of the 59 genes, 30 (50.8%) of them are involved in metabolism and 17 (28.8%) in cellular physiological process.

(a) (b) (c) (d)

Figure 2. (a) Heat map (ordered according the results of average linkage hierarchical clustering) of the expression level of 376 genes and 13 samples. (b) A 90%-dense region with value -1 (326 genes, 4 samples). (c) A 90%-dense region with value -1 (47 genes, 2 samples). (d) A 90%-dense region with value 1 (59 genes, 3 samples).

April 12, 2005

Please send your suggestions and comments to: mng@maths.hku.hk