We use data matrices of size 20 ´ 200. Four kinds of random %binary matrices with sparsity (percentage of number of 1's) 10%, 20%, 30% and 40% respectively are generated. Then the DRIFT algorithm (with minimum size threshold 1 ´ 1) is applied to each of these four kind of matrices and the size of the identified dense regions are recorded. These procedures are repeated 10 times and the counts are averaged. The resulting (average) frequency distributions of the sizes (depicted as heat maps) are shown in Figure 1 below.
![]() |
![]() |
| (a) | (b) |
![]() |
![]() |
| (c) | (d) |
Figure 1. Frequency distributions of the sizes of dense regions in random matrices with size 200 ´ 20. (a) Sparsity = 10%. (b) Sparsity = 20%. (c) Sparsity = 30%. (d) Sparsity = 40%. |
|
We found that the size of the dense regions generally increases with the sparsity. More importantly, more than dense regions have very small sizes, for instance, when sparsity is 30% more than 99% of the dense regions have size smaller than 5 ´ 5.
We employ the dataset consisting of gene
expression measurements of 23 primate brain samples (7 human, 8 chimpanzees, 8
Rhesus macaques) studied by
Cáceres
et al. (2003). Oligonucleotide microarrays
were used to measure expression levels of ~10000 genes simultaneously. The
purpose of the study was to explain phenotypic differences between human and
chimpanzees at level of gene regulation using macaques as an outgroup, despite
the fact that the two species have ~99% of their DNA sequences in common.
For illustration purposes, a subset of 376 genes is selected based on the
coefficient of variation and percentage of present calls generated by the
dChip
1.3 software. Next, model-based expression indices are calculated and all
replicates are pooled resulting in a dataset with 13 samples and 376 genes. Each
gene is then normalized to have mean 0 and standard deviation 1 across the
samples. Finally, the values are rounded off to integers.
We apply our algorithms to find DRs in the dataset. The results are shown in
Figure 2. The heat map of the dataset is shown in (a) where the
genes and the samples are ordered according to the results of average linkage
hierarchical clustering. Three sample DRs identified by £gDRIFT are illustrated
in (b)-(d). Moreover, we apply the annotation tool
DAVID 2.0 to find the
functional categories of genes in each DR.
¡@
The region in (b) suggests that most of the genes
under the study are down-regulated in the macaque brain samples (Mm1-Mm4). Among
the 326 genes in this region, 130 (39.9%) of them are involved metabolism and 96
(29.4%) in cellular physiological process.
The region in (c) mostly consists of genes that are down-regulated in the two
human samples (Hs1-Hs2) but not other human samples (Hs3-Hs5). The samples Hs1
and Hs2 differ from the other three human samples by (i) they had longer
postmortem intervals (~13 hrs) so that the degradation of the RNA samples may
have been more pronounced; (ii) they were collected from a different region (the
frontal pole) which may show a different pattern of gene expression than samples
collected from other regions. If case (i) is the major reason that causes the
deviation of Hs1 and Hs2 from Hs3-Hs5, then one may want to discard Hs1 and Hs2
before any further analysis. Thus, it will be helpful to look at the functional
categories of the genes in this region. Indeed, 15 genes (31.9%) are involved
metabolism while 14 genes (29.8%) in cellular physiological process.
The region in (d) consists of genes that are consistently up-regulated in the
human samples (Hs3-Hs5) but not in other samples. This gives a list of candidate
genes to analyze the difference between human and chimpanzees while reducing the
effects (i) and (ii) in Hs1 and Hs2 mentioned above. In this region, out of the
59 genes, 30 (50.8%) of them are involved in metabolism and 17 (28.8%) in
cellular physiological process.
![]() |
![]() |
![]() |
![]() |
| (a) | (b) | (c) | (d) |
Figure 2. (a) Heat map (ordered according the results of average linkage hierarchical clustering) of the expression level of 376 genes and 13 samples. (b) A 90%-dense region with value -1 (326 genes, 4 samples). (c) A 90%-dense region with value -1 (47 genes, 2 samples). (d) A 90%-dense region with value 1 (59 genes, 3 samples). |
|||
April 12, 2005
Please send your suggestions and comments to: mng@maths.hku.hk