Basic statistics for analysis

In this part, we will practise basic statistical analysis for RNASeq data. Please note, these analysis approaches are general and can be applied to any type of data sets.

In most cases, the input data format is a matrix. In this exercise, we will use a TCGA glioblastoma dataset. The original data source is from https://tcga-data.nci.nih.gov/docs/publications/gbm_exp/.

We will use the following two files:

TCGA_unified_CORE_ClaNC840.txt: Expression values and subtype calls for the Core TCGA samples using the unified scaled data
unifiedScaled.txt: unified gene expression estimate for 202 samples and 11,861 genes

In this analysis, researchers predicted four subtypes based on RNASeq data. Here we assume the predicted subtypes are known subtypes.

Read the data

Before doing any analysis, importing or reading data into R is an important step. Sometimes it is quite straightforward and smooth, but othertimes you might have problems or errors. Knowing how to solve these small but annoying problems will speed up your analysis and give you a good mood for that day.

If it is the first time that you see the data, the first thing is to have a simple look at how the format is. Most datasets are in a format of plain text (or a zipped file that you need to uncomparess it in advance). You can open it in a text editor or Excel (if the size of the file is small) or run following command. head prints the first 10 lines in the file.

# If you use linux or mac
head TCGA_unified_CORE_ClaNC840.txt
head unifiedScaled.txt

We can see the two files are all table-like files. Now we can read it by the read.table() function. You might image read.table() is a magic function that it automatically converts the file properly into R. If you run following command, you will have an error.

df = read.table("TCGA_unified_CORE_ClaNC840.txt")

## Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 2 did not have 184 elements

read.table() has some defaults for some parameters. The default separator is a single space while in TCGA_unified_CORE_ClaNC840.txt, the second column has some values containing spaces (try command cut -f2 TCGA_unified_CORE_ClaNC840.txt | sort | uniq under linux or mac). In this case, we need to explictly set the separator to tab \t.

df = read.table("TCGA_unified_CORE_ClaNC840.txt", sep = "\t")

After successfully reading the data into R, the next thing is to see how it looks in R. Simply use head() function to print the first 10 rows:

head(df)

In df, the first two rows are what we need, which are sample names and subtypes. The first two columns are not needed here and we need to exclude them.

We can subset df to get the first two rows. Note -(1:2) means to exclude the first two columns.

sample_id = df[1, -(1:2)]
subtype = df[2, -(1:2)]

We need to double check the format of sample_id and subtype:

sample_id

You might wonder why sample_id is in such a strange format. The reason is df is a data frame and subsetting the data frame by rows will also give you a data frame, even when the data frame only has one row.

is.data.frame(sample_id)

## [1] TRUE

To convert a one-row data frame to a vector, you need to first convert it to a one-row matrix then to a vector.

sample_id = as.vector(as.matrix(sample_id))
subtype = as.vector(as.matrix(subtype))
# you can check the value of sample_id and subtype

Here subtype is called annotation (of samples) in the analysis. Normally, the annotations are stored as a data frame where rows correspond to samples.

anno = data.frame(subtype = subtype)
rownames(anno) = sample_id
head(anno)

##                       subtype
## TCGA-02-0003-01A-01 Proneural
## TCGA-02-0010-01A-01 Proneural
## TCGA-02-0011-01B-01 Proneural
## TCGA-02-0014-01A-01 Proneural
## TCGA-02-0024-01B-01 Proneural
## TCGA-02-0026-01B-01 Proneural

As you can see, the sample nams are assigned as the row names of the annotation data frame anno. Later we can use it to correspond to the expression matrix to make sure the columns in expression matrix and rows in annotation data frame as the same.

For unifiedScaled.txt, it is quite clean (which means there are only gene names, sample names and expression matrix in it) and simply use read.table() function:

mat = read.table("unifiedScaled.txt")
mat[1:2, 1:2]

##       TCGA.02.0001.01C.01 TCGA.02.0002.01A.01
## AACS              0.29861             0.14778
## FSTL1            -0.66392            -0.21234

read.table() always returns a data frame, which means mat is actually a data frame. Although in some analysis, it won’t matter whether the input is a matrix or a data frame, but for some other analysis, the input is enforced to be a matrix (e.g. clustering). Thus, we need to convert it to a matrix.

mat = as.matrix(mat)

OK, now we successfully read the annotation as well as the main matrix into R and format them properly. To make further analysis, one curcial thing is to check whether the expression matrix and the annotation data frame are corresponded.

intersect(colnames(mat), rownames(anno))

## character(0)

Nothing in common?! If you check the value of colnames(mat) and rownames(ano), you will find the format of the sample names are different.

colnames(mat)[1:2]

## [1] "TCGA.02.0001.01C.01" "TCGA.02.0002.01A.01"

rownames(anno)[1:2]

## [1] "TCGA-02-0003-01A-01" "TCGA-02-0010-01A-01"

This is another annoying thing caused by read.table(). By default, it will automatically convert a column name like TCGA-12-0620-01A-01 to TCGA.12.0620.01A.01 for historical reason. We always need to set check.names = FALSE in read.table() to avoid such conversion.

mat = read.table("unifiedScaled.txt", check.names = FALSE)
mat = as.matrix(mat)
mat[1:2, 1:2]

##       TCGA-02-0001-01C-01 TCGA-02-0002-01A-01
## AACS              0.29861             0.14778
## FSTL1            -0.66392            -0.21234

intersect(colnames(mat), rownames(anno))

##   [1] "TCGA-02-0003-01A-01" "TCGA-02-0004-01A-01" "TCGA-02-0006-01B-01" "TCGA-02-0007-01A-01"
##   [5] "TCGA-02-0009-01A-01" "TCGA-02-0010-01A-01" "TCGA-02-0011-01B-01" "TCGA-02-0014-01A-01"
##   [9] "TCGA-02-0016-01A-01" "TCGA-02-0021-01A-01" "TCGA-02-0023-01B-01" "TCGA-02-0024-01B-01"
##  [13] "TCGA-02-0025-01A-01" "TCGA-02-0026-01B-01" "TCGA-02-0027-01A-01" "TCGA-02-0028-01A-01"
##  [17] "TCGA-02-0033-01A-01" "TCGA-02-0034-01A-01" "TCGA-02-0038-01A-01" "TCGA-02-0039-01A-01"
##  [21] "TCGA-02-0043-01A-01" "TCGA-02-0046-01A-01" "TCGA-02-0047-01A-01" "TCGA-02-0048-01A-01"
##  [25] "TCGA-02-0051-01A-01" "TCGA-02-0054-01A-01" "TCGA-02-0055-01A-01" "TCGA-02-0057-01A-01"
##  [29] "TCGA-02-0059-01A-01" "TCGA-02-0060-01A-01" "TCGA-02-0064-01A-01" "TCGA-02-0069-01A-01"
##  [33] "TCGA-02-0070-01A-01" "TCGA-02-0074-01A-01" "TCGA-02-0075-01A-01" "TCGA-02-0079-01A-03"
##  [37] "TCGA-02-0080-01A-01" "TCGA-02-0084-01A-03" "TCGA-02-0085-01A-01" "TCGA-02-0086-01A-01"
##  [41] "TCGA-02-0087-01A-01" "TCGA-02-0089-01A-01" "TCGA-02-0099-01A-01" "TCGA-02-0102-01A-01"
##  [45] "TCGA-02-0104-01A-01" "TCGA-02-0106-01A-01" "TCGA-02-0107-01A-01" "TCGA-02-0111-01A-01"
##  [49] "TCGA-02-0113-01A-01" "TCGA-02-0114-01A-01" "TCGA-02-0115-01A-01" "TCGA-02-0260-01A-03"
##  [53] "TCGA-02-0269-01B-01" "TCGA-02-0281-01A-01" "TCGA-02-0285-01A-01" "TCGA-02-0289-01A-01"
##  [57] "TCGA-02-0290-01A-01" "TCGA-02-0317-01A-01" "TCGA-02-0321-01A-01" "TCGA-02-0325-01A-01"
##  [61] "TCGA-02-0326-01A-01" "TCGA-02-0333-01A-02" "TCGA-02-0337-01A-01" "TCGA-02-0338-01A-01"
##  [65] "TCGA-02-0339-01A-01" "TCGA-02-0422-01A-01" "TCGA-02-0430-01A-01" "TCGA-02-0432-01A-02"
##  [69] "TCGA-02-0439-01A-01" "TCGA-02-0440-01A-01" "TCGA-02-0446-01A-01" "TCGA-02-0451-01A-01"
##  [73] "TCGA-06-0122-01A-01" "TCGA-06-0124-01A-01" "TCGA-06-0125-01A-01" "TCGA-06-0126-01A-01"
##  [77] "TCGA-06-0128-01A-01" "TCGA-06-0129-01A-01" "TCGA-06-0130-01A-01" "TCGA-06-0132-01A-02"
##  [81] "TCGA-06-0133-01A-02" "TCGA-06-0137-01A-03" "TCGA-06-0138-01A-02" "TCGA-06-0139-01A-01"
##  [85] "TCGA-06-0143-01A-01" "TCGA-06-0145-01A-04" "TCGA-06-0146-01A-01" "TCGA-06-0147-01A-01"
##  [89] "TCGA-06-0148-01A-01" "TCGA-06-0149-01A-05" "TCGA-06-0152-01A-02" "TCGA-06-0154-01A-02"
##  [93] "TCGA-06-0156-01A-01" "TCGA-06-0160-01A-01" "TCGA-06-0162-01A-01" "TCGA-06-0164-01A-01"
##  [97] "TCGA-06-0166-01A-01" "TCGA-06-0167-01A-01" "TCGA-06-0171-01A-02" "TCGA-06-0173-01A-01"
## [101] "TCGA-06-0174-01A-01" "TCGA-06-0175-01A-01" "TCGA-06-0176-01A-03" "TCGA-06-0177-01A-01"
## [105] "TCGA-06-0179-01A-02" "TCGA-06-0182-01A-01" "TCGA-06-0184-01A-01" "TCGA-06-0185-01A-01"
## [109] "TCGA-06-0187-01A-01" "TCGA-06-0189-01A-05" "TCGA-06-0190-01A-01" "TCGA-06-0194-01A-01"
## [113] "TCGA-06-0195-01B-01" "TCGA-06-0197-01A-02" "TCGA-06-0208-01B-01" "TCGA-06-0210-01A-01"
## [117] "TCGA-06-0211-01B-01" "TCGA-06-0214-01A-02" "TCGA-06-0219-01A-01" "TCGA-06-0221-01A-01"
## [121] "TCGA-06-0237-01A-02" "TCGA-06-0238-01A-02" "TCGA-06-0240-01A-02" "TCGA-06-0241-01A-02"
## [125] "TCGA-06-0397-01A-01" "TCGA-06-0402-01A-01" "TCGA-06-0409-01A-02" "TCGA-06-0410-01A-01"
## [129] "TCGA-06-0412-01A-01" "TCGA-06-0413-01A-01" "TCGA-06-0414-01A-01" "TCGA-06-0644-01A-02"
## [133] "TCGA-06-0645-01A-01" "TCGA-06-0646-01A-01" "TCGA-06-0648-01A-01" "TCGA-08-0245-01A-01"
## [137] "TCGA-08-0246-01A-01" "TCGA-08-0344-01A-01" "TCGA-08-0346-01A-01" "TCGA-08-0347-01A-01"
## [141] "TCGA-08-0348-01A-01" "TCGA-08-0349-01A-01" "TCGA-08-0350-01A-01" "TCGA-08-0352-01A-01"
## [145] "TCGA-08-0353-01A-01" "TCGA-08-0354-01A-01" "TCGA-08-0355-01A-01" "TCGA-08-0357-01A-01"
## [149] "TCGA-08-0358-01A-01" "TCGA-08-0359-01A-01" "TCGA-08-0360-01A-01" "TCGA-08-0375-01A-01"
## [153] "TCGA-08-0380-01A-01" "TCGA-08-0385-01A-01" "TCGA-08-0386-01A-01" "TCGA-08-0390-01A-01"
## [157] "TCGA-08-0392-01A-01" "TCGA-08-0509-01A-01" "TCGA-08-0510-01A-01" "TCGA-08-0511-01A-01"
## [161] "TCGA-08-0512-01A-01" "TCGA-08-0514-01A-01" "TCGA-08-0517-01A-01" "TCGA-08-0518-01A-01"
## [165] "TCGA-08-0520-01A-01" "TCGA-08-0522-01A-01" "TCGA-08-0524-01A-01" "TCGA-08-0529-01A-02"
## [169] "TCGA-08-0531-01A-01" "TCGA-12-0616-01A-01" "TCGA-12-0618-01A-01" "TCGA-12-0619-01A-01"
## [173] "TCGA-12-0620-01A-01"

We check how many samples are there in mat and in anno:

ncol(mat)

## [1] 202

nrow(anno)

## [1] 173

The numbers are not the same and maybe the order of the samples are also different. We need to take the common samples in the two objects:

cn = intersect(colnames(mat), rownames(anno))
mat = mat[, cn]
anno = anno[cn, , drop = FALSE]

Note since mat has column names and anno has row names, cn can be used as character indices for reordering and subsetting objects.

You might note when subsetting anno, we additionally add drop = FALSE argument. This is another annoying default behavior of R. Since the original anno only has one column, doing anno[cn, ] will degenerate to a vector (or in other words, the dimension is droppped.)

Now mat and anno are in nice and correct format and can be used for downstream analysis.

Last but not the least, since we spend so much effort on cleaning the data, we actually can save anno and mat into files which is also good if you want to re-perform the analysis or share the data with other collaborators.

save(mat, anno, file = ...)  # use load() to read back 
saveRDS(mat, file = ...)  # use readRDS() to read back 
saveRDS(anno, file = ...)  # use readRDS() to read back

General analysis

Now we know mat contains expression data, but what exactly does the expression value look like? Or in other words, how does the distribution of the gene expression look like? The most straightforward way that a lot of people will do is to look at the distribution.

par(mfrow = c(1, 2))  # a layout of one row and two columns
hist(mat[, 1]) # the first column
plot(density(mat[, 1]))

par(mfrow = c(1, 1)) # reset the layout, you don't need this line if you just close the graphic window

Siimlarlly, you can make density plot for other samples.

According to the density plot, we see the values distributed in both positive and negative values (why the expression values have negative values?). According to the file name of the expression matrix, we can infer the gene expression are “scaled”. Scaling is an approach that converts the data and put data scale of all genes into a similar level.

We can check the mean value and standard deviation of all genes.

par(mfrow = c(1, 2))
gene_mean = apply(mat, 1, mean) # use ?apply to check what does apply() do
plot(gene_mean, col = "#00000080")
gene_sd = apply(mat, 1, sd)
plot(gene_sd, col = "#00000080")

par(mfrow = c(1, 1))

We can see the mean expression of all genes are centered at y = 0 and the sd are close to 0.5. So we can say all the genes are in a similar scale.

To get an impression of how the data distributes, sometimes we don’t need the exact the distribution, while five quantiles are enough for intepration. Recall we can use boxplot() function to visualize quantiles.

boxplot(mat[, 1])

Atually you can put mulitplt boxplots in one plot:

par(las = 3, mar = c(12, 4, 4, 1))  # type ?par to find out what do las and mar mean
boxplot(mat[, 1:10])

It seems there are too many outliers that comparess the boxes too small. The outliers can be removed by setting outline = FALSE.

boxplot(mat[, 1:10], outline = FALSE)

Recall to get the numbers of the quantiles, use quantile() function.

quantile(mat[, 1])

##       0%      25%      50%      75%     100% 
## -2.75794 -0.16462  0.08712  0.40294  5.16804

We can calculate correlation between two samples:

cor(mat[, 1], mat[, 2])

## [1] -0.2545496

If you put the matrix into cor(), it returns a correlation matrix.

cor_mat = cor(mat)
dim(cor_mat)

## [1] 173 173

Normally we visualize it as a heatmap. If you don’t have ComplexHeatmap installed, try following commands:

install.packages("https://cran.r-project.org/src/contrib/circlize_0.4.5.tar.gz", repos = NULL)
install.packages("http://bioconductor.org/packages/devel/bioc/src/contrib/ComplexHeatmap_1.99.3.tar.gz", repos = NULL)

Following means we make a heatmap for cor_mat and use anno as annotation on top and left of the heatmap. Note the colors for annotations are randomly assigned.

library(ComplexHeatmap)
Heatmap(cor_mat, top_annotation = HeatmapAnnotation(df = anno),
    left_annotation = rowAnnotation(df = anno),
    show_row_names = FALSE, show_column_names = FALSE)

Next we will do hypothesis test on samples only in two subtypes (Proneural and Mesenchymal). We first get a subset from mat and anno.

l = anno$subtype %in% c("Proneural", "Mesenchymal")
mat2 = mat[, l]
anno2 = anno[l, , drop = FALSE]
anno2$subtype = factor(as.character(anno2$subtype), levels = c("Proneural", "Mesenchymal"))
identical(colnames(mat2), rownames(anno2))

## [1] TRUE

Now we want to test how differential the gene “CREB3L1” is in comparison of Proneural vs Mesenchymal. We first get the expression of CREB3L1 and assign the subtype annotation to another variable so that we don’t need to type anno2$subtyep too many times in following code.

gene = "CREB3L1"  # ELMO2
x = mat2[gene, ]  # note `gene` is a character and can be used as character index
subtype = anno2$subtype
# you can check the value of `x` and `subtype`

Perform two-sample t-test with default settings.

t.test(x ~ subtype)

## 
##  Welch Two Sample t-test
## 
## data:  x by subtype
## t = -4.9387, df = 106.97, p-value = 2.916e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6145674 -0.2625117
## sample estimates:
##   mean in group Proneural mean in group Mesenchymal 
##                -0.2768483                 0.1616913

x ~ subtyep is called a formula in R. It basically means performing test against subtype (of course, subtype must be a categorical variable). An alternative way to use t.test() is to provide two vectors, where each for one subtype.

t.test(x[subtype == "Proneural"], x[subtype == "Mesenchymal"])

## 
##  Welch Two Sample t-test
## 
## data:  x[subtype == "Proneural"] and x[subtype == "Mesenchymal"]
## t = -4.9387, df = 106.97, p-value = 2.916e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6145674 -0.2625117
## sample estimates:
##  mean of x  mean of y 
## -0.2768483  0.1616913

Use boxplot to visualize how different the two groups are. Here we also use “formula” in boxplot().

boxplot(x ~ subtype)

# same as:
# boxplot(list(Proneural = x[subtype == "Proneural"], 
#              Mesenchymal = x[subtype == "Mesenchymal"]))

Perform Wilcox rank test is similar as t-test:

wilcox.test(x ~ subtype)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  x by subtype
## W = 674, p-value = 9.214e-07
## alternative hypothesis: true location shift is not equal to 0

# you can also do:
# wilcox.test(x[subtype == "Proneural"], x[subtype == "Mesenchymal"])

Finally, we implement the permutation-based test. Recall we need to design a statistic for the permutation test. In following, the statistic is defined as:

$s = |\frac{\mu_1}{\sigma_1} - \frac{\mu_2}{\sigma_2}|$

We implement the calculation of permutation statistic as a simple function so that it can be used multiple times:

get_score = function(x1, x2) {
    abs(mean(x1)/sd(x1) - mean(x2)/sd(x2))
}

The score for the real data (not permutated)

score = get_score(x[subtype == "Proneural"], x[subtype == "Mesenchymal"])
score

## [1] 0.9514647

Next we random permutation subtype to produce the null hypothesis that the expression of this gene is not correlated to subtype.

n = 1000
score_random = numeric(n)

for(i in 1:n) {
    subtype_random = sample(subtype, length(subtype))
    score_random[i] = get_score(x[subtype_random == "Proneural"], x[subtype_random == "Mesenchymal"])
}

score_random corresponds that is called “null distribution”. We can see how different between score and score_random by looking at the position of score in score_random.

plot(density(score_random), xlim = c(-0.3, 1.2))
abline(v = score, col = "red", lwd = 2)

The final p-value is calculated as the probability of score_random larger than score.

p = sum(score_random > score)/n
p

## [1] 0

Since we only perfrom 1000 permuations, the minimal non-zero p-value is 1/1000 = 0.001, when you see a p-value is zero from a permutaiton test, actually it means the p-value < 0.001 while not really 0.

Day 2 Exercise

Basic statistics for analysis

Read the data

General analysis