- Basic statistical analysis
- descriptive statistics
- hypothesis test
- We first introduce the concept, then we will practise with a real dataset
Dec 4, 2018
Statistics that describe your data
mean: \(\frac{1}{n}\sum_i^n x_i\)
median and quantiles are special scenarios of percentiles.
k/100*(n-1) + 1
valueWhat if i = k/100*(n-1) + 1
is not an integer?
i
floor(i)
and celing(i)
(the left and right integers)floor(i)
and celing(i)
by weighting the distance (linearly interpolation)assume the following vector that has been sorted:
1 2 3 4 5 6 7 8 9 10 11 value vector 0 10 20 30 40 50 60 70 80 90 100 percentile ~ <-
the 12^th percentile (i = 12/100*(11-1) + 1 = 2.2)
quantile(1:11, p = 0.12)
## 12% ## 2.2
R functions to get descriptive statistics:
set.seed(123456) x = rnorm(100) # sample 100 random data points from standard normal distribution mean(x)
## [1] 0.01681979
median(x)
## [1] 0.04790734
quantile(x)
## 0% 25% 50% 75% 100% ## -2.74886838 -0.72599751 0.04790734 0.83391491 2.50264541
quantile(x, p = c(0.05, 0.95)) # percentiles
## 5% 95% ## -1.564047 1.538384
mean and median are very close for symmetric distribution.
mean and median can be far from each other for non-symmetric distributions.
So mean or median value can only tell you the "center" location of your data, while you cannot make and assumption
Range is simply \(max - min\)
x = rnorm(100) range(x) # minimal and maximal
## [1] -2.558968 3.078589
Interquartile range (IQR) is Q3 - Q1 (75^th percentile - 25^percentile)
quantile(x, 0.75) - quantile(x, 0.25)
## 75% ## 1.25958
IQR(x)
## [1] 1.25958
Variance and standard deviation
\(var = \frac{1}{n}\sum(x - \mu)^2\)
\(sd = \sqrt{var}\)
var(x)
## [1] 1.181237
sd(x)
## [1] 1.086847
Coefficient of Variation (CV)
dispersion relative to the level of the data
sd(x)/mean(x)
## [1] 40.43593
boxplot(x)
With descriptive statistics, we can scale our data.
x1 = rnorm(10, mean = 4, sd = 3) x2 = rnorm(10, mean = -1, sd = 1)
z-score scaling
\(x' = \frac{x - \mu}{\sigma}\)
y1 = (x1 - mean(x1))/sd(x1) y2 = (x2 - mean(x2))/sd(x2)
standardization
\(x' = frac{x - min}{max - min}\)
y1 = (x1 - min(x1))/(max(x1) - min(x1)) y2 = (x2 - min(x2))/(max(x2) - min(x2))
Besides the descriptive statistics, we always need to make the distribution to see what exactly the data looks like.
Histogram is a frequency-based approach
x = rnorm(100) hist(x)
Density is a more exact way to estimate the real distribution based on x
( assume x
is a small sample from the population)
plot(density(x))
x = runif(10) y = x + rnorm(10, sd = 0.5) cor(x, y)
## [1] 0.5968809
cor(x, y, method = "spearman")
## [1] 0.430303
It is always a good idea to plot the correlation.
plot(x, y)
Spearman correlation is good at non-linear correlation
x = runif(10, max = 5) y = x^10 cor(x, y)
## [1] 0.7438561
cor(x, y, method = "spearman")
## [1] 1
Spearman correlation is robust to outliers
set.seed(1234) x = runif(10); x[10] = 100 y = runif(10); y[10] = 100 cor(x, y)
## [1] 0.9999322
cor(x, y, method = "spearman")
## [1] 0.3212121
cor(x[1:9], y[1:9])
## [1] 0.06834788
cor(x[1:9], y[1:9], method = "spearman")
## [1] 0.06666667
par(mfrow = c(1, 2)) plot(x, y) plot(x[1:9], y[1:9], main = "remove the outlier")
Outlier is a basic problem in data anlaysis, but in most cases, people ignore it.
Following statistics are significantly affected by outliers:
Following statistics are robust to outliers:
two-sample test
Why do we need to do test?
parameteric test
Main steps:
the test gives you a probability of "rejecting H0" is wrong.
Assume x1
and x2
are from normal distribution, the t-statistic is defined as
\(t = \frac{\bar{X}_1 - \bar{X}_2}{s_p\sqrt{\frac{2}{n}}}\)
\(s_p = \sqrt{\frac{s^2_{X_1}+s^2_{X_2}}{2}}\)
t-statistic follows a t-distribution and we can calculate
P(T >= |t|)
Apply two-sample t-test with t.test()
function.
x1 = rnorm(10, mean = -1) x2 = rnorm(10, mean = 1) t.test(x1, x2)
## ## Welch Two Sample t-test ## ## data: x1 and x2 ## t = -4.349, df = 15.086, p-value = 0.0005654 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -2.5777800 -0.8826677 ## sample estimates: ## mean of x mean of y ## -1.1181707 0.6120532
Boxplot can be used to visualize the difference:
boxplot(list(x1, x2))
non-parametric test
wilcox.test()
function:
wilcox.test(x1, x2)
## ## Wilcoxon rank sum test ## ## data: x1 and x2 ## W = 9, p-value = 0.00105 ## alternative hypothesis: true location shift is not equal to 0
It complete learns from the data.
steps are:
x1
and x2
and random 10 data points for each of the two groups, and calculate s0_random for this random datasets0 = abs(mean(x1) - mean(x2)) n = 1000 s0_random = numeric(n) for(i in 1:1000) { x_random = sample(c(x1, x2), 20) s0_random[i] = abs(mean(x_random[1:10]) - mean(x_random[11:20])) } sum(s0_random > s0)/n
## [1] 0.002