-
Notifications
You must be signed in to change notification settings - Fork 2
/
getting-started-with-stats.page
522 lines (388 loc) · 17.2 KB
/
getting-started-with-stats.page
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
---
title: Getting started with ... Stats
created: 2017-04-27
author: 徐栖
subtitle: some basic statistics concepts
status: in progress
type: rational
importance: 5
tags: test, statistics, descriptive
...
What is statistics?
===================
We measure things. In practice, there are two major difficulties: 1)
there are too many things of the same kind, and it is not possible to
measure every one of them; 2) you cannot eliminate errors when you
measure, so different individuals gives you different measurements, even
multiple measurement of the same individual yields different results;
how do you describe this character being measured for the whole
population of this kind of things?
Staticstics was invented to solve this problem. In short, it a)
describes the similarity and variation of measurements (scores) within a
group of individuals (sample) that you can measure, which is called
**descriptive statistics**; and b) help us estimate the characteristics
in question in a larger group, which we often cannot or should not
measure one by one, and this is called **inferential statistics**.
Descriptive statistics
======================
When we obtain scores over a sample, the first things that we need to
know are: what are these values? What is the maximum and the minimum
value? How large are the differences between the scores? What
information do the values tell us as a whole? These questions are
answered by descriptive statistics.
Central tendency
----------------
Central tendency tells you where the typical or common value is in your
sample's scores. The mostly widely used central tendency measures are
(in that order): the mean, the median, and the mode.
### The Mean
Just the arithmetic mean of all scores in the sample.
$$
\bar{X} = \frac{\sum_{i=1}^{n}{X_{i}}}{n}
$$
where \\(n\\) is sample size, \\(X_{i}\\) is the score for the individual \\(i\\).
**In R**, the mean is calculated with
``` r
mean(x, trim = 0, na.rm = FALSE, ...)
```
### The Median
If you sort your sample scores in ascending order, and pick the one in
the middle if you have a sample size of odd value, or the mean of the
two in the middle if you have a sample size of even value, you get the
sample's median. A median is no smaller than half of the scores in the
sample and no larger than the other half.
Median is more useful than the mean in telling you the position of the
typical value if you have some extreme values on either end of the score
distribution, or the scores are ordinal values but not interval, i.e.
the distance between values has no meaning.
**In R**, the median is calculated with
``` r
median(x, na.rm = FALSE)
```
### The Mode
The mode is simply the value that appears most frequently in the sample
scores. Note that in some samples, there may be two or more modes. These
are called bimodal and multimodal distributions. Also note that the mode
is not necessarily close to the mean.
**In R**, you can use such a function to find mode numbers for a
univariate sample:
``` r
Mode <- function(x) {
ux <- unique(x)
tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}
```
Alternatively you can use `modeest` package.
Dispersion
----------
Dispersion tells you how large are the differences between scores. Some
measurements here, namely variance and standard deviation, are basis of
inferential statistics later, as they combined with central tendency
give us ideas about where to expect scores in the population.
### The Range
The range is the distance between the largest and the smallest value in
the sample. It is often reported with the max/min values.
$$
R = X_{\text{max}} - X_{\text{min}}
$$
**In R**, the range is calculated with
``` r
range(..., na.rm = FALSE)
```
Note that it reports the min and max values in a pair, not the range
itself.
### The Interquartile Range
The interquartile range is less influenced by extreme values than the
range. As finding the median, you sort the scores, find the median and
divide the samples into two groups, and find the medians in the two
groups respectively. If you find the medians by calculating means, use
the larger score. The distance between the two values you find is the
interquartile range.
**In R**, the interquartile range is calculated with
``` r
# assuming continuous sample
IQR(x, na.rm = FALSE, type = 7)
```
### Variance
Variance indicates the average of the amount of dispersion in a
distribution of scores. To calculate variance, you add the distances of
each score from the mean together and divide the sum by population size.
But by definition, if you just add positive and negative distances
themselves, the result would be 0. So the differences are squared first.
Therefore, for the population's variance:
$$
\sigma^{2} = \frac{\sum{(X-\mu)}^{2}}{N}
$$
where \\(N\\) is the size of the population, \\(\mu\\) is the population mean.
For the sample's variance, because the mean of the sample is a parameter
calculated with all the sample scores, the degree of freedom is 1 less
than the sample size:
$$
s^{2} = \frac{\sum{(X-\bar{X})}^{2}}{n - 1}
$$
where \\(\bar{X}\\) is the sample mean, and \\(n\\) the sample size.
**In R**, variance is calculated with
``` r
var(x, y = NULL, na.rm = FALSE)
```
### Standard Deviation
SD is the typical deviation between individual scores in a distribution
and the mean of the distribution.
It is simply calculated by obtaining the square root of variance.
**In R**, the sample's SD is calculated with `sd` from `stats` package.
``` r
sd(x, na.rm = FALSE)
```
Standardization and z-score
---------------------------
z-score is a quick way to describe how far the score is from the mean.
$$
z = \frac{X - \bar{X}}{SD}
$$
So, if \\(X\\) is 0.5 SD larger than the mean, its z-score is 0.5. If \\(X\\) is
1 SD smaller than the mean, its z-score is -1.
**In R**, z-scores can be calculated with
``` r
scale(x, center = TRUE, scale = TRUE)
```
Inferential statistics
======================
Inferential statistics is used when we know some characteristics of the
sample, and want to infer whether such characteristics exist in the
larger population.
Standard error of the mean
--------------------------
Suppose we have a population. We randomly draw a sample, calculate its
mean. Then we put the sampled individuals back into the population, and
randomly draw another sample, we can again calculate its mean. Repeat
this a few (hundred) more times, the means you calculated forms a
distribution, called **the sampling distribution of the mean**, and the
standard error of the mean is the standard deviation of this sampling
distribution of the mean.
Because of the [central limit
theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), the
sampling distribution of the mean will be approching normal
distribution. This makes standard error an indispensible tool in
inferential statistics.
Standard error of the mean is calculated from SD and sample size like
this:
$$
S_{e} = \frac{SD}{\sqrt{n}}
$$
**In R**, you can use the above equation, or use `std.error` from the
`plotrix` package.
t distributions and t-value
---------------------------
t distributions are a family of symmetrical distributions. They describe
the probability distributions that arise when estimating the mean of a
normally distributed population when the sample size is small and the
population SD is unknown. The larger the sample, the closer t
distribution resembles the normal distribution. When sample size is
around 120, t distribution is almost identical to normal distribution.
Similar to z-score describing a score relative to the mean, t-value
describes how far the sample mean is from the population mean, with the
unit being standard error, given a certain sample size (degree of
freedom). With t-value, we can lookup what is the probability of
obtaining such a sample mean, if we know the population mean. In
reverse, we can tell the probability that the population mean is to an
extent near the sample mean we obtained. Thus inferring parameters of
population from those of samples is possible.
The t-value is calculated with the population mean, the sample mean and
SE like this:
$$
t = \frac{\bar{X} - \mu}{S_e}
$$
**In R**, you can use `dt` in `stats` package to calculate the
probability of obtaining a certain t-value by chance in sampling.
Statistical significance and hypothesis testing
-----------------------------------------------
Suppose a theory declares that a measurement of population should have a
mean value of \\(\mu\\), while a study of a sample from the population
yields a mean value of \\(\bar{X}\\) that is different from \\(\mu\\). Is the
difference due to chance (random sampling error) or indicating an error
in theory?
The hypothesis that observed difference is due to chance is called null
hypothesis, and the other is called alternative hypothesis. In most
cases, only when statistics tells us that the probability that null
hypothesis is valid is less than 0.05, can we claim that the observed
difference is significant, i.e. alternative hypothesis is valid. To
calculate the probability, we calculate the t-value, then lookup the
corresponding probability. A large t-value means the alternative
hypothesis is more likely.
Confidence interval of the mean
-------------------------------
Suppose a randomly selected sample of size \\(n\\) yields a mean of
\\(\bar{X}\\), how can we estimate the population mean \\(\mu\\)? We can almost
be certain that \\(\mu \neq \bar{X}\\). What we care about, is how large the
range around \\(\bar{X}\\) should be, if we are 95% or 99% sure that \\(\mu\\)
is in this range.
This range is called 95% confidence interval or 99% confidence interval,
the formula to calculate it is
$$
\begin{eqnarray}
CI_{95} & = & \bar{X} \pm t_{95}S_e \\
CI_{99} & = & \bar{X} \pm t_{99}S_e
\end{eqnarray}
$$
\\(t_{95}\\) and other t-values corresponding to the confidence required can
be found by looking up a t distributions table. **In R**, you can use
``` r
qt(.95, df)
```
to get the \\(t_{95}\\) value given sample size (degree of freedom) `df`.
Correlation
-----------
When we measure two or more variables, the question of correlation often
pops up. In many cases, we raise correlation questions to begin with
("is the amount of people entering the mall related to weather?").
We calculate correlation coefficients, to see how strong the correlation
is between variables. The most used may be the Pearson product-moment
correlation coefficient. To calculate it, we first standardize the
variables \\(X\\) and \\(Y\\), to convert them to z-scores. For each case in the
sample, we multiply its \\(X\\) variable's z-score by its \\(Y\\) variable's
z-score, add up the products for all the cases, then divide the sum by
the sample size:
$$
r = \frac{\sum{z_{x}z_{y}}}{N}
$$
The coefficient will be between -1 and 1. Higher absolute value means a
strong correlation and 0 means no correlation at all. Positive value
indicates a positive correlation.
**In R**, Pearson correlation coefficient can be calculated with
``` r
cor(x, y, method = "pearson")
```
When \\(X\\) is a continuous variable and \\(Y\\) a naturally two-category
nominal variable, one can use a special case of Pearson coefficients
called
[point-biserial](https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient).
**In R**, point-biserial coefficient can be calculated with
`biserial.cor` from the `ltm` package.
If both \\(X\\) and \\(Y\\) are dichotomous variables, one can use a phi
coefficient, or use chi-square analysis. Phi coefficient is yet another
special case of Pearson coefficient.
**In R**, you can use `phi` from the `psych` package to calculate phi
coefficients.
If one of the variable is an ordinal but not interval variable, one
should use Spearman's rho coefficient. It is, you guessed right, another
specialized form of Pearson coefficient.
**In R**, you can use the same `cor` function, with `method` =
`"spearman"`.
### Significance of correlation
The correlation coefficient \\(r\\) can tell us whether a correlation exists
between two variables in the sample. But is the correlation significant?
Can we say the correlation exists in the population? We use the
versatile t distributions again to answer the question.
The t value for the correlation is
$$
t = \frac{r - \rho}{s_r}
$$
where \\(r\\) is the sample correlation coefficient, \\(\rho\\) is the
population correlation coefficient, for a null hypothesis it is 0, and
\\(s_r\\) is the standard error of the sample correlation coefficient.
\\(s_r\\) can be calculated with the following formula:
$$
s_r = \sqrt{(1 - r^2) + (N - 2)}
$$
where N is the sample size.
So the formula for calculating \\(t\\) can be written as
$$
t = r \sqrt{\frac{N - 2}{1 - r^2}}
$$
Then we can obtain the probability that null hypothesis is true by
looking up tables.
### The coefficient of determination
\\(r^2\\) actually denotes how much variance is shared between the two
variables, if you look closely. So the value of \\(r^2\\) is interpreted as
how much variance in one variable can be explained by the variance in
another.
### Correlation and causality
Correlation does not imply causality. In many cases, the logical
relationship between the two variables is not directly explained by
their correlation. Maybe a third unobserved variable causes both
variables to change. Maybe the two have no relationships whatsoever and
we are observing "artifacts".
On the other hand, if you want to prove the existence of causality, you
have to first prove there is correlation between the independent
variable and the dependent variable.
Independent samples t-test and pair-samples t-test
--------------------------------------------------
It is a common task in statistics that we want to know whether the
difference observed in two groups of samples are the result of
differences in the populations delinated by a grouping variable, or are
they just due to chances. If the variable in question is a continuous
interval or ratio variable, and the grouping variable is a nominal or
categorial variable that separates the samples into independent groups,
e.g. men and women, non-smokers and smokers, 3-graders and 5-graders, we
can use the independent samples t-test to see if the differences are
statistically significant.
The basic idea is the same as estimating the probability of the
population parameter falling into a certain interval given the sample
parameters. We have the parameter differences between the two groups of
samples, and obtain the t-value by dividing the difference with the
standard error, then we look up a probability table to see how likely
the t-value is the result of chances alone.
For example, if we need to calculate whether the difference in the mean
of two groups of samples is significant, we use the following equation:
$$
t = \frac{\bar{X}_1 - \bar{X}_2}{S_e}
$$
\\(S_e\\) here is the standard error of the difference between the means.
From the name you can tell it is a bit more complex than the sample's
standard error of the mean. \\(S_e\\) is calculated as follows:
$$
S_e = \sqrt{S_{X_1}^2 + S_{X_2}^2}
$$
where \\(S_{X_1}\\) and \\(S_{X_2}\\) are the two groups' respective standard
errors of the mean ... if the two groups of samples are similar in size.
In some cases this can be a very big IF. When the two groups differ
greatly in size or variance, or the data are not normally distributed,
you may want to use some non-parametric alternatives such as
Mann-Whitney U test.
When looking up the table, the degree of freedom is the sum of two
sample sizes minus 2, because you have two parameters: the means of two
groups:
$$
df = n_1 + n_2 - 2
$$
Paired-sample t-test answers a similar question, but in this case, each
individual in one group is paired with one individual in another group
in some way. For example, we want to look at the effect of father's TV
watching habits on their eldest children, so we take observations of two
groups, fathers in a group, their child in another, the sample comprise
father-child pairs. Or, we do a longitudinal research, observe some
children when they are 3, then take another measure of the same
indicators at the age of 7, in this case the samples are also paired, or
dependent.
Again,
$$
t = \frac{\bar{X} - \bar{Y}}{S_e}
$$
\\(S_e\\), the standard error of the difference between dependent sample
means, is even more complex to calculate here. You have to first
calculate the standard deviation of the difference between dependent
sample means:
$$
SD = \sqrt{\frac{\sum{D^2} - \frac{(\sum{D})^2}{N}}{N-1}}
$$
and then calculate the standard error in the good old way:
$$
S_e = \frac{SD}{\sqrt{N}}
$$
In this and previous equations, \\(N\\) stands for the number of pairs in
the sample. The degree of freedom in this case is \\(N - 1\\).
References
==========
- [Statistics in Plain English,
3rd Edition.](https://amzn.com/041587291X)
Descriptive statistics
----------------------
- [Is there a built-in function for finding the
mode?](http://stackoverflow.com/questions/2547402/is-there-a-built-in-function-for-finding-the-mode)
- [R Tutorial Series: Centering Variables and Generating Z-Scores with
the Scale()
Function](http://www.r-bloggers.com/r-tutorial-series-centering-variables-and-generating-z-scores-with-the-scale-function/)
Inferential statistics
----------------------
- [Student's
t-distribution](https://en.wikipedia.org/wiki/Student%2527s_t-distribution)