Calculating Type I Probability
Philip Mayfield
I have had many requests to explain the math behind
the statistics in the article
Roger
Clemens and a Hypothesis Test. The math is usually handled by
software packages, but in the interest of completeness I will explain
the calculation in more detail.
A t-Test provides the probability of making a Type I
error (getting it wrong). If you are familiar with Hypothesis testing,
then you can skip the next section and go straight to
t-Test hypothesis.
Hypothesis Testing
To perform a hypothesis test, we start with two mutually exclusive
hypotheses. Here’s an example: when someone is accused of a crime, we
put them on trial to determine their innocence or guilt. In this classic
case, the two possibilities are the defendant is not guilty (innocent of
the crime) or the
defendant is guilty. This is classically written as…
H0: Defendant is
Not Guilty ← Null Hypothesis
H1: Defendant is
Guilty ← Alternate Hypothesis
Unfortunately, our justice systems are not perfect. At times, we let the
guilty go free and put the innocent in jail. The conclusion drawn can be
different from the truth, and in these cases we have made an error. The
table below has all four possibilities. Note that the columns represent
the “True State of Nature” and reflect if the person is truly innocent
or guilty. The rows represent the conclusion drawn by the judge or jury.

Two of the four possible outcomes are correct. If the truth is they are
innocent and the conclusion drawn is innocent, then no error has been
made. If the truth is they are guilty and we conclude they are guilty,
again no error. However, the other two possibilities result in an error.
A Type I (read “Type one”) error is when the person is truly innocent
but the jury finds them guilty. A Type II (read “Type two”) error is
when a person is truly guilty but the jury finds him/her innocent. Many
people find the distinction between the types of errors as unnecessary
at first; perhaps we should just label them both as errors and get on
with it. However, the distinction between the two types is extremely
important. When we commit a Type I error, we put an innocent person in
jail. When we commit a Type II error, we let a guilty person go free.
Which error is worse? The generally accepted position of society is that
a Type I Error or putting an innocent person in jail is far worse than a
Type II error or letting a guilty person go free. In fact, in the United
States our burden of proof in criminal cases is established as “Beyond
reasonable doubt”.
Another way to look at Type I vs. Type II errors is
that a Type I error is the probability of overreacting and a Type II
error is the probability of under reacting.
In statistics, we want to quantify the probability of a Type I and Type
II error. The probability of a Type I Error is α (Greek letter “alpha”)
and the probability of a Type II error is β (Greek letter “beta”).
Without slipping too far into the world of theoretical statistics and
Greek letters, let’s simplify this a bit. What if I said the probability
of committing a Type I error was 20%? A more common way to express this
would be that we stand a 20% chance of putting an innocent man in jail.
Would this meet your requirement for “beyond reasonable doubt”? At 20%
we stand a 1 in 5 chance of committing an error. To me, this is not
sufficient evidence and so I would not conclude that he/she is guilty.
The formal calculation of the probability of Type I error is critical
in the field of probability and statistics. However, the term
"Probability of Type I Error" is not reader-friendly. For this
reason, for the duration of the article, I will use the term "Chances of
Getting it Wrong" instead of "Probability of Type I Error". I think that
most people would agree that putting an innocent person in jail is
"Getting it Wrong" as well as being easier for us
to relate to. To help you get a better understanding of what this means,
the table below shows some possible values for getting it wrong.
Chances of
Getting it Wrong
(Probability of Type I Error)
|
| Percentage |
Chances of sending an innocent man to jail |
| 20% Chance |
1 in 5 |
| 5% Chance |
1 in 20 |
| 1% Chance |
1 in 100 |
| .01% Chance |
1 in 10,000 |
t-Test Hypothesis
A t-Test is the hypothesis test used to compare two
different averages. There are other hypothesis tests used to compare
variance (F-Test), proportions (Test of Proportions), etc. In the case
of the Hypothesis test the hypothesis is specifically:
H
0: µ
1= µ
2 ← Null Hypothesis
H
1: µ
1<> µ
2 ←
Alternate Hypothesis
The Greek letter µ (read "mu") is used to describe the
population average of a group of data. When the null hypothesis states µ
1= µ
2,
it is a statistical way of stating that the averages of dataset 1 and dataset 2
are the same. The
alternate hypothesis, µ
1<> µ
2, is that the averages
of dataset 1 and 2 are different. When you do a formal hypothesis test,
it is extremely useful to define this in plain language. For our
application, dataset 1 is Roger Clemens' ERA before the alleged use of
performance-enhancing drugs and dataset 2 is his ERA after alleged use.
For this specific application the hypothesis can be stated:
H
0: µ
1= µ
2 "Roger
Clemens' Average ERA before and after alleged drug use is the same"
H
1: µ
1<> µ
2 "Roger Clemens'
Average ERA is different after alleged drug use"
It is helpful to look at the null hypothesis as the default conclusion
unless we have sufficient data to conclude otherwise. In the case of the
criminal trial, the defendant is assumed not guilty (H
0:Null
Hypothesis = Not Guilty) unless we have sufficient evidence to show that the
probability of Type I Error is so small that we can conclude the
alternate hypothesis with minimal risk. How much risk is acceptable?
Frankly, that all depends on the person doing the analysis and is
hopefully linked to the impact of committing a Type I error (getting it
wrong). For example, in the criminal trial if we get it wrong, then we
put an innocent person in jail. For this application, we might want the
probability of Type I error to be less than .01% or 1 in 10,000 chance.
For applications such as did Roger Clemens' ERA change, I am willing to
accept more risk. I am willing to accept the alternate
hypothesis if the probability of Type I error is less than 5%. A 5%
error is equivalent to a 1 in 20 chance of getting it wrong. I should note
one very important concept that many experimenters do incorrectly. I set
my threshold of risk at 5% prior to calculating the probability of Type
I error. If the probability comes out to something close but greater
than 5% I should reject the alternate hypothesis and conclude the null.
Calculating The Probability of a Type I Error
To calculate the probability of a Type I Error, we calculate the t
Statistic using the formula below and then look this up in a t
distribution table.
Where y with a small bar over the top (read "y bar") is the average
for each dataset, S
p is the pooled standard deviation, n
1
and n
2 are the sample sizes for each dataset, and S
12
and S
22 are the variances for each dataset.
This is a little vague, so let me flesh out the details a little for
you.
What if Mr. Clemens' ERA was exactly the same in the before alleged
drug use years as after? For example, what if his ERA before was
3.05 and his ERA after was also 3.05? If this were the case, we
would have no evidence that his average ERA changed before and after.
What if his average ERA before the alleged drug use years was 10 and his
average ERA
after the alleged drug use years was 2? In this case there would be
much more evidence that this average ERA changed in the before and
after years. The difference in the averages between the two data sets
is sometimes called the signal. The greater the difference, the more
likely there is a difference in averages. However, the signal
doesn't tell the whole story; variation plays a role in this as
well.
If the datasets that are being compared have a great deal of
variation, then the difference in averages doesn't carry as much
weight. For
example, let's look at two hypothetical pitchers' data below.
|
|
|
|
|
|
|
|
|
Mr. HotandCold |
Mr. Consistent |
|
|
|
Before |
After |
Before |
After |
|
|
|
4.78 |
3.45 |
3.34 |
2.84 |
|
|
|
1.23 |
1.22 |
3.24 |
2.82 |
|
|
|
3.45 |
4.56 |
3.32 |
2.86 |
|
|
|
1.02 |
4.22 |
3.28 |
2.78 |
|
|
|
4.55 |
1.09 |
3.22 |
2.77 |
|
|
|
3.22 |
3.44 |
3.22 |
2.83 |
|
|
|
3.44 |
1.29 |
3.34 |
2.78 |
|
|
4.55 |
3.23 |
3.24 |
2.83 |
|
|
Average |
3.28 |
2.81 |
3.28 |
2.81 |
|
| Range |
3.76 |
3.47 |
0.12 |
0.09 |
|
|
|
|
|
|
|
|
Mr. "HotandCold" has an average ERA of 3.28 in the before years
and 2.81 in the after years, which is a difference of approximately
.47. Does this imply that the pitcher's average has truly changed or
could the difference just be random variation? Looking at his data
closely, you can see that in the before years his ERA varied from
1.02 to 4.78 which is a difference (or Range) of 3.76 (4.78 - 1.02 =
3.76). In the after years his ERA varied from 1.09 to 4.56 which is
a range of 3.47.
Let's contrast this with the data for Mr. Consistent. Note that both
pitchers have the same average ERA before and after. However, look
at the ERA from year to year with Mr. Consistent. The range of ERAs
for Mr. Consistent is .12 in the before years and .09 in the after
years.
Both pitchers' average ERA changed from 3.28 to 2.81 which is a
difference of .47. However, Mr. Consistent's data changes very
little from year to year. If you find yourself thinking that it
seems more likely that Mr. Consistent has truly had a change in
mean, then you are on your way to understanding variation. In the
before years, Mr. Consistent never had an ERA below 3.22 or greater
than 3.34. In the after years, Mr. Consistent never had an ERA
higher than 2.86. There is much more evidence that Mr. Consistent
has truly had a change in the average rather than just random
variation. As for Mr. HotandCold, if he has a couple of bad years
his after ERA could easily become larger than his before.
The difference in the means is the "signal" and the amount of
variation or range is a way to measure "noise". The greater the
signal, the more likely there is a shift in the mean. The lower the
noise, the easier it is to see the shift in the mean. The t-Statistic
is a formal way to quantify this ratio of signal to noise. The
actual equation used in the t-Test is below and uses a more formal
way to define noise (instead of just the range). The theory behind
this is beyond the scope of this article but the intent is the same.
The larger the signal and lower the noise the greater the chance the
mean has truly changed and the larger t will become.
Roger Clemens' ERA data for his Before and After alleged
performance-enhancing drug use is below. At the bottom is the
calculation of t. You can also download the Excel workbook with the
data
here.
|
|
|
|
|
|
|
ERA |
|
|
|
Before |
After |
|
|
|
4.32 |
2.65 |
|
|
|
3.29 |
4.6 |
|
|
|
2.48 |
3.7 |
|
|
|
2.97 |
3.51 |
|
|
|
2.93 |
4.35 |
|
|
|
3.13 |
3.91 |
|
|
|
1.93 |
2.98 |
|
|
|
2.62 |
1.87 |
|
|
|
2.41 |
|
|
|
|
4.46 |
|
|
|
|
2.85 |
|
|
|
|
4.18 |
|
|
|
|
3.63 |
|
|
|
|
2.05 |
|
|
|
|
|
|
|
|
Average |
3.08929 |
3.44625 |
|
|
Standard Deviation |
0.80819 |
0.9077 |
|
|
Count |
14 |
8 |
|
|
|
|
|
|
|
Signal |
0.35696 |
|
|
| Sp |
0.84435 |
|
|
|
Noise |
0.37422 |
|
|
| t |
0.95389 |
|
|
|
p-value |
0.35153 |
|
|
|
|
|
|
|
The t statistic for the average ERA before and after is
approximately .95. The last step in the process is to calculate the
probability of a Type I error (chances of getting it wrong). Most
statistical software and industry in general refers to this a "p-value".
The probability of committing a Type I error (chances of getting it
wrong) is commonly referred to as p-value by statistical software.
A famous statistician named
William
Gosset was the first to determine a way to calculate the probability
of Type I error (p-value) from a t statistic. His work is commonly
referred to as the t-Distribution and is so commonly used that it is
built into Microsoft Excel as a worksheet function. The Excel function "TDist"
returns a p-value for the t-distribution. The syntax for the Excel
function is "=TDist(x,degrees of freedom, Number of tails)" where...
x = the calculated value for t
degrees of freedom = n
1 + n
2 -2
number of tails = 2 (two sided test)
Thus, if we enter =TDist(.95389,14+8-2,2) into Excel, the resulting
p-value will be .35153.
Conclusion
The
calculated p-value of .35153 is the probability of committing a Type I
Error (chance of getting it wrong). A p-value of .35 is a high probability of
making a mistake, so we can not conclude that the averages are different
and would fall back to the null hypothesis that Mr. Clemens' average ERAs
before and after are the same. As an exercise, try calculating the
p-values for Mr. HotandCold and Mr. Consistent; you should get .524 and
.000000000004973 respectively.

Additional Notes
The t-Test makes the assumption that the data is normally distributed.
If the data is not normally distributed, than another test should be
used.
This example was based on a two sided test. In a two sided test, the
alternate hypothesis is that the means are not equal. You can also
perform a single sided test in which the alternate hypothesis is that
the average after is greater than the average before. In this case, you
would use 1 tail when using TDist to calculate the p-value.
Copyright © 2008 SigmaZone.com. All Rights Reserved.
Read More Articles
Copyright (C) 2007 Digital Computations, Inc. All Rights Reserved.