Philip Mayfield
Athletes and their purported use of performance-enhancing drugs have dominated the sporting news in the last few months. The controversy seems to be particularly heated in the sport of baseball, with the Mitchell Report naming many famous players. Of particular interest is the accusation by Brian McNamee that Roger Clemens used performance-enhancing drugs to increase his performance. As statistics are readily available in the sport of baseball, I decided to perform a statistical analysis of Mr. Clemens’ performance before and after his alleged use of performance-enhancing drugs.
The field of probability and statistics has formal tests which can be
used to determine if an average has changed. These tests are called
“Hypothesis Tests” and can be used to help understand whether Roger
Clemens’ performance changed before and after the period
of alleged drug use. Before I explain the test, let me explain why we
need formal Hypothesis tests.
Many people point to Mr. Clemens’ statistics and highlight that his best
ERA (earned run average) was 1.87 in 2005. This statement is offered as
evidence that when Mr. Clemens should have been aging and getting worse,
he was the best he had ever been. The problem with this statement is
that it doesn’t take into account the natural variation in a process.
Almost all data has some form of variation. If you don’t believe me, go
outside and throw a baseball, football, or whatever kind of ball you
prefer as far as you can 10 times. If you measure the distance of each
throw, you will find that each of them goes a different distance.
Additionally – and here is a key point – one of the 10 throws will be
the longest. The problem is that we as humans tend to see this single
point and draw conclusions that are not necessarily valid. For example,
if the longest throw was one of the latter, then we might say that “we
were just getting warmed up”. If the longest was one of the
earlier throws, we might say that “our arm got tired at the end”.
However, it is possible that the longest throw was simply random.
Perhaps there isn’t anything “different” about the throw, it was just
another throw in 10 that happened to go the farthest. We don’t tend to
think this way. We want to be able to find assignable causes in data so
that changes in performance can be explained, and are therefore not
random.
What do I mean by an assignable cause? Go outside and throw 10 more
balls, but this time throw using your left-hand (or your non-dominant
hand). Unless you are different from the vast majority of the people on
the planet, there will be a large difference between the distances you
threw right-handed vs. left-handed. In this case, the assignable cause
is that you changed hands. Changing from your dominant hand changed the
distance that the ball went.
In the case of Roger Clemens, the assignable cause
would be the purported use of performance-enhancing drugs.
Put more
simply, when Mr. Clemens was allegedly taking
performance-enhancing drugs, did this make him pitch better? In
order to test this theory, we can perform a formal hypothesis test.
To perform a hypothesis test, we start with two mutually exclusive hypotheses. Here’s an example: when someone is accused of a crime, we put them on trial to determine their innocence or guilt. In this classic case, the two possibilities are the defendant is not guilty (innocent of the crime) or the defendant is guilty. This is classically written as…
H0: Defendant is
Innocent ← Null Hypothesis
H1: Defendant is
Guilty ← Alternate Hypothesis
Unfortunately, our justice systems are not perfect. At times, we let the guilty go free and put the innocent in jail. The conclusion drawn can be different from the truth, and in these cases we have made an error. The table below has all four possibilities. Note that the columns represent the “True State of Nature” and reflect if the person is truly innocent or guilty. The rows represent the conclusion drawn by the judge or jury.

Two of the four possible outcomes are correct. If the truth is they are
innocent and the conclusion drawn is innocent, then no error has been
made. If the truth is they are guilty and we conclude they are guilty,
again no error. However, the other two possibilities result in an error.
A Type I (read “Type one”) error is when the person is truly innocent
but the jury finds them guilty. A Type II (read “Type two”) error is
when a person is truly guilty but the jury finds him/her innocent. Many
people find the distinction between the types of errors as unnecessary
at first; perhaps we should just label them both as errors and get on
with it. However, the distinction between the two types is extremely
important. When we commit a Type I error, we put an innocent person in
jail. When we commit a Type II error we let a guilty person go free.
Which error is worse? The generally accepted position of society is that
a Type I Error or putting an innocent person in jail is far worse than a
Type II error or letting a guilty person go free. In fact, in the United
States our burden of proof in criminal cases is established as “Beyond
reasonable doubt”.
Another way to look at Type I vs. Type II errors is that a Type I error is the probability of overreacting and a Type II error is the probability of under reacting.
In statistics, we want to quantify the probability of a Type I and Type II error. The probability of a Type I Error is α (Greek letter “alpha”) and the probability of a Type II error is β (Greek letter “beta”). Without slipping too far into the world of theoretical statistics and Greek letters, let’s simplify this a bit. What if I said the probability of committing a Type I error was 20%? A more common way to express this would be that we stand a 20% chance of putting an innocent man in jail. Would this meet your requirement for “beyond reasonable doubt”? At 20% we stand a 1 in 5 chance of committing an error. To me, this is not sufficient evidence and so I would not conclude that he/she is guilty.
The formal calculation of the probability of Type I error is critical in the field of probability and statistics. However, the term "Probability of Type I Error" is not reader-friendly. For this reason, for the duration of the article, I will use the phrase "Chances of Getting it Wrong" instead of "Probability of Type I Error". I think that most people would agree that putting an innocent person in jail is "Getting it Wrong" as well as being easier for us to relate to. To help you get a better understanding of what this means, the table below shows some possible values for getting it wrong.
Chances of
Getting it Wrong
|
|
| Percentage | Chances of sending an innocent man to jail |
| 20% Chance | 1 in 5 |
| 5% Chance | 1 in 20 |
| 1% Chance | 1 in 100 |
| .01% Chance | 1 in 10,000 |

Roger Clemens Alleged Drug Use Periods |
|
| Before Alleged Drug Use | 1984 to 1997 |
| After Alleged Drug Use | 1998 to 2005 |
We need a better way to define “pitched better”; luckily, baseball keeps
a wealth of statistics on pitchers and batters to give us a quantitative
assessment. For pitchers, the most commonly used statistic seems to be
ERA (earned run average). The lower the
ERA, the better the pitcher. There are other statistics, such as ERA+,
WHIP, and win percentage which we will get to in a moment. Mr. Clemens’
ERA before alleged drug use is 3.09 and his ERA after alleged drug use
is 3.45. Remembering that a lower ERA is better, his performance after
the alleged use is worse than before. The question still remains: did
Mr. Clemens’ performance change (for better or worse) after the alleged
drug use? Is the difference in ERA from 3.09 to 3.45 due to some
assignable cause or is it simply random variation? For this data, the
hypothesis test is defined as…
H0: Mr. Clemens’
average ERA was the same before and after
H1: Mr. Clemens’
average ERA was different after alleged drug use

The hypothesis test for this type of data is called a “t-Test”. A t-Test
is commonly used to determine if two different data sets have a
different average. In our example, we would like to know if the average
ERA is different before and after the alleged drug use. The chances of
getting it wrong using Mr. Clemens’ ERA data before and
after alleged drug use is 35%. (If you are interested in the data
behind this article or how to calculate the probability of Type I error click
here.) If we conclude that Mr. Clemens’ ERA
changed before and after 1998, we would have a 35% chance of being wrong
or roughly a 1 in 3 chance of being incorrect. Most scientists require a
level of proof such that the chances of getting it wrong are less than
5% before they will conclude that there is a difference in average. A
35% chance of getting it wrong is too big of a chance and I would
conclude that there was no difference in performance. A simple graph
called a dot plot can help us compare Mr. Clemens’ performance before
and after 1998.
In the graph below, the blue dots represent Mr. Clemens’ ERA in the
years before 1998, while the green triangles represent his ERA in the
years after 1998. Visually, it does not appear that there is a
difference in the average ERA, and the t-Test confirms this.

Based upon this analysis, I would conclude that Mr. Clemens’ average ERA did not change before and after 1998 and that any differences were due to random variation.
While Mr. Clemens’ ERA doesn’t appear to have changed, we can get a
clearer picture if we look at statistics other than ERA. Pitchers are
also evaluated using the statistic Adjusted ERA+ which adjusts the ERA
for ballparks. Since some ballparks favor batters and others pitchers,
the ERA+ statistic was created to adjust for this potential bias and
normalize pitchers in a more equitable manner. An ERA+ of 100 means that
a pitcher performed equal to the average pitcher, with any value over
100 being better than average and any value under 100 being worse than
average. Note that for the raw statistic ERA lower is better, and for
ERA+ bigger is better. We can also use Walks Plus Hits
Per Inning Pitched (WHIP) which is yet
another baseball statistic. The lower the WHIP, the better the pitcher.
The table below has the before and after analysis for Mr. Clemens and
the associated chances of getting it wrong (Type I error). While Mr. Clemens’ performance was slightly
worse in after years, the difference is very small and likely the result
of random variation.
Roger Clemens Pitching Statistics Before and After Alleged Drug Use
|
||||
| Before (1984-1997) |
After (1998-2005) |
Chances of Getting it Wrong (Type I Error) | Conclusion | |
| ERA (lower better) |
3.09 | 3.45 | 35% | No change in performance |
| Adjusted ERA+ (higher better) |
152 | 140 | 49% | No change in performance |
| WHIP (lower better) |
1.168 | 1.227 | 35% | No change in performance |
Based upon the analysis of Roger Clemens' ERA, Adjusted ERA+, and WHIP statistics, there is insufficient statistical evidence to suggest that his average performance changed in the years before and after the alleged use of performance-enhancing drugs.
This analysis is limited in scope to Mr. Clemens’ performance in the
years prior to and after alleged drug use. I am sure many will argue
that his performance should have dropped in his later years due to the
natural effects of aging. In fact, Mr. Clemens’ performance did drop;
however, the drop was not statistically significant and it appears that
his performance before and after alleged drug use was approximately the
same.
Is it possible that Mr. Clemens took performance-enhancing drugs? Yes.
Assuming for the moment that he did take performance-enhancing drugs,
did it increase his performance over previous years? No.
There is
little, if any, evidence that Roger Clemens' performance was increased in the years
after the alleged use. Put another way, if Mr. Clemens did take
performance-enhancing drugs, he should get his money back.
Many people will likely disagree on the years that I chose to analyze
Mr. Clemens’ records. In this section, I will explain my
rationale for picking the dates of before and after alleged drug use.
The more important concept is that I picked the dates and then afterward
performed the statistical analysis. This is distinctly different from
looking through the players’ statistics and then picking which years to
include.
According to the Mitchell Report, Brian McNamee claims to have given Mr.
Clemens steroids in 1998 and human growth hormone (HGH) in 2000 and
2001. I made the assumption that the benefits of these drugs would not
be instant on/instant off. In other words, if Mr. Clemens did take HGH
in 2000 and 2001 he would continue to see performance gains from this
into 2002 and on. Perhaps the benefits of HGH would subside quickly or
perhaps they would continue for years. Mr. Clemens didn’t play a full
season in the year 2006, and therefore this made a convenient break
point. Undoubtedly some will want to analyze the data using a smaller
period, perhaps stopping in the year 2002 or 2003. I should note that
Mr. Clemens’ career best ERA was in 2005. The inclusion of the 2005 data
improves his “after” statistics and yet he still didn’t have a
performance increase.
If you are statistically inclined you may have some additional
questions. The following section will likely be useful.
My motivation for writing this article is to provide an interesting
example of Hypothesis testing. I am not a baseball fan and frankly wish
this would all come to an end so we can get more football news coverage.
I have been to one Major League baseball game which was in Chicago in
2006 and happened to coincide with a family vacation. We left after 4
innings. I do not know Roger Clemens nor am I associated with him in
any way.
Copyright © 2008 SigmaZone.com. All Rights Reserved.
Raw Data and Probability Calculations
Read more articlesCopyright (C) 2007 Digital Computations, Inc. All Rights Reserved.