
The Sabermetric Manifesto
By David Grabiner
I. What is sabermetrics?
Bill James defined sabermetrics as "the search for objective knowledge
about baseball." Thus, sabermetrics attempts to answer objective
questions about baseball, such as "which player on the Red Sox
contributed the most to the team's offense?" or "How many home runs will
Ken Griffey hit next year?" It cannot deal with the
subjective judgments which are also important to the game, such as "Who
is your favorite player?" or "That was a great game."
Since statistics are the best objective record of the game available,
sabermetricians often use them. Of course, a statistic is only useful
if it is properly understood. Thus, a large part of sabermetrics
involves understanding how to use statistics properly, which statistics
are useful for what purposes, and similar things. This does not mean
that you need to know a lot about mathematics to understand
sabermetrics, only that you need to have some idea of how statistics can
be used and misused.
The statistics which are available in baseball are a collected record of
observations. An individual fan, sportswriter, or even a player or
manager will see most teams thirteen or fewer times during the year. His
observations may be of some interest, but they are a small (and often
biased) sample. In thirteen games, the difference between a great hitter
and a poor hitter is just five hits; thus, if the observer happens to
see a mediocre player's two best games of the season, he would get an
incorrect impression of the player's ability.
In contrast, a player's statistics are a record obtained from all of his
games, as observed by the official scorers in the league. This is a
much larger collection of observations, and it is converted to a form
which can be easily understood; few fans could get a good idea of a
player's batting average by watching his 600 plate appearances.
And since sabermetrics is an objective study of the game, it is
necessary to use logical reasoning in sabermetric arguments. Thus, a
hypothesis can be developed from the information you have, either from
statistics or observation; a claim which cannot be directly tested can
be evaluated by studying the conclusions which would follow.
A good example is the statement "Pitching is X% of baseball," which has
been said with X between 15 and 80. Suppose you want to test the claim
"Pitching is 75% of baseball." If this were true, you would conclude
that the teams with the best pitching would be much more likely to win
the pennant than the teams with the best hitting. However, this isn't
the case. The league leaders in fewest runs allowed (which is both
pitching and fielding) win the pennant about half the time; the league
leaders in runs scored (which includes all of hitting) win just as
often. (Note the definition of offense here: if you measure hitting by
an incomplete measure such as batting average, you would conclude that
pitching is much more important.) Other unreasonable conclusions also
follow; for example, a team with 75% of its value in pitching would
never trade a regular pitcher for a regular hitter. Thus the claim must
be rejected. But if 75% is replaced by a number close to 40%, the
conclusions become reasonable. This is how a sabermetric argument
works.
II. General principles
The goal of a baseball team is to win more games than any other team.
Since one team has very little control over the number of games other
teams win, the goal is essentially to win as many games as possible.
Therefore, it is of interest to measure the player's contribution to the
team's wins.
There is a clear relationship between a team's runs scored and allowed
and its wins and losses. This relationship isn't perfect, but it is
very strong. A good formula, determined empirically from the data by
Bill James, is that a team's ratio of wins to losses will be equal to
the square of the ratio between its runs scored and allowed. Thus a
team which scores and allows the same number of runs will win and lose
the s same number of games, finishing at .500; a team which scores 800
runs and allows 700 will win 64 games for every 49 it loses, which
projects to a 92-70 record over a season. This formula comes very close
to the actual records of most teams.
The basic goal of sabermetrics is to evaluate a measure for a given
purpose. The most common uses of statistics are to evaluate past
performance (such as to determine who should win the MVP award) and to
predict future performance (such as to evaluate a trade that was just
made). In both cases, we are interested in measuring contribution to
games won and lost.
The reasons that such analysis is possible are the same reasons that
make statistics more interesting in baseball than in other sports.
Baseball statistics can measure individual performance, independent of
what other players do. And while the importance of an individual event
depends on the situation, the effect of the situations on the importance
of the statistic over a large sample such as a season is not great.
When a batter hits a single, this describes what he did; when a
quarterback throws a ten-yard pass, the guard who took out a linebacker
gets no statistical credit. And the batter who received a single is
properly credited for a success; the ten-yard pass may have been a
failure if it was third down with 13 yards to go. Thus it is reasonable
for the goal of a baseball statistic to be to measure a player's
individual contribution to runs or wins.
Given the goal, it is possible to evaluate a statistic. Baseball
statistics can be evaluated in the same way as non-baseball statistics;
they can have the same types of flaws, or be misused or misinterpreted
in the same ways.
The first natural question to ask about a statistic is, "Does the
statistic measure an important contribution to that goal?" For example,
ERA measures the number of runs a pitcher allows, which is almost all a
pitcher contributes to winning games. Batting average does fairly well
because it counts hits, but it ignores power and walks, which are also
important parts of offense. Few statistics fail badly here; those which
do measure things which happen only rarely (such as HBP), have little to
do with winning games (such as the fraction of a batter's outs which are
strikeouts), or both. As a non-baseball example, the number of crimes
in a city last year is important if you want to know something about the
safety of the city; the number of crimes on a single street says very
little about the safety of the whole city.
The second, and usually most important, question to ask is, "How well
does the statistic measure the player's own contribution?" There are
many ways that a statistic, baseball-related or not, can fail here.
Virtually every statistic fails in some way to some extent, so the best
statistics are those with only minor failings, and relatively few of
them.
For example, a player should be evaluated for what he does, not for what
his teammates or manager do. This is a major problem with such
statistics as runs scored. Unless the batter hits a home run or steals
home, he needs his teammates' contribution to actually score a run, and
he cannot do much to cause them to get hits once he is on base. Thus,
if you bat in front of the best home-run hitters in the league, you will
score a lot of runs, whether or not you have a good ability to score
runs. If your manager decides to bat you eighth on an NL team, you
won't score many runs when you do get on base.
Likewise, a good statistic should not measure outside effects over which
the player has no control, such as the park. A good non-baseball
example of this problem is the high death rate in Miami. The population
of Miami is older than the population of most other cities; thus,
regardless of the quality of medical care in Miami, you would expect a
high death rate.
Likewise, it is easier to score runs in Fenway Park than in Oakland.
Therefore, a pitcher with a 3.60 ERA in Oakland could pitch just as well
in Fenway, helping his team win games just as much, but have a 4.00
ERA. You will sometimes see a discussion of park-adjusted numbers,
designed to eliminate this effect; for example, the pitcher above might
have a 3.80 park-adjusted ERA in either park. Note that this is
adjusting for the value of the pitcher's performance, not the actual
performance; the 4.00 ERA for a Red Sox pitcher is just as valuable to
his team regardless of how it is split between home and road games.
If a player's statistics change considerably when he changes teams,
parks, or lineup positions, this suggests that the outside effect has a
major effect on the statistics. If the statistic remains consistent
when outside conditions change, this means that it is measuring the
player's own contribution. Pitchers with good ERA's tend to keep
them when they change teams, so the park effect is not a serious
problem. Hitters who score a lot of runs in the leadoff spot will score
many fewer runs if they are dropped to sixth in the lineup, which means
that the runs scored were mostly created by the lineup position rather
than the batter.
In addition to these problems with outside effects, there can be
problems with measurement. For example, no statistic can be useful
without proper context, a measure of opportunities. There were more
crimes committed in New York than in Boston last year, but this doesn't
say much about the relative safety of the cities; to make such a
comparison, you would need to compare crime rates.
If a batter has 150 hits, what does that mean? Well, if he has 500
at-bats, he is good at getting hits; if he has 650 at-bats, he is poor.
This is a problem with most counting statistics. Batting average places
hits in a reasonable context, and this is recognized because the batting
title goes to the player with the highest average, not the player with
the most hits.
Similarly, a statistic may not be useful if it tries to measure
something with a very small sample size or number of occurrences. The
best pitchers at throwing shutouts often don't lead the league in
shutouts, because the league leader normally has about five, and it's
quite common for a pitcher who usually throws three shutouts a year to
get seven in one year. In contrast, the best strikeout pitchers do lead
the league in strikeouts (or strikeouts per nine innings), because their
totals are in the hundreds, and a pitcher who is capable of getting 250
strikeouts in 240 innings might get 230, but not 150.
Again, the same problem comes up with non-baseball statistics. If 2/3
of the people polled in your city plan to vote Democratic, that means
nothing if it was four of six, and not much if it was forty of sixty,
but quite a lot if it was 400 of 600.
This is the major flaw with many of the statistics that are often used
on TV; a statistic such as, "Wade Boggs is hitting .154 against
Baltimore pitchers with runners in scoring position" means nothing
because the sample is probably two hits in thirteen at-bats.
Sabermetricians agree with most fans that such stats are ridiculous;
they are there only to hold the interest of the (mostly statistically
illiterate) television audience.
Now, once you have some idea of how well the statistic measures the
player's own contribution to the goal, the final question to ask is, "Is
there a better way to measure the same thing?" A statistic which has
problems with the other questions but has no reasonable alternative
measurement may still be useful. In contrast, a statistic such as runs
scored, which can be replaced by other statistics, is of very little
value. A player's own contribution to his total of runs scored can be
measured by his ability to get on base (measured very well by on-base
percentage) and, to a lesser extent, to advance himself once he gets on
base (measured by extra-base hits, and by stolen bases and caught
stealing).
Now, given these criteria, you can evaluate a statistical conclusion.
If you dispute the conclusion, your argument may be valid if it is based
on these criteria. That is, you need to find something which is not
measured by the statistic, or is measured but shouldn't be. For
example, you can argue that Mike Schmidt is a good hitter, even though
his career average is .267, because he hit 548 home runs and drew 1507
walks. These are valid arguments, because batting average gives the
same value to homers and singles, and does not count walks at all.
Likewise, Ozzie Smith is not a great offensive player, but he is still
an excellent player, because of his defense; no offensive statistic
measures his overall value.
But you cannot dispute a statistical conclusion with a claim which is
based on something which is already included in the statistic, or
something which is improperly measured by your claim. It isn't
reasonable to say that Brooks Robinson was great at getting hits because
of his 2848 hits; the correct measure of how well he got hits is his
.267 batting average, which led to such a high hit total because his
other skills allowed him to have a very long career. Turning one of the
above examples around, you can't claim that Schmidt could not possibly
be a great hitter, despite his .527 SLG, by looking at his batting
average; the batting average is already counted in the slugging average.
III. Sabermetric stats
A good, complete measure of individual offense would satisfy the
criteria above for a valuable statistic better than any of the
traditional offensive measures. Therefore, sabermetricians often use or
develop such statistics. (For measuring pitching, there is less need
for such a statistic, because ERA and runs allowed already count the
number of runs allowed by a pitcher.)
At the team level, a good measure of offense should have a strong
correlation with runs scored. This means that it should be possible to
predict runs scored reasonably well from the measure; the best teams by
this measure should score a lot of runs, while the worst teams should
score very few. Measures such as batting average do not do this; it is
common for the team with the best batting average to be below average in
runs scored. Runs scored itself obviously measures team offense very
well, but it creates a problem when you try to measure individual
contributions; it isn't easy to measure directly how much a batter
helped or hurt his team score runs.
There are several ways to develop a statistic which measures team
offense. Probably the most natural way is to say that a team scores
runs by getting runners on base, and then advancing them. Thus, a
team's runs scored should be proportional to the number of runners it
gets on base, and to the frequency with which it advances the runners.
On-base percentage measures the number of runners on base, while
slugging average is one way to measure advancement. (Note that an out
reduces slugging average, because it makes it less likely that any
runners on base will be advanced.) Thus team runs should be correlated
with OBP*SLG.
The test of a statistic of this type is how well it agrees with
reality. If you compare teams' OBP*SLG to their runs scored, you find a
very good correlation; the standard error is just 24 runs. For
comparison, the standard deviation of runs scored in one season is 70
runs (this is the error you would get if you predicted that all teams
would be average in runs scored), while batting average alone has a
standard error of 54 runs. The 24-run standard error covers everything
which OBP*SLG does not measure or measures improperly; this includes
such factors as baserunning and imperfections in the formula, but much
of the difference is chance.
Now, we need to make an individual statistic by measuring a player's
contribution; OBP*SLG is not the correct measure for a player because he
usually doesn't drive himself in. Instead, you want to multiply his OBP
by the team's SLG, and his SLG by the team's OBP. Since the league SLG
(and individual teams' SLG) are usually about 1.2 times the OBP, each
point of a player's OBP has 1.2 times the effect on OBP*SLG that a point
of his SLG has. Thus our measure is (1.2*OBP)+SLG. For simplicity, we
often ignore the factor of 1.2 and refer to OPS, On-base Plus Slugging.
When using this statistic, remember that OBP is slightly undervalued,
and that stolen bases have not been counted.
Using the same process for other models of offense gives other measures,
which give slightly different values for different elements of offense.
The choice of which measure to use depends on which ones you have handy,
the purpose for which you want to use it, and some personal preferences.
But if you use any well-designed measure of offense, you won't be wrong.
You may find that a player who has two more Runs Created than another is
.003 worse in OPS, but such differences aren't important; either way,
you will reach the reasonable conclusion that they are very close.
The complete measures of offense give a good estimate of the value of
the individual categories, such as walks, home runs, and outs, which
make them up. The value of a player's home runs is the effect that they
have on OPS or any similar statistic, and the importance of home runs
thus depends on this value and their frequency.
IV. Evaluating official statistics
We can now apply the criteria to the official statistics. While it
isn't reasonable to go through the arguments for every statistic, it is
useful to look at the statistics which cause the most frequent
arguments.
RBI's are commonly used as a measure of a player's offense, because they
are the only statistics which are easily available which look like a
complete measure. (As a result, the MVP winner is more likely to be the
league leader in RBI than in any other category.) Of course, they
aren't a complete measure; the ability to drive in runs is an important
part of offense, but not the whole thing. This does not make RBI's
meaningless, only incomplete.
But the real problem with RBI's is the second question; they measure a
lot of things which are not the player's own contribution. You cannot
drive in runners who are not on base (except with home runs), but your
own batting doesn't put them there; if you bat behind good players, you
will get a lot of chances. In fact, the league leaders in RBI are much
more likely to be the players who batted with the most teammates on base
or in scoring position (not the batter's contribution) than those who
hit the best with runners on base or in scoring position. Thus RBI are
a better measure of who had the most chances to drive in runners than of
who was the best at driving in runners.
And now, we try the third test; there is a better measure of the ability
to drive in runners. Hits drive runners in from scoring position;
therefore, a player who gets a lot of hits is good at this part of
driving runners in. Likewise, extra-base hits drive runners in from
first base, and home runs drive them in from home plate. Slugging
average measures a player's ability to get hits, extra-base hits, and
home runs, so it measures his ability to drive in runs, with park
effects the only significant bias. Thus RBI's are not useful as a
measure of offense, or even as a measure of the ability to drive in
runs.
The other statistic which is subject to many of the same problems is a
pitcher's won-lost record; we will compare it to ERA. Both measure
something which is clearly important, since a pitcher's goal is to win
games, and the way he does this is by preventing the opponents from
scoring. But both have some problems measuring the pitcher's own
contribution; a comparison of their value depends on these problems.
The first problem is that runs are allowed by the whole defense, not
just by the pitcher. This is slightly more of a problem with W-L; ERA
eliminates runs due to errors, but not due to fielders who are out of
position, run slowly, or make weak throws. At the major-league level,
it isn't a serious problem; good pitchers can still have good ERA's (and
runs allowed) even with teams of poor fielders.
Won-lost record is one of the few categories which is immune to park
effects; there is one win in every game in every park. ERA has a slight
problem with park effects, which makes it more useful with a park
adjustment.
But the most important factor is the effect of the team offense.
Offense has almost no effect on ERA, but it has a considerable effect on
W-L. A game is not won just by the pitcher (despite the name of the
statistic), but by the team which scores more runs than it allows. In a
single season, the pitcher with the best W-L record in the league is
just as likely to be the pitcher with the best run support as the
pitcher with the fewest runs allowed. And the run support is not the
pitcher's contribution (except for batting in the NL). If there were
pitchers who could cause their teammates to score more runs for them, it
would make sense to give the pitchers some of the credit. But this
doesn't happen; there is no tendency for pitchers who had support better
than their team's average in one season to have it again in the
following season. Nor does a pitcher have any control over whether he
gets to pitch on a good offensive team.
Because of the effect of run support, single-season W-L records are not
a good measure of a pitcher's own value. ERA is available, and it is a
better measure of what you actually want to know. However, a career W-L
reduces the luck in run support by using a much larger sample size. In
addition, pitchers rarely spend their full careers with poor or good
teammates. Thus a career W-L record for a long career (several hundred
decisions) is a decent measure of a pitcher's own performance; it's
about as useful as a career ERA without park adjustments.
Since we have now dealt with the most common measures of batting and
pitching, it makes sense to deal with the most common measure of
fielding. Fielding average has its problem with the first test; while
defense is important, an incomplete measure of defense is not. The
league leader in errors at third usually makes about 30; the leader in
fielding average makes about 10. There aren't enough plays to make a
difference of very many runs. The more important part of fielding is
the ability to prevent hits; if the third baseman can't reach a ball in
the hole, or knocks it down but has no play, he won't be charged with an
error, but the batter will get a hit which has the same effect.
Errors are about as useful as a measure of defense as strikeouts are as
a measure of batting average. They measure one way to fail to make a
play; while it is the most obvious failure, all failures count the same
on the scoreboard. A fielder with poor range will be a poor fielder
whether he makes few or many errors, just as a hitter who hits too many
routine grounders or popups can be a poor hitter even though he puts the
ball in play.
While fielding average also has problems with park effects and scorer's
biases, the incompleteness is the most serious problem. Still, since it
does measure something useful, and fielders who are good at other things
tend not to make errors (fielding percentage has a good correlation with
games won), it would be a useful measure in the absence of anything
else. It still has some value, particularly in concluding that players
with very low fielding averages can't handle their positions, but it
should be used in conjunction with putouts, assists, and an attempt to
understand any biases in the numbers.
But for recent players, we have a better measure of overall defense,
Defensive Average (abbreviated DA), which makes fielding average
unnecessary. The basis for DA is a division of the playing field into
zones of responsibility for the fielders. When a ball is hit into a
fielder's zone, it is charged as an opportunity for that fielder; if the
fielder turns it into an out, he receives credit for a play made. Thus,
all ground balls near third base are charged as chances for the third
baseman; a good third baseman will make plays on most of them. If he
fails to make a play, the effect is the same whether his throw is wild
(scored an error) or late (scored a single), so fielding average does
not tell you anything more.
Defensive average should be put to the same tests as any other
statistic. It does reasonably well in the first test. It measures a
player's ability to turn balls in play into outs, which covers most of
his defensive play but not all of it; such skills as turning the double
play and throwing out runners trying to stretch hits are not counted.
It also does well in the second test, although it still has some
problems, mostly with park effects. Pitchers cannot introduce bias
simply by being left-handed (and thus allowing a lot of ground balls to
third base and fly balls to left), but good pitchers may help their
fielders' DA slightly by allowing fewer hard-hit balls. Fielders do not
have a great effect on each other's DA, although there will be a small
effect for plays such as the low throws that a good first baseman can
handle. (All of these effects will cause problems with almost any
measure of fielding.) And for the third test, DA is the best measure of
the ability to make the play in the field that we have; it isn't
perfect, but it is complete enough and accurate enough to be useful.
Thus the established statistics, used for reasons of tradition, may
be good measures (such as ERA) or poor measures (such as RBI's). Their
value does not depend on their tradition or their names; it depends on
how well they meet the basic tests of any statistic.
V. Other sabermetric arguments
Similar analysis must also be used in evaluating a hypothesis which
depends on a statistical argument. If the hypothesis leads to
conclusions which don't correspond with the real game of baseball, then
it needs to be revised.
For example, a natural question in predicting a player's future
performance in the major league is how useful his minor-league numbers
will be in a prediction. There are problems with using minor-league
numbers because there are extreme park effects and differences between
leagues. However, once you adjust a player's minor-league numbers for
these effects, and then make a specific adjustment for the difference
between AA or AAA ball and the majors, you may have something
meaningful. There is a method for making these corrections; the result
is called the MLE, Minor-League Equivalency. This will be useful if it
works, tested against the real world. In fact, it works almost as well
as past major-league performance in predicting future major-league
performance. Most players with MLE's which say they will hit .300 will
hit close to .300 as rookies, just as most players who hit .300 last
year will. (Of course, neither prediction is perfect.)
Another issue which sabermetricians have studied (and often discussed)
is the existence of clutch hitters. Clutch hits themselves certainly
exist; when Bobby Thomson, Carlton Fisk, Bucky Dent, Kirk Gibson, and
Joe Carter hit their famous home runs, or when Ken Griffey singled in
the tying run in the eighth inning in May, they got hits when it was
important. But many players have reputations as players who will hit
their best with the game on the line, and this is a hypothesis which can
be tested; are there any players with such an ability?
Again, it is necessary to look at what actually happens, and what would
happen if there were no clutch ability at all or if clutch hitting was a
significant ability. Even if a .250 hitter were just a pair of coins
which got a hit when they were both heads, some .250 hitters would hit
.400 during one season in the late innings of close games (a 3% chance
in 80 AB), so the existence of such numbers doesn't prove anything. But
if there is an ability, players who hit well in the clutch in the past
will continue to do so. This can be tested, and has been; there is only
very weak evidence of an ability, and it is clear that whatever ability
there is does not mean much in baseball terms. There may be .267
hitters who are actually as valuable as .268 hitters because of their
good clutch numbers, but if you replace .268 with .275, you have a
conclusion which is inconsistent with what actually happens.
VI. Conclusion
Baseball statistics are useful only if they enhance your understanding
of the game. Therefore, they should be judged by how well they measure
what actually happens in the game. Meaningless statistics should be
ignored or replaced; deficient statistics should be improved. And
well-designed statistics should be used as an important part of
discussion about the game and its players.
Bibliography
Bill James, _The Baseball Abstract_, published annually from 1980 to
1988 by Ballantine Books.
John Thorn and Pete Palmer, _The Hidden Game of Baseball_, New York:
Doubleday, 1985.
John Thorn and Pete Palmer, eds., _Total Baseball_, New York:
HarperCollins, 1993.