From: rmg3@access2.digex.net (Robert Grumbine)
Date: 1997/10/26
Message-ID: <630330$6p4@access2.digex.net>
Usenet review of Health and Amenity Effects of Warming, by Thomas Gale
Moore
Review By Robert W. Grumbine
Some notes:
I wrote this over a year ago, so far back in time that
Mr. Moore was still a regular participant in sci.environment.
My original intention was to place this and supporting figures
on my web page. I eventually became so frustrated with the
level of Mr. Moore's paper that I ceased working on both the
review and on the web illustrations. If you see links below
to figures, do not assume that they exist. They probably don't.
The review is of a by now ancient paper, and old edition of
that paper. My attempts to get a current copy of the paper
have failed as the web page I've had an address for continues
to either deny access or be down. From comments made elsewhere,
by other people, it appears that the substantive claims and
their support remain unchanged. As such, this review is
likely still (unfortunately) relevant.
Robert Grumbine
Begin 1996 draft of review:
This is a review for Usenet, rather than a professional
journal. That means that I'll be writing in a much chattier fashion
and explaining far more than I ordinarily do in a review.
Further, Mr. Moore has supplied me
with the data sets that he developed in pursuing the work. I will
therefore also be redoing the statistics done in the paper and
adding several other tests. My thanks to Mr. Moore.
To Usenet: Mr. Moore had reason to believe that my review
of his paper was not likely to be favorable. Further, some
effort on his part was required to get the data set to me.
It is to his credit that he has upheld the professional
standard of science in giving me his paper and data. I will
attempt to continue the professional standard and give a
fair review of the paper. Fair is not to be confused with
favorable.
Overview:
The author proposes that the outcome of a moderate climate warming
may be favorable to the United States. Two lines are advanced
to support this conclusion. First, that death rates may drop.
Second, that people have an economic preference for warmer weather
(the amenity effect of the title). We examine both lines of
evidence and find that the statistics were not performed correctly,
and that even if they had been, there is considerable doubt about
the conclusions which could be drawn from the relations.
Section: County death rate statistics
There are several tables (xxxxx) relating to this. The premise is to
test the hypothesis whether the death rate is a function of
meteorological variables. Of course there are several other things
which can, or might be supposed to, affect the death rate. The basic
data are described in Moore [1996]. Moore displays tables of multiple
regression of sets of variables attempting to predict the death rate. t
and F statistics are displayed for various of these families of
regressions. My copy of the data is available on the web as table A.
The meaning of the F statistic is that if the F statistic is
significant (large enough) then at least _one_ of the predictor
variables is statistically correlated to the predicted variable [DeVore,
1981]. The families of predictor variables always include percent of
the county population which is over 65. There is little question that a
high percent of population being over 65 is statistically (and causally)
related to there being a high death rate in the county. Since (from the
t values in the tables) this is overwhelmingly the most important single
variable, the F statistic as displayed is not necessarily telling us
anything beyond the fact that people over 65 have a higher death rate
than the population at large. The other predictors may or may not be
significant, more on this later.
There is a further problem with the application of both the t and F
statistic throughout this paper, particularly in this section. Namely,
it is important to know how many degrees of freedom there are in the
data set. At no point in the paper does Moore discuss this parameter.
In the county data set there are 89 counties. These counties are not
'free', that is, groups of these counties _must_ vary together. In the
meteorological data, Cook, Kane, Will, and DuPage counties in Illinois
(metro Chicago) are all represented by the same weather (O'Hare airport
it appears). Lake County, IL (metro Chicago, north of the city) for
some peculiar reason is represented by Midway airport, which is on the
south side and within the city of Chicago. Similar peculiarities are
present for other metropolitan areas with which I am familiar. Moore
appears to treat the data set as having 89 independent samples. Given
the significant meteorological and cultural overlap between the counties
in this data set, the true number of degrees of freedom is far less than
89, easily only a fifth that. It is highly disconcerting that no
discussion was made of this point. We'll return to this.
With one overwhelmingly important parameter included in a regression,
one needs to be exceptionally careful in continuing past that point. The
problem is that when you have several variables which are correlated to
each other the results of your regressions (and hence statistical tests)
can be drastically affected by the numerical stability of your
computation [Strang, XXXX]. The variables in the county data set are
indeed mutually correlated, as Moore notes. To avoid the numerical
stability problems (which Moore does not mention), I will advance
stepwise through the analysis. One can proceed so as to enhance the
numerical stability [Strang, 19XX], which I will do in the following.
First off, we'll start by checking to see whether any of the variables
are correlated with the county death rate. We expect to see that
percent of the population over 65 is the most significant variable. In
carrying this out, I'll include in the tables all the variables for
which Moore gave data, even those which he noted were not significantly
correlated to death rate.
Table I:
Variable R^2
% > 65 0.837
Median Income 0.513
>16 years edu. 0.164
Hospital Beds 0.146
Doctors 0.080
Cooling (W) 0.076
% Black 0.073
Ann Avg (W) 0.056
Latitude 0.053
Heating (W) 0.038
High Temp (W) 0.033
Low Temp (W) 0.031
Altitude 0.009
Sky Cover (W) 0.002
Some explanations:
%>65 is the percent of county population greater than 65, Median Income is the
median per capita income, >16 years edu is the percent of the county population
with more than 16 years of education, Hospital Beds is the number of hospital
beds per 100,000 county population, Doctors is the number of doctors per
100,000 population, Cooling is the number of cooling degree days at the
station of record used by Moore, % Black is the percent of the county
population which was black, Ann Avg is the annual average temperature
in degrees Celsius, Latitude is the latitude of the meteorological reporting
station in degrees, Heating is the number of heating degree days, High
Temp is the highest temperature recorded that year, low temp is the lowest
temperature recorded that year, altitude is the elevation of the
meteorological reporting station in feet, and sky cover is the percent
cloudiness for the year.
(W) denotes variables which can be affected by climate change (the
weather variables), and R^2 is the fraction of the variation in death
rate explained by this variable. t(89) is the t value of the
correlation if one assumes there to be 89 independant counties (higher
means more significant or more likely to be significant). Right off,
we see something that might be strange -- if we add up the fraction of
the variance explained by each of the variables independantly, we find
that we've explained well over 100% of the variance. The reason is that
the variables are correlated with each other, so the same bit of
variation in the death rate is 'explained' by several different
variables.
Casual inspection reveals that two variables have vastly higher
correlations with death rate than all the rest, %>65 and Median Income.
Taken together, they 'explain' 135% of the variation in the death rates.
Obviously these variables are also correlated with each other.
Ok, I hope I've made it repetitively and redundantly clear that it is
very, _very_, important to handle the mutual correlation of the
variables properly. Let us now look at the scatter plot of the death
rate versus the %>65 figure 1.
We see that there are three points with wildly higher %>65 (and death
rate) than all other points. The %>65 for the highest values are 30.7,
27.8, 22.0, 15.7, 15.4, 14.9, 14.3. The first three are clearly far
higher than all the rest. With those points included in the regression,
the slope is .496, and R^2 is 0.84. Without them, the slope is 0.646
(and R^2 without them is 0.83, scarcely any different). If one includes
these 3 (out of the 89) counties, one gets a significantly different
regression line. If a variable is correlated with %>65, then removal of
that correlation from an improper regression line is going to give false
significance to the later variables. I will therefore perform the
analysis with the three outlier counties (Tampa - Pasco County, Pinellas
County, and Ft. Lauderdale) deleted.
Ok, so there is an apparent linear correlation between death rate and
%>65 some degree of relation between %>65 and the other variables.
_How_, exactly, do we remove the dependance of the variables on the
%>65? The answer is Gram-Schmidt Orthogonalization [Strang, 19XX, p.
XXX]. Roughly speaking, we take the prediction of death rate given
%>65 and subtract that from the actual death rate. We repeat this for
every variable. At the end of the process, we have a new family of
variables, which can be interpreted as being X to the extent that X is
independant of %>65. This is what we want in order to continue the
analysis.
After orthogonalization, our table is,
Table 2:
Res. Variable R^2 t(86)
Black 0.70 0.49 8.87
Income -0.65 -7.77
Beds 0.45 4.65
%>16 -0.31 -3.02
Drs 0.22 0.048 2.07
Latitude -0.20 0.04 -1.91
Altitude 0.04 0.002 0.34
Sky -0.01 0.0001 -0.07
------
Cooling 0.19 0.036 1.75
Tbar 0.14 0.02 1.31
T^2 0.11 0.012 0.99
Heating -0.10 0.01 -0.97
Tmax -0.08 0.006 -0.73
Tmin 0.02 0.0004 0.17
Where each variable has had its dependance on %>65 removed, and that
variable explained 84% of the variance in death rates. All variable names
(and variances explained) now refer to these residual variables (and the
18% of total variance which remains to be explained). So the %Black (to
the extent that this is unrelated to %over 65) explains 49% of the
_remaining_ variation in death rate, and the t statistic for this is 8.87.
Some points arise now. First, it is clear that we can easily keep
repeating the process, each time taking the variable which explains the most
variance in the remaining portion of the death rate as the variable to use
for orthogonalization. Second, it is clear that the apparent importance
of the variables depends on whether they are correlated to other things
which are good predictors. For instance, due to the correlation between
%>65 and Income, income was aparrently more important than %Black in
predicting death rates. When we remove that relation (%black is essentially
unrelated to %>65), %black is shown to be more important than income.
It is important that we do remove the variables in order of statistical
importance -- except for one family. That is our third point: we do not
want to falsely conclude that the meteorological effects are important
until after all other possibly significant effects are accounted for. If
the reason for a correlation between death rate and high average temperature
is really because, say, old black poor people live in warmer areas
preferentially, we don't want to mislead ourselves in to thinking that
the relation could be affected by temperature.
So, we will continue the process of orthogonalization, but only with the
non-meteorological variables. In doing this, we find in order of importance,
that the variables affecting death rate are:
%>65, %black, income, number of hospital beds per 100,000 population, %>16
years education, and number of doctors per 100,000 population. Other variables
do not have any significant effect on predicting the residual variation.
Having remove all apparently significant variables, let us now consider
the meteorological variables as well.
Variable R R^2 t(86)
Tmax -0.36 0.13 -3.57
T^2 -0.21 0.044 -1.95
Tmin -0.20 0.040 -1.90
Tbar -0.18 0.032 -1.66
Heating 0.17 0.029 1.59
Cooling -0.14 0.020 -1.29
Sky 0.07 0.005 0.62
Altitude -0.06 0.004 -0.55
Latitude 0.06 0.004 0.58
At a very long last, we can finally get in to the meat of determining what
significance, if any, meteorological variables have for death rates.
Recall that the statistic is a measure of how well the variable can
predict the residual variance. If the t statistic is greater than 1.67,
then the correlation is significant at P < 0.05 for 86 independant
counties. For only 18 independant counties, it needs to be 2.1. For P <
0.001, the critical values are 3.2 and 3.61, respectively. This
suggests fairly strongly that if any of the meteorological variables are
important at this point, then the maximum temperature is the one.
A more important question is how much variance is left. The
unexplained variance at the moment is 14.7, out of the initial 229.5.
That is, the most significant non-meteorological variables explained
93.6% of the variation in the death rate. The first three explained
92.5%. This leads us to another point. If we can predict 93.6% of the
variance, already, how much of the residual variance do we need to be
able to predict in order to justify adding the new variable? This is a
standard question, discussed in chapter 13 of DeVore [1982]. What we
need to apply is an F test, where the statistic depends on how many
degrees of freedom remain (whatever we started with, we remove a degree
of freedom for each variable we have orthogonalized with respect to) and
what fraction of the remaining variance the new variable explains.
Let n be the initial number of degrees of freedom. The F statistic
for Tmax is then 0.136*(n-7). If n were 86, this is 10.7. If n is 18,
F is 1.5. For significance at P < 0.01, n=86, F must be greater than
about 7, which it is. For signifcance at P < 0.01, n = 18, F would have
to be greater than 9.65, while it is really only 1.5 (which would pass
for significant even at P < 0.05). The number of independant counties
is obviously important, so I'll tabulate the computed and critical F
statistic for a varying degrees of independance between counties in the
table. The P value corresponding to
Fcritical is given in the parentheses
N Fcomputed Fcritical(0.01) Fcritical(0.05)
86 10.7 7.0 ----
43 4.90 7.41 4.10
37 4.08 7.56 4.17
29 2.99 7.95 4.30
18 1.5 9.65 4.84
So where are we? Only if the majority of the data are independant can
one consider that there might be a statistically meaningful relation
between death rates and Tmax (aside: The F statistic for the other
variables would be no more than 1/3 that for the Tmax, which would make
F insignificant even for all the data being independent). We have
excellent reason to believe that the data are _not_ largely independent
in terms of the fact that much of it comes from clusters of mutually
(socially, economically) related counties. Socially and economically it
is unreasonable to assume that DuPage and Lake counties (near Chicago)
are independent of each other and of Chicago. Far worse,
meteorological variables are correlated over scales of hundreds of
kilometers. This means that in terms of the meteorology, not only are
DuPage and Lake counties strongly related to each other, but they're
both related to Indianapolis (Marion County), St. Louis (Madison and St.
Clair counties), and quite a few others through the midwest. On the
grounds of mutual relation between counties economically and socially,
(which variables, recall, explain over 90% of the total variation in
death rates) we expect that the number of degrees of freedom is
significantly less than expected from the 86 counties. On the basis of
meteorology, we will find even fewer degrees of freedom.
Either of these considerations results in rejecting Tmax as a useful
predictor of death rates.
Even if one ignored the lack of statistical significance, little is
gained for Mr. Moore's thesis. The reason is that the only respectable
candidate is maximum temperature. The maximum temperature is a variable
which is expected to be relatively unchanged in CO2 induced climate
changes. The reason is based on observations by Karl et al. which show
that the last century's warming has occurred primarily by increasing the
minimum temperatures while leaving the maximum temperatures unchanged.
Such results are now also found in the model predictions. Even if death
rate did depend on Tmax, then, the likely climate change wouldn't
produce the hoped for reduction in death rates.
Reflection on this section of the paper:
My conclusions here are quite contrary to those reached by Mr. Moore in
his paper. It is therefore important to examine exactly how we came to such
different results while working from the same data set. Four points seem
key.
1) Moore never considered the number of degrees of freedom.
2) Moore did not defer consideration of the meteorological variables until
after building the best possible model without them.
3) Moore included outlier counties in the data set (which aggravated the
above three).
On number 3, which we reached first in this discussion, one can return
and redo the analysis procedure I did, but for all 89 counties. The regression
coefficients change, but the %>65, income, and %black again explain over
90% of the variation in death rates (93.3% as opposed to 92.5 without the
outliers). After going through the additional 2 (latitude and hospital beds)
(as opposed to 3 previously, hospital beds, %>16 years education, and doctors)
variables which look like they might be important (large t statistic), we
arrive at a near toss-up between Tmax and T^2 as the candidate for significant
meteorological variable (t is 3.06 and 3.07, respectively). Taking T^2 as
the meteorologically important variable, we arrive at F = 0.1075(n-6), where
n is again the number of independant counties we believe we have.
N Fcomputed Fcritical(0.01) Fcritical(0.05)
89 8.9 ~7.0 ----
42 3.98 7.41 4.10
36 3.22 7.56 4.17
28 2.37 7.95 4.30
17 1.18 9.65 4.84
So with the original data set, the meteorological variables are even _less_
significant than if the outliers ar excluded from the analysis (reminder: they
were rejected for being outliers in death rate and %>65, not for any
meteorological parameters). Outliers or no, the meteorological parameters
cannot be accepted as significant.
In point 2, the problem of multicollinearity is aggravated to the point
of being able to produce the effect Moore finds. In working with the
outlier points, latitude becomes statistically meaningful. This variable
is correlated to the temperatures. Moore fails to remove the effect of
latitude on death rates and on the temperatures, so finds signifcance
from the temperatures. But, this is mostly due to the relation of temperature
to latitude, not temperature to death rates. Note that if we don't include
the outlier counties, the importance of latitude disappears. The fact
that Tmax is in both cases the most significant (or nearly so) of the
meteorological variables, while T^2 goes from almost significant to
meaningless is another sign that Tmax is the more robust variable, even
if still not statistally significant.
Point 1, I hope it is clear from the above, is drastically important.
The statistical tests used in these problems rely heavily on the degrees
of freedom. That Moore never discussed this point is a severe failing.
The consideration here suggests very strongly that for any reasonable estimate
of the number of degrees of freedom, none of the meteorological variables
are even statistically significant.
To recap: the relation between meteorological variables and death rates
is not statistically significant, and even if it were, the dependance is
on a variable which is expected not to change under climate change. The
conclusion that there would be a decrease in the death rate due to climate
change is therefore both statistically and physically unsupported by this
data set.
[1997 comment -- from here on, the review is very fragmentary ]
In finishing the health effects section of the paper, Mr. Moore considered
the effects meteorological variables may have on number of hospital beds
and on number of doctors. He concluded that warmer temperatures would lead
to a decrease in health care costs.
I analyzed the data in the same way as for death rate, with the same
three outlier counties removed as before. Since cloud cover turned out
to be potentially important for hospital beds, I only used the counties
for which there was cloud data, 7 did not: Jefferson Co, AL (Birmingham)
Orange Co, CA (Anaheim), San Francisco Co, CA, Norfolk Co, MA (Boston),
Bronx, Co, NY (NYC) New York Co, NY (NYC), Hamilton Co, OH, (Cincinnati)).
I found for hospital beds, that the important non-meteorological variables
were: doctors, income, and sky cover. Together, these explained 49% of
the variation in number of hospital beds per 100,000. The most significant
meteorological variable was the maximum temperature. The F statistic,
considered as before, was 0.073*(n-4). This is clearly even less significant
than the previous attempts (where the coefficients were .1075, and .136,
the larger giving more nearly significant results). Note that I've been
taking sky cover as a non-climate variable, on the grounds that we have
little reason to predict whether cloudiness will increase or decrease in
changed climates. This point affects none of the preceding. If here we
considered it a climate variable, and looked for the F statistic for
including it would be 0.064 (n-3), which would, as before, be insufficient
to accept it as a predictor.
For the number of doctors, the result is that the important
non-meteorological variables are beds, %>16 years, death rate, income,
and altitude (in descending order of importance). The meteorological
variables are overwhelmingly not significant. By this I mean that the
t statistic, even for 86 independent counties and even considering only
the regression on the residual variance, is not even significant at the
0.05 level. Again, sky cover considered as a climate variable
Mr. Moore in considering the hospitals beds, did not include number of
doctors in the regression, yet this is the single best predictor of number
of hospital beds. To the extent that there is a relation (in both cases
the one is the best predictor of the other) this is a requirement of his
hypothesis that these figures reflect the health care needs of the area.
It is therefore puzzling that he neglected this. He also did not consider
the relation between %>16 years education and either the hospital beds or
doctors. I found this to be the second most important variable for predicting
the number of doctors in an area. The F statistic for including the %>16
years as a predictor of number of doctors is 0.47*(n-2) for the doctors,
and is significant even for the probably low number of degrees of freedom.
One simply cannot conclude regarding the importance of some variable
in predicting another, if one has neglected to consider the role of major
variables. If the major dependencies are considered, none of the
meteorological variables are important to either number of hospital beds
or the number of doctors. The conclusion that health care costs would
be reduced is therefore unsupported by the data.