From: rmg3@access2.digex.net (Robert Grumbine) Date: 1997/10/26 Message-ID: <630330$6p4@access2.digex.net> Usenet review of Health and Amenity Effects of Warming, by Thomas Gale Moore Review By Robert W. Grumbine Some notes: I wrote this over a year ago, so far back in time that Mr. Moore was still a regular participant in sci.environment. My original intention was to place this and supporting figures on my web page. I eventually became so frustrated with the level of Mr. Moore's paper that I ceased working on both the review and on the web illustrations. If you see links below to figures, do not assume that they exist. They probably don't. The review is of a by now ancient paper, and old edition of that paper. My attempts to get a current copy of the paper have failed as the web page I've had an address for continues to either deny access or be down. From comments made elsewhere, by other people, it appears that the substantive claims and their support remain unchanged. As such, this review is likely still (unfortunately) relevant. Robert Grumbine Begin 1996 draft of review: This is a review for Usenet, rather than a professional journal. That means that I'll be writing in a much chattier fashion and explaining far more than I ordinarily do in a review. Further, Mr. Moore has supplied me with the data sets that he developed in pursuing the work. I will therefore also be redoing the statistics done in the paper and adding several other tests. My thanks to Mr. Moore. To Usenet: Mr. Moore had reason to believe that my review of his paper was not likely to be favorable. Further, some effort on his part was required to get the data set to me. It is to his credit that he has upheld the professional standard of science in giving me his paper and data. I will attempt to continue the professional standard and give a fair review of the paper. Fair is not to be confused with favorable. Overview: The author proposes that the outcome of a moderate climate warming may be favorable to the United States. Two lines are advanced to support this conclusion. First, that death rates may drop. Second, that people have an economic preference for warmer weather (the amenity effect of the title). We examine both lines of evidence and find that the statistics were not performed correctly, and that even if they had been, there is considerable doubt about the conclusions which could be drawn from the relations. Section: County death rate statistics There are several tables (xxxxx) relating to this. The premise is to test the hypothesis whether the death rate is a function of meteorological variables. Of course there are several other things which can, or might be supposed to, affect the death rate. The basic data are described in Moore [1996]. Moore displays tables of multiple regression of sets of variables attempting to predict the death rate. t and F statistics are displayed for various of these families of regressions. My copy of the data is available on the web as table A. The meaning of the F statistic is that if the F statistic is significant (large enough) then at least _one_ of the predictor variables is statistically correlated to the predicted variable [DeVore, 1981]. The families of predictor variables always include percent of the county population which is over 65. There is little question that a high percent of population being over 65 is statistically (and causally) related to there being a high death rate in the county. Since (from the t values in the tables) this is overwhelmingly the most important single variable, the F statistic as displayed is not necessarily telling us anything beyond the fact that people over 65 have a higher death rate than the population at large. The other predictors may or may not be significant, more on this later. There is a further problem with the application of both the t and F statistic throughout this paper, particularly in this section. Namely, it is important to know how many degrees of freedom there are in the data set. At no point in the paper does Moore discuss this parameter. In the county data set there are 89 counties. These counties are not 'free', that is, groups of these counties _must_ vary together. In the meteorological data, Cook, Kane, Will, and DuPage counties in Illinois (metro Chicago) are all represented by the same weather (O'Hare airport it appears). Lake County, IL (metro Chicago, north of the city) for some peculiar reason is represented by Midway airport, which is on the south side and within the city of Chicago. Similar peculiarities are present for other metropolitan areas with which I am familiar. Moore appears to treat the data set as having 89 independent samples. Given the significant meteorological and cultural overlap between the counties in this data set, the true number of degrees of freedom is far less than 89, easily only a fifth that. It is highly disconcerting that no discussion was made of this point. We'll return to this. With one overwhelmingly important parameter included in a regression, one needs to be exceptionally careful in continuing past that point. The problem is that when you have several variables which are correlated to each other the results of your regressions (and hence statistical tests) can be drastically affected by the numerical stability of your computation [Strang, XXXX]. The variables in the county data set are indeed mutually correlated, as Moore notes. To avoid the numerical stability problems (which Moore does not mention), I will advance stepwise through the analysis. One can proceed so as to enhance the numerical stability [Strang, 19XX], which I will do in the following. First off, we'll start by checking to see whether any of the variables are correlated with the county death rate. We expect to see that percent of the population over 65 is the most significant variable. In carrying this out, I'll include in the tables all the variables for which Moore gave data, even those which he noted were not significantly correlated to death rate. Table I: Variable R^2 % > 65 0.837 Median Income 0.513 >16 years edu. 0.164 Hospital Beds 0.146 Doctors 0.080 Cooling (W) 0.076 % Black 0.073 Ann Avg (W) 0.056 Latitude 0.053 Heating (W) 0.038 High Temp (W) 0.033 Low Temp (W) 0.031 Altitude 0.009 Sky Cover (W) 0.002 Some explanations: %>65 is the percent of county population greater than 65, Median Income is the median per capita income, >16 years edu is the percent of the county population with more than 16 years of education, Hospital Beds is the number of hospital beds per 100,000 county population, Doctors is the number of doctors per 100,000 population, Cooling is the number of cooling degree days at the station of record used by Moore, % Black is the percent of the county population which was black, Ann Avg is the annual average temperature in degrees Celsius, Latitude is the latitude of the meteorological reporting station in degrees, Heating is the number of heating degree days, High Temp is the highest temperature recorded that year, low temp is the lowest temperature recorded that year, altitude is the elevation of the meteorological reporting station in feet, and sky cover is the percent cloudiness for the year. (W) denotes variables which can be affected by climate change (the weather variables), and R^2 is the fraction of the variation in death rate explained by this variable. t(89) is the t value of the correlation if one assumes there to be 89 independant counties (higher means more significant or more likely to be significant). Right off, we see something that might be strange -- if we add up the fraction of the variance explained by each of the variables independantly, we find that we've explained well over 100% of the variance. The reason is that the variables are correlated with each other, so the same bit of variation in the death rate is 'explained' by several different variables. Casual inspection reveals that two variables have vastly higher correlations with death rate than all the rest, %>65 and Median Income. Taken together, they 'explain' 135% of the variation in the death rates. Obviously these variables are also correlated with each other. Ok, I hope I've made it repetitively and redundantly clear that it is very, _very_, important to handle the mutual correlation of the variables properly. Let us now look at the scatter plot of the death rate versus the %>65 figure 1. We see that there are three points with wildly higher %>65 (and death rate) than all other points. The %>65 for the highest values are 30.7, 27.8, 22.0, 15.7, 15.4, 14.9, 14.3. The first three are clearly far higher than all the rest. With those points included in the regression, the slope is .496, and R^2 is 0.84. Without them, the slope is 0.646 (and R^2 without them is 0.83, scarcely any different). If one includes these 3 (out of the 89) counties, one gets a significantly different regression line. If a variable is correlated with %>65, then removal of that correlation from an improper regression line is going to give false significance to the later variables. I will therefore perform the analysis with the three outlier counties (Tampa - Pasco County, Pinellas County, and Ft. Lauderdale) deleted. Ok, so there is an apparent linear correlation between death rate and %>65 some degree of relation between %>65 and the other variables. _How_, exactly, do we remove the dependance of the variables on the %>65? The answer is Gram-Schmidt Orthogonalization [Strang, 19XX, p. XXX]. Roughly speaking, we take the prediction of death rate given %>65 and subtract that from the actual death rate. We repeat this for every variable. At the end of the process, we have a new family of variables, which can be interpreted as being X to the extent that X is independant of %>65. This is what we want in order to continue the analysis. After orthogonalization, our table is, Table 2: Res. Variable R^2 t(86) Black 0.70 0.49 8.87 Income -0.65 -7.77 Beds 0.45 4.65 %>16 -0.31 -3.02 Drs 0.22 0.048 2.07 Latitude -0.20 0.04 -1.91 Altitude 0.04 0.002 0.34 Sky -0.01 0.0001 -0.07 ------ Cooling 0.19 0.036 1.75 Tbar 0.14 0.02 1.31 T^2 0.11 0.012 0.99 Heating -0.10 0.01 -0.97 Tmax -0.08 0.006 -0.73 Tmin 0.02 0.0004 0.17 Where each variable has had its dependance on %>65 removed, and that variable explained 84% of the variance in death rates. All variable names (and variances explained) now refer to these residual variables (and the 18% of total variance which remains to be explained). So the %Black (to the extent that this is unrelated to %over 65) explains 49% of the _remaining_ variation in death rate, and the t statistic for this is 8.87. Some points arise now. First, it is clear that we can easily keep repeating the process, each time taking the variable which explains the most variance in the remaining portion of the death rate as the variable to use for orthogonalization. Second, it is clear that the apparent importance of the variables depends on whether they are correlated to other things which are good predictors. For instance, due to the correlation between %>65 and Income, income was aparrently more important than %Black in predicting death rates. When we remove that relation (%black is essentially unrelated to %>65), %black is shown to be more important than income. It is important that we do remove the variables in order of statistical importance -- except for one family. That is our third point: we do not want to falsely conclude that the meteorological effects are important until after all other possibly significant effects are accounted for. If the reason for a correlation between death rate and high average temperature is really because, say, old black poor people live in warmer areas preferentially, we don't want to mislead ourselves in to thinking that the relation could be affected by temperature. So, we will continue the process of orthogonalization, but only with the non-meteorological variables. In doing this, we find in order of importance, that the variables affecting death rate are: %>65, %black, income, number of hospital beds per 100,000 population, %>16 years education, and number of doctors per 100,000 population. Other variables do not have any significant effect on predicting the residual variation. Having remove all apparently significant variables, let us now consider the meteorological variables as well. Variable R R^2 t(86) Tmax -0.36 0.13 -3.57 T^2 -0.21 0.044 -1.95 Tmin -0.20 0.040 -1.90 Tbar -0.18 0.032 -1.66 Heating 0.17 0.029 1.59 Cooling -0.14 0.020 -1.29 Sky 0.07 0.005 0.62 Altitude -0.06 0.004 -0.55 Latitude 0.06 0.004 0.58 At a very long last, we can finally get in to the meat of determining what significance, if any, meteorological variables have for death rates. Recall that the statistic is a measure of how well the variable can predict the residual variance. If the t statistic is greater than 1.67, then the correlation is significant at P < 0.05 for 86 independant counties. For only 18 independant counties, it needs to be 2.1. For P < 0.001, the critical values are 3.2 and 3.61, respectively. This suggests fairly strongly that if any of the meteorological variables are important at this point, then the maximum temperature is the one. A more important question is how much variance is left. The unexplained variance at the moment is 14.7, out of the initial 229.5. That is, the most significant non-meteorological variables explained 93.6% of the variation in the death rate. The first three explained 92.5%. This leads us to another point. If we can predict 93.6% of the variance, already, how much of the residual variance do we need to be able to predict in order to justify adding the new variable? This is a standard question, discussed in chapter 13 of DeVore [1982]. What we need to apply is an F test, where the statistic depends on how many degrees of freedom remain (whatever we started with, we remove a degree of freedom for each variable we have orthogonalized with respect to) and what fraction of the remaining variance the new variable explains. Let n be the initial number of degrees of freedom. The F statistic for Tmax is then 0.136*(n-7). If n were 86, this is 10.7. If n is 18, F is 1.5. For significance at P < 0.01, n=86, F must be greater than about 7, which it is. For signifcance at P < 0.01, n = 18, F would have to be greater than 9.65, while it is really only 1.5 (which would pass for significant even at P < 0.05). The number of independant counties is obviously important, so I'll tabulate the computed and critical F statistic for a varying degrees of independance between counties in the table. The P value corresponding to Fcritical is given in the parentheses N Fcomputed Fcritical(0.01) Fcritical(0.05) 86 10.7 7.0 ---- 43 4.90 7.41 4.10 37 4.08 7.56 4.17 29 2.99 7.95 4.30 18 1.5 9.65 4.84 So where are we? Only if the majority of the data are independant can one consider that there might be a statistically meaningful relation between death rates and Tmax (aside: The F statistic for the other variables would be no more than 1/3 that for the Tmax, which would make F insignificant even for all the data being independent). We have excellent reason to believe that the data are _not_ largely independent in terms of the fact that much of it comes from clusters of mutually (socially, economically) related counties. Socially and economically it is unreasonable to assume that DuPage and Lake counties (near Chicago) are independent of each other and of Chicago. Far worse, meteorological variables are correlated over scales of hundreds of kilometers. This means that in terms of the meteorology, not only are DuPage and Lake counties strongly related to each other, but they're both related to Indianapolis (Marion County), St. Louis (Madison and St. Clair counties), and quite a few others through the midwest. On the grounds of mutual relation between counties economically and socially, (which variables, recall, explain over 90% of the total variation in death rates) we expect that the number of degrees of freedom is significantly less than expected from the 86 counties. On the basis of meteorology, we will find even fewer degrees of freedom. Either of these considerations results in rejecting Tmax as a useful predictor of death rates. Even if one ignored the lack of statistical significance, little is gained for Mr. Moore's thesis. The reason is that the only respectable candidate is maximum temperature. The maximum temperature is a variable which is expected to be relatively unchanged in CO2 induced climate changes. The reason is based on observations by Karl et al. which show that the last century's warming has occurred primarily by increasing the minimum temperatures while leaving the maximum temperatures unchanged. Such results are now also found in the model predictions. Even if death rate did depend on Tmax, then, the likely climate change wouldn't produce the hoped for reduction in death rates. Reflection on this section of the paper: My conclusions here are quite contrary to those reached by Mr. Moore in his paper. It is therefore important to examine exactly how we came to such different results while working from the same data set. Four points seem key. 1) Moore never considered the number of degrees of freedom. 2) Moore did not defer consideration of the meteorological variables until after building the best possible model without them. 3) Moore included outlier counties in the data set (which aggravated the above three). On number 3, which we reached first in this discussion, one can return and redo the analysis procedure I did, but for all 89 counties. The regression coefficients change, but the %>65, income, and %black again explain over 90% of the variation in death rates (93.3% as opposed to 92.5 without the outliers). After going through the additional 2 (latitude and hospital beds) (as opposed to 3 previously, hospital beds, %>16 years education, and doctors) variables which look like they might be important (large t statistic), we arrive at a near toss-up between Tmax and T^2 as the candidate for significant meteorological variable (t is 3.06 and 3.07, respectively). Taking T^2 as the meteorologically important variable, we arrive at F = 0.1075(n-6), where n is again the number of independant counties we believe we have. N Fcomputed Fcritical(0.01) Fcritical(0.05) 89 8.9 ~7.0 ---- 42 3.98 7.41 4.10 36 3.22 7.56 4.17 28 2.37 7.95 4.30 17 1.18 9.65 4.84 So with the original data set, the meteorological variables are even _less_ significant than if the outliers ar excluded from the analysis (reminder: they were rejected for being outliers in death rate and %>65, not for any meteorological parameters). Outliers or no, the meteorological parameters cannot be accepted as significant. In point 2, the problem of multicollinearity is aggravated to the point of being able to produce the effect Moore finds. In working with the outlier points, latitude becomes statistically meaningful. This variable is correlated to the temperatures. Moore fails to remove the effect of latitude on death rates and on the temperatures, so finds signifcance from the temperatures. But, this is mostly due to the relation of temperature to latitude, not temperature to death rates. Note that if we don't include the outlier counties, the importance of latitude disappears. The fact that Tmax is in both cases the most significant (or nearly so) of the meteorological variables, while T^2 goes from almost significant to meaningless is another sign that Tmax is the more robust variable, even if still not statistally significant. Point 1, I hope it is clear from the above, is drastically important. The statistical tests used in these problems rely heavily on the degrees of freedom. That Moore never discussed this point is a severe failing. The consideration here suggests very strongly that for any reasonable estimate of the number of degrees of freedom, none of the meteorological variables are even statistically significant. To recap: the relation between meteorological variables and death rates is not statistically significant, and even if it were, the dependance is on a variable which is expected not to change under climate change. The conclusion that there would be a decrease in the death rate due to climate change is therefore both statistically and physically unsupported by this data set. [1997 comment -- from here on, the review is very fragmentary ] In finishing the health effects section of the paper, Mr. Moore considered the effects meteorological variables may have on number of hospital beds and on number of doctors. He concluded that warmer temperatures would lead to a decrease in health care costs. I analyzed the data in the same way as for death rate, with the same three outlier counties removed as before. Since cloud cover turned out to be potentially important for hospital beds, I only used the counties for which there was cloud data, 7 did not: Jefferson Co, AL (Birmingham) Orange Co, CA (Anaheim), San Francisco Co, CA, Norfolk Co, MA (Boston), Bronx, Co, NY (NYC) New York Co, NY (NYC), Hamilton Co, OH, (Cincinnati)). I found for hospital beds, that the important non-meteorological variables were: doctors, income, and sky cover. Together, these explained 49% of the variation in number of hospital beds per 100,000. The most significant meteorological variable was the maximum temperature. The F statistic, considered as before, was 0.073*(n-4). This is clearly even less significant than the previous attempts (where the coefficients were .1075, and .136, the larger giving more nearly significant results). Note that I've been taking sky cover as a non-climate variable, on the grounds that we have little reason to predict whether cloudiness will increase or decrease in changed climates. This point affects none of the preceding. If here we considered it a climate variable, and looked for the F statistic for including it would be 0.064 (n-3), which would, as before, be insufficient to accept it as a predictor. For the number of doctors, the result is that the important non-meteorological variables are beds, %>16 years, death rate, income, and altitude (in descending order of importance). The meteorological variables are overwhelmingly not significant. By this I mean that the t statistic, even for 86 independent counties and even considering only the regression on the residual variance, is not even significant at the 0.05 level. Again, sky cover considered as a climate variable Mr. Moore in considering the hospitals beds, did not include number of doctors in the regression, yet this is the single best predictor of number of hospital beds. To the extent that there is a relation (in both cases the one is the best predictor of the other) this is a requirement of his hypothesis that these figures reflect the health care needs of the area. It is therefore puzzling that he neglected this. He also did not consider the relation between %>16 years education and either the hospital beds or doctors. I found this to be the second most important variable for predicting the number of doctors in an area. The F statistic for including the %>16 years as a predictor of number of doctors is 0.47*(n-2) for the doctors, and is significant even for the probably low number of degrees of freedom. One simply cannot conclude regarding the importance of some variable in predicting another, if one has neglected to consider the role of major variables. If the major dependencies are considered, none of the meteorological variables are important to either number of hospital beds or the number of doctors. The conclusion that health care costs would be reduced is therefore unsupported by the data.