Hoover & Siegler: McCloskey is Oh So Wrong About Statistical Significance (UPDATED)
By Kevin
In a November 2005 working paper, Sound and Fury: McCloskey and Significance Testing in Economics, Kevin Hoover and Mark Siegler inform us that because Deidre McCloskey still hasn't done her homework right, she continues to misrepresent the median economist as a statistical dummy. Part of the abstract:
That statistical significance is not economic significance is a jejune and uncontroversial claim, and there is no convincing evidence that economists systematically mistake the two. Other elements of McCloskey’s analysis of statistical significance are shown to be ill-founded, and her criticisms of practices of economists are found to be based in inaccurate readings and tendentious interpretations of their work. Properly used, significance tests are a valuable tool for assessing signal strength, for assisting in model specification, and for determining causal structure.
Here's a more extensive earlier draft, and a list of the full-length AER papers McCloskey and Ziliak failed to include in previous analyses.
That's all in section 5.1, starting at page 31 (37) of the working paper. I was with H&S much of the way in that section-- especially about the subjectivity required to construct the evaluations, and the inconsistency across the two reviews -- until they conflate the refusal of M&Z to re-produce a representative sample of the now lost paper-to-dataset mappings with a refusal to "share" them. This serves to imply that the mappings are hidden in Ziliak's sock drawer, or thereabouts.... and the authors lose my respect with what I fear is not just poor word choice. Still, the paper is as interesting as it is fierce.
From page 34 (40):
Unfortunately, Ziliak has informed us that such records ["that indicate precisely which passages in the text warrant particular judgments with respect to each question."] do not exist. And McCloskey and Ziliak declined our requests to reconstruct these mappings retrospectively for a random selection of the articles (e-mail McCloskey to Hoover 11 February 2005). Absent such information, including any description of procedures for calibrating and maintaining consistency of scoring between the two surveys, we cannot assess the quality of the scoring or the comparability between the surveys.[Emphasis added]McCloskey’s reason for not sharing the mappings appears to be, first, that they are utterly transparent and, second, that the relevant information is contained in the scores themselves:
Think of astronomers disagreeing. We have supplied you with the photographic plates with which we arrived at our conclusions [i.e., the question-by-question scores for the 1990 survey]. . . The stars [i.e., the articles in the American Economic Review] are still there, too, for you to observe independently. [e-mail McCloskey to Hoover 19 February 2005]
UPDATE 1/25: After reading Dr. Ziliak's comment, and carefully reading the sections of their paper dealing with M&Z's AER work, I must say that I'm disappointed in H&S. I don't think H&S have much new to say other than the problem is not as bad as M&Z claim. However, this is an empirical question, that in my mind, H&S fail to address thoroughly -- in fact, not even in a cursory fashion.
H&S caught my attention by insisting their data were better than the original. So I figured they would try to reproduce results -- which granted, is pretty hard and thankless work! What I really wanted to know from their paper were the results of a sensitivity analysis that should have been performed: given a) the expanded and more comprehensive dataset (allegedly 20% larger over the original), and b) a revised protocol (they didn't seem to like the multi-faceted M&Z questionnaire), how often do the M&Z results still hold? How frequently do published papers focus on measuring and sizing up economic impact? H&S didn't answer these questions. Hence, I found the paper of Hoover and Siegler pointless from the standpoint of my interests. And their selective detailed review of several papers in section 5 demonstrates nothing to me.
In Size Matters, M&Z found that the percent of full-length AER papers that didn't distinguish economic from statistical significance grew from 70% in the 1980's to 82% in the 1990's.
But H&S claim to have found new data: 15 papers in the 80's and 56 papers in the 90's that M&Z failed to include in their previous analyses. Since H&S don't perform one, let me create my own sensitivity analysis, measuring the potential impact of new data (though not the impact of a revised questionnaire).
First, the original M&Z data, percent of papers not distinguishing economic from statistical significance:
1980's: 127/182=70%
1990's: 112/137=82%
Second, I'm looking to make a lower bound: assume previous identifications of M&Z are correct, but that every single paper H&S have discovered does measure oomph:
1980's: 127/197=64%
1990's: 112/193=58%
In other words, under the extremely unlikely scenario that every single paper H&S have identified distinguishes economic from statistical significance, a majority of top AER papers STILL don't! And that 6% drop over the period, by itself, is not important to the profession.
For a more likely (though not most likely), mid-range estimate, assume half of all newly discovered papers measure oomph:
1980's: 135/197=69%
1990's: 140/193=73%
And finally, what if none of the new papers measure oomph:
1980's: 142/197=72%
1990's: 168/193=87%
In interval form, the new estimates for the share of papers not distinguishing economic from statistical significance range from 64%-72% for the 1980's and 58%-87% for the 1990's.
In sum: A majority of papers in the AER in the 1980's and 1990's did not distinguish economic and statistical significance, although trends in the share are not yet determinable.
(Of course, what is really called for is another observer to categorize the raw data using a different protocol, but that will have to wait for somebody without a blog).
Comments