In a November 2005 working paper, Sound and Fury: McCloskey and Significance Testing in Economics, Kevin Hoover and Mark Siegler inform us that because Deidre McCloskey still hasn't done her homework right, she continues to misrepresent the median economist as a statistical dummy. Part of the abstract:

That statistical significance is not economic significance is a jejune and uncontroversial claim, and there is no convincing evidence that economists systematically mistake the two. Other elements of McCloskey’s analysis of statistical significance are shown to be ill-founded, and her criticisms of practices of economists are found to be based in inaccurate readings and tendentious interpretations of their work. Properly used, significance tests are a valuable tool for assessing signal strength, for assisting in model specification, and for determining causal structure.

Here's a more extensive earlier draft, and a list of the full-length AER papers McCloskey and Ziliak failed to include in previous analyses.

That's all in section 5.1, starting at page 31 (37) of the working paper. I was with H&S much of the way in that section-- especially about the subjectivity required to construct the evaluations, and the inconsistency across the two reviews -- until they conflate the refusal of M&Z to *re-produce* a representative sample of the now lost paper-to-dataset mappings with a refusal to "share" them. This serves to imply that the mappings are hidden in Ziliak's sock drawer, or thereabouts.... and the authors lose my respect with what I fear is not just poor word choice. Still, the paper is as interesting as it is fierce.

From page 34 (40):

[Emphasis added]Unfortunately, Ziliak has informed us that such records ["that indicate precisely which passages in the text warrant particular judgments with respect to each question."] do not exist. And McCloskey and Ziliakdeclined our requests to reconstruct these mappings retrospectivelyfor a random selection of the articles (e-mail McCloskey to Hoover 11 February 2005). Absent such information, including any description of procedures for calibrating and maintaining consistency of scoring between the two surveys, we cannot assess the quality of the scoring or the comparability between the surveys.

McCloskey’s reason fornot sharing the mappingsappears to be, first, that they are utterly transparent and, second, that the relevant information is contained in the scores themselves:Think of astronomers disagreeing. We have supplied you with the photographic plates with which we arrived at our conclusions [i.e., the question-by-question scores for the 1990 survey]. . . The stars [i.e., the articles in the American Economic Review] are still there, too, for you to observe independently. [e-mail McCloskey to Hoover 19 February 2005]

**UPDATE 1/25:** After reading Dr. Ziliak's comment, and carefully reading the sections of their paper dealing with M&Z's AER work, I must say that I'm disappointed in H&S. I don't think H&S have much new to say other than the problem is not as bad as M&Z claim. However, this is an empirical question, that in my mind, H&S fail to address thoroughly -- in fact, not even in a cursory fashion.

H&S caught my attention by insisting their data were better than the original. So I figured they would try to reproduce results -- which granted, is pretty hard and thankless work! What I really wanted to know from their paper were the results of a sensitivity analysis that should have been performed: given a) the expanded and more comprehensive dataset (allegedly 20% larger over the original), and b) a revised protocol (they didn't seem to like the multi-faceted M&Z questionnaire), how often do the M&Z results still hold? How frequently do published papers focus on measuring and sizing up economic impact? H&S didn't answer these questions. Hence, I found the paper of Hoover and Siegler pointless from the standpoint of my interests. And their selective detailed review of several papers in section 5 demonstrates nothing to me.

In *Size Matters*, M&Z found that the percent of full-length AER papers that didn't distinguish economic from statistical significance grew from 70% in the 1980's to 82% in the 1990's.

But H&S claim to have found new data: 15 papers in the 80's and 56 papers in the 90's that M&Z failed to include in their previous analyses. Since H&S don't perform one, let me create my own sensitivity analysis, measuring the potential impact of new data (though not the impact of a revised questionnaire).

First, the original M&Z data, percent of papers not distinguishing economic from statistical significance:

1980's: 127/182=70%

1990's: 112/137=82%

Second, I'm looking to make a *lower bound*: assume previous identifications of M&Z are correct, but that every single paper H&S have discovered **does** measure oomph:

1980's: 127/197=64%

1990's: 112/193=58%

In other words, under the extremely unlikely scenario that every single paper H&S have identified distinguishes economic from statistical significance, a majority of top AER papers STILL don't! And that 6% drop over the period, by itself, is not important to the profession.

For a more likely (though not most likely), mid-range estimate, assume half of all newly discovered papers measure oomph:

1980's: 135/197=69%

1990's: 140/193=73%

And finally, what if none of the new papers measure oomph:

1980's: 142/197=72%

1990's: 168/193=87%

In interval form, the new estimates for the share of papers not distinguishing economic from statistical significance range from 64%-72% for the 1980's and 58%-87% for the 1990's.

**In sum**: A majority of papers in the AER in the 1980's and 1990's did not distinguish economic and statistical significance, although trends in the share are not yet determinable.

(Of course, what is really called for is another observer to categorize the raw data using a different protocol, but that will have to wait for somebody without a blog).

How long has McCloskey been doing this? So boring.

My banker always wants to sell me overdraft insurance, and keeps telling me that how many people don't understand the risk of inadvertantly overdrafting their accounts, as if everyone is a fool like himself.

Kevin and R-squared:

McCloskey and I have written a polite but detailed response to Hoover and Siegler. We find no reason in the facts or the logic to change our minds on two points:

1) A supermajority of economists believe statistical significance is a necessary and sufficient condition for proving economic significance, and,

2) Most economists do not in their research papers explore the meaning of the magnitudes of their results.

We are not fools who have not done our homework: our logic and findings have been verified by numerous top-flight econometricians--editors, for example, of Econometrica and the Journal of Econometrics--and by 4 and counting Nobel laureates. A 2004 issue of the Journal of Socio-Economics (33) is devoted to the matter, and prints the comments.

But if none of that persuades, look, as I have, into the rhetorical history of significance. You will find that William Sealy Gosset (aka "Student") was the inventor of significance testing for small samples (his 1908 "z" became today's "t"). Gosset was a lifelong proponent of economic significance and a lifelong enemy of that "boring" point: t, r-squared, p are neither necessary nor sufficient for proving economic significance. He told everyone he knew, including Karl Pearson, The Man of applied statistics in English. Pearson wouldn't listen.

He told Ronald Fishr. Gosset tutored Ronald Fisher through his teethings. But Fisher wouldn't listen. Fisher invented the foolish and damaging Rule of Two. We document all this in a book by Ziliak and McCloskey, SIZE MATTERS: HOW SOME SCIENCES LOST INTEREST IN MAGNITUDE, AND WHAT TO DO ABOUT IT, Chps. 19-22).

Hoover and Siegler--if you look carefull at their paper--listen sometimes, sometimes not. Their equivocation will influence a few students--students eager to find a path of least resistance.

Beyond lack of courage and simple minded careerism ("hey, if you can get away with it, why not?") an explanation for the lure is trained incapacity and the bureacratization of knowledge after high modernism.

It's helpful to look at the facts. As for example my friends Dan K. and Peter B. do.

Prof Z

McCloskey and Ziliak are correct,but they have overlooked a much more serious problem for econometricians and econometrics.Benoit Mandelbrot has spent over 50 years demonstrating that the normal distribution is not an accurate or reliable representation ,in general,for most time series economic data.It is interesting that neither Frisch,Tinbergen,Koopmans,Haavelmo,Marshack,et al,ever did ANY goodness of fit test on their time series data to see if the normal distribution was a sound representation of the data.Of course,Keynes asked Tinbergen very politely to demonstrate that his data sets were "...HOMOGENEOUS,UNIFORM,AND STABLE.."over time back in 1939.No econometrician has ever shown that their time series data pass any goodness of fit test.The test Keynes suggested be used in the A Treatise on Probability on pp.420-421 was the Lexis Q test.Lets hope that future econometricians don't provide the "answer" given by Paul Cootner to Mandelbrot in 1964,which was that they were going to continue to assume normality in spite of the fact that the actual data fit the Cauchy distribution the best because it would be too hard to apply the Cauchy.

I don't understand the relevance of the table with the additional papers. Isn't the main issue whether authors pay attention to the size of the coefficients? And that is not in the table. What am I missing?

Thomas,

You've identified the major problem here. Hoover and Siegler did not analyze whether the authors of the additional papers paid attention to the size of the coefficients, which is what I think is called for.

They might be saving this analysis -- along with a reassessment of all the papers McCloskey and Ziliak did initially include -- for a later paper.

I did not email them to ask.