On the Way to Fraud [UPDATED]
If you will bear with me for a moment, I would like to begin with a story. When I was a grad student, lo these many years ago, I was mildly (read: pathologically) obsessed with a particular theory of embodied cognition. At a general level, the theory is pretty straightforward: bodily actions activate particular concepts, which in turn influence our perception and understanding of the world. The people advocating this theory have specific ideas about some types of actions and the concepts they activate, and I generally found (find) their ideas absurd. So from the first weeks of my first semester of graduate school, I set out to show just how absurd they were.
What this entailed, in short, was taking the action-concept connections seriously and having people perform the actions to see if the concepts were activated. I came up with a lot of absolutely silly experiments involving things like people rolling across the floor in chairs, or going up and down escalators, and then using some measure of conceptual activation. Long story short: none of them worked, because the particular theory of embodiment is as silly as the experiments were.
After a couple years of this I gave up, because it was a waste of time and effort and I wasn’t going to get publications out of it. Let’s face it, after a while that’s what most science graduate students care about. The research program lay dormant for a while, until another grad student heard about it and somehow convinced me to revive it (he will make an excellent used car salesman if his psychology gig doesn’t work out). So we took an idea that I’d ditched when I abandoned the whole project, modified it a bit, and ran with it. Then we got the data, and it was working! Not just working: the effect was huge. I mean, really huge! We talked to our adviser, we talked to some other people, and everyone agreed: this finding was a really, really big deal. The refrain around the lab became, “We’re going to publish this in Science!”
Here’s the thing, though: the results were too good to be true, so that at the same time we were very excited, we knew something was probably not right. So we went over every step of the method, including the MATLAB code that ran the experiment, and to our great disappointment, we discovered that there was an error in the way we were running it, and that, not the hypothesized connection between a particular action and a particular concept, was responsible for the huge effect. When we fixed the code and we re-ran the study, the effect disappeared. Zip, zilch, nada.
We were devastated, not because we had thought we had discovered evidence for a revolutionary view of the human mind (remember, we thought that view was bullshit, because it is), but because we thought we were going to get published in a major journal like Science, which would (combined with our other, lesser publications), make getting a tenure-track position at a major research school much, much easier. In other words, it would have made our careers.
This story should make clear, then, just how much such a publication would mean to graduate students, and what they (and untenured faculty) would give to get their research published in Science or Nature: a kidney, maybe two, their first-born child, and if the Devil doing the peer reviewing, their immortal soul. It is a really, really big deal.
I tell you all of this to shed a bit of light on this week’s revelation of fraud in political science. If you haven’t heard the story, here are the basics (check out FiveThirtyEight and Retraction Watch for more in depth discussions): In December of last year, Michael LaCour, a graduate student in political science at UCLA, and Donald Green, a political science professor at Columbia, published a paper titled “When contact changes minds: An experiment on transmission of support for gay equality” in Science. Here is the paper’s abstract:
Can a single conversation change minds on divisive social issues, such as same-sex marriage? A randomized placebo-controlled trial assessed whether gay (n = 22) or straight (n = 19) messengers were effective at encouraging voters (n = 972) to support same-sex marriage and whether attitude change persisted and spread to others in voters’ social networks. The results, measured by an unrelated panel survey, show that both gay and straight canvassers produced large effects initially, but only gay canvassers’ effects persisted in 3-week, 6-week, and 9-month follow-ups. We also find strong evidence of within-household transmission of opinion change, but only in the wake of conversations with gay canvassers. Contact with gay canvassers further caused substantial change in the ratings of gay men and lesbians more generally. These large, persistent, and contagious effects were confirmed by a follow-up experiment. Contact with minorities coupled with discussion of issues pertinent to them is capable of producing a cascade of opinion change.
The study was pretty simple: two types of survey canvassers, gay and straight, read one of two scripts: a script sharing a story about why gay marriage was important to the canvasser personally, or a script sharing a story about why recycling was important to the canvasser personally. They found that when gay canvassers read the gay marriage story, their impact on the opinions of those surveyed was large and long-lasting, showing up even after 9 months.
The effect was huge, and based on our existing knowledge of influence, virtually inexplicable. Political scientist Andrew Gelman of The Monkey Cage had this to say about the size of the effect soon after it was published:
What stunned me about these results was not just the effect itself—although I agree that it’s interesting in any case—but the size of the observed differences. They’re huge: an immediate effect of 0.4 on a five-point scale and, after nine months, an effect of 0.8.
A difference of 0.8 on a five-point scale . . . wow! You rarely see this sort of thing. Just do the math. On a 1-5 scale, the maximum theoretically possible change would be 4. But, considering that lots of people are already at “4” or “5” on the scale, it’s hard to imagine an average change of more than 2. And that would be massive. So we’re talking about a causal effect that’s a full 40% of what is pretty much the maximum change imaginable. Wow, indeed. And, judging by the small standard errors (again, see the graphs above), these effects are real, not obtained by capitalizing on chance or the statistical significance filter or anything like that.
Gelman does his best to come up with a plausible explanation for the finding in that post, but it’s clear he’s reaching and that he recognizes he’s doing so. At no point does he question the validity of the findings, however. In fact, no one did (I read the paper, as I read any social or behavioral science research in Science or Nature, and I certainly didn’t). The FiveThirtyEight post linked above describes the coverage, and influence, of the paper, all of which was unquestioning:
By describing personal contact as a powerful political tool, the paper influenced many campaigns and activists to shift their approach to emphasize the power of the personal story. The study was featured by Bloomberg, on “This American Life” and in activists’ playbooks, including those used by backers of an Irish constitutional referendum up for a vote Friday that would legalize same-sex marriage.
“How to convince anyone to change their mind on a divisive issue in just 22 minutes — with science,” was one catchy headline on a Business Insider story about the study.
It wasn’t until this month, when a professor and a graduate student set about trying to replicate the study that anyone realized something was up. It turns out it was really easy to see, if anyone had looked. Here is the summary of their findings (which are detailed here):
We report a number of irregularities in the replication dataset posted for LaCour and Green (Science, “When contact changes minds: An experiment on transmission of support for gay equality,” 2014) that jointly suggest the dataset (LaCour 2014) was not collected as described. These irregularities include baseline outcome data that is statistically indistinguishable from a national survey and over-time changes that are unusually small and indistinguishable from perfectly normally distributed noise. Other elements of the dataset are inconsistent with patterns typical in randomized experiments and survey responses and/or inconsistent with the claimed design of the study. A straightforward procedure may generate these anomalies nearly exactly: for both studies reported in the paper, a random sample of the 2012 Cooperative Campaign Analysis Project (CCAP) form the baseline data and normally distributed noise are added to simulate follow-up waves.
They even contacted the company that LaCour claimed he had used to do the surveys, and they had never heard of the project. What does this mean? It means that the experiment was never actually run. The data was produced using an existing data set with some statistically generated extra data to make it look like more than one set (for the first survey and follow-up surveys). In short, the entire thing was fabricated.
The discoverers of the fraud, Broockman and Kalla, contacted Green with their case, and he was quickly convinced by their overwhelming evidence. Green, to his credit, then wrote to Science retracting the paper. In the end, the system worked.
Even so, the social science world, as well as the journalists and lay people who follow that world, are left trying to figure out what went wrong at every stage, from LaCour’s decision to fabricate the data to Green’s failure to notice, from Science‘s peer-review process to the credulous acceptance of the press and other political scientists. I am afraid that the culture and incentives at each of these four stages are such that this will happen again, perhaps often, and in many cases with less publicity will go unnoticed much longer. I will address each of the four stages in order.
The Graduate Student
The first and perhaps most pressing question is what motivated LaCour to commit fraud in the first place? No one can know with any certainty what, precisely, LaCour was thinking, but I think I have a pretty good idea. The experiment, as Green himself notes, was incredibly ambitious for a graduate student, not in its design, which is quite simple, but in its scope, in the time commitment, and its cost. It was so ambitious that Green says he initially rejected the idea, but in the end LaCour convinced him he could pull it off. At some point, LaCour must have realized that he couldn’t. At that point he had two choices. First, he could admit a failure that likely cost him a great deal of time and effort, and would no doubt have resulted in significant delays in his research career, and therefore in his graduate education, and barring major research successes in the future, would have been a significant setback for his academic career as well. Or he could make up the data and hope that he did a good enough job of it that no one would ever notice. He chose the latter.
While I was not privy to the actual process, I suspect that it went something like this: LaCour produced the data and the analyses, and sent it to Green, who immediately recognized the importance of the finding. It was likely Green who suggested that they submit the findings to Science first, and I imagine this caused LaCour a great deal of anxiety because a Science publication means a great deal of attention, and therefore scrutiny, but he had long-since passed the point of no return: if he had told Green that he had fabricated the data, at this point, Green would have contacted LaCour’s graduate adviser, who would have initiated an investigation, which would ultimately have resulted in LaCour being dismissed from the graduate program. From his perspective, he had no choice but to run with the fraud and hope that somehow, some way, no one ever noticed.
And between this time and someone finally, and inevitably, noticing, LaCour benefited greatly from the study. His name was all over the media, his career took off in the form of a position at Princeton upon finishing his PhD, and the respect of many other, more senior researchers who were eager to work with him to replicate and extend his findings. I imagine it was easy to get caught up in this, and perhaps at times he even convinced himself that everything was going to work out OK. At others, I am sure he experienced a great deal of anxiety. At no point, however, could he have done anything about it without ending his career as a political scientist. He was a prisoner of his own ill-gotten success.
The Professor
Green’s part is a more difficult for me to understand. In interviews (e.g.), he makes it clear that he never saw the raw data, and was never involved in the actual running of the study. As he tells it, he was only involved in some of the more advanced statistical analyses and the write up. His excuse for this? His university’s Institutional Review Board (IRB) had never approved the study.
Why would this matter? IRB review is necessary for any research using human subjects to insure that the research is consistent with ethical guidelines. Without such a review, researchers cannot conduct a study using university resources, and are certainly limited in their involvement with studies conducted elsewhere. How limited? Can they not even look at raw data, as Green seems to be claiming? I cannot imagine this is true. Researchers request data from other researchers all the time, for the purpose of running their own analyses or replication, without having IRB approval for the study that produced that data. Certainly a researcher planning on publishing that data can do so as well? And if not, why would an experienced researcher, upon seeing such a dramatic effect, not go ahead and request approval so that he or she could see the primary data? Due diligence, especially for a finding of such importance, would seem to demand this.
These are questions Green will undoubtedly have to answer when his own department and university investigate this incident. From here, it looks like Green simply placed too much trust in LaCour, trust that was almost certainly augmented by the fact that he immediately recognized the importance of the finding, and the attention it would bring him. Green’s career may not have needed it as much as LaCour’s, but it would definitely be a huge feather in his cap.
The Journal
First, it is important to understand how Science works. It is one of the two most widely read journals in science generally, along with Nature, both of which share a basic format: short papers (sometimes with more extensive online supplemental materials) on research with potentially broad interest and influence. Unlike most discipline-specific journals, these big multidisciplinary journals also have quick turn arounds: while it is not uncommon for discipline-specific journals to take months, even a year, to review a paper, Science and Nature might go from submission to acceptance or rejection in a matter of weeks. This means that they usually involve fewer reviewers per paper, but also that, given how many submissions they get, they reject almost everything. The few papers that make it through are then published as soon as there is space available.
It should be clear, then, that Science relies on the integrity of the researchers who submit papers to them more than most journals do, because their peer review process cannot be as involved given their swift turnaround time and the number of submissions they receive. They simply do not have the time or the resources to require that reviewers closely check the data, as Broockman and Kalla did when they decided to replicate the study, and as reviewers for other journals may have done.
The implications of this are clear to me: if the name of an experienced researcher like Green had not been on that paper, Science would not have published it, because Science depended on his reputation to supplement its less-stringent review process. In a sense, Green failed Science more than Science itself failed.
Still, the editors and reviewer(s) should have recognized that the results were highly improbable, and made a conscious effort to do more than they usually do when reviewing papers for that publication. Or if they were really responsible, they shouldn’t have published it in the first place, recognizing that it was a study that needed more vetting, and perhaps more follow-up research, before it could be considered believable.
Everyone Else
If we, that is other political scientists, journalists, and practitioners, and lay people interested in social science, are being honest with ourselves, our failure is perhaps as great as that at any other stage in this process. These results were, by any standard, unbelievable. That is, everything we knew about opinion and influence suggested that the effects we were reading about were, if not impossible, then at least highly improbable. Many people recognized this, but our trust in the scientific process, along with our excitement at the results, blinded us to the implications of this, and made us forget our responsibility as scientists or lovers of science, to be skeptical of any results, but particularly of improbable ones. That is suppose to be how science works, at every level: the more improbable or counterintuitive the finding, the more scrutiny everyone gives it. In this case, we all failed to scrutinize the finding sufficiently.
I do not mean to suggest that we should have assumed fraud; basic charity demands that we exhaust all other possible explanations before we get to that point. We should all have seen that there were likely other explanations, though: errors in the methods, in the data collection, or the analyses, perhaps, or maybe just a statistical anomaly. We should have waited for replications before we got excited about the implications of the findings, and we certainly shouldn’t have used the results to guide policy or practice. We did’t do any of this, and we have only ourselves to blame.
Will anything change as a result of this case? Likely not. Graduate students (and untenured professors) will still be under a great deal of pressure to produce publications, and this pressure will induce a few to commit fraud. The established researchers working with those few will have little incentive to take the time and effort to sufficiently check the work, and strong incentives to publish wave-making findings. Journals like Science will, and should, continue to have a quick turnaround, resulting in a less rigorous review process, even for improbable results, as long as those results are likely to make a big splash. And everyone else will be wowed by sexy research findings, as we so often are, regardless of how preliminary and implausible they may be. More fraud will occur and be published, and will be lauded until it is discovered. The only positive, at this point, is that in the vast majority of cases, and in all of the most visible ones, someone will catch it and the perpetrators’ careers will be over.
I do think it is possible to prevent most cases of fraud, but this would require working against the incentives already in place. In order to do so, I suggest we follow Chris’ Three Laws of Data:
First Law: If the data is too good to be true, it is not true. No need for “probably.”
Second Law: If your research partner brings you data that is too good to be true, check that shit.
Third Law: If the data is so good that it defies reasonable empirical and/or statistical explanation, see the First Law.
Following these worked for my colleagues and I in the story at the top of the post, and they would have worked for Green, the editors at Science, and the rest of us had we followed them in this case. We didn’t, and that’s on all of us.
UPDATE: Here is LaCour’s response to the allegations, at his website:
Statement: I will supply a definitive response on or before May 29, 2015. I appreciate your patience, as I gather evidence and relevant information, since I only became aware of the allegations about my work on the evening of May 19, 2015, when the not peer-reviewed comments in “Irregularities in LaCour (2014),” were posted publicly online.
I must note, however, that despite what many have printed, Science has not published a retraction of my article with Professor Green. I sent a statement to Science Editor McNutt this evening, providing information as to why I stand by the findings in LaCour & Green (2014). I’ve requested that if Science editor McNutt publishes Professor’s Green’s retraction request, she publish my statement with it.
This is the sort of non-denial denial that I can only assume was drafted by a lawyer.
Interesting story. When the research first came out i read the first few lines and then dismissed it as wildly improbable and there was clearly some over promoting or a mistake somewhere. So i can’t say i’m surprised that it actually was bs. The journals have been to trusting for to long of papers they see. Sad but true. Even with the long times it can take to get a paper published and all that.Report
I remember thinking it was implausible as well, and hadn’t thought about it since. But I just figured it was a research issue, not fraud. I don’t think fraud would ever have crossed my mind when reading it. And in a sense, it shouldn’t, but at the very least readers and other researchers and journalists should have, collectively, applied the brakes with respect to the findings.Report
First of all, awesome write-up. Seriously.
Secondly, what is interesting to me here is, for lack of a better word, “BS detection”, and the ways in which it can be defeated (more specifically: why human BS Detectors are circumvented in some scenarios for some people, but not for others; and the next scenario might flip that around, with a completely-different set of people uncritically accepting some BS, and others calling it out.)
I suspect that “biases” and “incentives” are pretty much ALWAYS the answer; in which case, since we know biases/incentives cannot be eliminated, then exposing any narrative or issue to multiple biases/incentives is always best.
To tie it into Dan’s Fox piece and my comments there, if Fox were actually good at its declared job, then Fox would be an unambiguously-good thing, regardless if you agree with its underlying biases or not.Report
Thank you. And you’re right about the bias and incentives. I think it was Will who said (on Twitter) that motivated reasoning is really, really powerful. It is, overwhelmingly so. And we’re all subject to its power.Report
Appeal to the prejudices of supporters + Appeal to the prejudices of opponents == Too good to check.
The three laws are perfect and, ideally, would prevent stuff like this from happening… but… well, it seems like the only real solution is to run LaCour out of town on a rail and make his name be used in the same breath as Stephen Glass and Jayson Blair and point to LaCour on the first day of any given senior-level or grad school research class until we get another example that we can use.Report
Oh, he will undoubtedly be run out of town like every caught fraudster before him (I am thinking, specifically, of one whom I respected a great deal prior to his fraud being discovered: Marc Hauser). His name will be widely known. His tale will be a cautionary one. Fraud will still happen, because research takes forever, and when you realize it’s not working, you will have wasted that time, and effort, and you will be behind. For grad students and untenured faculty, for whom publications are everything (in cog psy, 5-7 in grad school are necessary for tenure-track jobs, and 3-5 a year as junior faculty are necessary for tenure), fraud’s going to happen because people panic.
The only real way to reduce the amount (which, I like to think, is pretty rare as it is) is to change the incentives: make publication numbers less of an issue, and quality research more of one. Then no one cares that you have a Science pub, or that you have 5 pubs in top journals, but only that you’ve demonstrated an ability to do good research.Report
Every “no significant results to report” paper looks the same, though…Report
Which is why you always have something to report, even if it’s just bolstering already known data!!
(Seriously, smart researchers know how to twist one study into 10 findings, and only report on the most interesting of the ten — so if five fail in an entirely probable fashion, you report on the seventh.).Report
Seriously, smart researchers know how to twist one study into 10 findings, and only report on the most interesting of the ten — so if five fail in an entirely probable fashion, you report on the seventh.
This does happen, and it is an absolutely horrible practice. With every comparison you make, the probability of finding a statistically significant result goes up. This is not the way to do good science.Report
Depends… If you’re looking at an entirely predictable result (that 30 years of research will back up), you’re probably not seeing something spurious.
(Besides, you know as well as I do that I’m not really talking cherry-picking Anything Significant).Report
You touched on an important point: publishing this in a high-profile publication is what brought them down.
My brother was an academic chemist. He once in some fifth-tier journal a paper that was relevant to his research. So he attempted to replicate it, and was unable to. The next step is to contact the paper’s author to discuss the techniques used in more detail. The initial contact was easy. He simply got the guy on the phone. Once they got past the preliminaries, and my brother had explained why he was calling, the guy became evasive and nervous. That conversation ended inconclusively, and my brother was never able to reach him again by any medium. Eventually it dawned on my brother that the paper falsified the results. He didn’t pursue it further. Actually proving this would have been at best a lot of work, and really, why bother?
The thing about fifth-tier journals is (1) there are a lot of them, and (2) they largely don’t matter. They exist to give an outlet for academics to meet their publish-or-perish quotas, not to publish interesting and significant results leading to further research. Those papers are published higher up in the food chain. This isn’t to say that everything in the fifth-tier journals is fraudulent, but some substantial portion of them are. In my brother’s case it was just dumb luck that a paper in one of these journals caught his attention.
I don’t think anyone starts down this road intending to fake results, but the sunk costs bring them around. The smart ones at least know not to make their fake results all that interesting.Report
“The smart ones at least know not to make their fake results all that interesting.”
Yeah, I said this yesterday. If you don’t want to get caught, keep things boring. It’s always ambition that brings you down!Report
I suspect he didn’t fully understand how big the findings would be. I mean, I figure he knew they’d be seen as important in his field, but as I say in the post, I’d bet a lot of money that the one who chose Science as the first journal to submit it to (and it must have been the first) was Green, who plausibly claims to have had no knowledge of the fraud. I suspect that, when Green made that suggestion, LaCour felt the ground falling out from beneath his feet, but at that point, what’s he going to do? His choices are then between an absolutely certain career-ending revelation to Green, his adviser, his department, and his university, or an almost completely certain career-ending revelation to the whole world. In the moment, that “almost” must look like the only possible route to take.Report
I will add, though, that if there’s an inexplicable aspect to the nature of his fabrication, it’s in the size of the effect. Seriously, why the hell would you make it that big? and at 9 months, no less! He’d have been published if the effect were still showing up at 3 months, and been half (or a quarter) as big. He definitely got greedy, but again, I bet he had no idea it’d be Science material. And if he did, he’s an either an idiot or a sociopath.Report
I’d offer as an explanation that he might not have realized that effect sizes matter. He might have generated the data using some process and seen that the results were statistically significant and thought that was enough. A lot of academics don’t pay that much attention to effect sizes. I’d be unsurprised if a grad student missed their importance.Report
Well, that’s an indictment of his methods and statistics instructors, though you’re probably right.Report
Yawn. There will always be people trying to fake data.
Other than sullying (briefly) the quality of academic
research, they’re mostly irrelevant.
Where better procedures are needed is in preventing
“unethical research” from being used as seed corn
in order to make flamboyant (perfectly true!) articles
[In fact, at least one of them has been cited on this site].
It’s always easier to prove something if you already know what you’re looking for, after all.Report
Great writeup. Thank you.
We should have waited for replications before we got excited about the implications of the findings, and we certainly shouldn’t have used the results to guide policy or practice.
This, I’m not so sure about. You’d need to ask what we were using previously. If the study results were used to replace hunches, then I think that was probably a sound decision at the time even without replication.Report
Thank you.
I suppose that’s true, in that just about any empirically-based idea is better than a gut-based hunch,but I doubt it’s ever that clear cut. There will usually be at least some research, and unless we’re talking about something completely new, there will be experience-based practices.
But I realize I’m extremely conservative when it comes to science and its application: I want to see replications of the replications of the replications before I take it very seriously. I can get excited about early results, but I ain’t talking to policy-makers about it, and I ain’t investing money in programs and training based on it.Report
I think some of this reflects our different disciplines. When you get something wrong, you might kill someone. If I get something wrong…well, I might not never know. And it’s relatively frequent that companies succeed by executing well on the wrong thing rather than waiting for the right thing. Further, acting on mistaken information and observing the results might be the best way to discover what the correct information is.
By I could understand why you might not want to take that sort of approach.Report
Ah, I think you’ve just awakened me to a distinction I was missing, which is likely important. Applied research (of which I’ve never done any) has to deal with these issues much more directly, with more complications. My experience is, in a sense, not as applicable as I’d like to think.Report
Great write up!
Here’s my question: would it have been so bad to simply turn in the paper that found that contact has no statistically significant effect on changing minds? That paper doesn’t get you in Science and it doesn’t get you a tenure track position at Princeton before you graduate, but it gets you a PhD and a shot at a decent career.
Was it that his CV was so otherwise unimpressive that he felt that he had no choice but to go big?Report
In today’s social scientific environment, that paper doesn’t get published at all with a null result. Null results are incredibly difficult to get published in most sciences, but particularly so in social and behavioral sciences. At the very least, he’d have had to run a bunch of replications of the study in multiple contexts, showing null results every time, to get it published in a 3rd-tier, highly domain-specific journal, which would have been unreasonable for a grad student.Report
@chris
Which is, in itself, a problem.Report
I’ve also had trouble selling my screenplay, And Absolutely Nothing Happened That Day.Report
get it published in a 3rd-tier, highly domain-specific journal
With, career-wise, pretty much a null result.Report
Also, thank you.Report
Great post Chris, though I do have a largely unrelated methodological question:
I may be reading that wrong, but it sounds to me like a linear specification is being fitted to the response on the 5-point scale. Is that normal, because I wouldn’t want to fit a linear model to something that is bounded at both ends.Report
Thank you.
I’m not sure I understand your question. He’s speaking about a fairly simple case: the maximum absolute difference possible between two instances (or two means) on a 5-point scale. In this case, it’s the difference between the mean ratings in two experimental conditions. So the maximum logically possible difference is that between a 1 and 5, or 4, and because a lot of people will select 4 or 5, the maximum reasonable difference is going to be smaller (about 2) on average. That is, the 4’s and 5’s are going to pull the means up sufficiently that it’s unlikely the differences could be anything close to 4 in practice.Report
@chris
Ah, so it’s just a T-test or Chi-Square test on the difference between the subgroup means. I forget regressions are much less popular outside of Economics.Report
Likely ANOVA, as there were multiple conditions (and t-tests for post hoc). And of course, ANOVA is a special case of regression.
Actually, political scientists use regression a great deal, but this was a proper experiment, and therefore regression would be logically unnecessary.Report
@chris
Ah, ANOVA of course.
I would have considered a regression specification anyway, after all there’s no way the subgroups can be perfectly controlled. Also, there a technique for handling categorical dependant variables (though I’ve never tried it with more than 2 categories) that would make it possible to relax the assumption that (for example) the difference in attitude between points 2 and 3 on the scale was the same as between points 4 and 5. It would also deal with the symmetry of changing opinions at the ends of the scale. It’s bad form statistically to assume normal disturbances in a variable with strictly bounded edges.Report
ANOVA’s ability to be useful even with fairly large deviations from the normality assumption is famous, of course, but I get your point.
Looking through the paper again, it looks like they just used t-tests for the main comparisons (the results section and supplemental material suck, but they usually do in Science).Report
I could not disagree more with your First Law. Do not EVER deny the validity of data because the results are counterintuitive. Research the data like crazy, question every step you made along the way, but don’t dismiss it. You may have found something extraordinary, or you may have stumbled upon a one-shot statistical outlier. Both things happen.
I’m reminded of a story, that I’m probably going to tell wrong, of a team that was investigating Mount Saint Helens in early 1980. There had been unusual seismic activity in the area, so they brought in cutting-edge laser sensors that could detect ground shifts in the range of fractions of millimeters. The next day they checked the sensors which reported that they’d moved two feet. They assumed that the sensors must not be working right.
Extreme example, I know. But when you start disregarding evidence that seems wrong, or signing off on data because it seems right, you’re throwing away the value of empirical research.
There’s no shame in reporting results that can’t be duplicated, either. 5% of all reports with a 95% confidence interval are misleading. You put them out there anyway. We’re all aware of the pressures to publish. You shouldn’t get a job at Princeton because you hit upon a 1-in-20 unusuality, any more than you should lose one for only rolling a 19 on the 20-sided dice. That’s an institutional problem. Institutional problems shouldn’t affect how you analyze data.Report
Counterintuitive is the best data, from a scientific perspective, because it shows us something new. Too good to be true is different. In this case, it was huge effects, much larger than such research usually produces, with a finding that defies reasonable explanation.Report
Also, the point is only that if it’s too good to be true, check your methods, replicate, try other methods, etc.Report
Why the “and should”? I’m no scientist, but from what you’re saying, Science is one of the two premier journals. Why not have a more rigorous review process?Report
Because they provide a pretty valuable service using that model: solid, early results of broad interest that should spark a lot of follow up, perhaps in multiple disciplines or sub-disciplines.Report
Got it. It just seems strange coming, as I do, from a discipline (history) where peer review takes so long that articles are vetted and re-vetted so much as to kill their spirit. Or so I’ve heard….I’ve never actually tried to publish a peer reviewed article.
Maybe it’s in a weird way a function of what Vikram says above. A historian who gets his or her argument wrong is probably not thereby going to kill people, while in the harder sciences the stakes are harder?Report
Yeah, that’s what the more specialized journals do. I took 4 years trying to get a series of studies, once.Report
By the way, I echo others’ points….this is a well-written, informative post.Report
Excellent post @chris
Do you think this could change the perception of Science as a good journal? Or is this just one of those things that, at this point, just happen?
While I have a couple scientists in the family, they only published in field specific journals.Report
First, thank you. You are all too kind.
Second, I hope not. While I can’t speak for the harder sciences, everyone in the social and behavioral sciences recognized Science and Nature for what they are. That is, they’re not where you will find comprehensive, detailed, multi-study papers that are perhaps years in the making and meant to cover as many possible objections as possible. Those get published in more specialized journals. Science and Nature are for “sexy” but well-conducted studies with cross-disciplinary appeal. Like this one, if it had panned out.Report
I can only speak for myself, but my answer is not in the slightest bit. They still have just about the most interesting pieces of any publication.Report
There’s been a lot of problems with research publishing over the years. One of our investment companies found this out, expensively. It had hired a doctor to oversee clinical trials some months before the NYT did an expose on papers written by employees of drug companies but published under names of independent researchers; this doctor was one of those ‘independent authors.’
We had done seed funding, and needed follow-up funding to keep the company growing; and now, the person we’d hired to oversee clinical trials was in the middle of a huge controversy that brought into question the integrity of the trials he was conducting. His job also included talking to potential investors about those trials; he was very much a public face of the company.
This is one of the very few times the decision makers in the family took my advice: fire him, halt the trials currently under way, and start over; because the company’s ability to grow and thrive depended on the perception of the integrity of those trials. This proved wise, the economy had just collapsed and investors were jittery; it created enough confidence in the research for follow-up investors to invest.
The integrity of peer-reviewed research has a lot of implications not just in academia, but it the business world and rubes like me who might hire people who write research papers and conduct clinical trials and do experiments.
This company is still thriving, has recently presented the results of the new trials we conducted the wake of that debacle, and they’ve been well received. It may, someday make a profit. But I don’t think it would have survived if we hadn’t fired that doctor.
I’m not convinced he did anything wrong; but he participated in something wrong, and that created the impression of dishonest research — an impression of fraud — and that, I think, would have destroyed the business if we hadn’t fired him.
Integrity matters.
Stellar post, @chrisReport
I could be wrong, but I think TAL just did a segment on this study, that must have been produced prior to the fraud coming out.Report
They did.Report
Wow. I was just telling my wife last night at dinner about the story is heard. Guess I need to go tell her this part now.Report
Excellent post, Chris! Thank you for writing it.Report
I know this is late, but I just saw this this AM and it seemed relevant.
How easy it is to perpetrate then widely-disseminate fraudulent (or probably more often, just plain bad) study results:
http://io9.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800Report
Yeah, this is something I was hinting at in the stats conversation with James earlier in the thread, which is actually an even bigger issue than they’re getting at.
For those who haven’t read Glyph’s link (which is fascinatingly disturbing), they ran a “real” study (though a really, really small one) and then run a ton of statistical comparisons between the groups, some of which yielded statistically significant results. It turns out that even with fairly rigorous criteria for statistical significance, with multiple comparisons you’re likely to find at least some that are statistically significant just by chance.
The math for this is pretty simple. They describe it at the link, but I’ll lay it out here as well. Assuming your comparisons are independent (that is, the results of one is not dependent on some aspect of another), the probability of getting a significant result for a particular alpha-level (your pre-determined criteria for statistical significance) with k comparisons is 1-(1-alpha)k. So, for example, if your alpha-level was p < .05 (as is common in social and behavioral science), and you ran 20 comparisons, your probability of a significant result would be 1-(.9520), or .64. That is, with 20 comparisons, and an alpha-level of .05, you have a 64% chance of getting a statistically significant result, which would lead you to reject the null hypothesis, perhaps incorrectly.
In the story Glyph links, they ran 18 comparisons. Assuming they were independent (since they were health measurements, they weren’t, so the probability of a statistically significant result is actually higher), the probability that they’d find a statistically significant result at the .05 level is 1-.9518, or .603. In other words, they were more likely to find a statistically significant result than to not find one.
Now, researchers know about this problem, and if they’re honest, they will account for it with various correction techniques, the most common of which in psychology is called the Bonferroni correction. It’s really, really simple: just take your p-value and divide it by the number of comparisons, to maintain an alpha-level of .05 with their 18 comparisons, they’d need to get p-values below (.05/18)=.0028. Peer reviewers will, when multiple comparisons are made, look for such a correction, and studies that don’t use them generally won’t get published.
However, there’s a more serious problem related to the multiple comparisons problem: while researchers almost always use methods like the Bonferroni correction to adjust their significance criteria in a single study, they rarely, if ever, use them between studies. Since it’s not uncommon for researchers in the social or behavioral science to run the same or similar studies multiple times, the multiple comparisons problem comes into play: if they run the same study 10 times, their probability of finding a statistically significant result at the .05 level is 40%. Since it’s much more likely than the usual 5% that such a result would be obtained by chance, it’s very likely then that a bunch of such findings, which would of course be published, are incorrect. That is, a non-trivial proportion of published social and behavioral scientific findings, particularly if they rely primarily on one published study, are likely bogus.
Of course, science has a pretty straightforward corrective mechanism for this: replication. However, if one fails to replicate a published finding once or twice, it’s unlikely anyone would publish the results, so you will have to fail to replicate it multiple times. Which, as you now know, raises the multiple comparisons problem.
This isn’t intractable, and again, there are built in mechanisms for dealing with this. Really, the basic concept of hypothesis testing is a corrective. However, it means that false results will slip through the cracks, and it means that if one really wants a statistically significant result, one can probably find one, and one can probably get it published, and given the way science journalism works today, if one’s fake result is sexy enough, one can get a lot of coverage for it meaning that the result will be part of the public consciousness for years even if it is shown to be false relatively quickly.Report
Ugh, for some reason it didn’t let me use the Greek symbol for alpha.Report
We only do beta around here.Report
This is the greater problem to be dealt with & is similar to news corrections or retractions in that the corrected information is often buried. Online publications are somewhat better in that corrections can be added directly to the article in question, but that’s only valuable at the source. The outfit that reports on a report, or reblogs on it, may ignore such corrections. Also, if HuffPo reports on some study & 6 months later it’s shown to be flawed, HuffPo might update the original, or pull it, but they certainly won’t run a new article for the front page unless there was a juicy scandal associated with it.
Toss in our national past time of loving a good conspiracy theory & even if corrections are made & widely publicized, there will be a disturbing number of people who will refuse to discard the original report & assume Big X quashed it for nefarious reasonReport
Green Jelly Beans Cause Acne!
https://xkcd.com/882/Report