BETA Podcast: Replication crisis

02 September 2019

Have you heard about the ‘replication crisis’ facing science?

Scientists in many academic and applied fields around the world are grappling with a crisis of confidence.

Results of many prominent, widely-accepted studies have been thrown into question, as attempts to recreate the experiments have failed.

Social sciences, economics and other disciplines have been on a difficult journey over the past decade, grappling with the implications of challenges to long-held notions.

This week’s episode of the BETA podcast dissects the issue, with Professor Ben Newell, Deputy Head of the School of Psychology at the University of New South Wales, and Harry Greenwell, BETA’s own head of trial design and evaluation.

Ben goes through some famous replication-fails, while Harry talks about what BETA and others are doing to mitigate these risks and give confidence to findings.

This is the first in our new series of podcasts, where we explore interesting topics with experts in the field.

You can find all past episodes of BETA’s podcast at the website.

Disclaimer: BETA's podcasts discuss behavioural economics and insights with a range of academics, experts and practitioners. The views expressed are those of the individuals interviewed and do not necessarily reflect those of the Australian Government.

Transcript

[music]

Dave:

Hello and welcome to another episode of BETA’s podcast. I’m your new host Dave.

Today we’ve got Professor Ben Newell, one of our independent academic advisory panel, who joins us from the School of Psychology at the University of New South Wales.

Ben chats with Harry Greenwell, who heads up BETA’s trial design and evaluation unit, about the replication crisis facing science. Hope you enjoy.

Dave:

Welcome, Professor Ben Newell and Harry Greenwell. It's good to have you here. It'd be great if we can just start with a quick 30 seconds on who you are, where you've come from, what kind of work you do. So we'll start with you, Ben, if that's okay.

Ben:

Sure. My name is Ben Newell. I'm a professor of cognitive psychology at UNSW and I'm also on the Academic Advisory Panel for BETA. My work is broadly in the area of judgement and decision-making, and I'm interested in cognitive models and underlying cognitive processes that are involved when people make judgments, decisions and choices. I've also had a long-standing interest in implicit influences on cognition, so when people are influenced by factors in the environment that they're apparently unaware of and how that changes their behaviour. And through that work I've become interested in the issues around replication and replicability of some prominent findings.

Dave:

Perfect person to be speaking to today then. And you, Harry?

Harry:

Yeah, hi. I'm Harry Greenwell. I work at the Behavioural Economics Team of the Australian Government, or BETA. And our purpose is to take the findings from the behavioural sciences and especially from psychology, the sort of work that Ben does, and think about how we can apply it to public policy design and delivery. My particular role within BETA is to head up the evaluation unit so that we can look at the interventions that we've developed and find out whether they work the way that we want them to work.

Dave:

Excellent. It's probably good that we get onto today's topic as a nice segue. So we're talking about what's called the replication crisis which has hit science and academia in the last few years. I guess it would be good to start with a plain English definition. I might ask you to start, Ben, because this is an area of very deep interest to you?

Ben:

Okay, so a simple version of the replication crisis is so replication is important in science in general. So before we want to accept a finding as, quote unquote, true, we need to show that that result can be obtained in multiple different experiments and hopefully in multiple different contexts and locations and so on. And so a replication crisis is when a result that you thought was a standard, believed, accepted finding turns out to be very hard to redo in your lab or in someone else's lab.

And what's happened in the last, I guess, seven or eight years in psychology and in several other disciplines as well, but perhaps most publicly in psychology, has been that some high profile findings that were kind of ground truths to some extent in some areas of the literature, particularly in social psychology, have started to unravel in that they haven't been able to be replicated.

Dave:

It's been really interesting to follow it. It's almost kind of like a mass epitome that the way we're doing science is a bit flawed in many cases. But there are a few interesting studies that helped crystallise this for people. This Daryl Bem study, are you able to give us a quick sense of what that was about?

Ben:

Yeah, so the Daryl Bem paper. Daryl Bem is a very well known, well established figure in social psychology and he did work in this paper, that was called Feeling the Future, where he claimed to have found evidence for kind of precognitive processes. So people being able to guess what was hidden behind a screen and make inferences about that information even though they had no means of perceiving it.

Dave:

So kind of seeing the future, or they call it Extra Sensory Perception, don't they?

Ben:

Yeah. So ESP's long been searched for and never really been reliably established. But this paper, he claimed to have found across several different experiments and several hundred people evidence that people could indeed see the future.

Dave:

And it was sort of seen at the time as pretty powerful evidence, wasn't it? There was a pretty decent sample size. It was peer reviewed, it was published in a pretty well reputed journal, right?

Ben:

Yeah. I mean, that was one of the surprising thing. So it was published in the Journal of Personality and Social Psychology, which is probably the flagship journal for social psychology. And I think the reaction to it was that this is an extraordinary claim, right? This goes against most of what we understand about how people process information. And yet it was being made in this very high profile journal.

Dave:

Great tip. Thank you. And throughout the last six and a half years that BIT, could you share some reflections on the evolution of behavioural economics and behavioural insights, especially in public policy and also where do you see BE going in the field?

Ben:

And so as the saying goes, extraordinary claims require extraordinary evidence. And it attracted the interest of several people, but perhaps most prominently E.J. Wagenmakers and his group from Amsterdam who started to look at the evidence for the claims in a little more detail and found, I guess, some inconsistencies in the way that the data had been analysed.

Dave:

And was it that group of guys who eventually did that paper just showing how you could actually fudge the numbers to prove the impossible? This study looking at the Beatles' song, When I'm 64, actually through statistics they proved that listening to this song actually makes you 18 months younger.

Ben:

That wasn't E.J's group. That was Uri Simonsohn and some of his colleagues, Jo Simmons and Leif Nelson. So E.J's group published a reanalysis of the Bem data showing that other ways that you could analyse the data would lead you to very different conclusions. And then the Simonsohn work showed ... The title of the paper was False Positive Psychology Disclosed and the general point of that paper was to highlight to people that various things that one could do when analysing data or designing studies could lead inadvertently sometimes to erroneous finding, spurious findings. And so you had the example that they illustrate is that if you tweak the stats in sufficient ways, you can find things that are just demonstrably false.

Dave:

Yeah. I know Harry, you've kind of looked at this a little bit before as well. What are the sorts of things that people can do to actually change these results? And what are the incentives? I know things like selection bias and the actual system of journals and grants comes into it, but how do people actually go about doing this?

Harry:

Yeah, sure. Look, I'll just add a coda to what Ben was saying, especially on the Daryl Bem study. So I think in addition to there being a reanalysis of the work that Bem had done, others then went out and attempted to replicate his experiments. And they consistently failed to replicate his results. So just in case any of our listeners are uncertain, the evidence is pretty unequivocal. We can't see the future, unfortunately.

What's causing this and then what can we do about it is really where you're going. You mentioned selection bias. So part of that just comes down to what's interesting for journals to publish? What's interesting for academics to publish? If you've found something new, that's interesting, that gets published. And so we talk about publication bias. And the idea here is just that we're only seeing a subset of the available evidence, and it's a skewed subset. It's the times when we turn up something that appears to be positive. We'll always get some false positives. We run tests, we hope that the false positive rate is reasonably low. But we'll always have some.

If you only publish the positives, then you're losing all of that information you get from the times when the tests reveal nothing. That's important. So publication bias and how we address that is the first part of the question. The second part that Ben was also getting at, especially when he was talking about the paper on false positive psychology, is about what are now called questionable research practises and the practises that I think a number of researchers thought were benign, or just hadn't really reflected on. But what this paper, False Positive Psychology, revealed is that they can substantially increase our false positive rate well above what we thought it was. So we thought our false positive rate might be one in 20. You combine a few of these questionable research practises and you might have a one in two chance of hitting a false positive. So the answer is to try to tackle those possible causes.

Dave:

It's pretty staggering numbers, isn't it? I know there are a couple of really interesting papers since. There was, I think, a bunch of psychologists. There was 270 psychologists tried to replicate 100 experiments published in top journals, and only 40% of the studies held up. So that's 60% failed to replicate. And that's pretty staggering when you think about the rigour that is supposed to be behind these journals that publish these top papers.

Harry:

Yeah. So I think that this study is both concerning and ultimately hopeful. So the concerning part is that you don't want to hear that around four in 10 of your studies hold up when somebody attempts to replicate them. That seems too low a hit rate. On the other hand, think about the incredible collaboration that went on and the much richer evidence that we got out of that project. And Ben's been part of similar collaborations where the depth of evidence that we're now getting around a number of these questions in psychology is far, far, far richer than we had a decade ago.

Ben:

Yeah, I think that's right. I think these registered replication reports and the large scale open science type collaborations are definitely moving us towards realising the implications of these questionable research practises, and also putting in the frameworks for getting a much more solid base of evidence. I do think, though, that one comment on this, the four in 10 only replicating, there's a few things that came out of that work. One was a kind of acknowledgement that the effect sizes in some of the original studies were just surprisingly large or spurious.

And the second thing is the need for stronger theorising in the first place about how you think these effects are manifesting themselves. What is the theoretical count that you have? And there's been some, I guess, backlash already against these very, very large projects where you may test 7,000 participants in a study where a priori, from a theoretical perspective, it seems like an unlikely effect. It seems like something which you find it hard to believe that do we do we need to invest 40 labs and 7,000 participants to go and test each and every one of these things? Or can we have a more stringent, more nuanced, more mature theoretical understanding of what's going on? But it's difficult because the data has to drive the theory, so you have to.

Dave:

And does part of that comes down to preregistration and pre-analysis plans? I know there's a lot of talk about actually doing the theory upfront and stating upfront what it is that you're trying to test, an actual hypothesis. Is that part of the solution?

Harry:

Right, so it's worth just going back a step about the sorts of methods that are used in psychology. The sort of methods that are used in my team in BETA, which is to run experiments. So this is just like testing a new pharmaceutical drug before it comes onto the market. You'll take a group of people in a trial. You will randomly assign half of them to receive the drug, randomly assign the other half to receive a placebo. They don't know which one they're getting.

You then track their health outcomes, and because the two groups are close to identical due to the randomization you know that any difference in effects is attributable to the drug. We can do the same thing with psychology experiments or even in some cases with policy experiments. I should just point out, in case anybody's concerned about government experimenting on them or even just researchers experimenting on them, that there are standard ethics review procedures for these sorts of things. We're not going to be administering electric shocks or anything like that. But the reason why I went through all of that is that if you're designing an experiment, you can step back beforehand and write down all the details about what data you want to collect, what outcomes you want to measure, how many people you need in the trial, how you're going to do the analysis.

You get the opportunity to do all of that upfront. And so many researchers were doing this. I think we need to be careful not to impugn a whole discipline on the back of problems in part of the discipline. But the idea behind preregistration is, first of all, to say, "I'm conducting this study. I'm going to be testing these hypotheses and regardless of whether those hypotheses are confirmed or not, I'm going to publish the results somewhere." So that's addressing the publication bias concern. The pre-analysis plan goes a step further and sets out in as much detail as you can manage exactly how you're going to conduct the analysis. So you're tying your hands as a researcher before you commence the experiment to say, "Here's how I'm going to assess whether my test has confirmed my hypothesis or not." And that at least reduces the scope, if not eliminates the scope, for the questionable research practises that we were talking about earlier.

Dave:

And the publishing by default's important too, isn't it? Because even if something doesn't turn out the way you think it does, that's still quite useful information for people to have for future experiments, that sort of thing.

Ben:

Absolutely. I mean, I think knowing why something doesn't work is as valuable as knowing that it does work. You're not going to be able to understand the whys unless you publish. And journals are increasingly recognising that and there's more and more journals now that are accepting these preregistration reports where initially what you submit to the journal is the introduction. You know, this is what we're going to do. Here's the method, here's our analysis and the analysis plan. And the journal will accept that for publication irrespective of what the results are going to look like. Which I think is really important.

The one thing that I think is worth discussing though here is, and I completely agree with what Harry's saying about the need to state your hypotheses and irrespective of whether you find evidence in favour or not, you need to go ahead and publish that. But I think some of the discussion around the preregistration has been the potential for it to kind of stifle creativity and stifle exploratory research. And some people turn that around as weaselling out of I don't have to tell you what I think I'm going to find because I'm this creative genius.

I think what's important is that you need to say yes, exploratory research is fundamental. And quite a lot of what we do can be very exploratory. I mean, these are complex questions, complex issues. But once you have a pattern of findings and you think, yes, I've got evidence for that and I'm going to seek to confirm it, that's when you should be doing your preregistration and saying, "Yes, this is now the pattern of effects that I think I've got. Here's my preregistration plan, here's the analysis I'm going to do." And you go forward and do it that way.

Because I think that if you completely push the preregistration thing away and say no, no, I don't need that because I'm being creative and exploratory, then we're back to this problem of people second guessing and post hoc rationalisation. HARKing and all of these. So HARKing is Hypothesis testing After the Results are Known.

Dave:

Right.

Ben:

Or a hypothesis generation after the results are known.

Harry:

Yeah, I think part of what this gets to is the messy process of science. Science is about trying to come up with new knowledge. It's trying to be creative and so there is a lot of exploratory work there. And some would argue that the exploratory work is where a lot of the action is. But then once you've generated a hypothesis which you think fits with a broader theory and which seems plausible based on some patterns you've seen in the data, you then want to confirm it.

And to the greatest extent possible, if we can have a clear demarcation of do some exploratory work first, write down the hypothesis, then do the confirmatory work, we get the best of both worlds. I think in practise it's sometimes messier than that and you're doing the exploratory work and then seeking to confirm it within the one research programme. And I think that's where some of the neat delineations we're describing become a bit more complex when practitioners actually sit down to try to do this stuff.

Dave:

Yeah, I think it's important to point out as well that most people seem to think that the vast majority of scientists and researchers, they're not doing questionable research practises on purpose. It's because it's what the norm is. So this is quite a big shift. And for me as a lay person on this, one of the questions I normally have is is replication actually part of the solution? So if we have a problem replicating, then do we need to do more of this once papers are published? And I know, Ben, you've done some pretty interesting replication studies as well.

Ben:

Well, yeah. I mean, as I said at the start, replication is fundamental to science. The thing about do we need 7,000 participants in 40 labs to replicate a finding, which probably, I don't know, sometimes maybe people didn't really believe in the first place or perhaps it's not super important. So there has been a process by which if you want to conduct one of these registered replications, then there's a kind of submission process whereby you say this is why this effect is important. It might have been very highly cited in the literature or it might be the cornerstone of something.

And so in those cases I think, yes, replication is the solution. But I think it does have to be combined with this stronger theorising and stronger understanding of the body of knowledge that one's trying to develop and trying to contribute to. And I would pick up on this questionable research practises as the norm. I don't think that that's necessarily true. I think there has been a wake up call probably for some people in some areas in particular to be a bit more sensitive to the kinds of things that they thought were fine to do. But there are, like Harry was saying before, it's not the entire discipline.

Dave:

Yeah

Harry:

Can I jump in on that and say, and it's not just one discipline only.

Ben:

No, exactly. I was going to say it's economics as well.

Harry:

So we've been focusing on psychology, but some similar concerns have been raised in economics, in cancer biology research, and political science research and other fields. And so there's a more general question in the social and biomedical sciences about the research practises that have been used. And I think once you take that broader view, you realise that what we're talking about is a series of tools to improve the way the social sciences are being conducted. And you want to pick the tools that are appropriate to the circumstances.

Large chunks of economics can't run experiments the way psychology can. Cancer biology research can run experiments, but they take five years and tens of millions of dollars to run. Running large scale replication projects is often not going to be feasible. Then you look to other things like trying to be more stringent upfront in the process, or trying to be more transparent about the data and methods you used. So there is a challenge there for social scientists and those who work in biomedical fields as well. And I think it's heightening the awareness around the risks of the error rate, the false positive rate, and getting people to think more about what are the possible sources of inflated error rates, of getting these false results, and how can they combat those?

Dave:

Yeah, it's clearly really kind of far-reaching. Not just psychology. It kind of reaches across a lot of what we would've had as assumed knowledge. I'll ask one final question then and, given this has been around for a little while now, what does the future look like? How has the response actually been and what's the prospects looking like for the future?

Ben:

I think it's been a painful process but a positive one in that there now are much more readily available tools through things like the open science framework, through journals, providing the mechanism for publishing preregistered studies. There's a much stronger move to making all of your materials and data widely publicly available. And I think the process of self-reflection across the discipline has been one where, yeah, there's been, I think, a step change or a cultural shift actually in what used to be okay and is now.

And even even just at the level of when people are writing their results, there's now a clear demarcation between these are the analyses that we preregistered, and maybe after the data were in we then decided to do some other exploration in the data. But that marcation is being made in the paper and appropriate conclusions are then so you can say this is our confirmatory part, this is our exploratory part, so the exploratory part should be taken with a different degree of confidence than the confirmatory part. And that you would never have seen a few years ago. It was just here's the data, here's our analysis, here's our conclusion. So I think that's a good recognition.

Harry:

I'm going to make two concluding observations. The first one is that, to me, the tone of the discussion has shifted. I think that understandably many were quite defensive when these issues first emerged. Understandably because their work and their livelihoods were being challenged within some academic fields. I think that's shifted. I think the people who are promoting open science, transparency and replication are being much more sensitive to that.

But I think that part of this paradigm shift, as Ted Miguel has called it I think rightly, has has been a greater openness among social scientists that their work should be transparent. And that means that others will approach them to replicate it. The other comment I was going to make is maybe to pivot back to my role in policy, in public policy. We're having this discussion about what's happening within the scientific community and within academia, but a lot of the same questions apply to policy evaluations. We are not so much trying to find new knowledge. We just want to find policies that work, are going to improve people's livelihoods.

But the same questions about we think this is going to work. Here's the theory that underpins it. Here's how we're going to gather evidence. We might get it wrong, but we're going to try to get the best evidence we can. That applies to us as well. In BETA we've been doing our best to adopt these practises around preregistration, pre-analysis plans and publishing our results. But I think it's something that's still relatively new in government.

There's particular challenges around replication within government because a lot of the data is sensitive or tightly held. And so, as I said before, then we need to look at other practises that we can adopt to give people confidence that the evaluations we're conducting are rigorous and that they can have confidence in the evidence that supports those policy interventions.

Dave:

So a good start but still a bit of work to do. Harry, Ben, thanks for joining me.

Ben Newell:   

A pleasure.

Harry Greenwell:        

Thank you.

Ben Newell:   

Thanks.

[music]

Dave:

That's it for this episode. Don't forget to follow us on Twitter and read our reports on the website. Until next time, thanks for listening.

[music]