HomeVideos

Gerd Gigerenzer: Mindless Statistics

Now Playing

Gerd Gigerenzer: Mindless Statistics

Transcript

524 segments

0:05

Today I will talk about a strikingly  persistent phenomenon in the social  

0:12

and biomedical sciences; mindless statistics.

0:20

Let me begin with a story. Herbert Simon  is the only person who has received both  

0:27

the Nobel prize in economics, and the  Turing award in computer science. The  

0:32

two highest distinctions in both disciplines.  Shortly before he died Herb sent me a letter  

0:42

in which he mentioned what has frustrated  him almost more than anything else during  

0:52

his scientific career. Significance testing.  Now he wrote "the frustration does not lie in  

1:05

the statistical test themselves, but in the  stubbornness with which social scientists  

1:15

hold a misapplication that is consistently  denounced by professional statisticians.

1:25

Herbert Simon was not alone. The mathematician  R. Duncan Luce spoke of mindless hypothesis  

1:36

testing in lieu of doing good science. The  experimental psychologist Edwin Boring spoke  

1:48

of a meaningless ordeal of pedantic calculations.  And Paul Meehl, the clinical psychologist and  

2:00

former president of the  American Psychological Society,  

2:04

called significance testing "one of the worst  things that have ever happened to psychology."

2:14

What is going on? Why these emotions?  What could be wrong with what most  

2:24

psychologists, social scientists,  and biomedical scientists are doing?

2:33

In this talk I will explain what is  going wrong. The institutionalization  

2:40

of a statistical ritual instead of goods  statistics. I will explain what the ritual  

2:52

is. I will explain how it fuels the replication  crisis, how it brings blind spots in the mind  

3:03

of the researchers. And also how it creates  a conflict for researchers, young and old,  

3:13

between doing good science, and doing  everything to get a significant result.

3:22

Let me give an example. A few years ago  I gave a lecture on scientific method  

3:29

and also on the importance of trust and  honesty in science. After I finished,  

3:39

in the discussion section, a student from an  ivy league university stood up and told me "You  

3:49

can afford to follow the rules of science.  I can't. I have to publish and get a job.  

4:01

My own advisor tells me to do  anything to get a significant result."

4:12

That's known as 'slicing and  dicing data' or also 'P-hacking'.

4:22

The student is not to blame.  He was honest. But he has to  

4:29

go through a ritual that is  not in the service of science.

4:37

So let me start with the replication crisis.  So every couple of weeks the media proclaims  

4:48

the discovery of a new tumor marker that  promises personalized diagnostic or even  

4:56

treatment of cancer. And medical research, tumor  research, is even more productive. Every day four  

5:05

to five studies report at least one significant  new marker. Nevertheless, despite this mass of  

5:21

results, few have been replicated and even  fewer have been put into clinical practice.

5:30

When a team of 100 scientists  at the bio tech company, Amgen,  

5:36

tried to replicate the findings of 53 landmark  studies, they succeeded only with six. When the  

5:49

pharmaceutical company, Bayer, examined  67 projects on oncology, women's health,  

5:59

and cardiovascular medicine, they  were able to replicate only 14.

6:06

So what do you do when your doctor  prescribes you a drug based on randomized  

6:13

trials that showed that's efficient,  but then it seems to fade away. Now,  

6:25

medical research seems to be preoccupied  by producing non-reproducible results.

6:33

Ian Chalmers, one of the founders of the  Cochrane Society, and Paul Glasziou, chair of  

6:41

the International Society for evidence-based  Health Care, estimated that 85% of medical  

6:53

research is avoidably wasted. And they estimated  a loss of $170 billion every year worldwide.

7:08

The discovery that too many scientific results  

7:15

appear to be false alarms has been  baptized the 'Replication Crisis'.

7:28

In recent years a number of  researchers, often young researchers,  

7:35

have tried to systematically find out to  what degree the problem is. And typically  

7:43

the results show that between 1/3 and 2/3  of published findings cannot be replicated.  

7:52

And among those who can be replicated  the effect size is on average, half.

8:02

So in medical research for instance the  efficacy of anti-depressants plummeted  

8:11

drastically from study to study. And  second generation anti-psychotics that  

8:20

earned Eli Lily a fortune, seem to  lose their efficacy when retested.

8:32

It's interesting how the scientific  community reacted. So what would you  

8:38

do if your result that made you famous,  disappears? Some researchers like the  

8:48

psychologist Jonathan Schooler faced the  problem and tried to think about what's  

8:57

the reason. And Jonathan came up with the  idea of 'cosmic habituation'. In his words  

9:10

it was as if nature gave me this great  result and then tried to take it back.

9:21

The New Yorker called this 'The Truth  Wears Off' phenomenon. Others reacted,  

9:30

so other researchers reacted differently,  and we're not happy with those who tried  

9:40

to replicate their studies and failed, and waged  personal attacks on those speaking of, I quote,  

9:50

'replication police, shameless little bullies,  which hunts,' or compared to to the Stasi.

10:03

So here we are. At the beginning of the  21st century one of the most cited claims  

10:11

in the social and biomedical sciences was John  Ionnidis, 'Most Scientific Results are False.'

10:23

In 2017, just to give a hint about the  possible political consequences the  

10:31

news website breitbart.com headlined a claim by  Wharton School, Professor Scott Armstrong, that  

10:43

I quote "fewer than 1% of papers in scientific  journals follow scientific method.' End of quote.

10:57

Now we we have seen in this country and in  other countries, politicians trying to cut  

11:05

down funding of research. And if they would  read more about this there would be more  

11:11

going in this direction. And those who point  out that so many results are not replicable,  

11:23

they face a double problem. They want to save  science, at the same time they run the danger  

11:31

that maybe Donald Trump, someone else,  will use this to cut funding totally down.

11:41

So how did we get there?

11:44

The replication crisis has been blamed on  economic, on false economic incentives, like  

11:52

'publish or perish,' and I want to make a point  today that we need to go beyond the important role  

12:03

of external incentives and focus on an internal  problem that fuels the replication crisis. And  

12:17

this is the factor that good scientific practice  has been replaced by a statistical ritual.

12:25

My point is resources follow this ritual  not because, or not only because of external  

12:35

pressure. No, they have internalized the ritual  and many genuinely believe in it and that can be  

12:48

seen most clearly by the delusions they have  about the P-value, the product of the ritual.

12:58

So statistic methods are not just applied to a  science, they can change the entire science. So  

13:08

think about parapsychology, which once was to  study of messages by the dear departed, and it  

13:22

turned into the study of repetitive card guessing.  Because that's what the statistic method demanded.

13:35

In a similar way the social sciences  have been changed by the introduction  

13:44

of statistical inference. And typically in  social science, scientists first encountered  

13:53

Sir Ronald Fisher's theories, and particular his  1935 book. He wrote three books. The first was  

14:03

too much about agriculture and manure,  and technically too difficult for most  

14:10

social scientists. But the second one was  just right. And it didn't smell anymore.

14:21

And so they started writing textbooks. And  then I became aware of a competing Theory  

14:28

by the Polish statistician, Jerzy Neyman,  and a British statistician, Egon Pearson.

14:38

Fisher had a theory, so at least  his null hypothesis testing,  

14:44

where he had just one hypothesis, Neyman  insisted that you need two. Fisher had the  

14:51

P-value computed after the experiment. Neyman  and Pearson insisting everything in advance.

14:59

I'll just give you an idea about the fundamental  differences, and I give you an idea about the  

15:06

flavor of the controversy. Fisher branded Neyman's  theory as 'childish' and 'horrifying for the  

15:15

freedom of the West,' and linked Neyman-Pearson  theory to Stalin's five-year programs. Also to  

15:26

Americans who cannot distinguish or don't want  to distinguish between making money and doing  

15:33

science. Incidentally Neyman was born in  Russia and moved to Berkeley, in the U.S.

15:42

So Neyman, for his part, responded to  some of Fisher's tests and said "these  

15:51

are in a mathematically specifiable sense,  worse than useless." What he meant with  

15:59

is that that the power was smaller than  Alpha. Such in the famous lady T test.

16:10

So what do textbook writers do when there are  two different ideas about statistical inference?

16:23

One solution would have been, you present both.  And maybe also Bayes or Tukey, and others,  

16:34

and teach researchers to use their judgment  to develop a sense in what situation it's not  

16:44

working and where it's better to do this. No, that  was not what textbook writers were going for. They  

16:53

created a hybrid theory of statistical inference  that didn't exist and doesn't exist in statistics  

17:01

proper. Taking some parts from Fisher some  parts from Neyman, and adding their own parts,  

17:10

mostly about the idea that scientific  inference must be without any judgment.

17:18

That's what I mean mindless... automatic.

17:24

And the essence of this hybrid  theory is the null ritual.

17:31

The null ritual has three steps. First set  up a null hypotheses of no mean differences,  

17:39

or zero correlation. And most important,  

17:44

do not specify your own hypothesis  or theory, nor its predictions.

17:53

Second step. Use 5% as a convention for  for rejecting the null hypothesis. If  

18:01

the test is significant claim  victory for your hypothesis,  

18:06

that you have never specified. If the test result  and report the test results as P smaller than 5%,  

18:18

or 1%, or 0.1%, whichever level  is is met by your results.

18:28

And the third step is a unique step. It  says always perform this procedure. Period.

18:41

Now neither Fisher, nor to be sure, Neyman-Pearson  would have approved of this procedure,  

18:50

and Fisher for instance said 'no scientific  researcher will ever have the same level of  

18:56

significance from experiment to experiment.'  He will give his thoughts. Neyman also and  

19:03

Pearson emphasized the role of judgment.  And if the two fighting camps agreed on  

19:12

one thing it was scientific inference cannot  be mechanical. You need you use your brain.

19:21

And that was exactly the message the  null ritual did not convey. Namely  

19:29

it wanted a mechanical procedure. Where  we can measure the quality of an article.

19:38

Now what did the poor readers of these  textbooks do with a mishmash of two  

19:45

theories which were not mentioned that  it is a mishmash, not in names of Neyman  

19:54

and Pearson attached to the theories. So the  result was that the external conflict between  

20:04

the two groups of statisticians went into an  internal conflict in the average researcher.

20:13

I use a Freudian analogy to make that clear. So  the super ego was Neyman-Pearson theory. So the  

20:20

average researcher somehow believed that he or  she had to have two hypothesis and and actually  

20:29

give thought about Alpha and the power before they  experiment and calculate the number of subjects  

20:34

you need. But the ego, the Fisher in part, got  the things done and published. But left with a  

20:45

feeling of guilt of having violated the rules.  And at the at the bottom was the Bayesian Id,  

20:54

longing for probabilities of hypothesis, which  neither of these two theories could deliver.

21:05

How did all this come apart?  So how could this happen?

21:12

I'll give you another story. I once visited  a distinguished statistical textbook writer  

21:20

whose book went through many editions and whose  name doesn't matter. He was one of the only ones,  

21:30

actually his book was one of the best ones in  the social sciences, and he was the only one  

21:36

who had in an early edition, a chapter on  Bayes, and also, albeit only one sentence,  

21:45

mentioning that there is a theory of Fisher  and a different one of Pearson. Neyman-Pearson.

21:56

So to mention the existence of alternative  theories was unheard of, and even names  

22:05

attached to that. So I asked him why he took  out the chapter on Bayes and this one sentence,  

22:17

from all further editions. When I met  him he was just busy having, I think,  

22:24

was the fifth edition of his bestselling book.  

22:29

And why he, I asked him, created a inconsistent  hybrid that every decent statistician would have  

22:42

rejected. To his credit, I should say that  he also did not attempt to deny that he had  

22:52

produced an illusion. But he let me know whom  to blame for it, and there were three culprits.

23:01

First, his fellow researchers. Then,  

23:05

the university administration.  And third, his publisher.

23:11

His fellow researchers, he said, are not  interested in doing good statistics. They  

23:17

want their papers published. The university  administration promoted people by the number  

23:24

of publications, which reinforced the  researchers attitude. And his publisher  

23:35

did not want to hear about different theories. He  wanted a single recipe cookbook and forced him,  

23:47

so the author told me, to take out the  Bayesian chapter, and even this single  

23:54

only one sentence about Fisher,  and Neyman and Pearson series.

24:03

At the end of our conversation I asked him in  what statistical theory he himself believes.  

24:16

And he said deep in my heart I'm a Bayesian.  Now if they also was telling me the truth,  

24:30

he had sold his heart for multiple editions of a  famous book whose message he did not believe in.

24:42

10,000 students have read this text believing  that it reveals the method of science,  

24:50

dozens of less informed textbook writers  copied his text, churning out a flood  

24:59

of offspring textbooks inconsistent,  and not noticing the mess.

25:08

I have used the term 'ritual' for this  procedure for the essence of the hybrid  

25:15

logic because it resembles to social rights.  Social rights have typically the following  

25:25

elements. There are sacred numbers or colors.  Then, there's a repetition of the same action,  

25:33

again and again. And then there's fear. Fear  about being punished if you don't repeat these  

25:41

actions. And finally delusions. You have to  have delusions in order to conduct the ritual.

25:51

The null ritual contains all of these features.

25:55

There's a fixation on 5%. And in functional MRI  it's colors. Second, there's repetitive behavior  

26:10

resembling compulsive handwashing. And third,  there's fear of sanctions by editors or advisers.  

26:21

And finally, there's a delusion about what a  P-value means, and describe that in a moment.

26:30

Let me just give you a few examples  about the mindless performance of the  

26:36

ritual. They may be funny, but  deep it's really disconcerting.

26:46

So in an internet study on  implicit theories of moral courage,  

26:53

Philip Zimbardo, who is famous for his stain  for prison experiments, and two colleagues,  

27:02

asked their participants "do you feel that there  is a difference between altruism and heroism?"

27:15

Most felt so. 2,347 respondents said  'yes' and only 58 said 'no.' Now the  

27:26

authors computed a Chi-Squared test  to answer whether these two numbers  

27:32

are the same or different. And the  found that they indeed different.

27:43

A graduate student of mine, a smart one,  had the opposite situation. His name is  

27:52

Pogo. Pogo ran an experiment with two groups  and found that it two means are exactly the  

28:00

same but Pogo could not just write that or say  that. He felt he had to do a statistical test,  

28:10

a T test to find out whether the two  exactly same numbers differ significantly,  

28:16

and he found out they don't. And  the P-value was impressively high.

28:26

Here's the third illustration. I  recently reviewed an article in which  

28:35

the number of subjects was 57. The authors  calculated a 95% confidence interval for  

28:50

the number of subjects and concluded the  confidence interval is between 47 and 67.

29:01

Don't ask why they did it.  It's mindless statistics.

29:04

Almost every number in this paper was  scrutinized in the same way. The only  

29:12

numbers in the paper that had no  confidence were the page numbers.

29:21

This is an extreme case but unfortunately it's not  

29:26

the exception. Consider all behavioral  neuropychological and medical studies  

29:32

in Nature in the year 2011. 89% of  the studies report only P-values and  

29:44

nothing else that's of importance. Such as no  effect sizes. No power. No model estimates.

29:53

Or an analysis of the Academy  of Management Journal,  

29:59

in a year later reported that the average  number of P-values in an article is...  

30:07

guess how many P-values. If you have two  hypotheses you would need two. No. 99.

30:19

Yeah it's a mechanical testing of any number.  So the idol of automatic universal inference,  

30:30

however is not unique to P-values, or confidence  intervals. Dennis Lindley, a leading advocate of  

30:40

Bayesian statistics, once declared that the only  good statistic is Bayesian statistics and that  

30:50

Bayesian methods are even more automatic,  in his opinion, than Fisher's run methods.

31:00

So the danger is here. You don't have  much progress if you use Bayes factors  

31:06

just as mindlessly. So let me go on. The  examples about mindless use sound funny,  

31:15

but there are deep costs of the  ritual. I'll give you a few examples.

31:20

Maybe the first and quite  interesting one is that you  

31:23

actually fair better if you don't  specify your own hypothesis. Why?

31:32

Paul Meehl once pointed out a methodological  paradox in physics. Improvement of experimental  

31:43

measurement and amount of data, make it  harder for a theory to pass. Because you  

31:49

can more easily distinguish between  the prediction of the theory and the  

31:54

actual data. In fields that rely on  the null ritual it's the opposite.

32:04

So, because it tests a null in which  you don't believe. So improvements  

32:14

of measurement make it more easier to detect  the difference between the data and the null,  

32:22

and that means more easier to reject the null and  you can collect victory for your hypothesis. And  

32:31

you can't just imagine that it's another factor  that leads to the irreplicability of results.

32:40

Second point and, I now get to the  delusions. And about this sacred object,  

32:50

the P-value. Now a P-value is the probability  of a result, or a more extreme one, if the null  

33:01

hypothesis is correct. And more technically  correct, it is the probability of a test  

33:07

statistic given an entire model. But the point  is it's a probability of data given hypothesis,  

33:20

not the Bayesian probability. And that should be  easy to understand for any academic researcher.

33:31

And also it's not the probability that  you will be able to replicate it. And  

33:41

the replication delusion, that's the first  one, is that when you have a P-value of 1%,  

33:52

then logically it follows that the probability  you can replicate your result is 99%. Clear?

34:00

So this illusion has been already told in the  book that Greg pointed out by Nunnally. It's  

34:10

great reading. if you have fun, you want to really  have fun, read the old textbooks about statistics.  

34:17

They're all not written by statisticians  otherwise they wouldn't have been used.

34:24

So instance Nunnally writes, quote, "what  does a P-value of 5% mean?" His answer:  

34:34

"The investigator can be confident with odds of  95 out of 100 that the observed difference will  

34:42

hold up in further investigations."  That's the replicability delusion.

34:51

So I was curious, what is the state today?

35:00

Do academic researchers understand what a P-value  means? So the the the object they looking for.

35:12

So I surveyed all studies available in six  different countries with a total of over  

35:20

800 academic psychologists, and about thousand  students, and they were all asked "what does the  

35:27

P-value of 1% mean?" How many of these professors  and students cherish the replicability illusion.

35:42

So in this example there were 115 persons who  taught statistics. But you should know that  

35:54

in the social sciences, mostly and certainly  not in psychology, those who teach statistics  

36:00

are not statisticians. For the same reason,  because they would notice what's going on.

36:07

So what do you think? What proportion of  115 statistics teachers, that's across  

36:15

six countries, fall prey to the illusion of  reputability? It should be zero. It is 20%.

36:26

Then we have looked at the professors,  and that's over 700 in this study. In  

36:32

the professors it's 39% who believe in the  replicability illusion. So almost double.

36:42

And among the poor students it is  66%. They have inherited the delusion.

36:52

And note that this is another reason  why there replicability crisis is not  

37:00

being noticed until recently, because  of this illusion. I have a problem,  

37:06

it is a P-value of 1%, I can be  almost sure it can be replicated.

37:13

Now it's not the only delusion that is shared by  the academic researchers. So the next delusion is  

37:25

that you think that P-value is the probability  of data given a hypothesis, tells you about the  

37:33

probability of the hypothesis given the data.  And the majority of academic psychologists in  

37:44

six countries and in every study, shares at least  one ,and typically several, of these Illusions.  

37:55

Including that the P-value of 1% tells you  that the probability that null hypothesis  

38:02

is true is also 1%, or that the alternative  hypothesis is true is 99%. And so it goes on.

38:13

This is a remarkable state of the art  of doing science. Remarkable because  

38:23

everyone of these academic researchers  understands the probability of A, given B,  

38:29

is not the probability of B, given A. But  within the ritual thinking is blocked.

38:38

So another blind spot and cost is of obviously  effect size. There is no effect size in the  

38:45

ritual. There's an effect of in Neyman and  Pearson, yeah. But that's not being told.

38:52

McCloskey and Ziliak asked the question "do  economists distinguish between significance,  

39:04

and so statistical significance,  and economic significance?"

39:11

And I looked at the papers in one of the  top journals, The American Economic Review,  

39:18

and of 182 papers, 70% did not make a  distinction. And what Ziliak and McCloskey did,  

39:32

they published the names of those who got most  confused, including a number of Nobel laureates.  

39:42

10 years later they repeated that, assuming that  everyone must have read that and people are now  

39:49

more reasonable, but the 70% who confused it,  the number didn't go down, it went up to 82.

40:00

Similar, there's a blind spot  for statistical power. There is  

40:04

no power in the null ritual. And  power means the probability that  

40:10

you find an effect if there is one. And  that should be 80%, 90%. Better higher.

40:19

The psychologist Jacob Cohen was the first  one who systematically studied the power of  

40:25

a major clinical journal, and he found that the  average power for a medium effect was 46%. That  

40:38

wasn't much. Now, Peter Sedlmeier and I, 25 years  later, which should be a time that things change,  

40:51

looked and analyzed the power in the same  journal: Before it was 46, it went down  

41:00

to 37. Why? Because many researchers now did  Alpha adjustment which decreases the power.

41:12

And notice what that means. If you if you set up  an experiment that has only a power of say 30%,  

41:31

to detect an effect, if there's  one, you could do better and much  

41:39

more cheaper by throwing a coin and  you would have a power of 50%. Clear?

41:52

And you could spare all this effort, and  what I've told you now are even the better  

42:02

results. Studies in Neuroscience, so for  instance, studies about Alzheimer's disease,  

42:10

genetics, cancer biomarkers, the median  power of more than 700 studies is 21%.

42:20

In F functional MRI studies only 8%. And a recent  study that has looked at 368 research areas in  

42:34

economics and analyzed the 31 leading journals,  found a median statistical power, again calculated  

42:46

for an alpha of 5% and a median effect size in  the area, what would you guess? Economists, 7%.

42:58

Then they looked on the top five,  so economics has a hierarchy,  

43:03

the top five. What's there? What do you think?  7% in the hoi polloi. In the top five only 5%.

43:16

Low statistical power is another  reason for failure for replication,  

43:21

and the interesting thing  is it's not being noticed.

43:25

There's a recent study by Paul Smaldino and  Richard McElory have looked at 60 years of  

43:33

power research in the behavioral sciences. They  took every study that has referenced our study,  

43:44

there with Peter Sedlmeier.  So in order to get studies,  

43:52

a large number of studies, and  they found consistently low power  

43:59

and it's not progressing. And one of the  reasons is blindness in the null ritual.

44:09

Let me get to a final point about the costs.  It is the moral problem. So science is based  

44:18

on trust and the honesty of researchers.  Otherwise we can't do this. And this  

44:27

statistic ritual creates a conflict between  following the scientific morals, or trying to  

44:39

do everything to get a significant  result even, if it's a false one.

44:47

And that's called 'p-hacking'. That's called  'borderline cheating.' Borderline cheating  

44:54

because you don't really invent your data but you  slice and you calculate maybe in this slicing,  

45:04

or in that slicing, as long you find something.  And borderline cheating includes: You do not  

45:11

report all the studies you have run but only the  one where it's significant. You do not report all  

45:17

the dependent measures you have looked at but only  the signal. You do not report all the independent  

45:23

measures, and maybe if your if your P-value ends  up is 5.4%, you round it down slightly under five.

45:36

And if you analyze the the distribution  of P-values, you see exactly,  

45:41

there is those are missing who is slightly  above. And those are too high or bit lower.

45:48

So a study by John, Loewenstein, and Prelec,  

45:51

with over 2,000 academic psychologists  found that the far majority admitted  

45:59

that they have done at least one of these  questionable research practice that amounts  

46:06

to cheating. And when they were asked whether  the peers do it, the numbers were even higher.

46:18

So let me go come to my end, and  ask what can we do about all this?

46:27

And the simplest answer would be we  need to start, or we need to foster  

46:36

statistical thinking. not rituals. And the  crisis, there have been proposals made, The  

46:45

American Statistical Association has made a number  of statements which I think were not very helpful.

46:57

A group of 50 researchers, all luminaries,  

47:00

has recommended a solution, namely to  change P equal or smaller than 5% to P  

47:10

smaller than 0.5%. That makes it harder  but it doesn't even address the problem.

47:22

What will happen? There will be more intensive  

47:27

P-hacking because you have to  work harder to get this. Right?

47:33

I think we have to make more fundamental  changes and I can here only sketch that.  

47:40

First we need to finally realize that we  should test our own hypothesis, not a null.

47:50

Second, we need to realize that the business  is about minimizing the the real error in our  

47:58

measurement. Not taking the error and  dividing it by the square root of N.

48:08

So this is a key disease and here's another story.  I once was a visiting scholar in Harvard and it  

48:18

happened that I had my room, my office next  to B.F. Skinner's office. So B.F. Skinner was  

48:26

once the most most well-known and controversial  psychologist. And at that time he was quite old,  

48:35

and his store was going down because of  criticism. And he felt a little bit lonely,  

48:42

that was my impression, so  we had lots of time to talk.

48:46

And I asked him about his attitude to statistical  testing. It turned out that he had obviously no  

48:55

recognizable training in statistics but he  had a good intuition. He said he admitted  

49:03

that he once tried to run 24 rats at the  same time. He said "it doesn't work because  

49:11

you can't leave keep them at the same level  of deprivation and you increase the error."

49:18

And he had the right intuition.  It's the same intuition as Gosset,  

49:25

the man who under the pseudonym, Student,  developed a T-test said "a significant  

49:31

result by itself is useless. You  need to minimize the real error."

49:37

Skinner told me this story that when he gave an  address to the American Psychological Association,  

49:47

and after having reported about  one rat, he said "according to  

49:54

the new rules of the Society I will  now now report on the second rat."

50:02

So he understood that part.

50:07

And a third move, besides really  taking care of what you measure,  

50:14

and in many physics experiments weeks and months  are spent on trying to get clear measurements. In  

50:22

the social science it's often Amazon Turk  workers, who somehow answer questions for  

50:32

little money, in short time, and they do  best if they don't really pay attention.

50:39

So the third point would be remember that  statistics is a toolbox. There is no single  

50:48

statistical inference method that is the best  in every situation. One often needs to tell  

50:56

this to my dear Bayesian friends. Bayes is a great  system, but it also doesn't help you everywhere.

51:06

And universities should start  teaching the toolbox and not a ritual.

51:13

And editors are very important in  this business. They should make  

51:19

a distinction between research that finds a  hypothesis and one that tests the hypothesis.  

51:27

So that young scientists don't have  to cheat anymore and pretend they  

51:32

would have the hypothesis that they  got after the data, already before.

51:37

Second, editors should require when inferences  are made to state the population to which the  

51:46

inference refers. People or situations. And many  application of statistics there is no population.  

51:54

There is no random sample. Why do we do the  inference, and to what population? Unclear.

52:01

Third, editors should require competitive  testing and not null hypothesis testing.

52:10

And finally, in my opinion one important signal  would be that editors should no longer accept a  

52:22

manuscript that reports results as significant  or not significant. There's no point to make  

52:31

this division, and it's exactly the problem that  then people try to chea,t or fail to have this.

52:39

If you want to report P-vales, fine,  but report them as exact P-value. That  

52:44

was what Fisher in his third book, in the  1950s, always said. Fisher rejected the  

52:51

idea to have a criteria because that  is what he meant, five year plans.

53:01

And at the end, I want to put this in  a larger context. The null ritual is  

53:09

part of a larger structural problem  that we have in the sciences. And  

53:17

the problem is that quality is more  and more being replaced by quantity.

53:24

As Noble laureate in physics, Peter Hicks said:

53:33

"Today," he said, "I wouldn't get  an academic job. It's as simple as  

53:39

that. I don't think I would be  regarded as productive enough."

53:46

And we have come into an understanding of  science, that science means producing as many  

53:55

papers as possible. That means you have less idea  to think. And thinking is hard. Writing is easy.

54:07

And one driver of this change from  quality to quantity, are the university  

54:18

administrators who count rather than read,  when the deciding on promotion or tenure.

54:30

A second driver is the scientific publication  industry. Sorry, the scientific publishing  

54:39

industry, that misuses the infinite capacity  of online publication for making researchers  

54:50

to publish more and more, in more and more  special issues, and in more and more journals.

54:58

This development towards quantity  instead of quality is further fueled  

55:07

by the so-called predatory journals that  emerged in the last 20 years. Predatory  

55:14

journals are journals are obviously only for  collecting a few thousand bucks from you,  

55:21

from publishing your paper with no noticeable  review system. We know cases where reviewers,  

55:30

these are often serious scientists who somehow  do not notice what's going on, and then reject  

55:37

the paper, and they were told clearly that it's  not in the interest of the publishing company.

55:46

And most recent we face an new  problem. Namely the industry  

55:56

like systematic production of fake articles  with the use of AI by so-called paper mills.  

56:08

A paper mill, so assume you are, and it's it's  mostly in the biomedical sciences in genetics.  

56:18

Assume you work in a hospital, you're a doctor,  and you need an article in a good journal to be  

56:28

promoted. And somehow it doesn't work.  So paper mill offers you, it's called  

56:37

assistant services. Papermill offers you for  10,000 or $20,000 to write an article that's  

56:46

actually faked from the beginning to the end,  including maybe, western plots that are faked,  

56:54

and they can guarantee significance,  and they can guarantee publication.

57:04

And why? Because more and more in the last years  they bribe editors from journals to publish the  

57:14

papers they sent them, and pay them—like a  colleague of mine who is an editor of medical  

57:21

journal, he got an offer from a paper mill in  China— that for every article you publish we pay  

57:30

you $11,000 multiplied by your impact factor. And  we will help you to increase your impact factor.

57:41

So, talking about broken science. That's the  future. So let me finish. I think the larger  

57:51

goals are scientific organizations like  the Royal Society of London should take  

57:59

back control of publishing. Out of the hands  of commercial pro-profit publishers. That can  

58:09

be done. For instance it happened last year.  The top journal in neuroscience, NeuroImage,  

58:19

42 editors of NeuroImage stepped back.  Resigned because of the what they called  

58:26

'greed of Elsevier,' and founded a new journal  called Imaging Neuroscience. And made a call  

58:34

to the entire scientific community, submit  your papers to the nonprofit journals and  

58:41

no longer support the exploitation of  reviewers, of writers, of editors, by  

58:49

these big publishing companies who in some years  make more profit than the pharmaceutic industry.

58:58

So that's one way. And a second important  conclusion is universities need to be restored  

59:07

as intellectual institutions, rather  than run as if there were corporations.

59:15

We need to publish fewer  articles and better science.  

59:21

We need statistical thinking. Not rituals.

59:27

Thank you for your attention.

59:34

[Applause]

59:41

Thank you so much.

59:49

Questions?

59:49

[Haavi Morreim] [I've got one quick.  

59:51

You've you've outlined this, the myriad of factors  contributing to the replication crisis. Could I  

59:58

pile on by adding one more? Grant giving  entities, whether government or private,  

60:05

aren't interested in doing this. They paying  you to do what somebody else already did.  

60:12

What's already been shown. They want you to plow  new ground. Show us something new. Whatever.

60:17

There's no money in replicating. Or unless there  is some specific reason to question what's,  

60:24

quote unquote, "already been shown."  And I I live inside an academic Medical  

60:30

Center and you can publish all you  want but if you didn't bring cash,  

60:35

okay, grants and overhead and all that  kind of thing. I love your ideas. Who's  

60:41

going to pay for it. The financing system  needs to be a part of the resolution.]

60:47

Yeah you're totally right. And there are more  factors. And what we can do is from below,  

60:55

from the ground, and stand up as scientists and  use our own values to change the system. And the  

61:07

system had worked some time before. And we are in  the danger to let it go in the hands of commerce.

61:20

[Jay Couey] [Yeah,  

61:21

I thought what you said about the  two different kinds of science,  

61:28

like whether you're testing a hypothesis or  defining it, I think Neuroscience is very  

61:34

much trapped in this idea that they're testing  hypothesis yet they're still trying formulate one.  

61:41

And their null hypothesis, not they don't  understand. That's the problem. I think  

61:47

something you can add, the null hypothesis of your  assumption, or it might be this but it's not.]

61:57

Yeah, you're totally right. And the response  to what you're saying is often "yeah,  

62:03

it's too difficult. We cannot  make precise predictions."

62:08

My response to that response is, it's not too  difficult, but the reliance of neuroscience is  

62:16

mostly the null ritual. That invites you not to  think about theory. Real theory. And about precise  

62:27

hypothesis. And then you just continue this state  of ignorance. But you get your papers published.

62:40

[Malcolm Kendrick] [Can I just make one point,  

62:42

which is in the medical world if you do  a study on say statin versus a placebo,  

62:48

and you prove statistical whatever, rubarb,  you're never allowed to do that study again, ever,  

62:55

because you proved that the drug is better than  placeo. You can never have another placebo arm,  

63:00

ever. And I've been thinking about this. I don't  know what the answer is but it means that once  

63:04

you've proven your drug works you can never do any  reputation or reproducing that study ever again.]

63:14

Do we getting together the pieces for  Broken Science? Here's another one.

63:21

[Peter Coles] [Yeah, I just I mean I agree  

63:23

with everything you said about the confusion in  academics about whether they're talking about the  

63:30

probability of the data given the model, or the  probability of the model given the data. So given  

63:35

the level of confusion that exists within the  research community it's not surprising that when  

63:41

it comes to the public understanding of science  it's a complete nightmare, because non-specialist  

63:47

journalists garble this meaning of the results  even more than the academics do, probably.

63:54

So we end up with very misleading  press coverage about what results  

63:59

actually mean— what has been discovered,  what has not been discovered—and that's  

64:03

really damaging for public trust in science, and  also obviously, has implications for you know,  

64:10

kind of political influence on science  as well. If they see the public trust in  

64:15

science disappearing, and it's all part  of this problem that scientists and the  

64:25

media do not communicate their ideas  clearly enough for people to understand.

64:34

What I would say just to conclude is that um very,  very often in science the only reasonable answer  

64:40

to a journalist's question is "I don't know,"  because in many, many situations we really don't  

64:47

know the answer. But you don't get on in your  career if you keep answering questions like that,  

64:51

even though it's true that you don't really know  the answer. The journalists will push you into  

64:56

saying "oh yes I prove this is true because  my P-values." So there's a much wider...]

65:05

One solution to this would be systematic  programs to train journalists. A few of  

65:11

them exist. I have been participating in the,  but it really needs to be on a broader basis,  

65:17

and it's becoming more and more  difficult because journalism is  

65:21

more... less investigative. But  more from one day to another one.

65:27

[Peter Coles] [Well the main media  

65:28

outlets are sacking journalist essentially.]

65:32

Competition through social media  is tough. And at the end we may at  

65:40

some point face a situation where it's  no longer clear what is truth and what  

65:44

is fake. And we need to be prepared, and  we need to prepare the public for that.

65:53

[Yeah okay. I do have a question. So you  talked about the idea of we need to have  

65:59

fewer papers. Fewer better papers. And  just looking at the idea of there's so  

66:07

many papers that are not reproducible,  to me suggest that there's actual no  

66:12

science there. Okay. So is there enough new  science to write enough papers to go around?]

66:24

Yes. The moment you stop: Predatory journals -  they need papers. And also you stop publishers  

66:34

like Frontiers, who in between 2019 and 2021  has doubled the number of papers they put out.

66:44

So your question "are there enough papers for the  

66:51

pro profit publishers that we have now?" For  most of them there are never enough papers.

66:56

[Yes. So I'm actually saying  a slightly different thing.]

66:58

Yeah. [If I'm  

66:59

a researcher are there actually enough ideas to  write enough papers to like do anything useful?]

67:09

Yeah that's a good question. Fair  question. One answer would be if  

67:13

you have no ideas you shouldn't be a  researcher. Do something different.

67:19

And also we need to see... so university  departments need to be care careful in  

67:25

hiring but let people have time to really  develop some ideas, and take the risk. That  

67:34

there are few. But we don't profit from the  mass production of average things. Yeah. Yeah.

67:41

[Following up on on his question. I had  a teacher in graduate school who had a  

67:47

prophylactic solution or contribution to this  debate. He thought that instead of encouraging  

67:55

teachers to publish, we should discourage them,  

67:59

or at least we should give them an incentive to  think very carefully before they published. He  

68:06

proposed docking them $1,000 for every article  they published and $5,000 for every book.]

68:13

[Of course that was quite a while ago  so you would have to increase those  

68:16

numbers to account for inflation. I think  there's something to be said for that.]

68:21

So students would have to pay?

68:24

[So the the teacher would have to pay.  In other words it would encourage him to  

68:28

really think he had something to say. Because  that it would be worth $1,000 to publish this  

68:33

article or $5,000 to publish the book. He the  teacher... his salary let's say would go down.]

68:42

[Graciano Rubio] [Would you say that  

68:46

there's value in using the P-values as a way to  allocate resources for studies that deserve to  

68:53

be repeated and a way using those resources on  new research and publishing more papers? So in  

69:01

in regards to the the charity and foundation for  the philanthropic events, there's always limited  

69:07

resources. You only have so much time and money.  So if we accept that the P-value is not sufficient  

69:14

for validation could you use the P-value as a  way to say these are the studies which deserve  

69:20

to be repeated. So that we can we can find  results that are predictable and away from  

69:25

going out and trying to find something new  and just focusing on the volume of papers.]

69:30

Yeah. So I understand your question there's  a conflict between replication and finding  

69:39

something new. Yeah. Certainly yes. But I think  there are enough researchers who might focus on  

69:49

replication. For instance —it could be an  answer to your question— those who think  

69:53

that at least at the moment have no new ideas...  do the replication. Police. And others to find  

70:02

new ideas. But new ideas are hard to come by.  We need to patient with them. And most of us,  

70:11

if you have one great idea in your  life that's already above average.

70:21

[Peter Coles] [Am I allowed a second second one?  

70:24

It occurred to me a long time ago that part of the  pathology of the academic system is actually the  

70:32

paper itself. The idea of a paper. Science has  become synonymous with writing papers. Whereas  

70:41

if you look at the tech, the world we have now,  digital publication. We don't have to write these  

70:47

tiny little Quant of papers and get publications.  You don't communicate science effectively that way  

70:53

anymore. It's a kind of 18th century idea that  you communicate by these written papers. You  

70:59

could have living documents for example, which are  gradually updated as you repeat. But of course the  

71:06

current system does not allow a graduate student  to get promoted, to get advanced in that way.  

71:13

But I think by focusing entirely on papers we're  really corrupting the system as well. It's not so  

71:20

much an issue about who publishes them, although  it's a serious one as you said. It's the fact that  

71:27

we're fixed in this mindset that we have these  little articles that we have to communicate only  

71:33

by these articles. And that's forcing science  into boxes which is not really helpful.]

71:38

We can talk about this over lunch, but  I'm still with papers. I think papers,  

71:46

and also books, or patents, depending on  the realm, are still good. But the number  

71:52

of the papers is the problem. And also this  edifice that you have, that all that counts is  

72:02

significance instead of effect size. Good theory.  A good theory can predict a very small change.

72:13

[Anton Garret] [Can I add to that that Peter Medawar, who is  

72:16

a Nobel prize winning immunologist, wrote an essay  called "Is The Scientific Paper A Fraud?" I think  

72:25

several decades ago. And he didn't mean that the  results were fraudulent. What he was saying was  

72:32

that a scientific paper as the end product does  not reflect the process by which it's created.]

72:39

Yeah. Yeah. And that would be very helpful  for students. To see the agony that goes into  

72:47

writing a paper the ever, ever changing of the  thing. And that would help students very much,  

72:55

because they think "Oh, I never will  achieve this." The final product.

Interactive Summary

The speaker addresses the persistent issue of "mindless statistics" and significance testing in social and biomedical sciences, which has led to a widespread "replication crisis." Many scientific findings, often 1/3 to 2/3, cannot be reproduced, resulting in an estimated annual waste of $170 billion in medical research. This crisis is fueled by external incentives like "publish or perish" and an internal problem: the institutionalization of a statistical ritual. This ritual, a hybrid theory created by textbook writers, promotes mechanical inference over scientific judgment. It leads to common delusions about P-values and blind spots regarding effect size and statistical power, contributing to practices like "P-hacking" and undermining research integrity. The speaker argues for fundamental changes, including fostering statistical thinking, testing specific hypotheses, minimizing real error, and using a statistical toolbox. Furthermore, larger structural problems like universities prioritizing publication quantity over quality, the greed of the scientific publishing industry, and the rise of predatory journals and paper mills are contributing to "broken science." The solution involves scientific organizations reclaiming control of publishing and universities being restored as intellectual institutions focused on quality over quantity.

Suggested questions

7 ready-made prompts