HomeVideos

Gerd Gigerenzer: Mindless Statistics

Now Playing

Gerd Gigerenzer: Mindless Statistics

Transcript

524 segments

0:05

Today I will talk about a strikingly  persistent phenomenon in the social  

0:12

and biomedical sciences; mindless statistics.

0:20

Let me begin with a story. Herbert Simon  is the only person who has received both  

0:27

the Nobel prize in economics, and the  Turing award in computer science. The  

0:32

two highest distinctions in both disciplines.  Shortly before he died Herb sent me a letter  

0:42

in which he mentioned what has frustrated  him almost more than anything else during  

0:52

his scientific career. Significance testing.  Now he wrote "the frustration does not lie in  

1:05

the statistical test themselves, but in the  stubbornness with which social scientists  

1:15

hold a misapplication that is consistently  denounced by professional statisticians.

1:25

Herbert Simon was not alone. The mathematician  R. Duncan Luce spoke of mindless hypothesis  

1:36

testing in lieu of doing good science. The  experimental psychologist Edwin Boring spoke  

1:48

of a meaningless ordeal of pedantic calculations.  And Paul Meehl, the clinical psychologist and  

2:00

former president of the  American Psychological Society,  

2:04

called significance testing "one of the worst  things that have ever happened to psychology."

2:14

What is going on? Why these emotions?  What could be wrong with what most  

2:24

psychologists, social scientists,  and biomedical scientists are doing?

2:33

In this talk I will explain what is  going wrong. The institutionalization  

2:40

of a statistical ritual instead of goods  statistics. I will explain what the ritual  

2:52

is. I will explain how it fuels the replication  crisis, how it brings blind spots in the mind  

3:03

of the researchers. And also how it creates  a conflict for researchers, young and old,  

3:13

between doing good science, and doing  everything to get a significant result.

3:22

Let me give an example. A few years ago  I gave a lecture on scientific method  

3:29

and also on the importance of trust and  honesty in science. After I finished,  

3:39

in the discussion section, a student from an  ivy league university stood up and told me "You  

3:49

can afford to follow the rules of science.  I can't. I have to publish and get a job.  

4:01

My own advisor tells me to do  anything to get a significant result."

4:12

That's known as 'slicing and  dicing data' or also 'P-hacking'.

4:22

The student is not to blame.  He was honest. But he has to  

4:29

go through a ritual that is  not in the service of science.

4:37

So let me start with the replication crisis.  So every couple of weeks the media proclaims  

4:48

the discovery of a new tumor marker that  promises personalized diagnostic or even  

4:56

treatment of cancer. And medical research, tumor  research, is even more productive. Every day four  

5:05

to five studies report at least one significant  new marker. Nevertheless, despite this mass of  

5:21

results, few have been replicated and even  fewer have been put into clinical practice.

5:30

When a team of 100 scientists  at the bio tech company, Amgen,  

5:36

tried to replicate the findings of 53 landmark  studies, they succeeded only with six. When the  

5:49

pharmaceutical company, Bayer, examined  67 projects on oncology, women's health,  

5:59

and cardiovascular medicine, they  were able to replicate only 14.

6:06

So what do you do when your doctor  prescribes you a drug based on randomized  

6:13

trials that showed that's efficient,  but then it seems to fade away. Now,  

6:25

medical research seems to be preoccupied  by producing non-reproducible results.

6:33

Ian Chalmers, one of the founders of the  Cochrane Society, and Paul Glasziou, chair of  

6:41

the International Society for evidence-based  Health Care, estimated that 85% of medical  

6:53

research is avoidably wasted. And they estimated  a loss of $170 billion every year worldwide.

7:08

The discovery that too many scientific results  

7:15

appear to be false alarms has been  baptized the 'Replication Crisis'.

7:28

In recent years a number of  researchers, often young researchers,  

7:35

have tried to systematically find out to  what degree the problem is. And typically  

7:43

the results show that between 1/3 and 2/3  of published findings cannot be replicated.  

7:52

And among those who can be replicated  the effect size is on average, half.

8:02

So in medical research for instance the  efficacy of anti-depressants plummeted  

8:11

drastically from study to study. And  second generation anti-psychotics that  

8:20

earned Eli Lily a fortune, seem to  lose their efficacy when retested.

8:32

It's interesting how the scientific  community reacted. So what would you  

8:38

do if your result that made you famous,  disappears? Some researchers like the  

8:48

psychologist Jonathan Schooler faced the  problem and tried to think about what's  

8:57

the reason. And Jonathan came up with the  idea of 'cosmic habituation'. In his words  

9:10

it was as if nature gave me this great  result and then tried to take it back.

9:21

The New Yorker called this 'The Truth  Wears Off' phenomenon. Others reacted,  

9:30

so other researchers reacted differently,  and we're not happy with those who tried  

9:40

to replicate their studies and failed, and waged  personal attacks on those speaking of, I quote,  

9:50

'replication police, shameless little bullies,  which hunts,' or compared to to the Stasi.

10:03

So here we are. At the beginning of the  21st century one of the most cited claims  

10:11

in the social and biomedical sciences was John  Ionnidis, 'Most Scientific Results are False.'

10:23

In 2017, just to give a hint about the  possible political consequences the  

10:31

news website breitbart.com headlined a claim by  Wharton School, Professor Scott Armstrong, that  

10:43

I quote "fewer than 1% of papers in scientific  journals follow scientific method.' End of quote.

10:57

Now we we have seen in this country and in  other countries, politicians trying to cut  

11:05

down funding of research. And if they would  read more about this there would be more  

11:11

going in this direction. And those who point  out that so many results are not replicable,  

11:23

they face a double problem. They want to save  science, at the same time they run the danger  

11:31

that maybe Donald Trump, someone else,  will use this to cut funding totally down.

11:41

So how did we get there?

11:44

The replication crisis has been blamed on  economic, on false economic incentives, like  

11:52

'publish or perish,' and I want to make a point  today that we need to go beyond the important role  

12:03

of external incentives and focus on an internal  problem that fuels the replication crisis. And  

12:17

this is the factor that good scientific practice  has been replaced by a statistical ritual.

12:25

My point is resources follow this ritual  not because, or not only because of external  

12:35

pressure. No, they have internalized the ritual  and many genuinely believe in it and that can be  

12:48

seen most clearly by the delusions they have  about the P-value, the product of the ritual.

12:58

So statistic methods are not just applied to a  science, they can change the entire science. So  

13:08

think about parapsychology, which once was to  study of messages by the dear departed, and it  

13:22

turned into the study of repetitive card guessing.  Because that's what the statistic method demanded.

13:35

In a similar way the social sciences  have been changed by the introduction  

13:44

of statistical inference. And typically in  social science, scientists first encountered  

13:53

Sir Ronald Fisher's theories, and particular his  1935 book. He wrote three books. The first was  

14:03

too much about agriculture and manure,  and technically too difficult for most  

14:10

social scientists. But the second one was  just right. And it didn't smell anymore.

14:21

And so they started writing textbooks. And  then I became aware of a competing Theory  

14:28

by the Polish statistician, Jerzy Neyman,  and a British statistician, Egon Pearson.

14:38

Fisher had a theory, so at least  his null hypothesis testing,  

14:44

where he had just one hypothesis, Neyman  insisted that you need two. Fisher had the  

14:51

P-value computed after the experiment. Neyman  and Pearson insisting everything in advance.

14:59

I'll just give you an idea about the fundamental  differences, and I give you an idea about the  

15:06

flavor of the controversy. Fisher branded Neyman's  theory as 'childish' and 'horrifying for the  

15:15

freedom of the West,' and linked Neyman-Pearson  theory to Stalin's five-year programs. Also to  

15:26

Americans who cannot distinguish or don't want  to distinguish between making money and doing  

15:33

science. Incidentally Neyman was born in  Russia and moved to Berkeley, in the U.S.

15:42

So Neyman, for his part, responded to  some of Fisher's tests and said "these  

15:51

are in a mathematically specifiable sense,  worse than useless." What he meant with  

15:59

is that that the power was smaller than  Alpha. Such in the famous lady T test.

16:10

So what do textbook writers do when there are  two different ideas about statistical inference?

16:23

One solution would have been, you present both.  And maybe also Bayes or Tukey, and others,  

16:34

and teach researchers to use their judgment  to develop a sense in what situation it's not  

16:44

working and where it's better to do this. No, that  was not what textbook writers were going for. They  

16:53

created a hybrid theory of statistical inference  that didn't exist and doesn't exist in statistics  

17:01

proper. Taking some parts from Fisher some  parts from Neyman, and adding their own parts,  

17:10

mostly about the idea that scientific  inference must be without any judgment.

17:18

That's what I mean mindless... automatic.

17:24

And the essence of this hybrid  theory is the null ritual.

17:31

The null ritual has three steps. First set  up a null hypotheses of no mean differences,  

17:39

or zero correlation. And most important,  

17:44

do not specify your own hypothesis  or theory, nor its predictions.

17:53

Second step. Use 5% as a convention for  for rejecting the null hypothesis. If  

18:01

the test is significant claim  victory for your hypothesis,  

18:06

that you have never specified. If the test result  and report the test results as P smaller than 5%,  

18:18

or 1%, or 0.1%, whichever level  is is met by your results.

18:28

And the third step is a unique step. It  says always perform this procedure. Period.

18:41

Now neither Fisher, nor to be sure, Neyman-Pearson  would have approved of this procedure,  

18:50

and Fisher for instance said 'no scientific  researcher will ever have the same level of  

18:56

significance from experiment to experiment.'  He will give his thoughts. Neyman also and  

19:03

Pearson emphasized the role of judgment.  And if the two fighting camps agreed on  

19:12

one thing it was scientific inference cannot  be mechanical. You need you use your brain.

19:21

And that was exactly the message the  null ritual did not convey. Namely  

19:29

it wanted a mechanical procedure. Where  we can measure the quality of an article.

19:38

Now what did the poor readers of these  textbooks do with a mishmash of two  

19:45

theories which were not mentioned that  it is a mishmash, not in names of Neyman  

19:54

and Pearson attached to the theories. So the  result was that the external conflict between  

20:04

the two groups of statisticians went into an  internal conflict in the average researcher.

20:13

I use a Freudian analogy to make that clear. So  the super ego was Neyman-Pearson theory. So the  

20:20

average researcher somehow believed that he or  she had to have two hypothesis and and actually  

20:29

give thought about Alpha and the power before they  experiment and calculate the number of subjects  

20:34

you need. But the ego, the Fisher in part, got  the things done and published. But left with a  

20:45

feeling of guilt of having violated the rules.  And at the at the bottom was the Bayesian Id,  

20:54

longing for probabilities of hypothesis, which  neither of these two theories could deliver.

21:05

How did all this come apart?  So how could this happen?

21:12

I'll give you another story. I once visited  a distinguished statistical textbook writer  

21:20

whose book went through many editions and whose  name doesn't matter. He was one of the only ones,  

21:30

actually his book was one of the best ones in  the social sciences, and he was the only one  

21:36

who had in an early edition, a chapter on  Bayes, and also, albeit only one sentence,  

21:45

mentioning that there is a theory of Fisher  and a different one of Pearson. Neyman-Pearson.

21:56

So to mention the existence of alternative  theories was unheard of, and even names  

22:05

attached to that. So I asked him why he took  out the chapter on Bayes and this one sentence,  

22:17

from all further editions. When I met  him he was just busy having, I think,  

22:24

was the fifth edition of his bestselling book.  

22:29

And why he, I asked him, created a inconsistent  hybrid that every decent statistician would have  

22:42

rejected. To his credit, I should say that  he also did not attempt to deny that he had  

22:52

produced an illusion. But he let me know whom  to blame for it, and there were three culprits.

23:01

First, his fellow researchers. Then,  

23:05

the university administration.  And third, his publisher.

23:11

His fellow researchers, he said, are not  interested in doing good statistics. They  

23:17

want their papers published. The university  administration promoted people by the number  

23:24

of publications, which reinforced the  researchers attitude. And his publisher  

23:35

did not want to hear about different theories. He  wanted a single recipe cookbook and forced him,  

23:47

so the author told me, to take out the  Bayesian chapter, and even this single  

23:54

only one sentence about Fisher,  and Neyman and Pearson series.

24:03

At the end of our conversation I asked him in  what statistical theory he himself believes.  

24:16

And he said deep in my heart I'm a Bayesian.  Now if they also was telling me the truth,  

24:30

he had sold his heart for multiple editions of a  famous book whose message he did not believe in.

24:42

10,000 students have read this text believing  that it reveals the method of science,  

24:50

dozens of less informed textbook writers  copied his text, churning out a flood  

24:59

of offspring textbooks inconsistent,  and not noticing the mess.

25:08

I have used the term 'ritual' for this  procedure for the essence of the hybrid  

25:15

logic because it resembles to social rights.  Social rights have typically the following  

25:25

elements. There are sacred numbers or colors.  Then, there's a repetition of the same action,  

25:33

again and again. And then there's fear. Fear  about being punished if you don't repeat these  

25:41

actions. And finally delusions. You have to  have delusions in order to conduct the ritual.

25:51

The null ritual contains all of these features.

25:55

There's a fixation on 5%. And in functional MRI  it's colors. Second, there's repetitive behavior  

26:10

resembling compulsive handwashing. And third,  there's fear of sanctions by editors or advisers.  

26:21

And finally, there's a delusion about what a  P-value means, and describe that in a moment.

26:30

Let me just give you a few examples  about the mindless performance of the  

26:36

ritual. They may be funny, but  deep it's really disconcerting.

26:46

So in an internet study on  implicit theories of moral courage,  

26:53

Philip Zimbardo, who is famous for his stain  for prison experiments, and two colleagues,  

27:02

asked their participants "do you feel that there  is a difference between altruism and heroism?"

27:15

Most felt so. 2,347 respondents said  'yes' and only 58 said 'no.' Now the  

27:26

authors computed a Chi-Squared test  to answer whether these two numbers  

27:32

are the same or different. And the  found that they indeed different.

27:43

A graduate student of mine, a smart one,  had the opposite situation. His name is  

27:52

Pogo. Pogo ran an experiment with two groups  and found that it two means are exactly the  

28:00

same but Pogo could not just write that or say  that. He felt he had to do a statistical test,  

28:10

a T test to find out whether the two  exactly same numbers differ significantly,  

28:16

and he found out they don't. And  the P-value was impressively high.

28:26

Here's the third illustration. I  recently reviewed an article in which  

28:35

the number of subjects was 57. The authors  calculated a 95% confidence interval for  

28:50

the number of subjects and concluded the  confidence interval is between 47 and 67.

29:01

Don't ask why they did it.  It's mindless statistics.

29:04

Almost every number in this paper was  scrutinized in the same way. The only  

29:12

numbers in the paper that had no  confidence were the page numbers.

29:21

This is an extreme case but unfortunately it's not  

29:26

the exception. Consider all behavioral  neuropychological and medical studies  

29:32

in Nature in the year 2011. 89% of  the studies report only P-values and  

29:44

nothing else that's of importance. Such as no  effect sizes. No power. No model estimates.

29:53

Or an analysis of the Academy  of Management Journal,  

29:59

in a year later reported that the average  number of P-values in an article is...  

30:07

guess how many P-values. If you have two  hypotheses you would need two. No. 99.

30:19

Yeah it's a mechanical testing of any number.  So the idol of automatic universal inference,  

30:30

however is not unique to P-values, or confidence  intervals. Dennis Lindley, a leading advocate of  

30:40

Bayesian statistics, once declared that the only  good statistic is Bayesian statistics and that  

30:50

Bayesian methods are even more automatic,  in his opinion, than Fisher's run methods.

31:00

So the danger is here. You don't have  much progress if you use Bayes factors  

31:06

just as mindlessly. So let me go on. The  examples about mindless use sound funny,  

31:15

but there are deep costs of the  ritual. I'll give you a few examples.

31:20

Maybe the first and quite  interesting one is that you  

31:23

actually fair better if you don't  specify your own hypothesis. Why?

31:32

Paul Meehl once pointed out a methodological  paradox in physics. Improvement of experimental  

31:43

measurement and amount of data, make it  harder for a theory to pass. Because you  

31:49

can more easily distinguish between  the prediction of the theory and the  

31:54

actual data. In fields that rely on  the null ritual it's the opposite.

32:04

So, because it tests a null in which  you don't believe. So improvements  

32:14

of measurement make it more easier to detect  the difference between the data and the null,  

32:22

and that means more easier to reject the null and  you can collect victory for your hypothesis. And  

32:31

you can't just imagine that it's another factor  that leads to the irreplicability of results.

32:40

Second point and, I now get to the  delusions. And about this sacred object,  

32:50

the P-value. Now a P-value is the probability  of a result, or a more extreme one, if the null  

33:01

hypothesis is correct. And more technically  correct, it is the probability of a test  

33:07

statistic given an entire model. But the point  is it's a probability of data given hypothesis,  

33:20

not the Bayesian probability. And that should be  easy to understand for any academic researcher.

33:31

And also it's not the probability that  you will be able to replicate it. And  

33:41

the replication delusion, that's the first  one, is that when you have a P-value of 1%,  

33:52

then logically it follows that the probability  you can replicate your result is 99%. Clear?

34:00

So this illusion has been already told in the  book that Greg pointed out by Nunnally. It's  

34:10

great reading. if you have fun, you want to really  have fun, read the old textbooks about statistics.  

34:17

They're all not written by statisticians  otherwise they wouldn't have been used.

34:24

So instance Nunnally writes, quote, "what  does a P-value of 5% mean?" His answer:  

34:34

"The investigator can be confident with odds of  95 out of 100 that the observed difference will  

34:42

hold up in further investigations."  That's the replicability delusion.

34:51

So I was curious, what is the state today?

35:00

Do academic researchers understand what a P-value  means? So the the the object they looking for.

35:12

So I surveyed all studies available in six  different countries with a total of over  

35:20

800 academic psychologists, and about thousand  students, and they were all asked "what does the  

35:27

P-value of 1% mean?" How many of these professors  and students cherish the replicability illusion.

35:42

So in this example there were 115 persons who  taught statistics. But you should know that  

35:54

in the social sciences, mostly and certainly  not in psychology, those who teach statistics  

36:00

are not statisticians. For the same reason,  because they would notice what's going on.

36:07

So what do you think? What proportion of  115 statistics teachers, that's across  

36:15

six countries, fall prey to the illusion of  reputability? It should be zero. It is 20%.

36:26

Then we have looked at the professors,  and that's over 700 in this study. In  

36:32

the professors it's 39% who believe in the  replicability illusion. So almost double.

36:42

And among the poor students it is  66%. They have inherited the delusion.

36:52

And note that this is another reason  why there replicability crisis is not  

37:00

being noticed until recently, because  of this illusion. I have a problem,  

37:06

it is a P-value of 1%, I can be  almost sure it can be replicated.

37:13

Now it's not the only delusion that is shared by  the academic researchers. So the next delusion is  

37:25

that you think that P-value is the probability  of data given a hypothesis, tells you about the  

37:33

probability of the hypothesis given the data.  And the majority of academic psychologists in  

37:44

six countries and in every study, shares at least  one ,and typically several, of these Illusions.  

37:55

Including that the P-value of 1% tells you  that the probability that null hypothesis  

38:02

is true is also 1%, or that the alternative  hypothesis is true is 99%. And so it goes on.

38:13

This is a remarkable state of the art  of doing science. Remarkable because  

38:23

everyone of these academic researchers  understands the probability of A, given B,  

38:29

is not the probability of B, given A. But  within the ritual thinking is blocked.

38:38

So another blind spot and cost is of obviously  effect size. There is no effect size in the  

38:45

ritual. There's an effect of in Neyman and  Pearson, yeah. But that's not being told.

38:52

McCloskey and Ziliak asked the question "do  economists distinguish between significance,  

39:04

and so statistical significance,  and economic significance?"

39:11

And I looked at the papers in one of the  top journals, The American Economic Review,  

39:18

and of 182 papers, 70% did not make a  distinction. And what Ziliak and McCloskey did,  

39:32

they published the names of those who got most  confused, including a number of Nobel laureates.  

39:42

10 years later they repeated that, assuming that  everyone must have read that and people are now  

39:49

more reasonable, but the 70% who confused it,  the number didn't go down, it went up to 82.

40:00

Similar, there's a blind spot  for statistical power. There is  

40:04

no power in the null ritual. And  power means the probability that  

40:10

you find an effect if there is one. And  that should be 80%, 90%. Better higher.

40:19

The psychologist Jacob Cohen was the first  one who systematically studied the power of  

40:25

a major clinical journal, and he found that the  average power for a medium effect was 46%. That  

40:38

wasn't much. Now, Peter Sedlmeier and I, 25 years  later, which should be a time that things change,  

40:51

looked and analyzed the power in the same  journal: Before it was 46, it went down  

41:00

to 37. Why? Because many researchers now did  Alpha adjustment which decreases the power.

41:12

And notice what that means. If you if you set up  an experiment that has only a power of say 30%,  

41:31

to detect an effect, if there's  one, you could do better and much  

41:39

more cheaper by throwing a coin and  you would have a power of 50%. Clear?

41:52

And you could spare all this effort, and  what I've told you now are even the better  

42:02

results. Studies in Neuroscience, so for  instance, studies about Alzheimer's disease,  

42:10

genetics, cancer biomarkers, the median  power of more than 700 studies is 21%.

42:20

In F functional MRI studies only 8%. And a recent  study that has looked at 368 research areas in  

42:34

economics and analyzed the 31 leading journals,  found a median statistical power, again calculated  

42:46

for an alpha of 5% and a median effect size in  the area, what would you guess? Economists, 7%.

42:58

Then they looked on the top five,  so economics has a hierarchy,  

43:03

the top five. What's there? What do you think?  7% in the hoi polloi. In the top five only 5%.

43:16

Low statistical power is another  reason for failure for replication,  

43:21

and the interesting thing  is it's not being noticed.

43:25

There's a recent study by Paul Smaldino and  Richard McElory have looked at 60 years of  

43:33

power research in the behavioral sciences. They  took every study that has referenced our study,  

43:44

there with Peter Sedlmeier.  So in order to get studies,  

43:52

a large number of studies, and  they found consistently low power  

43:59

and it's not progressing. And one of the  reasons is blindness in the null ritual.

44:09

Let me get to a final point about the costs.  It is the moral problem. So science is based  

44:18

on trust and the honesty of researchers.  Otherwise we can't do this. And this  

44:27

statistic ritual creates a conflict between  following the scientific morals, or trying to  

44:39

do everything to get a significant  result even, if it's a false one.

44:47

And that's called 'p-hacking'. That's called  'borderline cheating.' Borderline cheating  

44:54

because you don't really invent your data but you  slice and you calculate maybe in this slicing,  

45:04

or in that slicing, as long you find something.  And borderline cheating includes: You do not  

45:11

report all the studies you have run but only the  one where it's significant. You do not report all  

45:17

the dependent measures you have looked at but only  the signal. You do not report all the independent  

45:23

measures, and maybe if your if your P-value ends  up is 5.4%, you round it down slightly under five.

45:36

And if you analyze the the distribution  of P-values, you see exactly,  

45:41

there is those are missing who is slightly  above. And those are too high or bit lower.

45:48

So a study by John, Loewenstein, and Prelec,  

45:51

with over 2,000 academic psychologists  found that the far majority admitted  

45:59

that they have done at least one of these  questionable research practice that amounts  

46:06

to cheating. And when they were asked whether  the peers do it, the numbers were even higher.

46:18

So let me go come to my end, and  ask what can we do about all this?

46:27

And the simplest answer would be we  need to start, or we need to foster  

46:36

statistical thinking. not rituals. And the  crisis, there have been proposals made, The  

46:45

American Statistical Association has made a number  of statements which I think were not very helpful.

46:57

A group of 50 researchers, all luminaries,  

47:00

has recommended a solution, namely to  change P equal or smaller than 5% to P  

47:10

smaller than 0.5%. That makes it harder  but it doesn't even address the problem.

47:22

What will happen? There will be more intensive  

47:27

P-hacking because you have to  work harder to get this. Right?

47:33

I think we have to make more fundamental  changes and I can here only sketch that.  

47:40

First we need to finally realize that we  should test our own hypothesis, not a null.

47:50

Second, we need to realize that the business  is about minimizing the the real error in our  

47:58

measurement. Not taking the error and  dividing it by the square root of N.

48:08

So this is a key disease and here's another story.  I once was a visiting scholar in Harvard and it  

48:18

happened that I had my room, my office next  to B.F. Skinner's office. So B.F. Skinner was  

48:26

once the most most well-known and controversial  psychologist. And at that time he was quite old,  

48:35

and his store was going down because of  criticism. And he felt a little bit lonely,  

48:42

that was my impression, so  we had lots of time to talk.

48:46

And I asked him about his attitude to statistical  testing. It turned out that he had obviously no  

48:55

recognizable training in statistics but he  had a good intuition. He said he admitted  

49:03

that he once tried to run 24 rats at the  same time. He said "it doesn't work because  

49:11

you can't leave keep them at the same level  of deprivation and you increase the error."

49:18

And he had the right intuition.  It's the same intuition as Gosset,  

49:25

the man who under the pseudonym, Student,  developed a T-test said "a significant  

49:31

result by itself is useless. You  need to minimize the real error."

49:37

Skinner told me this story that when he gave an  address to the American Psychological Association,  

49:47

and after having reported about  one rat, he said "according to  

49:54

the new rules of the Society I will  now now report on the second rat."

50:02

So he understood that part.

50:07

And a third move, besides really  taking care of what you measure,  

50:14

and in many physics experiments weeks and months  are spent on trying to get clear measurements. In  

50:22

the social science it's often Amazon Turk  workers, who somehow answer questions for  

50:32

little money, in short time, and they do  best if they don't really pay attention.

50:39

So the third point would be remember that  statistics is a toolbox. There is no single  

50:48

statistical inference method that is the best  in every situation. One often needs to tell  

50:56

this to my dear Bayesian friends. Bayes is a great  system, but it also doesn't help you everywhere.

51:06

And universities should start  teaching the toolbox and not a ritual.

51:13

And editors are very important in  this business. They should make  

51:19

a distinction between research that finds a  hypothesis and one that tests the hypothesis.  

51:27

So that young scientists don't have  to cheat anymore and pretend they  

51:32

would have the hypothesis that they  got after the data, already before.

51:37

Second, editors should require when inferences  are made to state the population to which the  

51:46

inference refers. People or situations. And many  application of statistics there is no population.  

51:54

There is no random sample. Why do we do the  inference, and to what population? Unclear.

52:01

Third, editors should require competitive  testing and not null hypothesis testing.

52:10

And finally, in my opinion one important signal  would be that editors should no longer accept a  

52:22

manuscript that reports results as significant  or not significant. There's no point to make  

52:31

this division, and it's exactly the problem that  then people try to chea,t or fail to have this.

52:39

If you want to report P-vales, fine,  but report them as exact P-value. That  

52:44

was what Fisher in his third book, in the  1950s, always said. Fisher rejected the  

52:51

idea to have a criteria because that  is what he meant, five year plans.

53:01

And at the end, I want to put this in  a larger context. The null ritual is  

53:09

part of a larger structural problem  that we have in the sciences. And  

53:17

the problem is that quality is more  and more being replaced by quantity.

53:24

As Noble laureate in physics, Peter Hicks said:

53:33

"Today," he said, "I wouldn't get  an academic job. It's as simple as  

53:39

that. I don't think I would be  regarded as productive enough."

53:46

And we have come into an understanding of  science, that science means producing as many  

53:55

papers as possible. That means you have less idea  to think. And thinking is hard. Writing is easy.

54:07

And one driver of this change from  quality to quantity, are the university  

54:18

administrators who count rather than read,  when the deciding on promotion or tenure.

54:30

A second driver is the scientific publication  industry. Sorry, the scientific publishing  

54:39

industry, that misuses the infinite capacity  of online publication for making researchers  

54:50

to publish more and more, in more and more  special issues, and in more and more journals.

54:58

This development towards quantity  instead of quality is further fueled  

55:07

by the so-called predatory journals that  emerged in the last 20 years. Predatory  

55:14

journals are journals are obviously only for  collecting a few thousand bucks from you,  

55:21

from publishing your paper with no noticeable  review system. We know cases where reviewers,  

55:30

these are often serious scientists who somehow  do not notice what's going on, and then reject  

55:37

the paper, and they were told clearly that it's  not in the interest of the publishing company.

55:46

And most recent we face an new  problem. Namely the industry  

55:56

like systematic production of fake articles  with the use of AI by so-called paper mills.  

56:08

A paper mill, so assume you are, and it's it's  mostly in the biomedical sciences in genetics.  

56:18

Assume you work in a hospital, you're a doctor,  and you need an article in a good journal to be  

56:28

promoted. And somehow it doesn't work.  So paper mill offers you, it's called  

56:37

assistant services. Papermill offers you for  10,000 or $20,000 to write an article that's  

56:46

actually faked from the beginning to the end,  including maybe, western plots that are faked,  

56:54

and they can guarantee significance,  and they can guarantee publication.

57:04

And why? Because more and more in the last years  they bribe editors from journals to publish the  

57:14

papers they sent them, and pay them—like a  colleague of mine who is an editor of medical  

57:21

journal, he got an offer from a paper mill in  China— that for every article you publish we pay  

57:30

you $11,000 multiplied by your impact factor. And  we will help you to increase your impact factor.

57:41

So, talking about broken science. That's the  future. So let me finish. I think the larger  

57:51

goals are scientific organizations like  the Royal Society of London should take  

57:59

back control of publishing. Out of the hands  of commercial pro-profit publishers. That can  

58:09

be done. For instance it happened last year.  The top journal in neuroscience, NeuroImage,  

58:19

42 editors of NeuroImage stepped back.  Resigned because of the what they called  

58:26

'greed of Elsevier,' and founded a new journal  called Imaging Neuroscience. And made a call  

58:34

to the entire scientific community, submit  your papers to the nonprofit journals and  

58:41

no longer support the exploitation of  reviewers, of writers, of editors, by  

58:49

these big publishing companies who in some years  make more profit than the pharmaceutic industry.

58:58

So that's one way. And a second important  conclusion is universities need to be restored  

59:07

as intellectual institutions, rather  than run as if there were corporations.

59:15

We need to publish fewer  articles and better science.  

59:21

We need statistical thinking. Not rituals.

59:27

Thank you for your attention.

59:34

[Applause]

59:41

Thank you so much.

59:49

Questions?

59:49

[Haavi Morreim] [I've got one quick.  

59:51

You've you've outlined this, the myriad of factors  contributing to the replication crisis. Could I  

59:58

pile on by adding one more? Grant giving  entities, whether government or private,  

60:05

aren't interested in doing this. They paying  you to do what somebody else already did.  

60:12

What's already been shown. They want you to plow  new ground. Show us something new. Whatever.

60:17

There's no money in replicating. Or unless there  is some specific reason to question what's,  

60:24

quote unquote, "already been shown."  And I I live inside an academic Medical  

60:30

Center and you can publish all you  want but if you didn't bring cash,  

60:35

okay, grants and overhead and all that  kind of thing. I love your ideas. Who's  

60:41

going to pay for it. The financing system  needs to be a part of the resolution.]

60:47

Yeah you're totally right. And there are more  factors. And what we can do is from below,  

60:55

from the ground, and stand up as scientists and  use our own values to change the system. And the  

61:07

system had worked some time before. And we are in  the danger to let it go in the hands of commerce.

61:20

[Jay Couey] [Yeah,  

61:21

I thought what you said about the  two different kinds of science,  

61:28

like whether you're testing a hypothesis or  defining it, I think Neuroscience is very  

61:34

much trapped in this idea that they're testing  hypothesis yet they're still trying formulate one.  

61:41

And their null hypothesis, not they don't  understand. That's the problem. I think  

61:47

something you can add, the null hypothesis of your  assumption, or it might be this but it's not.]

61:57

Yeah, you're totally right. And the response  to what you're saying is often "yeah,  

62:03

it's too difficult. We cannot  make precise predictions."

62:08

My response to that response is, it's not too  difficult, but the reliance of neuroscience is  

62:16

mostly the null ritual. That invites you not to  think about theory. Real theory. And about precise  

62:27

hypothesis. And then you just continue this state  of ignorance. But you get your papers published.

62:40

[Malcolm Kendrick] [Can I just make one point,  

62:42

which is in the medical world if you do  a study on say statin versus a placebo,  

62:48

and you prove statistical whatever, rubarb,  you're never allowed to do that study again, ever,  

62:55

because you proved that the drug is better than  placeo. You can never have another placebo arm,  

63:00

ever. And I've been thinking about this. I don't  know what the answer is but it means that once  

63:04

you've proven your drug works you can never do any  reputation or reproducing that study ever again.]

63:14

Do we getting together the pieces for  Broken Science? Here's another one.

63:21

[Peter Coles] [Yeah, I just I mean I agree  

63:23

with everything you said about the confusion in  academics about whether they're talking about the  

63:30

probability of the data given the model, or the  probability of the model given the data. So given  

63:35

the level of confusion that exists within the  research community it's not surprising that when  

63:41

it comes to the public understanding of science  it's a complete nightmare, because non-specialist  

63:47

journalists garble this meaning of the results  even more than the academics do, probably.

63:54

So we end up with very misleading  press coverage about what results  

63:59

actually mean— what has been discovered,  what has not been discovered—and that's  

64:03

really damaging for public trust in science, and  also obviously, has implications for you know,  

64:10

kind of political influence on science  as well. If they see the public trust in  

64:15

science disappearing, and it's all part  of this problem that scientists and the  

64:25

media do not communicate their ideas  clearly enough for people to understand.

64:34

What I would say just to conclude is that um very,  very often in science the only reasonable answer  

64:40

to a journalist's question is "I don't know,"  because in many, many situations we really don't  

64:47

know the answer. But you don't get on in your  career if you keep answering questions like that,  

64:51

even though it's true that you don't really know  the answer. The journalists will push you into  

64:56

saying "oh yes I prove this is true because  my P-values." So there's a much wider...]

65:05

One solution to this would be systematic  programs to train journalists. A few of  

65:11

them exist. I have been participating in the,  but it really needs to be on a broader basis,  

65:17

and it's becoming more and more  difficult because journalism is  

65:21

more... less investigative. But  more from one day to another one.

65:27

[Peter Coles] [Well the main media  

65:28

outlets are sacking journalist essentially.]

65:32

Competition through social media  is tough. And at the end we may at  

65:40

some point face a situation where it's  no longer clear what is truth and what  

65:44

is fake. And we need to be prepared, and  we need to prepare the public for that.

65:53

[Yeah okay. I do have a question. So you  talked about the idea of we need to have  

65:59

fewer papers. Fewer better papers. And  just looking at the idea of there's so  

66:07

many papers that are not reproducible,  to me suggest that there's actual no  

66:12

science there. Okay. So is there enough new  science to write enough papers to go around?]

66:24

Yes. The moment you stop: Predatory journals -  they need papers. And also you stop publishers  

66:34

like Frontiers, who in between 2019 and 2021  has doubled the number of papers they put out.

66:44

So your question "are there enough papers for the  

66:51

pro profit publishers that we have now?" For  most of them there are never enough papers.

66:56

[Yes. So I'm actually saying  a slightly different thing.]

66:58

Yeah. [If I'm  

66:59

a researcher are there actually enough ideas to  write enough papers to like do anything useful?]

67:09

Yeah that's a good question. Fair  question. One answer would be if  

67:13

you have no ideas you shouldn't be a  researcher. Do something different.

67:19

And also we need to see... so university  departments need to be care careful in  

67:25

hiring but let people have time to really  develop some ideas, and take the risk. That  

67:34

there are few. But we don't profit from the  mass production of average things. Yeah. Yeah.

67:41

[Following up on on his question. I had  a teacher in graduate school who had a  

67:47

prophylactic solution or contribution to this  debate. He thought that instead of encouraging  

67:55

teachers to publish, we should discourage them,  

67:59

or at least we should give them an incentive to  think very carefully before they published. He  

68:06

proposed docking them $1,000 for every article  they published and $5,000 for every book.]

68:13

[Of course that was quite a while ago  so you would have to increase those  

68:16

numbers to account for inflation. I think  there's something to be said for that.]

68:21

So students would have to pay?

68:24

[So the the teacher would have to pay.  In other words it would encourage him to  

68:28

really think he had something to say. Because  that it would be worth $1,000 to publish this  

68:33

article or $5,000 to publish the book. He the  teacher... his salary let's say would go down.]

68:42

[Graciano Rubio] [Would you say that  

68:46

there's value in using the P-values as a way to  allocate resources for studies that deserve to  

68:53

be repeated and a way using those resources on  new research and publishing more papers? So in  

69:01

in regards to the the charity and foundation for  the philanthropic events, there's always limited  

69:07

resources. You only have so much time and money.  So if we accept that the P-value is not sufficient  

69:14

for validation could you use the P-value as a  way to say these are the studies which deserve  

69:20

to be repeated. So that we can we can find  results that are predictable and away from  

69:25

going out and trying to find something new  and just focusing on the volume of papers.]

69:30

Yeah. So I understand your question there's  a conflict between replication and finding  

69:39

something new. Yeah. Certainly yes. But I think  there are enough researchers who might focus on  

69:49

replication. For instance —it could be an  answer to your question— those who think  

69:53

that at least at the moment have no new ideas...  do the replication. Police. And others to find  

70:02

new ideas. But new ideas are hard to come by.  We need to patient with them. And most of us,  

70:11

if you have one great idea in your  life that's already above average.

70:21

[Peter Coles] [Am I allowed a second second one?  

70:24

It occurred to me a long time ago that part of the  pathology of the academic system is actually the  

70:32

paper itself. The idea of a paper. Science has  become synonymous with writing papers. Whereas  

70:41

if you look at the tech, the world we have now,  digital publication. We don't have to write these  

70:47

tiny little Quant of papers and get publications.  You don't communicate science effectively that way  

70:53

anymore. It's a kind of 18th century idea that  you communicate by these written papers. You  

70:59

could have living documents for example, which are  gradually updated as you repeat. But of course the  

71:06

current system does not allow a graduate student  to get promoted, to get advanced in that way.  

71:13

But I think by focusing entirely on papers we're  really corrupting the system as well. It's not so  

71:20

much an issue about who publishes them, although  it's a serious one as you said. It's the fact that  

71:27

we're fixed in this mindset that we have these  little articles that we have to communicate only  

71:33

by these articles. And that's forcing science  into boxes which is not really helpful.]

71:38

We can talk about this over lunch, but  I'm still with papers. I think papers,  

71:46

and also books, or patents, depending on  the realm, are still good. But the number  

71:52

of the papers is the problem. And also this  edifice that you have, that all that counts is  

72:02

significance instead of effect size. Good theory.  A good theory can predict a very small change.

72:13

[Anton Garret] [Can I add to that that Peter Medawar, who is  

72:16

a Nobel prize winning immunologist, wrote an essay  called "Is The Scientific Paper A Fraud?" I think  

72:25

several decades ago. And he didn't mean that the  results were fraudulent. What he was saying was  

72:32

that a scientific paper as the end product does  not reflect the process by which it's created.]

72:39

Yeah. Yeah. And that would be very helpful  for students. To see the agony that goes into  

72:47

writing a paper the ever, ever changing of the  thing. And that would help students very much,  

72:55

because they think "Oh, I never will  achieve this." The final product.

Interactive Summary

Loading summary...