Gerd Gigerenzer: Mindless Statistics

Watch on YouTube

Now Playing

Transcript

524 segments

0:05

Today I will talk about a strikingly persistent phenomenon in the social

0:12

and biomedical sciences; mindless statistics.

0:20

Let me begin with a story. Herbert Simon is the only person who has received both

0:27

the Nobel prize in economics, and the Turing award in computer science. The

0:32

two highest distinctions in both disciplines. Shortly before he died Herb sent me a letter

0:42

in which he mentioned what has frustrated him almost more than anything else during

0:52

his scientific career. Significance testing. Now he wrote "the frustration does not lie in

1:05

the statistical test themselves, but in the stubbornness with which social scientists

1:15

hold a misapplication that is consistently denounced by professional statisticians.

1:25

Herbert Simon was not alone. The mathematician R. Duncan Luce spoke of mindless hypothesis

1:36

testing in lieu of doing good science. The experimental psychologist Edwin Boring spoke

1:48

of a meaningless ordeal of pedantic calculations. And Paul Meehl, the clinical psychologist and

2:00

former president of the American Psychological Society,

2:04

called significance testing "one of the worst things that have ever happened to psychology."

2:14

What is going on? Why these emotions? What could be wrong with what most

2:24

psychologists, social scientists, and biomedical scientists are doing?

2:33

In this talk I will explain what is going wrong. The institutionalization

2:40

of a statistical ritual instead of goods statistics. I will explain what the ritual

2:52

is. I will explain how it fuels the replication crisis, how it brings blind spots in the mind

3:03

of the researchers. And also how it creates a conflict for researchers, young and old,

3:13

between doing good science, and doing everything to get a significant result.

3:22

Let me give an example. A few years ago I gave a lecture on scientific method

3:29

and also on the importance of trust and honesty in science. After I finished,

3:39

in the discussion section, a student from an ivy league university stood up and told me "You

3:49

can afford to follow the rules of science. I can't. I have to publish and get a job.

4:01

My own advisor tells me to do anything to get a significant result."

4:12

That's known as 'slicing and dicing data' or also 'P-hacking'.

4:22

The student is not to blame. He was honest. But he has to

4:29

go through a ritual that is not in the service of science.

4:37

So let me start with the replication crisis. So every couple of weeks the media proclaims

4:48

the discovery of a new tumor marker that promises personalized diagnostic or even

4:56

treatment of cancer. And medical research, tumor research, is even more productive. Every day four

5:05

to five studies report at least one significant new marker. Nevertheless, despite this mass of

5:21

results, few have been replicated and even fewer have been put into clinical practice.

5:30

When a team of 100 scientists at the bio tech company, Amgen,

5:36

tried to replicate the findings of 53 landmark studies, they succeeded only with six. When the

5:49

pharmaceutical company, Bayer, examined 67 projects on oncology, women's health,

5:59

and cardiovascular medicine, they were able to replicate only 14.

6:06

So what do you do when your doctor prescribes you a drug based on randomized

6:13

trials that showed that's efficient, but then it seems to fade away. Now,

6:25

medical research seems to be preoccupied by producing non-reproducible results.

6:33

Ian Chalmers, one of the founders of the Cochrane Society, and Paul Glasziou, chair of

6:41

the International Society for evidence-based Health Care, estimated that 85% of medical

6:53

research is avoidably wasted. And they estimated a loss of $170 billion every year worldwide.

7:08

The discovery that too many scientific results

7:15

appear to be false alarms has been baptized the 'Replication Crisis'.

7:28

In recent years a number of researchers, often young researchers,

7:35

have tried to systematically find out to what degree the problem is. And typically

7:43

the results show that between 1/3 and 2/3 of published findings cannot be replicated.

7:52

And among those who can be replicated the effect size is on average, half.

8:02

So in medical research for instance the efficacy of anti-depressants plummeted

8:11

drastically from study to study. And second generation anti-psychotics that

8:20

earned Eli Lily a fortune, seem to lose their efficacy when retested.

8:32

It's interesting how the scientific community reacted. So what would you

8:38

do if your result that made you famous, disappears? Some researchers like the

8:48

psychologist Jonathan Schooler faced the problem and tried to think about what's

8:57

the reason. And Jonathan came up with the idea of 'cosmic habituation'. In his words

9:10

it was as if nature gave me this great result and then tried to take it back.

9:21

The New Yorker called this 'The Truth Wears Off' phenomenon. Others reacted,

9:30

so other researchers reacted differently, and we're not happy with those who tried

9:40

to replicate their studies and failed, and waged personal attacks on those speaking of, I quote,

9:50

'replication police, shameless little bullies, which hunts,' or compared to to the Stasi.

10:03

So here we are. At the beginning of the 21st century one of the most cited claims

10:11

in the social and biomedical sciences was John Ionnidis, 'Most Scientific Results are False.'

10:23

In 2017, just to give a hint about the possible political consequences the

10:31

news website breitbart.com headlined a claim by Wharton School, Professor Scott Armstrong, that

10:43

I quote "fewer than 1% of papers in scientific journals follow scientific method.' End of quote.

10:57

Now we we have seen in this country and in other countries, politicians trying to cut

11:05

down funding of research. And if they would read more about this there would be more

11:11

going in this direction. And those who point out that so many results are not replicable,

11:23

they face a double problem. They want to save science, at the same time they run the danger

11:31

that maybe Donald Trump, someone else, will use this to cut funding totally down.

11:41

So how did we get there?

11:44

The replication crisis has been blamed on economic, on false economic incentives, like

11:52

'publish or perish,' and I want to make a point today that we need to go beyond the important role

12:03

of external incentives and focus on an internal problem that fuels the replication crisis. And

12:17

this is the factor that good scientific practice has been replaced by a statistical ritual.

12:25

My point is resources follow this ritual not because, or not only because of external

12:35

pressure. No, they have internalized the ritual and many genuinely believe in it and that can be

12:48

seen most clearly by the delusions they have about the P-value, the product of the ritual.

12:58

So statistic methods are not just applied to a science, they can change the entire science. So

13:08

think about parapsychology, which once was to study of messages by the dear departed, and it

13:22

turned into the study of repetitive card guessing. Because that's what the statistic method demanded.

13:35

In a similar way the social sciences have been changed by the introduction

13:44

of statistical inference. And typically in social science, scientists first encountered

13:53

Sir Ronald Fisher's theories, and particular his 1935 book. He wrote three books. The first was

14:03

too much about agriculture and manure, and technically too difficult for most

14:10

social scientists. But the second one was just right. And it didn't smell anymore.

14:21

And so they started writing textbooks. And then I became aware of a competing Theory

14:28

by the Polish statistician, Jerzy Neyman, and a British statistician, Egon Pearson.

14:38

Fisher had a theory, so at least his null hypothesis testing,

14:44

where he had just one hypothesis, Neyman insisted that you need two. Fisher had the

14:51

P-value computed after the experiment. Neyman and Pearson insisting everything in advance.

14:59

I'll just give you an idea about the fundamental differences, and I give you an idea about the

15:06

flavor of the controversy. Fisher branded Neyman's theory as 'childish' and 'horrifying for the

15:15

freedom of the West,' and linked Neyman-Pearson theory to Stalin's five-year programs. Also to

15:26

Americans who cannot distinguish or don't want to distinguish between making money and doing

15:33

science. Incidentally Neyman was born in Russia and moved to Berkeley, in the U.S.

15:42

So Neyman, for his part, responded to some of Fisher's tests and said "these

15:51

are in a mathematically specifiable sense, worse than useless." What he meant with

15:59

is that that the power was smaller than Alpha. Such in the famous lady T test.

16:10

So what do textbook writers do when there are two different ideas about statistical inference?

16:23

One solution would have been, you present both. And maybe also Bayes or Tukey, and others,

16:34

and teach researchers to use their judgment to develop a sense in what situation it's not

16:44

working and where it's better to do this. No, that was not what textbook writers were going for. They

16:53

created a hybrid theory of statistical inference that didn't exist and doesn't exist in statistics

17:01

proper. Taking some parts from Fisher some parts from Neyman, and adding their own parts,

17:10

mostly about the idea that scientific inference must be without any judgment.

17:18

That's what I mean mindless... automatic.

17:24

And the essence of this hybrid theory is the null ritual.

17:31

The null ritual has three steps. First set up a null hypotheses of no mean differences,

17:39

or zero correlation. And most important,

17:44

do not specify your own hypothesis or theory, nor its predictions.

17:53

Second step. Use 5% as a convention for for rejecting the null hypothesis. If

18:01

the test is significant claim victory for your hypothesis,

18:06

that you have never specified. If the test result and report the test results as P smaller than 5%,

18:18

or 1%, or 0.1%, whichever level is is met by your results.

18:28

And the third step is a unique step. It says always perform this procedure. Period.

18:41

Now neither Fisher, nor to be sure, Neyman-Pearson would have approved of this procedure,

18:50

and Fisher for instance said 'no scientific researcher will ever have the same level of

18:56

significance from experiment to experiment.' He will give his thoughts. Neyman also and

19:03

Pearson emphasized the role of judgment. And if the two fighting camps agreed on

19:12

one thing it was scientific inference cannot be mechanical. You need you use your brain.

19:21

And that was exactly the message the null ritual did not convey. Namely

19:29

it wanted a mechanical procedure. Where we can measure the quality of an article.

19:38

Now what did the poor readers of these textbooks do with a mishmash of two

19:45

theories which were not mentioned that it is a mishmash, not in names of Neyman

19:54

and Pearson attached to the theories. So the result was that the external conflict between

20:04

the two groups of statisticians went into an internal conflict in the average researcher.

20:13

I use a Freudian analogy to make that clear. So the super ego was Neyman-Pearson theory. So the

20:20

average researcher somehow believed that he or she had to have two hypothesis and and actually

20:29

give thought about Alpha and the power before they experiment and calculate the number of subjects

20:34

you need. But the ego, the Fisher in part, got the things done and published. But left with a

20:45

feeling of guilt of having violated the rules. And at the at the bottom was the Bayesian Id,

20:54

longing for probabilities of hypothesis, which neither of these two theories could deliver.

21:05

How did all this come apart? So how could this happen?

21:12

I'll give you another story. I once visited a distinguished statistical textbook writer

21:20

whose book went through many editions and whose name doesn't matter. He was one of the only ones,

21:30

actually his book was one of the best ones in the social sciences, and he was the only one

21:36

who had in an early edition, a chapter on Bayes, and also, albeit only one sentence,

21:45

mentioning that there is a theory of Fisher and a different one of Pearson. Neyman-Pearson.

21:56

So to mention the existence of alternative theories was unheard of, and even names

22:05

attached to that. So I asked him why he took out the chapter on Bayes and this one sentence,

22:17

from all further editions. When I met him he was just busy having, I think,

22:24

was the fifth edition of his bestselling book.

22:29

And why he, I asked him, created a inconsistent hybrid that every decent statistician would have

22:42

rejected. To his credit, I should say that he also did not attempt to deny that he had

22:52

produced an illusion. But he let me know whom to blame for it, and there were three culprits.

23:01

First, his fellow researchers. Then,

23:05

the university administration. And third, his publisher.

23:11

His fellow researchers, he said, are not interested in doing good statistics. They

23:17

want their papers published. The university administration promoted people by the number

23:24

of publications, which reinforced the researchers attitude. And his publisher

23:35

did not want to hear about different theories. He wanted a single recipe cookbook and forced him,

23:47

so the author told me, to take out the Bayesian chapter, and even this single

23:54

only one sentence about Fisher, and Neyman and Pearson series.

24:03

At the end of our conversation I asked him in what statistical theory he himself believes.

24:16

And he said deep in my heart I'm a Bayesian. Now if they also was telling me the truth,

24:30

he had sold his heart for multiple editions of a famous book whose message he did not believe in.

24:42

10,000 students have read this text believing that it reveals the method of science,

24:50

dozens of less informed textbook writers copied his text, churning out a flood

24:59

of offspring textbooks inconsistent, and not noticing the mess.

25:08

I have used the term 'ritual' for this procedure for the essence of the hybrid

25:15

logic because it resembles to social rights. Social rights have typically the following

25:25

elements. There are sacred numbers or colors. Then, there's a repetition of the same action,

25:33

again and again. And then there's fear. Fear about being punished if you don't repeat these

25:41

actions. And finally delusions. You have to have delusions in order to conduct the ritual.

25:51

The null ritual contains all of these features.

25:55

There's a fixation on 5%. And in functional MRI it's colors. Second, there's repetitive behavior

26:10

resembling compulsive handwashing. And third, there's fear of sanctions by editors or advisers.

26:21

And finally, there's a delusion about what a P-value means, and describe that in a moment.

26:30

Let me just give you a few examples about the mindless performance of the

26:36

ritual. They may be funny, but deep it's really disconcerting.

26:46

So in an internet study on implicit theories of moral courage,

26:53

Philip Zimbardo, who is famous for his stain for prison experiments, and two colleagues,

27:02

asked their participants "do you feel that there is a difference between altruism and heroism?"

27:15

Most felt so. 2,347 respondents said 'yes' and only 58 said 'no.' Now the

27:26

authors computed a Chi-Squared test to answer whether these two numbers

27:32

are the same or different. And the found that they indeed different.

27:43

A graduate student of mine, a smart one, had the opposite situation. His name is

27:52

Pogo. Pogo ran an experiment with two groups and found that it two means are exactly the

28:00

same but Pogo could not just write that or say that. He felt he had to do a statistical test,

28:10

a T test to find out whether the two exactly same numbers differ significantly,

28:16

and he found out they don't. And the P-value was impressively high.

28:26

Here's the third illustration. I recently reviewed an article in which

28:35

the number of subjects was 57. The authors calculated a 95% confidence interval for

28:50

the number of subjects and concluded the confidence interval is between 47 and 67.

29:01

Don't ask why they did it. It's mindless statistics.

29:04

Almost every number in this paper was scrutinized in the same way. The only

29:12

numbers in the paper that had no confidence were the page numbers.

29:21

This is an extreme case but unfortunately it's not

29:26

the exception. Consider all behavioral neuropychological and medical studies

29:32

in Nature in the year 2011. 89% of the studies report only P-values and

29:44

nothing else that's of importance. Such as no effect sizes. No power. No model estimates.

29:53

Or an analysis of the Academy of Management Journal,

29:59

in a year later reported that the average number of P-values in an article is...

30:07

guess how many P-values. If you have two hypotheses you would need two. No. 99.

30:19

Yeah it's a mechanical testing of any number. So the idol of automatic universal inference,

30:30

however is not unique to P-values, or confidence intervals. Dennis Lindley, a leading advocate of

30:40

Bayesian statistics, once declared that the only good statistic is Bayesian statistics and that

30:50

Bayesian methods are even more automatic, in his opinion, than Fisher's run methods.

31:00

So the danger is here. You don't have much progress if you use Bayes factors

31:06

just as mindlessly. So let me go on. The examples about mindless use sound funny,

31:15

but there are deep costs of the ritual. I'll give you a few examples.

31:20

Maybe the first and quite interesting one is that you

31:23

actually fair better if you don't specify your own hypothesis. Why?

31:32

Paul Meehl once pointed out a methodological paradox in physics. Improvement of experimental

31:43

measurement and amount of data, make it harder for a theory to pass. Because you

31:49

can more easily distinguish between the prediction of the theory and the

31:54

actual data. In fields that rely on the null ritual it's the opposite.

32:04

So, because it tests a null in which you don't believe. So improvements

32:14

of measurement make it more easier to detect the difference between the data and the null,

32:22

and that means more easier to reject the null and you can collect victory for your hypothesis. And

32:31

you can't just imagine that it's another factor that leads to the irreplicability of results.

32:40

Second point and, I now get to the delusions. And about this sacred object,

32:50

the P-value. Now a P-value is the probability of a result, or a more extreme one, if the null

33:01

hypothesis is correct. And more technically correct, it is the probability of a test

33:07

statistic given an entire model. But the point is it's a probability of data given hypothesis,

33:20

not the Bayesian probability. And that should be easy to understand for any academic researcher.

33:31

And also it's not the probability that you will be able to replicate it. And

33:41

the replication delusion, that's the first one, is that when you have a P-value of 1%,

33:52

then logically it follows that the probability you can replicate your result is 99%. Clear?

34:00

So this illusion has been already told in the book that Greg pointed out by Nunnally. It's

34:10

great reading. if you have fun, you want to really have fun, read the old textbooks about statistics.

34:17

They're all not written by statisticians otherwise they wouldn't have been used.

34:24

So instance Nunnally writes, quote, "what does a P-value of 5% mean?" His answer:

34:34

"The investigator can be confident with odds of 95 out of 100 that the observed difference will

34:42

hold up in further investigations." That's the replicability delusion.

34:51

So I was curious, what is the state today?

35:00

Do academic researchers understand what a P-value means? So the the the object they looking for.

35:12

So I surveyed all studies available in six different countries with a total of over

35:20

800 academic psychologists, and about thousand students, and they were all asked "what does the

35:27

P-value of 1% mean?" How many of these professors and students cherish the replicability illusion.

35:42

So in this example there were 115 persons who taught statistics. But you should know that

35:54

in the social sciences, mostly and certainly not in psychology, those who teach statistics

36:00

are not statisticians. For the same reason, because they would notice what's going on.

36:07

So what do you think? What proportion of 115 statistics teachers, that's across

36:15

six countries, fall prey to the illusion of reputability? It should be zero. It is 20%.

36:26

Then we have looked at the professors, and that's over 700 in this study. In

36:32

the professors it's 39% who believe in the replicability illusion. So almost double.

36:42

And among the poor students it is 66%. They have inherited the delusion.

36:52

And note that this is another reason why there replicability crisis is not

37:00

being noticed until recently, because of this illusion. I have a problem,

37:06

it is a P-value of 1%, I can be almost sure it can be replicated.

37:13

Now it's not the only delusion that is shared by the academic researchers. So the next delusion is

37:25

that you think that P-value is the probability of data given a hypothesis, tells you about the

37:33

probability of the hypothesis given the data. And the majority of academic psychologists in

37:44

six countries and in every study, shares at least one ,and typically several, of these Illusions.

37:55

Including that the P-value of 1% tells you that the probability that null hypothesis

38:02

is true is also 1%, or that the alternative hypothesis is true is 99%. And so it goes on.

38:13

This is a remarkable state of the art of doing science. Remarkable because

38:23

everyone of these academic researchers understands the probability of A, given B,

38:29

is not the probability of B, given A. But within the ritual thinking is blocked.

38:38

So another blind spot and cost is of obviously effect size. There is no effect size in the

38:45

ritual. There's an effect of in Neyman and Pearson, yeah. But that's not being told.

38:52

McCloskey and Ziliak asked the question "do economists distinguish between significance,

39:04

and so statistical significance, and economic significance?"

39:11

And I looked at the papers in one of the top journals, The American Economic Review,

39:18

and of 182 papers, 70% did not make a distinction. And what Ziliak and McCloskey did,

39:32

they published the names of those who got most confused, including a number of Nobel laureates.

39:42

10 years later they repeated that, assuming that everyone must have read that and people are now

39:49

more reasonable, but the 70% who confused it, the number didn't go down, it went up to 82.

40:00

Similar, there's a blind spot for statistical power. There is

40:04

no power in the null ritual. And power means the probability that

40:10

you find an effect if there is one. And that should be 80%, 90%. Better higher.

40:19

The psychologist Jacob Cohen was the first one who systematically studied the power of

40:25

a major clinical journal, and he found that the average power for a medium effect was 46%. That

40:38

wasn't much. Now, Peter Sedlmeier and I, 25 years later, which should be a time that things change,

40:51

looked and analyzed the power in the same journal: Before it was 46, it went down

41:00

to 37. Why? Because many researchers now did Alpha adjustment which decreases the power.

41:12

And notice what that means. If you if you set up an experiment that has only a power of say 30%,

41:31

to detect an effect, if there's one, you could do better and much

41:39

more cheaper by throwing a coin and you would have a power of 50%. Clear?

41:52

And you could spare all this effort, and what I've told you now are even the better

42:02

results. Studies in Neuroscience, so for instance, studies about Alzheimer's disease,

42:10

genetics, cancer biomarkers, the median power of more than 700 studies is 21%.

42:20

In F functional MRI studies only 8%. And a recent study that has looked at 368 research areas in

42:34

economics and analyzed the 31 leading journals, found a median statistical power, again calculated

42:46

for an alpha of 5% and a median effect size in the area, what would you guess? Economists, 7%.

42:58

Then they looked on the top five, so economics has a hierarchy,

43:03

the top five. What's there? What do you think? 7% in the hoi polloi. In the top five only 5%.

43:16

Low statistical power is another reason for failure for replication,

43:21

and the interesting thing is it's not being noticed.

43:25

There's a recent study by Paul Smaldino and Richard McElory have looked at 60 years of

43:33

power research in the behavioral sciences. They took every study that has referenced our study,

43:44

there with Peter Sedlmeier. So in order to get studies,

43:52

a large number of studies, and they found consistently low power

43:59

and it's not progressing. And one of the reasons is blindness in the null ritual.

44:09

Let me get to a final point about the costs. It is the moral problem. So science is based

44:18

on trust and the honesty of researchers. Otherwise we can't do this. And this

44:27

statistic ritual creates a conflict between following the scientific morals, or trying to

44:39

do everything to get a significant result even, if it's a false one.

44:47

And that's called 'p-hacking'. That's called 'borderline cheating.' Borderline cheating

44:54

because you don't really invent your data but you slice and you calculate maybe in this slicing,

45:04

or in that slicing, as long you find something. And borderline cheating includes: You do not

45:11

report all the studies you have run but only the one where it's significant. You do not report all

45:17

the dependent measures you have looked at but only the signal. You do not report all the independent

45:23

measures, and maybe if your if your P-value ends up is 5.4%, you round it down slightly under five.

45:36

And if you analyze the the distribution of P-values, you see exactly,

45:41

there is those are missing who is slightly above. And those are too high or bit lower.

45:48

So a study by John, Loewenstein, and Prelec,

45:51

with over 2,000 academic psychologists found that the far majority admitted

45:59

that they have done at least one of these questionable research practice that amounts

46:06

to cheating. And when they were asked whether the peers do it, the numbers were even higher.

46:18

So let me go come to my end, and ask what can we do about all this?

46:27

And the simplest answer would be we need to start, or we need to foster

46:36

statistical thinking. not rituals. And the crisis, there have been proposals made, The

46:45

American Statistical Association has made a number of statements which I think were not very helpful.

46:57

A group of 50 researchers, all luminaries,

47:00

has recommended a solution, namely to change P equal or smaller than 5% to P

47:10

smaller than 0.5%. That makes it harder but it doesn't even address the problem.

47:22

What will happen? There will be more intensive

47:27

P-hacking because you have to work harder to get this. Right?

47:33

I think we have to make more fundamental changes and I can here only sketch that.

47:40

First we need to finally realize that we should test our own hypothesis, not a null.

47:50

Second, we need to realize that the business is about minimizing the the real error in our

47:58

measurement. Not taking the error and dividing it by the square root of N.

48:08

So this is a key disease and here's another story. I once was a visiting scholar in Harvard and it

48:18

happened that I had my room, my office next to B.F. Skinner's office. So B.F. Skinner was

48:26

once the most most well-known and controversial psychologist. And at that time he was quite old,

48:35

and his store was going down because of criticism. And he felt a little bit lonely,

48:42

that was my impression, so we had lots of time to talk.

48:46

And I asked him about his attitude to statistical testing. It turned out that he had obviously no

48:55

recognizable training in statistics but he had a good intuition. He said he admitted

49:03

that he once tried to run 24 rats at the same time. He said "it doesn't work because

49:11

you can't leave keep them at the same level of deprivation and you increase the error."

49:18

And he had the right intuition. It's the same intuition as Gosset,

49:25

the man who under the pseudonym, Student, developed a T-test said "a significant

49:31

result by itself is useless. You need to minimize the real error."

49:37

Skinner told me this story that when he gave an address to the American Psychological Association,

49:47

and after having reported about one rat, he said "according to

49:54

the new rules of the Society I will now now report on the second rat."

50:02

So he understood that part.

50:07

And a third move, besides really taking care of what you measure,

50:14

and in many physics experiments weeks and months are spent on trying to get clear measurements. In

50:22

the social science it's often Amazon Turk workers, who somehow answer questions for

50:32

little money, in short time, and they do best if they don't really pay attention.

50:39

So the third point would be remember that statistics is a toolbox. There is no single

50:48

statistical inference method that is the best in every situation. One often needs to tell

50:56

this to my dear Bayesian friends. Bayes is a great system, but it also doesn't help you everywhere.

51:06

And universities should start teaching the toolbox and not a ritual.

51:13

And editors are very important in this business. They should make

51:19

a distinction between research that finds a hypothesis and one that tests the hypothesis.

51:27

So that young scientists don't have to cheat anymore and pretend they

51:32

would have the hypothesis that they got after the data, already before.

51:37

Second, editors should require when inferences are made to state the population to which the

51:46

inference refers. People or situations. And many application of statistics there is no population.

51:54

There is no random sample. Why do we do the inference, and to what population? Unclear.

52:01

Third, editors should require competitive testing and not null hypothesis testing.

52:10

And finally, in my opinion one important signal would be that editors should no longer accept a

52:22

manuscript that reports results as significant or not significant. There's no point to make

52:31

this division, and it's exactly the problem that then people try to chea,t or fail to have this.

52:39

If you want to report P-vales, fine, but report them as exact P-value. That

52:44

was what Fisher in his third book, in the 1950s, always said. Fisher rejected the

52:51

idea to have a criteria because that is what he meant, five year plans.

53:01

And at the end, I want to put this in a larger context. The null ritual is

53:09

part of a larger structural problem that we have in the sciences. And

53:17

the problem is that quality is more and more being replaced by quantity.

53:24

As Noble laureate in physics, Peter Hicks said:

53:33

"Today," he said, "I wouldn't get an academic job. It's as simple as

53:39

that. I don't think I would be regarded as productive enough."

53:46

And we have come into an understanding of science, that science means producing as many

53:55

papers as possible. That means you have less idea to think. And thinking is hard. Writing is easy.

54:07

And one driver of this change from quality to quantity, are the university

54:18

administrators who count rather than read, when the deciding on promotion or tenure.

54:30

A second driver is the scientific publication industry. Sorry, the scientific publishing

54:39

industry, that misuses the infinite capacity of online publication for making researchers

54:50

to publish more and more, in more and more special issues, and in more and more journals.

54:58

This development towards quantity instead of quality is further fueled

55:07

by the so-called predatory journals that emerged in the last 20 years. Predatory

55:14

journals are journals are obviously only for collecting a few thousand bucks from you,

55:21

from publishing your paper with no noticeable review system. We know cases where reviewers,

55:30

these are often serious scientists who somehow do not notice what's going on, and then reject

55:37

the paper, and they were told clearly that it's not in the interest of the publishing company.

55:46

And most recent we face an new problem. Namely the industry

55:56

like systematic production of fake articles with the use of AI by so-called paper mills.

56:08

A paper mill, so assume you are, and it's it's mostly in the biomedical sciences in genetics.

56:18

Assume you work in a hospital, you're a doctor, and you need an article in a good journal to be

56:28

promoted. And somehow it doesn't work. So paper mill offers you, it's called

56:37

assistant services. Papermill offers you for 10,000 or $20,000 to write an article that's

56:46

actually faked from the beginning to the end, including maybe, western plots that are faked,

56:54

and they can guarantee significance, and they can guarantee publication.

57:04

And why? Because more and more in the last years they bribe editors from journals to publish the

57:14

papers they sent them, and pay them—like a colleague of mine who is an editor of medical

57:21

journal, he got an offer from a paper mill in China— that for every article you publish we pay

57:30

you $11,000 multiplied by your impact factor. And we will help you to increase your impact factor.

57:41

So, talking about broken science. That's the future. So let me finish. I think the larger

57:51

goals are scientific organizations like the Royal Society of London should take

57:59

back control of publishing. Out of the hands of commercial pro-profit publishers. That can

58:09

be done. For instance it happened last year. The top journal in neuroscience, NeuroImage,

58:19

42 editors of NeuroImage stepped back. Resigned because of the what they called

58:26

'greed of Elsevier,' and founded a new journal called Imaging Neuroscience. And made a call

58:34

to the entire scientific community, submit your papers to the nonprofit journals and

58:41

no longer support the exploitation of reviewers, of writers, of editors, by

58:49

these big publishing companies who in some years make more profit than the pharmaceutic industry.

58:58

So that's one way. And a second important conclusion is universities need to be restored

59:07

as intellectual institutions, rather than run as if there were corporations.

59:15

We need to publish fewer articles and better science.

59:21

We need statistical thinking. Not rituals.

59:27

Thank you for your attention.

59:34

[Applause]

59:41

Thank you so much.

59:49

Questions?

59:49

[Haavi Morreim] [I've got one quick.

59:51

You've you've outlined this, the myriad of factors contributing to the replication crisis. Could I

59:58

pile on by adding one more? Grant giving entities, whether government or private,

60:05

aren't interested in doing this. They paying you to do what somebody else already did.

60:12

What's already been shown. They want you to plow new ground. Show us something new. Whatever.

60:17

There's no money in replicating. Or unless there is some specific reason to question what's,

60:24

quote unquote, "already been shown." And I I live inside an academic Medical

60:30

Center and you can publish all you want but if you didn't bring cash,

60:35

okay, grants and overhead and all that kind of thing. I love your ideas. Who's

60:41

going to pay for it. The financing system needs to be a part of the resolution.]

60:47

Yeah you're totally right. And there are more factors. And what we can do is from below,

60:55

from the ground, and stand up as scientists and use our own values to change the system. And the

61:07

system had worked some time before. And we are in the danger to let it go in the hands of commerce.

61:20

[Jay Couey] [Yeah,

61:21

I thought what you said about the two different kinds of science,

61:28

like whether you're testing a hypothesis or defining it, I think Neuroscience is very

61:34

much trapped in this idea that they're testing hypothesis yet they're still trying formulate one.

61:41

And their null hypothesis, not they don't understand. That's the problem. I think

61:47

something you can add, the null hypothesis of your assumption, or it might be this but it's not.]

61:57

Yeah, you're totally right. And the response to what you're saying is often "yeah,

62:03

it's too difficult. We cannot make precise predictions."

62:08

My response to that response is, it's not too difficult, but the reliance of neuroscience is

62:16

mostly the null ritual. That invites you not to think about theory. Real theory. And about precise

62:27

hypothesis. And then you just continue this state of ignorance. But you get your papers published.

62:40

[Malcolm Kendrick] [Can I just make one point,

62:42

which is in the medical world if you do a study on say statin versus a placebo,

62:48

and you prove statistical whatever, rubarb, you're never allowed to do that study again, ever,

62:55

because you proved that the drug is better than placeo. You can never have another placebo arm,

63:00

ever. And I've been thinking about this. I don't know what the answer is but it means that once

63:04

you've proven your drug works you can never do any reputation or reproducing that study ever again.]

63:14

Do we getting together the pieces for Broken Science? Here's another one.

63:21

[Peter Coles] [Yeah, I just I mean I agree

63:23

with everything you said about the confusion in academics about whether they're talking about the

63:30

probability of the data given the model, or the probability of the model given the data. So given

63:35

the level of confusion that exists within the research community it's not surprising that when

63:41

it comes to the public understanding of science it's a complete nightmare, because non-specialist

63:47

journalists garble this meaning of the results even more than the academics do, probably.

63:54

So we end up with very misleading press coverage about what results

63:59

actually mean— what has been discovered, what has not been discovered—and that's

64:03

really damaging for public trust in science, and also obviously, has implications for you know,

64:10

kind of political influence on science as well. If they see the public trust in

64:15

science disappearing, and it's all part of this problem that scientists and the

64:25

media do not communicate their ideas clearly enough for people to understand.

64:34

What I would say just to conclude is that um very, very often in science the only reasonable answer

64:40

to a journalist's question is "I don't know," because in many, many situations we really don't

64:47

know the answer. But you don't get on in your career if you keep answering questions like that,

64:51

even though it's true that you don't really know the answer. The journalists will push you into

64:56

saying "oh yes I prove this is true because my P-values." So there's a much wider...]

65:05

One solution to this would be systematic programs to train journalists. A few of

65:11

them exist. I have been participating in the, but it really needs to be on a broader basis,

65:17

and it's becoming more and more difficult because journalism is

65:21

more... less investigative. But more from one day to another one.

65:27

[Peter Coles] [Well the main media

65:28

outlets are sacking journalist essentially.]

65:32

Competition through social media is tough. And at the end we may at

65:40

some point face a situation where it's no longer clear what is truth and what

65:44

is fake. And we need to be prepared, and we need to prepare the public for that.

65:53

[Yeah okay. I do have a question. So you talked about the idea of we need to have

65:59

fewer papers. Fewer better papers. And just looking at the idea of there's so

66:07

many papers that are not reproducible, to me suggest that there's actual no

66:12

science there. Okay. So is there enough new science to write enough papers to go around?]

66:24

Yes. The moment you stop: Predatory journals - they need papers. And also you stop publishers

66:34

like Frontiers, who in between 2019 and 2021 has doubled the number of papers they put out.

66:44

So your question "are there enough papers for the

66:51

pro profit publishers that we have now?" For most of them there are never enough papers.

66:56

[Yes. So I'm actually saying a slightly different thing.]

66:58

Yeah. [If I'm

66:59

a researcher are there actually enough ideas to write enough papers to like do anything useful?]

67:09

Yeah that's a good question. Fair question. One answer would be if

67:13

you have no ideas you shouldn't be a researcher. Do something different.

67:19

And also we need to see... so university departments need to be care careful in

67:25

hiring but let people have time to really develop some ideas, and take the risk. That

67:34

there are few. But we don't profit from the mass production of average things. Yeah. Yeah.

67:41

[Following up on on his question. I had a teacher in graduate school who had a

67:47

prophylactic solution or contribution to this debate. He thought that instead of encouraging

67:55

teachers to publish, we should discourage them,

67:59

or at least we should give them an incentive to think very carefully before they published. He

68:06

proposed docking them $1,000 for every article they published and $5,000 for every book.]

68:13

[Of course that was quite a while ago so you would have to increase those

68:16

numbers to account for inflation. I think there's something to be said for that.]

68:21

So students would have to pay?

68:24

[So the the teacher would have to pay. In other words it would encourage him to

68:28

really think he had something to say. Because that it would be worth $1,000 to publish this

68:33

article or $5,000 to publish the book. He the teacher... his salary let's say would go down.]

68:42

[Graciano Rubio] [Would you say that

68:46

there's value in using the P-values as a way to allocate resources for studies that deserve to

68:53

be repeated and a way using those resources on new research and publishing more papers? So in

69:01

in regards to the the charity and foundation for the philanthropic events, there's always limited

69:07

resources. You only have so much time and money. So if we accept that the P-value is not sufficient

69:14

for validation could you use the P-value as a way to say these are the studies which deserve

69:20

to be repeated. So that we can we can find results that are predictable and away from

69:25

going out and trying to find something new and just focusing on the volume of papers.]

69:30

Yeah. So I understand your question there's a conflict between replication and finding

69:39

something new. Yeah. Certainly yes. But I think there are enough researchers who might focus on

69:49

replication. For instance —it could be an answer to your question— those who think

69:53

that at least at the moment have no new ideas... do the replication. Police. And others to find

70:02

new ideas. But new ideas are hard to come by. We need to patient with them. And most of us,

70:11

if you have one great idea in your life that's already above average.

70:21

[Peter Coles] [Am I allowed a second second one?

70:24

It occurred to me a long time ago that part of the pathology of the academic system is actually the

70:32

paper itself. The idea of a paper. Science has become synonymous with writing papers. Whereas

70:41

if you look at the tech, the world we have now, digital publication. We don't have to write these

70:47

tiny little Quant of papers and get publications. You don't communicate science effectively that way

70:53

anymore. It's a kind of 18th century idea that you communicate by these written papers. You

70:59

could have living documents for example, which are gradually updated as you repeat. But of course the

71:06

current system does not allow a graduate student to get promoted, to get advanced in that way.

71:13

But I think by focusing entirely on papers we're really corrupting the system as well. It's not so

71:20

much an issue about who publishes them, although it's a serious one as you said. It's the fact that

71:27

we're fixed in this mindset that we have these little articles that we have to communicate only

71:33

by these articles. And that's forcing science into boxes which is not really helpful.]

71:38

We can talk about this over lunch, but I'm still with papers. I think papers,

71:46

and also books, or patents, depending on the realm, are still good. But the number

71:52

of the papers is the problem. And also this edifice that you have, that all that counts is

72:02

significance instead of effect size. Good theory. A good theory can predict a very small change.

72:13

[Anton Garret] [Can I add to that that Peter Medawar, who is

72:16

a Nobel prize winning immunologist, wrote an essay called "Is The Scientific Paper A Fraud?" I think

72:25

several decades ago. And he didn't mean that the results were fraudulent. What he was saying was

72:32

that a scientific paper as the end product does not reflect the process by which it's created.]

72:39

Yeah. Yeah. And that would be very helpful for students. To see the agony that goes into

72:47

writing a paper the ever, ever changing of the thing. And that would help students very much,

72:55

because they think "Oh, I never will achieve this." The final product.

Interactive Summary

Ask follow-up questions or revisit key timestamps.

The speaker addresses the persistent issue of "mindless statistics" and significance testing in social and biomedical sciences, which has led to a widespread "replication crisis." Many scientific findings, often 1/3 to 2/3, cannot be reproduced, resulting in an estimated annual waste of $170 billion in medical research. This crisis is fueled by external incentives like "publish or perish" and an internal problem: the institutionalization of a statistical ritual. This ritual, a hybrid theory created by textbook writers, promotes mechanical inference over scientific judgment. It leads to common delusions about P-values and blind spots regarding effect size and statistical power, contributing to practices like "P-hacking" and undermining research integrity. The speaker argues for fundamental changes, including fostering statistical thinking, testing specific hypotheses, minimizing real error, and using a statistical toolbox. Furthermore, larger structural problems like universities prioritizing publication quantity over quality, the greed of the scientific publishing industry, and the rise of predatory journals and paper mills are contributing to "broken science." The solution involves scientific organizations reclaiming control of publishing and universities being restored as intellectual institutions focused on quality over quantity.