New developments in scientific thought and hypotheses testing are changing the ways we think about truth and certainty.
Not only do scientific decisions rely heavily on tools and procedures of quantitative analysis, but also on social life and daily decisions. I want to start with a story from the judiciary system about the misuse of an important quantitative tool of the scientific community, Hypothesis Testing. It's about the overturned Amanda Knox case in Italy. In 2007, she was accused of killing her roommate and sent to jail. The odds of DNA matching in the case of accidental death were reported as incredibly rare by a forensic data analyst, who examined data with regard to incidents of crime, concluding that the accident is not statistically an accident. The judges then made their decision based on this rareness, a single value, named p-value. Later, this was overturned (New York Times, March 27, 2013), because of a misinterpretation by the judges and lawyers, of the rare probability. What factors forced the judiciary system to make this decision based on a single value alone? Honestly, this is the story of the century and a story of Hypothesis Testing. This article introduces the logic behind their decision, draws attention to the misuse of data, and mentions alternative logics to this traditional approach on the matter (in March 2013, Italy's Supreme Court ordered a retrial and found the two suspects guilty on January 30, 2014).
We live in a data-driven world. Statistical inference is the process of drawing conclusions or decisions from data. The Hypothesis Testing procedure in the Neyman-Pearson paradigm is one of the procedures of this process, widely used in the scientific community for over a century. In this mindset, two complementary hypotheses are first defined: one is the conventional thesis (the null hypothesis) accepted to be true by default; the other is the alternative thesis (the alternative hypothesis) that needs evidence from data to falsify the null hypothesis. It results in a single value, called p-value, which is calculated from the data using theoretical model assumptions. A p-value is a measure - in probability sense, ranging from 0% to 100% - of how much evidence you have against the null hypothesis. The smaller the p-value, the more evidence you have against the null hypothesis. One may combine the p-value with the significance level to make decisions on a given hypothesis. This is also called significance testing. It has an accept-reject mindset (dichotomous decision or binary logic); in such a case, if the p-value is less than some threshold (usually .05) then the null hypothesis is rejected. Basically, it is a black or white decision, without considering the contrasts between them. This interpretation has been widely accepted in the twentieth century of scientific research, and many scientific journals routinely publish papers using this interpretation for the results of hypothesis tests, even though there are current tendencies not to use this. Let's take a look at the history of this black-white logic.
The history of the process of hypothesis testing starts with Fisher (1890-1962), around the early twentieth century. Later, the contributions of Pearson (1857-1936) and Neyman (1894-1981), who were early fathers of statistics, about the interpretation of the hypothesis tests were integrated into present-day applications. Fisher's approach focuses on inductive inference about a single hypothesis, whereas the Neyman-Pearson approach informs future behavior based on a test using two complementary hypotheses (Newman 2007).
These approaches are strongly inﬂuenced by Popper's logic of falsiﬁcation. Falsification can be defined as the act of disproving a proposition, hypothesis, or theory. This logic asserts that sufficiently improbable events can be considered impossible. A p-value suggests whether a null hypothesis is sufficiently improbable to be considered practically falsiﬁed in the sense of a logical refutation (Newman 2007). When we come back to the Amanda Knox case, the decision the judges made and justified is followed from this falsification logic in hypothesis testing. Another criticism with this mindset is: how fair it is to rely on a single value that yields a rare result? Are we going to generalize one lucky drawing of raffle tickets to believe all tickets are winners?
We use logical flaws and biases in our daily life. Finding the logical flaw in scientific papers is something most readers aren't interested in. Instead, the results are believed as a fact. One of the widely used flaws/biases is in seeking or interpreting evidence in ways that are partial to existing beliefs or a hypothesis in hand (Mercie and Sperber, 2011). We tend to prove or show the accessibility or superiority of arguments by bringing evidence. Failure to prove that a treatment - say, a low fat diet - is effective is not the same as proving it is ineffective.
Let me give another example to clarify this: The New York Times editorial news reported on February 9, 2006, "Millions of Americans have tried to reduce the fat in their diets, and the food industry has obligingly served up low-fat products. Yet now comes strong evidence that the war against all fats was mostly in vain." Actually, in the hypothesis testing mindset, the evidence collected from research should either reject the null hypothesis which is "diet is ineffective," or fail to reject it. The mindset in testing is not about finding evidence to support the null statement. It is also not about proving the alternative hypothesis, "diet is effective." Rather, it is basically looking for evidences to falsify the null statement. In the report, the "evidence" the reporter meant should be against the null hypothesis, not against the alternative hypothesis. Data is collected to falsify the null hypothesis, not to prove its truthiness, because the null is already accepted to be true unless convincing evidence is collected against it.
Let me give another example of this paradigm. As Newfoundland (2013) described, a person is innocent until proven guilty by bringing convincing evidence. The jury can't say "he is innocent," instead, the jury uses evidence that produces reasonable doubt to reject his innocence.
The accept-reject paradigm in inference represents a conventional wisdom. There are other, or currently developing, paradigms in inference, too. One of the dominating paradigms is the Bayesian-likelihood approach. After enjoying much wider acceptance in social and natural sciences, the Bayesian method suggests different views of hypothesis testing. The Bayesian approach is a method of data analysis in which subjectivity, conditionality, or past information is used to update the hypothesis as additional evidence is acquired. This approach in hypothesis testing offers a dynamic structure in probability calculation so hypotheses, and decisions, are not seen as a static truth; instead, they are updated with current data and the decision is stated in the sense of likelihood.
In our court room example, the Bayesian inference is applied to all evidence presented, with the past information being combined with the current evidence. The benefit of a Bayesian approach is that it gives all historical information so the decision is unbiased all along as the past data (prior knowledge) is used correctly. This paradigm changes the way statistics are calculated and how the result and inference are interpreted. Currently it is widely appreciated in quantitative data analysis, especially after convenient software exists for its implementation. However, the criticism to this approach, made by many, is found in its subjectivity. Objectivity and handling prior knowledge is a concern here so that different people, having different opinions, may arrive at different results.
Fuzzy logic is another method in quantitative decision making. In contrast to binary logic (yes-no, or accept-reject), fuzzy logic can be thought of as gray logic, which allows a way to express in-between data values. It emerged in the development of the theory of fuzzy sets, by Lotfi Zadeh (1965). Fuzzy logic is mostly seen as a branch of artificial intelligence that deals with reasoning algorithms used to emulate human thinking and decision making. It handles the concept of partial truth using linguistic variables, where the truth value may range between completely true and completely false. The concept of partial truth would be subjective and would depend on the observer. For example, for the temperature of the weather, we want to determine when to say cold, warm, and hot. The meaning of each of these concepts can be represented by a certain membership (called a fuzzy set). The concept of each would be subjective. One might consider cold for all values up to 40 0F. In Figure 1, the meanings of the expressions cold, warm, and hot are represented by functions mapping a temperature scale to "truth values" ranging from 0 to 1. A point on that scale has three truth degrees, one for each of the three functions (expressions). The vertical line in the figure represents a particular temperature that the truth values (see three arrows) are measured. Since the red arrow points to zero, this temperature may be interpreted as "not hot." The orange arrow (pointing at 0.2) may describe it as "slightly warm," and the blue arrow (pointing at 0.8) "fairly cold" (Fuzzy Logic, 2013, para. 6). A rule could then be adopted, like for example, if "slightly warm" then stop the fan.
Fuzzy logic uses truth degrees as a mathematical model of the vagueness phenomenon, and it summarizes data analysis in facts with truth degrees, and leaves the decision to the observer. Its advantage is its ability to deal with vague situations with respect to linguistic variables. While the significance testing for a hypothesis declares one accept or reject, fuzzy logic allows for degrees of acceptance or rejection. However, in fuzzy logic, the notion of truth doesn't fall by the wayside, but it is expressed in degrees and offers possibilities for different situations. Regarding inference, fuzzy logic uses the mindset 'everything is a matter of degree and open to interpretation,' and this mindset is also adopted to machine learning, which is considered a more suitable mindset with human reasoning instead of binary logic. The constraints in fuzzy logic are found in its tools and interpretations when complex inputs are considered.
The developments of paradigms in statistical inference have similar fates as in the developments of mathematics, geometry, and physics. According to the Euclidean parallel postulate, in space, there exist no two parallel lines that intersect each other. However, after Non-Euclidean geometry was developed in the nineteenth century, a wider geometrical and mathematical reasoning stemmed from it; accordingly, the geometry of the physical universe and particles came to be understood better. The development of Riemannian geometry, offering that distance properties might vary, resulted in the synthesis of diverse results concerning the geometry of higher dimensional surfaces and the behavior of geodesics on them. It also made Einstein's theory of general relativity justifiable. Likewise, as time passes, we witness wider paradigms or logics that abandon or correct former approaches in scientific decision making tools. This is a good reason why teachers and professors should update their current teachings as to be consistent with convincing trends, as well as to train students to be ready for wider paradigms in the future.
In today's scientific community, the way decisions are made and fallacies proved, are changing as new paradigms emerge. In order to make better decisions or to validate claims, many aspects and methodologies should be considered. One way to avoid mistakes as much as we can would be to expect fallacy points to exist in human mental processing during decision making, and improving and seeking better alternatives with well-established wisdoms.
(Thanks Ugur Sahin for reviewing the preliminary copy of this article.)