Unbreaking America. (2019). Retrieved from Represent Us: https://www.facebook.com/RepresentUs/videos/410253132875542/UzpfSTEwMDAxNTIwMTEwMDcwMDo1Nzc1MzAxNTk0MzAzNDk/?story_fbid=577530159430349&id=100015201100700¬if_id=1552594250718096¬if_t=feedback_reaction_generic
]]>A random sample of 100 observations from a population with a standard deviation of 44 yielded a sample mean of 108. Test the null hypothesis that μ = 100 against the alternative that μ > 100 at an alpha of 0.05.
Here because the alternative contains the > math operation, this is a right tail test and the pvalue is 0.035 which is less than alpha. The decision is to reject the null and conclude that at the 5% significance level there is enough evidence to support a claim that the mean is greater than 100.
Here is the StatCrunch solution:
]]>
A Type I is a false positive where a true null hypothesis that there is nothing going on is rejected. A Type II error is a false negative, where a false null hypothesis is not rejected – something is going on – but we decide to ignore it.
In this case, the software designers were trying to optimize the ability of the car’s autonomous systems to recognize humans and other obstacles so that the car did not slow or stop too often due to things like lane pylons, trash in the gutter, or street signs on the side of the road. If that happened, the car would have a jerky, uncomfortable ride and slower average speed. Recognizing a lane pylon or sign pole as an obstacle to the car would be a false positive, a Type I error. (O’Kane, 2018)
Understand this is fuzzy, complex science and the system designers had to integrate the information from multiple sensors. One sensor might “see” a data point as an object better in the dark than another. Still another system might “see” an object and project the object’s path to be stationary while still another system might calculate a motion into the car’s path. Most of the systems use artificial intelligence that had to be trained using thousands of “images” that may or may not be similar enough to this victim walking a bicycle. Logically, some overarching system has to evaluate all the data and make a decision that there is an object that is an obstacle in a threatening path or position and initiate evasive or braking actions.
Again, if the “master” system was too conservative in its decisions and registered “positive” for objects that were not threats, it would produce an unnecessarily jerky ride. My assumption is that the system designers set the detection program’s parameters to make false positives less likely, to ignore an object until the system was very sure it was a real obstacle to the car. At the same time, the system needed to not have false negatives, failing to recognize objects that were a threat. Finding the optimal setpoint is akin to picking a significance level.
Trying to minimize false positive is analogous to decreasing the alpha, the level of significance, so it is more difficult to reject the null that “nothing is going on here.” Because Type I and Type II errors are “connected,” as you make it more difficult to have a false positive, Type I error, you also make it more likely you will have a false negative, a Type II error.
After much review, Uber now says the autonomous system did “see” the pedestrian and first classified her as an unknown object at 6 seconds to impact. It then thought she and her bike was a car, and finally recognized her as a person about 1.3 seconds before impact. By then it was too late for the normal control system to react and Uber had disconnected the Volvo emergency braking system. For testing, the company had backup human observers who were supposed to recognize mistakes and take appropriate action. That backup failed at a crucial time and the autonomous system was on its own. (NTSB, 2018)
One way of thinking about this is that the sensors in the car actually detected the woman who was struck but “decided” initially the data was not sufficiently strong to register as a real obstacle and stop the car. The pvalue it calculated, if you will, was greater than the alpha the system designers chose, and the system did not reject the null that “there is nothing of concern happening here.”
But the system continued to collect data, increase the sample size n so to speak, until the test statistic crossed into the rejection area, though that was too late to save the pedestrian.
The system initially made a Type II, false negative, decision and failed to reject the false null that the “object is not real.” (Marshall, 2018)
The Key Takeaway
My point is that when you decide on your level of significance in the real world, you must consider the costs of a mistake either way. You must evaluate the consequences of making a Type I, false positive decision or making a Type II, false negative decision and set your significance level appropriately.
Efrati, A. (2018, May 7). Uber Finds Deadly Accident Likely Caused By Software Set to Ignore Objects On Road. Retrieved from The Information: http://bit.ly/2KquXbH
Lee, T. (2018, May 7). Report: Software bug led to death in Uber’s selfdriving crash. Retrieved from ARS Technica: https://arstechnica.com/techpolicy/2018/05/reportsoftwarebugledtodeathinubersselfdrivingcrash/
Marshall, A. (2018, May 29). FALSE POSITIVES: SELFDRIVING CARS AND THE AGONY OF KNOWING WHAT MATTERS. Retrieved from Wired: https://www.wired.com/story/selfdrivingcarsubercrashfalsepositivenegative/?mbid=social_twitter
NTSB. (2018). Preliminary Report Highway HWY18MH010. Washington D.C.: National Transportation Safety Board.
O’Kane, S. (2018, May 7). Uber reportedly thinks its selfdriving car killed someone because it ‘decided’ not to swerve. Retrieved from The Verge: https://www.theverge.com/2018/5/7/17327682/uberselfdrivingcardecisionkillswerve
Recall that the equation for the standard error is
where σ is the population standard deviation and n is the sample size.
One can find any number of precise “academic explanations” of why this is true, and I give my students links to those references. But I often get followup questions from students who look at those references and then ask for a simpler, clearer explanation that does not involve a lot of algebra.
So, I am going to attempt one here and in the companion video.
Let’s also assume that each roll of the dice is a sample of size n=1 since we just have one die. We know from basic probability that we need to look at the longterm when we are looking at empirical probabilities, so let us make 10,000 rolls of our single die. So, we then have 10,000 samples of sample size n = 1.
I am going to simulate doing 10,000 rolls of a single die using basic Excel. Here in Figure 1 is a screenshot of my worksheet with most of the rows hidden.
Although calculating the mean, xbar, of a sample of 1 is a bit trivial, I do that in column D using the Excel AVERAGE function. I copy that formula down the range D3:D10002, as you can see in column E where I show the formulas in column D. For the first sample, in cell D3 the average of the value on the single die, 3, is just 3.000. I am doing this to establish the format I will use for all the remaining sample sizes from 2 dice to 30 dice.
In cell F3 using the Excel AVERAGE function over the cell range D3:D10002, I calculated the average of the sample means of the 10,000 dice rolls and that is 3.483, which is very close to the actual population mean of 3.500. [(1+2+3+4+5+6)/6] I also calculated the standard deviation of the means of those 10,000 samples in cell H3 using the Excel function STDEV.P over the cell range D3:D10002. That value is 1.70971, again very close the actual population standard deviation, σ, of 1.7078. [The standard deviation of 1,2,3,4,5,6 = 1.7078.]
I could have simulated many more throws and would likely have gotten my “averages” closer to the theoretical values, but these 10,000 samples should be enough to give you the idea of how this works.
Next, I will repeat the simulation using 2 dice, which is a sample of n = 2. Figure 3 is an image of the Excel worksheet constructed for a sample of 2 dice as was done for the first example of one die, a sample of n = 1. You can consider this to be the sampling distribution for two dice rolled 10,000 times. In column D, I have calculated the mean, xbar, of the two dice, which is 2.500 (cell D3) for the first sample of n = 2. In column E, I show the formulas in the adjacent cells in column D. In Cell F3, I again calculated the average of the sample means, xbar, for those 10,000 samples of n = 2 and that value is 3.496, again close to the theoretical 3.500 population mean. This is what we expect if we believe the Central Limit Theorem is true.
Be sure to note the difference between the number of samples and the sample size, n. Here the number of samples is 10,000, but the sample size is just n = 2.
This next graph (Figure 4) is the relative frequency histogram of the average value of the two dice for the 10,000 samples. Note the distribution is beginning to resemble a “bell shape,” which is what should be expected if the Central Limit Theorem holds true. Also note, I increased the number of bins in the histogram to better display the distribution.
I repeated the Excel simulation for rolling 3 dice 10,000 times, and again for 4, 5, 6, … all the way to 30 dice being thrown 10,000 times, but I will not show all that information here in detail. I will use it for the very last part of this discussion.
And the standard error of these 10,000 samples of n = 5 is 0.764, calculated in cell I3. Recall that is the standard deviation of the sample means in column G, which is our sampling distribution of sample means.
Repeating the simulation using 10 dice, 20 and 30 dice, we get these three relative frequency distributions for the 10,000 sample means for samples sizes n = 10 (Figure 7), n = 20 (Figure 8), and n = 30 (Figure 9):
A Power function is an equation where the value of y is a function of x raised to a power. In this case, y is equal to 1.710 multiplied by x to the 0.5 power.
Recall from algebra, a negative coefficient just means we can put x in the denominator and get a positive coefficient,
Raising something to the 0.5 power is the same as taking the square root. Then, this best fit line’s equation is
This 1.710 is very, very close to the actual value of 1.7078 for the population standard deviation, σ, found earlier. If I had run even more than 10,000 samples, it is likely the numerator would have converged on sigma. Since x is our sample size, n, we have essentially shown that the sample standard error is
Again, in this exercise, I used simulations, a lot of them granted, to approximate the standard errors used by Excel to calculate the power function equation. Use this practical explanation to supplement the more precise algebraic proofs. I hope this gives you a better feeling when your instructor of textbook tells you the “proof” of why we divide sigma by the square root of the sample size n is beyond the scope of your course.
The process of statistics starts when we identify what group we want to study or learn something about. We call this group the population.
Note that the word “population” here (and in the entire course) is not just used to refer to people; it is used in the broader statistical sense, where population can refer not only to people, but also to animals, things etc. For example, we might be interested in:
The population, then, is the entire group that is the target of our interest.
In most cases, the population is so large that as much as we might want to, there is absolutely no way that we can study all of it (imagine trying to get the opinions of all U.S. adults about the death penalty…).
A more practical approach would be to examine and collect data only from a subgroup of the population, which we call a sample. We call this first component, which involves choosing a sample and collecting data from it, Producing Data or Sampling.
A sample is a subset of the population from which we collect data.
It should be noted that since, for practical reasons, we need to compromise and examine only a subgroup of the population rather than the whole population, we should try to choose a sample in such a way that it will represent the population well. This is best done using a form of random sampling.
For example, if we choose a sample from the population of U.S. adults, and ask their opinions about a federal health care program, we do not want our sample to consist of only Republicans or only Democrats.
Once the data have been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way.
This second component, which consists of summarizing the collected data, is called Exploratory Data Analysis or Descriptive Statistics.
Now we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results.
Before we can do so, we need to look at how the sample we’re using may differ from the population, so that we can factor that into our analysis. To examine this difference, we use Probability which is the third component in the big picture.
Probability is the “machinery” that allows us to draw conclusions about the population based on the data collected in the sample.
Finally, we can use what we’ve discovered about our sample to draw conclusions about our population.
We call this final component in the process Inference.
This is the Big Picture of Statistics
Example: EXAMPLE: Polling Customer Opinion
In December 2018, Fast Technologies conducted a poll of their customers to determine how they rated the quality of the Fast 4 Terra Byte Solid State external hard drive.
3 and 4. Probability and Inference: Based on the sample result (of 65% giving 4 stars) and our knowledge of probability, it was concluded (with 95% confidence) that the percentage of those rating the hard drive 4 stars in the population of Fast Technologies’ customers is within 3% of what was obtained in the sample (i.e., between 62% and 68%). The following figure summarizes the example:
This brings us back to where we started, the population.
This material was adapted from the Carnegie Mellon University open learning statistics course available at http://oli.cmu.edu and is licensed under a Creative Commons License.
]]>Recall that the equation for the standard error is
where σ is the population standard deviation and n is the sample size.
One can find any number of precise “academic explanations” of why this is true, and I give my students links to those references. But I often get followup questions from students who look at those references and then ask for a simpler, clearer explanation that does not involve a lot of algebra.
So, I am going to attempt one here and in the companion video.
Let’s also assume that each roll of the dice is a sample of size n=1 since we just have one die. We know from basic probability that we need to look at the longterm when we are looking at empirical probabilities, so let us make 10,000 rolls of our single die. So, we then have 10,000 samples of sample size n = 1.
I am going to simulate doing 10,000 rolls of a single die using basic Excel. Here in Figure 1 is a screenshot of my worksheet with most of the rows hidden.
Although calculating the mean, xbar, of a sample of 1 is a bit trivial, I do that in column D using the Excel AVERAGE function. I copy that formula down the range D3:D10002, as you can see in column E where I show the formulas in column D. For the first sample, in cell D3 the average of the value on the single die, 3, is just 3.000. I am doing this to establish the format I will use for all the remaining sample sizes from 2 dice to 30 dice.
In cell F3 using the Excel AVERAGE function over the cell range D3:D10002, I calculated the average of the sample means of the 10,000 dice rolls and that is 3.483, which is very close to the actual population mean of 3.500. [(1+2+3+4+5+6)/6] I also calculated the standard deviation of the means of those 10,000 samples in cell H3 using the Excel function STDEV.P over the cell range D3:D10002. That value is 1.70971, again very close the actual population standard deviation, σ, of 1.7078. [The standard deviation of 1,2,3,4,5,6 = 1.7078.]
I could have simulated many more throws and would likely have gotten my “averages” closer to the theoretical values, but these 10,000 samples should be enough to give you the idea of how this works.
Next, I will repeat the simulation using 2 dice, which is a sample of n = 2. Figure 3 is an image of the Excel worksheet constructed for a sample of 2 dice as was done for the first example of one die, a sample of n = 1. You can consider this to be the sampling distribution for two dice rolled 10,000 times. In column D, I have calculated the mean, xbar, of the two dice, which is 2.500 (cell D3) for the first sample of n = 2. In column E, I show the formulas in the adjacent cells in column D. In Cell F3, I again calculated the average of the sample means, xbar, for those 10,000 samples of n = 2 and that value is 3.496, again close to the theoretical 3.500 population mean. This is what we expect if we believe the Central Limit Theorem is true.
Be sure to note the difference between the number of samples and the sample size, n. Here the number of samples is 10,000, but the sample size is just n = 2.
This next graph (Figure 4) is the relative frequency histogram of the average value of the two dice for the 10,000 samples. Note the distribution is beginning to resemble a “bell shape,” which is what should be expected if the Central Limit Theorem holds true. Also note, I increased the number of bins in the histogram to better display the distribution.
I repeated the Excel simulation for rolling 3 dice 10,000 times, and again for 4, 5, 6, … all the way to 30 dice being thrown 10,000 times, but I will not show all that information here in detail. I will use it for the very last part of this discussion.
And the standard error of these 10,000 samples of n = 5 is 0.764, calculated in cell I3. Recall that is the standard deviation of the sample means in column G, which is our sampling distribution of sample means.
Repeating the simulation using 10 dice, 20 and 30 dice, we get these three relative frequency distributions for the 10,000 sample means for samples sizes n = 10 (Figure 7), n = 20 (Figure 8), and n = 30 (Figure 9):
A Power function is an equation where the value of y is a function of x raised to a power. In this case, y is equal to 1.710 multiplied by x to the 0.5 power.
Recall from algebra, a negative coefficient just means we can put x in the denominator and get a positive coefficient,
Raising something to the 0.5 power is the same as taking the square root. Then, this best fit line’s equation is
This 1.710 is very, very close to the actual value of 1.7078 for the population standard deviation, σ, found earlier. If I had run even more than 10,000 samples, it is likely the numerator would have converged on sigma. Since x is our sample size, n, we have essentially shown that the sample standard error is
Again, in this exercise, I used simulations, a lot of them granted, to approximate the standard errors used by Excel to calculate the power function equation. Use this practical explanation to supplement the more precise algebraic proofs. I hope this gives you a better feeling when your instructor of textbook tells you the “proof” of why we divide sigma by the square root of the sample size n is beyond the scope of your course.
Back in the dark ages when access to computers was not all that common, I was faced with developing a project schedule for, to me, a complex construction project. I was not that long out of school, so I sought out my boss with the hope he would give me some guidance on how to approach the problem.
He told me to use threepoint estimation and to talk to some of the older engineers in the firm to get their ideas on the likely outcomes. So, I did and learned that the three points he was talking about were the worst case, the best case, and the most likely case for what would happen during the project. (Wikipedia, n.d.)
He also directed me to consider using PERT. I did and learned that form of project management scheduling including consideration of the optimistic time estimate (o), the most likely or normal time estimate (m), and the pessimistic time estimate (p). In PERT, instead of using probabilities for each estimate of the time required, the task time is calculated as (o + 4m + p) ÷ 6. (Taylor Jr., 2011)
To model a threepoint estimate with a probability distribution you need to use a triangular distribution. Today, threepoint estimates are commonly used in business and engineering, so it is somewhat surprising that Excel does not have a builtin function to help. I was recently faced with this dilemma in my quantitative methods course which I am trying to migrate away from expensive software solutions.
We were working on modeling business problems using the Monte Carlo method. In the example “Make vs Outsource” problem I wanted to use, the demand for the new SSD (solid state drive) in our case study was forecast to have a worst case, best case, and most likely value.
This is the basic model:
How could we model this demand growth as a random variable using a triangular distribution for use in a Monte Carlo simulation?
As you begin to look at a triangular distribution, there is nothing more than basic geometry and algebra required.
In the image above, a is the minimum value, b is the maximum value, and c is the most likely value, the mode. The probability distribution represented by the area in the larger triangle is continuous and, of course, equal to 1.
Recall the area of a triangle is ½ * base * height. Since the area = 1, 1 = ½ * (ba) * h. Rearranging, we get
h = 2/(ba).
Looking at the smaller yellow triangle, A1, that area would represent the cumulative probability distribution of x > a and x <= c, the most likely value.
P(a < x <= c), then, is equal to ½ * (ca)*h = ½ * (ca) * [2/(ba)], or, with a bit of rearranging,
P(a < x <=c) = (ca)/(ba).
If we set a < x < b, we can then say that the cumulative probability x <= c, the most likely value,
P(<=c) = (ca)/(ba).
Let’s look at an example of threepoint forecast increases in demand for a product:
Worst case increase = 2%; best case increase = 10%, and most likely case increase = 7%.
Thus, a = 2%, b = 10%, and c = 7%. The probability x being less than or equal to the most likely value for demand increase of 7% is
P(<=c) = (72)/(102) = 5/8 = 0.625.
Let’s look at a more general case.
For the case of x_{1}, a < x_{1 }< = c,
For x_{2}, c < x_{2 }< = b,
(Weissteain, n.d.)
And again, by some algebraic manipulation, we can find the cumulative probability distributions:
(Weissteain, n.d.)
So, these two equations will give the probability for an x for a triangular distribution, but how do we use them in a Monte Carlo simulation where we want to randomize the x values for a triangular distribution?
For variables that follow a normal distribution, we can use the Excel RAND function to generate probabilities and, with the NORM.INVERSE, to then generate random values of x (see image 1 for an example). So, to generate random values of x that follow a triangular distribution, we need to develop an inverse of the two CDF formulas above.
To do that, we can generate random probabilities (P1 and P2) using the RAND() function and then set them equal to the CDF for each of the two equations. Then use algebra to solve for x for each of the two cases.
For the first part:

Now, put them into Excel format and use an IF statement to pick the one to use. To do that we will use the cumulative probability that x = c as the decision point.
That is P(c ) = (ca)/(ba)
If P(x) <= P(c ), use the equation for x_{1}, else, use the equation for x_{2}.
Here is the implementation in Excel. I then link cell B6 into the Make vs Buy model for the demand and conduct the Monte Carlo simulation. the red and blue colors refer back to the two equations developed above for x_{1} and x_{2}.
To see how this works, I ran a 5000 trial simulation and plotted a histogram of the xvalues generated.
I think that is pretty good. And all done using basic Excel – no expensive addins.
You can download a copy of the calculator here. Triangular_Distribution_Xvalue_Calculator_Dawn_Wright_PhD
Petty, N., & Dye, S. (2013, June 11). Triangular Distributions. Retrieved from Statistics Learning Center: https://learnandteachstatistics.files.wordpress.com/2013/07/notesontriangledistributions.pdf
Taylor Jr., J. (2011, July 6). The History of Project Management: Late 20th Century Process and Improvements. Retrieved from Bright Hub Project Management: http://www.brighthubpm.com/methodsstrategies/11663late20thcenturyprocessandimprovementsinpmhistory/
Weissteain, R. (n.d.). Triangular Distribution. Retrieved from Wolfram Mathworld – A Worfram Web Resource: http://mathworld.wolfram.com/TriangularDistribution.html
Wikipedia. (n.d.). Threepoint estimation. Retrieved from wikipedia: https://en.wikipedia.org/wiki/Threepoint_estimation
]]>Here is another MSL problem where you really have a lot to key in because there are 30 sample values each with 5 digits to enter.
I worked it in Excel and would like to point out two other things:
Here, because we have a righttail test, we must subtract the values we get from the tables or basic Excel functions from 1. Failing to recognize this results in the many, many mistakes I see in student quizzes with a Pvalue that is the wrong tail.
And if you know which hypothesis the claim is, writing the conclusion becomes a lot easier.
I would also note that in the image, I show how to get Excel to use the basic logic of hypothesis tests to “make” the Reject or Fail to Reject decision automatically using Excel functions AND and IF. On the calculators on my website, I take it one more step and get Excel to write the conclusion too. This is not difficult to do if you want to create your own Excel calculators which you can then use on quizzes and exams.
]]>
If the claim was the null, then your conclusion is about whether there was sufficient evidence to reject the claim. Remember, we can never prove the null to be true, but failing to reject it is the next best thing. So, it is not correct to say, “Accept the Null.”
If the claim is the alternative hypothesis, your conclusion can be whether there was sufficient evidence to support (prove) the alternative is true.
Use the following table to help you make a good conclusion.
The best way to state the conclusion is to include the significance level of the test and a bit about the claim itself.
For example, if the claim was the alternative that the mean score on a test was greater than 85, and your decision was to Reject then Null, then you could conclude:
“At the 5% significance level, there is sufficient evidence to support the claim that the mean score on the test was greater than 85.”
The reason you should include the significance level is that the decision, and thus the conclusion, could be different if the significance level was not 5%.
If you are curious why we say “Fail to Reject the Null” instead of “Accept the Null,” this short video might be of interest: Here
2. Use the following table to help you make a good conclusion.
The best way to state the conclusion is to include the significance level of the test and a bit about the claim itself.
For example, if the claim was the alternative that the mean score on a test was greater than 85, and your decision was to Reject then Null, then you could conclude:
“At the 5% significance level, there is sufficient evidence to support the claim that the mean score on the test was greater than 85.”
The reason you should include the significance level is that the decision, and thus the conclusion, could be different if the significance level was not 5%.
If you are curious why we say “Fail to Reject the Null” instead of “Accept the Null,” this short video might be of interest: Here