I know there is pressure, either selfinflicted or from external sources, to try to rush through your degree as fast as possible. For many, that means always taking 8week term courses. In my experience in teaching introductory statistics, I have seen students do well in the 8week terms, but I have seen too many students struggle in them. Perhaps, as I believe, statistics is “one” of those courses where time is required for the concepts and ideas to jell and firm up.
I stumbled across an interesting article while researching cognitive load and found this: “When you have nothing to think about, you can do your best thinking. You don’t even have to be in the shower.” (Baer, 2016)
In a related article, I found Stanford researcher Emma Seppälä saying:
We need to find ways to give our brains a break…. At work, we’re intensely analyzing problems, organizing data, writing—all activities that require focus. During downtime, we immerse ourselves in our phones while standing in line at the store or lose ourselves in Netflix after hours. (Seppälä, 2017)
Taking courses in the 8week term format, especially if you take more than one at a time, can easily be a form of information overload. Moreover, the 8week terms do not give you much freeboard if one of life’s frequent surprises shows up.
My “two cents” is that you should buildin time for your brain to recharge after work and studies. Time to be with your family and time to be alone. Taking the 15week version of a course now and then may help give you that time to recharge. That is not a sign of weakness or selfishness.
That is being smart.
Baer, D. (2016, June 20). ‘Unloaded’ Minds Are the Most Creative. Retrieved from Science of Us: http://nymag.com/scienceofus/2016/06/unloadedmindsarethemostcreative.html
JSeppälä, E. (2017, May 8). Happiness research shows the biggest obstacle to creativity is being too busy. Retrieved from Quartz: https://qz.com/978018/happinessresearchshowsthebiggestobstacletocreativityisbeingtoobusy/?utm_source=qzfb
2.4 Empirical Rule and Chebyshev Theorem
]]>
Back in the dark ages when access to computers was not all that common, I was faced with developing a project schedule for, to me, a complex construction project. I was not that long out of school, so I sought out my boss with the hope he would give me some guidance on how to approach the problem.
He told me to use threepoint estimation and to talk to some of the older engineers in the firm to get their ideas on the likely outcomes. So, I did and learned that the three points he was talking about were the worst case, the best case, and the most likely case for what would happen during the project. (Wikipedia, n.d.)
He also directed me to consider using PERT. I did and learned that form of project management scheduling including consideration of the optimistic time estimate (o), the most likely or normal time estimate (m), and the pessimistic time estimate (p). In PERT, instead of using probabilities for each estimate of the time required, the task time is calculated as (o + 4m + p) ÷ 6. (Taylor Jr., 2011)
To model a threepoint estimate with a probability distribution you need to use a triangular distribution. Today, threepoint estimates are commonly used in business and engineering, so it is somewhat surprising that Excel does not have a builtin function to help. I was recently faced with this dilemma in my quantitative methods course which I am trying to migrate away from expensive software solutions.
We were working on modeling business problems using the Monte Carlo method. In the example “Make vs Outsource” problem I wanted to use, the demand for the new SSD (solid state drive) in our case study was forecast to have a worst case, best case, and most likely value.
This is the basic model:
How could we model this demand growth as a random variable using a triangular distribution for use in a Monte Carlo simulation?
As you begin to look at a triangular distribution, there is nothing more than basic geometry and algebra required.
In the image above, a is the minimum value, b is the maximum value, and c is the most likely value, the mode. The probability distribution represented by the area in the larger triangle is continuous and, of course, equal to 1.
Recall the area of a triangle is ½ * base * height. Since the area = 1, 1 = ½ * (ba) * h. Rearranging, we get
h = 2/(ba).
Looking at the smaller yellow triangle, A1, that area would represent the cumulative probability distribution of x > a and x <= c, the most likely value.
P(a < x <= c), then, is equal to ½ * (ca)*h = ½ * (ca) * [2/(ba)], or, with a bit of rearranging,
P(a < x <=c) = (ca)/(ba).
If we set a < x < b, we can then say that the cumulative probability x <= c, the most likely value,
P(<=c) = (ca)/(ba).
Let’s look at an example of threepoint forecast increases in demand for a product:
Worst case increase = 2%; best case increase = 10%, and most likely case increase = 7%.
Thus, a = 2%, b = 10%, and c = 7%. The probability x being less than or equal to the most likely value for demand increase of 7% is
P(<=c) = (72)/(102) = 5/8 = 0.625.
Let’s look at a more general case.
For the case of x_{1}, a < x_{1 }< = c,
For x_{2}, c < x_{2 }< = b,
(Weissteain, n.d.)
And again, by some algebraic manipulation, we can find the cumulative probability distributions:
(Weissteain, n.d.)
So, these two equations will give the probability for an x for a triangular distribution, but how do we use them in a Monte Carlo simulation where we want to randomize the x values for a triangular distribution?
For variables that follow a normal distribution, we can use the Excel RAND function to generate probabilities and, with the NORM.INVERSE, to then generate random values of x (see image 1 for an example). So, to generate random values of x that follow a triangular distribution, we need to develop an inverse of the two CDF formulas above.
To do that, we can generate random probabilities (P1 and P2) using the RAND() function and then set them equal to the CDF for each of the two equations. Then use algebra to solve for x for each of the two cases.
For the first part:

Now, put them into Excel format and use an IF statement to pick the one to use. To do that we will use the cumulative probability that x = c as the decision point.
That is P(c ) = (ca)/(ba)
If P(x) <= P(c ), use the equation for x_{1}, else, use the equation for x_{2}.
Here is the implementation in Excel. I then link cell B6 into the Make vs Buy model for the demand and conduct the Monte Carlo simulation. the red and blue colors refer back to the two equations developed above for x_{1} and x_{2}.
To see how this works, I ran a 5000 trial simulation and plotted a histogram of the xvalues generated.
I think that is pretty good. And all done using basic Excel – no expensive addins.
You can download a copy of the calculator here. Triangular_Distribution_Xvalue_Calculator_Dawn_Wright_PhD
Petty, N., & Dye, S. (2013, June 11). Triangular Distributions. Retrieved from Statistics Learning Center: https://learnandteachstatistics.files.wordpress.com/2013/07/notesontriangledistributions.pdf
Taylor Jr., J. (2011, July 6). The History of Project Management: Late 20th Century Process and Improvements. Retrieved from Bright Hub Project Management: http://www.brighthubpm.com/methodsstrategies/11663late20thcenturyprocessandimprovementsinpmhistory/
Weissteain, R. (n.d.). Triangular Distribution. Retrieved from Wolfram Mathworld – A Worfram Web Resource: http://mathworld.wolfram.com/TriangularDistribution.html
Wikipedia. (n.d.). Threepoint estimation. Retrieved from wikipedia: https://en.wikipedia.org/wiki/Threepoint_estimation
]]>One kind are “natural” pairings, such as spouses, siblings, and especially twins. This type of pairing is often used in medical observational research when it is difficult to construct a true experiment. (PennState, 2017)
But even more common are other types of pairing. A more accurate label for this twosample test is a test for dependent samples. Samples are dependent when there is a relationship of some kind in play which causes the samples to not be independent.
I like this definition from the Minitab blog:
If the values in one sample affect the values in the other sample, then the samples are dependent.
If the values in one sample reveal no information about those of the other sample, then the samples are independent. (Minitab, n.d.)
Another author states the requirement for a twosample sample test for independent samples is:
“The two samples are randomly selected in an independent manner from the two target populations.” (McClave, Benson, & Sincich, 2014)
Another way of thinking about dependent vs independent samples: If there is no random process in selecting the second sample, the samples are dependent.
One example of a paired/dependent sample situation is comparing daily sales for two specific restaurants. We randomly pick 12 days from 2016 and get the sales for the two restaurants on those 12 days. Are the two samples independent? [data from (McClave, Benson, & Sincich, 2014)]
The answer is they are not. Although we randomly picked the 12 days, once we get the sales data for restaurant 1 we must get the same 12 days for restaurant 2. The second sample is not random – it is linked to the first sample.
If we mistakenly run the independent samples ttest, we get the following:
The large pvalue tells us the sales for the two restaurants are not different.
But, if we correctly run the paired samples ttest, we find a small pvalue:
The sales for the two restaurants are different!
Remember there are more types of “paired” samples than just before and after.
P.S. The image below shows the Excel PHStat version of the two tests:
]]>This sounds like a test of Goodness of Fit between the historical pattern and the observed pattern.
The claim is that the actual pattern and the historical pattern are different. That means we need the inequality math operator, which, in turn, means the claim is the alternative hypothesis.
Stating our two hypotheses:
H_{0}: the distribution of people ordering food for delivery is 7% Sunday, 4% Monday, 5% Tuesday, …
H_{a}: the distribution of people ordering food for delivery differs from the expected distribution.
Putting this in math equation form:
H_{0}: Distribution = P_{Sunday} 0 .07; P_{Monday} 0.04; P_{Tuesday} 0.05; P_{Wednesday} 0.12; P_{Thursday} 0.11; P_{Friday} 0.37; P_{Saturday} 0.24
H_{a}: Distribution ≠ P_{Sunday} 0 .07; P_{Monday} 0.04; P_{Tuesday} 0.05; P_{Wednesday} 0.12; P_{Thursday} 0.11; P_{Friday} 0.37; P_{Saturday} 0.24
Although the ≠ math operator normally indicates a twotailed test, Chisquare Goodness of Fit tests are always right tail tests.
First, let’s use StatCrunch; then we will use Excel.
Remember, if you are in MyStatLab, look for the small blue rectangles near the upper right of a table. Click on them to automatically load the data into StatCrunch (and into Excel).
This is how StatCrunch looks with the data entered. I labeled the history % “Expected.”
Hint: you do not have to convert the expected % to counts; StatCrunch will do that automatically. And it is smart enough to know if the expected data is already counts (frequency) – not sure how it does this. [Note: I think I figured it out. If the total of the “expected” values = 100, StatCrunch assumes the values are percentages. If they do not equal 100, StatCrunch assumes they are counts.]
Use the command sequence Stat > Goodness of Fit > Chisquare test. In the Observed: box, select the “Frequency_f” column and in the Expected: box, select “Expected.” In the Display: box, select “Expected” so the actual counts will be shown. Click Compute!.
The results box appears. The X test statistic is 21.107 and the pvalue is 0.0018, which is less than our alpha of 0.05.
Remember to check to make sure each of the expected frequencies is greater than 5, which they are in this problem. If any of the expected frequencies is less than 5, the Chisquare test is not valid.
If your problem requires you to find the critical value and rejection region, use the StatCrunch Chisquare calculator: Stat > Calculators > Chisquare. Enter the degrees of freedom, DF, which are the number of levels of the variable minus 1, i.e. 7 – 1 = 6 for this problem. Always select the ≥ option in the P(x) box. Enter alpha, 0.05 for this problem, and click Compute.
The critical value, X_{0} is 12.59 and the rejection area is any value of X greater than 12.59. The X test statistic of 21.107 is greater than 12.59 and thus falls within the rejection area.
Using either the pvalue approach or the critical value approach, we reject the null hypothesis.
Because our claim was the alternative, we conclude there is sufficient evidence to support the claim that the observed distribution of people ordering food for delivery is different from the expected pattern.
Now, let’s do the Excel solution. This takes a bit more time, but if you save your worksheet, you can reuse it on similar problems by editing the data ranges in the formulas. Note: you can click on an image to see if full size.
We get the same results as we did when using StatCrunch.
Hope this helps!
]]>Consider the following problem statement:
A bank auditor claims that credit card balances are normally distributed, with a mean of $2870 and a standard deviation of $900.
I usually see students get one of the questions correct, but not all. And they either seem to get #1 or #2 correct in about equal proportions. When I inspect their solutions, I find that they get confused over the “standard deviation” to use in the equation for z.
Most students seem to get #1 correct. They use the formula for z:
correctly interpreting the problem’s “standard deviation of $900” as the population sigma.
Here is the Excel solution for part 1. Note, I give two formulas for finding the probability. In both, “True” gives the cumulative probability from left infinity; the left tail, in other words.
Here is the StatCrunch solution using the Stat > Calculators > Normal command sequence. In the dialog box, make sure the Standard option is active, enter the mean, sigma, x, and select the < to get the left tail, and then click Compute. I like to use StatCrunch for these types of problem since it gives a sketch as well as the probability.
In both solutions to part 1, we find that the probability of an individual card holder having less than a $2500 balance is 34%, which is not unusual.
Let’s look at part 2 again:
The mistake I see many students make is to use the population sigma in their calculations. That means they probably did not recognize that the question is about a mean for 25 randomly selected card holders. In other words, a sample.
To find z for a sample, you must use the standard deviation of the sampling distribution of sample means, the standard error, σ_{x̅}.
This is the formula for finding z for a sample mean, x̅:
Recall that the mean of a sampling distribution of sample means, µ_{x}_{̅} , is the population mean, µ.
Here is the Excel solution:
Here is the StatCrunch solution, again using the Normal calculator.
I just learned a neat “trick” about the calculator: you can use Excellike formulas in the data entry windows. Here, to get the standard error, I entered 900/SQRT(25) in the Std. Dev. Window before I clicked Compute. Of course, you can use a regular calculator to find the standard error and enter that value.
In the StatCrunch graph, we can see that the $2500 sample mean balance is very far to the left of the population mean of $2870. Thus, the approximately 2% chance of getting less than $2500 for a sample of 25 is reasonable.
But getting a sample mean of $2500 for this population would be unusual if our standard of labeling an event unusual is a 5% chance.
Another common mistake on a similar problem but with a key difference in the wording.
Use the normal distribution of fish lengths for which the mean is 11 inches and the standard deviation is 4 inches. Assume the variable x is normally distributed.
Part 1 is straightforward. We are asked about individual fish, not a sample.
This time I will use StatCrunch first so we can see the sketch.
We can see that 14 inches is to the right side of the mean of 11 inches. The area under the normal curve to the right of 14 is 0.2266 which means that about 22.7% of the fish will be longer than 14 inches.
This is the Excel solution:
Because we need the right tail, we must subtract the value returned by the NORM.S.DIST function from one. Recall the TRUE parameter gives us the cumulative area under the curve from left infinity to our z value. Unfortunately, “False” does not give the right tail!
Students who do not draw the sketch often forget this important step and give an incorrect answer of 77.3%.
Part 2: If 200 fish are randomly selected, about how many would you expect to be shorter than 9 inches?
I think what throws some students on part 2 is the statement “If 200 fish are randomly selected” which sounds an awful lot like it is a sample. And it is a sample of n = 200.
But the key is that they do not ask for the mean or any other sample statistic.
They want to know how many of the 200 fish will be shorter than 9 inches. That means we do not need to use the standard error, σ_{x̅}. We should again use sigma as the standard deviation in the StatCrunch normal calculator.
We get a probability of 30.9%, which means about 62 [200*30.9%] of the fish will be shorter than 9 inches.
Here is the Excel solution:
Because we are again interested in the left tail (see the StatCrunch sketch), we go back to our original formulas by deleting the “1” in front of the NORM.S.DIST function. And, as with StatCrunch, we see that about 62 fish of the 200 will be shorter than 9 inches in length.
]]>
Consider the following problem statement:
In an investigation of the personality characteristics of drug dealers of a certain region, convicted drug dealers were scored on a scale that provides a quantitative measure of person’s level of need for approval and sensitivity to social situations. (Higher scores indicate a greater need for approval.) Based on the study results, it can be assumed that the scale scores for the population of convicted drug dealers of the region has a mean of 44 and a standard deviation of 7. Suppose that in a sample of 96 people from the region, the mean scale score is x̅ = 46. Is this sample likely to have been selected from the population of convicted drug dealers of the region? Explain. Consider an event with a probability less than 0.05 unlikely. (McClave, Benson, & Sincich, 2014)
Solution:
First, state the question: How unusual would it be to get a sample mean of 46 if the population mean is 44 and the population standard deviation is 7?
In math form, find P(x̅ ≥ 46) for µ = 44, σ = 7. Note: use the ≥ math operator because we need to find the area under the normal curve to the right of the sample mean. Because the normal distribution is continuous, the probability of getting exactly 46 is meaningless.
Second, identify the data provided:
3rd, draw a sketch. You can see that x̅ = 46 falls far to the right of the mean, µ = 44.
Excel solution:
Because p < α =0.05, it would be unusual to get a randomly collected sample mean of 46 or greater if the population mean is 44 and the population standard deviation is 7.
StatCrunch solution: Use the command sequence Stats > Z Stats > One Sample > With Summary. Enter the sample mean, the standard deviation, and n. (Note: enter the population standard deviation, σ; StatCrunch calculates the sample standard error.)
Select the radio button next to Hypothesis test for µ, enter the null value of 44, and select > for the alternative H_{a}. Click Compute!
We get the same results as we did with Excel.
]]>