Boys will be boys: Data error prompts Uturn on study of sex differences in school (Retraction Watch, 2017)
The article is about a peerreviewed article on selfregulation of study habits that was published earlier this year. In the retraction, the authors noted they had discovered a “coding error” that flipped the outcome of their research. Although the standard procedure for using coding dummy variables is “1” = Yes and “0” = No, that procedure trips up a bit when it comes to gender. Generally, male is coded “1” but here the researcher doing the basic coding used “1” to mean female. No one noticed and they drew their conclusions as if the data had been coded the opposite way with “1” meaning male.
I noticed several students made a similar “error” when setting up the multiple regression in M3A2 on the value of a fireplace. And that was a “coding error” when using a dummy variable to include the categorical variable “fireplace” in the regression.
As you probably know, a regression requires quantitative variables but, in our database, we have a variable Fireplace that contained either a True or False text value. When we use a dummy variable, we replace a categorical text value with a number. In the Fireplace problem, a logical way to do that, logical for me, would be to use a “1” to indicate the presence of a fireplace and a “0” to indicate no fireplace. Some students chose to do the reverse, “0” for Fireplace = True and “1” for Fireplace = False.
The error I am speaking of is not that – deciding to use “0” for Fireplace = True. The error comes in not understanding what either choice means for how you interpret the outcome of the regression.
If you chose “0” for Fireplace = True and did the multiple regression correctly, you came out with a beta2 coefficient of about $5567 for the dummy variable.
If you did the opposite and used “1” for Fireplace = True, you found a beta2 coefficient of +$5567.
The error happens when you try to interpret these outcomes.
It is straightforward to interpret the outcome if you let “1” indicate the presence of a fireplace, Image 1. That is that having a fireplace adds about $5567 in value to the sales price of a typical house. You see this when you use the CI/PI worksheet to forecast home values by putting in either a “0” or a “1” in the calculator.
Image 1
But some students who used “0” to indicate the presence of a fireplace came up with an incorrect conclusion: that the presence of a fireplace reduced the price of a home because beta2 was negative.
In reality, for the students who used “0” to indicate the presence of a fireplace, the negative beta2 tells you that not having a fireplace reduces the value of a home, just the opposite.
Image 2
Final thought: you may notice that the yintercepts on the two coding methods are also different. But if you check, you will see that the yintercept for Fireplace=True=0 is $5567 greater than the yintercept of Fireplace=True=1. That makes sense because under that scenario, the starting point should be greater because the assumption is that the house has a fireplace.
Retraction Watch. (2017, Oct). Boys will be boys: Data error prompts Uturn on study of sex differences in school. Retrieved from Retraction Watch: http://retractionwatch.com/2017/10/17/boyswillboysdataerrorpromptsuturnstudysexdifferencesschool/
]]>]]>You argue that statistical literacy gives citizens a kind of power. What do you mean?
What I mean is that if we don’t have the ability to process quantitative information, we can often make decisions that are more based on our beliefs and our fears than based on reality. On an individual level, if we have the ability to think quantitatively, we can make better decisions about our own health, about our own choices with regard to risk, about our own lifestyles. It’s very empowering to not be scared or bullied into doing things one way or another.
On a collective level, the impact of being educated in general is huge. Think about what democracy would be if most of us couldn’t read. We aspire to a literate society because it allows for public engagement, and I think this is also true for quantitative literacy. The more we can get people to understand how to view the world in a quantitative way, the more successful we can be at getting past biases and beliefs and prejudices. (Bleicher, 2017)
I know there is pressure, either selfinflicted or from external sources, to try to rush through your degree as fast as possible. For many, that means always taking 8week term courses. In my experience in teaching introductory statistics, I have seen students do well in the 8week terms, but I have seen too many students struggle in them. Perhaps, as I believe, statistics is “one” of those courses where time is required for the concepts and ideas to jell and firm up.
I stumbled across an interesting article while researching cognitive load and found this: “When you have nothing to think about, you can do your best thinking. You don’t even have to be in the shower.” (Baer, 2016)
In a related article, I found Stanford researcher Emma Seppälä saying:
We need to find ways to give our brains a break…. At work, we’re intensely analyzing problems, organizing data, writing—all activities that require focus. During downtime, we immerse ourselves in our phones while standing in line at the store or lose ourselves in Netflix after hours. (Seppälä, 2017)
Taking courses in the 8week term format, especially if you take more than one at a time, can easily be a form of information overload. Moreover, the 8week terms do not give you much freeboard if one of life’s frequent surprises shows up.
My “two cents” is that you should buildin time for your brain to recharge after work and studies. Time to be with your family and time to be alone. Taking the 15week version of a course now and then may help give you that time to recharge. That is not a sign of weakness or selfishness.
That is being smart.
Baer, D. (2016, June 20). ‘Unloaded’ Minds Are the Most Creative. Retrieved from Science of Us: http://nymag.com/scienceofus/2016/06/unloadedmindsarethemostcreative.html
JSeppälä, E. (2017, May 8). Happiness research shows the biggest obstacle to creativity is being too busy. Retrieved from Quartz: https://qz.com/978018/happinessresearchshowsthebiggestobstacletocreativityisbeingtoobusy/?utm_source=qzfb
2.4 Empirical Rule and Chebyshev Theorem
]]>
Back in the dark ages when access to computers was not all that common, I was faced with developing a project schedule for, to me, a complex construction project. I was not that long out of school, so I sought out my boss with the hope he would give me some guidance on how to approach the problem.
He told me to use threepoint estimation and to talk to some of the older engineers in the firm to get their ideas on the likely outcomes. So, I did and learned that the three points he was talking about were the worst case, the best case, and the most likely case for what would happen during the project. (Wikipedia, n.d.)
He also directed me to consider using PERT. I did and learned that form of project management scheduling including consideration of the optimistic time estimate (o), the most likely or normal time estimate (m), and the pessimistic time estimate (p). In PERT, instead of using probabilities for each estimate of the time required, the task time is calculated as (o + 4m + p) ÷ 6. (Taylor Jr., 2011)
To model a threepoint estimate with a probability distribution you need to use a triangular distribution. Today, threepoint estimates are commonly used in business and engineering, so it is somewhat surprising that Excel does not have a builtin function to help. I was recently faced with this dilemma in my quantitative methods course which I am trying to migrate away from expensive software solutions.
We were working on modeling business problems using the Monte Carlo method. In the example “Make vs Outsource” problem I wanted to use, the demand for the new SSD (solid state drive) in our case study was forecast to have a worst case, best case, and most likely value.
This is the basic model:
How could we model this demand growth as a random variable using a triangular distribution for use in a Monte Carlo simulation?
As you begin to look at a triangular distribution, there is nothing more than basic geometry and algebra required.
In the image above, a is the minimum value, b is the maximum value, and c is the most likely value, the mode. The probability distribution represented by the area in the larger triangle is continuous and, of course, equal to 1.
Recall the area of a triangle is ½ * base * height. Since the area = 1, 1 = ½ * (ba) * h. Rearranging, we get
h = 2/(ba).
Looking at the smaller yellow triangle, A1, that area would represent the cumulative probability distribution of x > a and x <= c, the most likely value.
P(a < x <= c), then, is equal to ½ * (ca)*h = ½ * (ca) * [2/(ba)], or, with a bit of rearranging,
P(a < x <=c) = (ca)/(ba).
If we set a < x < b, we can then say that the cumulative probability x <= c, the most likely value,
P(<=c) = (ca)/(ba).
Let’s look at an example of threepoint forecast increases in demand for a product:
Worst case increase = 2%; best case increase = 10%, and most likely case increase = 7%.
Thus, a = 2%, b = 10%, and c = 7%. The probability x being less than or equal to the most likely value for demand increase of 7% is
P(<=c) = (72)/(102) = 5/8 = 0.625.
Let’s look at a more general case.
For the case of x_{1}, a < x_{1 }< = c,
For x_{2}, c < x_{2 }< = b,
(Weissteain, n.d.)
And again, by some algebraic manipulation, we can find the cumulative probability distributions:
(Weissteain, n.d.)
So, these two equations will give the probability for an x for a triangular distribution, but how do we use them in a Monte Carlo simulation where we want to randomize the x values for a triangular distribution?
For variables that follow a normal distribution, we can use the Excel RAND function to generate probabilities and, with the NORM.INVERSE, to then generate random values of x (see image 1 for an example). So, to generate random values of x that follow a triangular distribution, we need to develop an inverse of the two CDF formulas above.
To do that, we can generate random probabilities (P1 and P2) using the RAND() function and then set them equal to the CDF for each of the two equations. Then use algebra to solve for x for each of the two cases.
For the first part:

Now, put them into Excel format and use an IF statement to pick the one to use. To do that we will use the cumulative probability that x = c as the decision point.
That is P(c ) = (ca)/(ba)
If P(x) <= P(c ), use the equation for x_{1}, else, use the equation for x_{2}.
Here is the implementation in Excel. I then link cell B6 into the Make vs Buy model for the demand and conduct the Monte Carlo simulation. the red and blue colors refer back to the two equations developed above for x_{1} and x_{2}.
To see how this works, I ran a 5000 trial simulation and plotted a histogram of the xvalues generated.
I think that is pretty good. And all done using basic Excel – no expensive addins.
You can download a copy of the calculator here. Triangular_Distribution_Xvalue_Calculator_Dawn_Wright_PhD
Petty, N., & Dye, S. (2013, June 11). Triangular Distributions. Retrieved from Statistics Learning Center: https://learnandteachstatistics.files.wordpress.com/2013/07/notesontriangledistributions.pdf
Taylor Jr., J. (2011, July 6). The History of Project Management: Late 20th Century Process and Improvements. Retrieved from Bright Hub Project Management: http://www.brighthubpm.com/methodsstrategies/11663late20thcenturyprocessandimprovementsinpmhistory/
Weissteain, R. (n.d.). Triangular Distribution. Retrieved from Wolfram Mathworld – A Worfram Web Resource: http://mathworld.wolfram.com/TriangularDistribution.html
Wikipedia. (n.d.). Threepoint estimation. Retrieved from wikipedia: https://en.wikipedia.org/wiki/Threepoint_estimation
]]>One kind are “natural” pairings, such as spouses, siblings, and especially twins. This type of pairing is often used in medical observational research when it is difficult to construct a true experiment. (PennState, 2017)
But even more common are other types of pairing. A more accurate label for this twosample test is a test for dependent samples. Samples are dependent when there is a relationship of some kind in play which causes the samples to not be independent.
I like this definition from the Minitab blog:
If the values in one sample affect the values in the other sample, then the samples are dependent.
If the values in one sample reveal no information about those of the other sample, then the samples are independent. (Minitab, n.d.)
Another author states the requirement for a twosample sample test for independent samples is:
“The two samples are randomly selected in an independent manner from the two target populations.” (McClave, Benson, & Sincich, 2014)
Another way of thinking about dependent vs independent samples: If there is no random process in selecting the second sample, the samples are dependent.
One example of a paired/dependent sample situation is comparing daily sales for two specific restaurants. We randomly pick 12 days from 2016 and get the sales for the two restaurants on those 12 days. Are the two samples independent? [data from (McClave, Benson, & Sincich, 2014)]
The answer is they are not. Although we randomly picked the 12 days, once we get the sales data for restaurant 1 we must get the same 12 days for restaurant 2. The second sample is not random – it is linked to the first sample.
If we mistakenly run the independent samples ttest, we get the following:
The large pvalue tells us the sales for the two restaurants are not different.
But, if we correctly run the paired samples ttest, we find a small pvalue:
The sales for the two restaurants are different!
Remember there are more types of “paired” samples than just before and after.
P.S. The image below shows the Excel PHStat version of the two tests:
]]>