Boys will be boys: Data error prompts U-turn on study of sex differences in school (Retraction Watch, 2017)
The article is about a peer-reviewed article on self-regulation of study habits that was published earlier this year. In the retraction, the authors noted they had discovered a “coding error” that flipped the outcome of their research. Although the standard procedure for using coding dummy variables is “1” = Yes and “0” = No, that procedure trips up a bit when it comes to gender. Generally, male is coded “1” but here the researcher doing the basic coding used “1” to mean female. No one noticed and they drew their conclusions as if the data had been coded the opposite way with “1” meaning male.
I noticed several students made a similar “error” when setting up the multiple regression in M3A2 on the value of a fireplace. And that was a “coding error” when using a dummy variable to include the categorical variable “fireplace” in the regression.
As you probably know, a regression requires quantitative variables but, in our database, we have a variable Fireplace that contained either a True or False text value. When we use a dummy variable, we replace a categorical text value with a number. In the Fireplace problem, a logical way to do that, logical for me, would be to use a “1” to indicate the presence of a fireplace and a “0” to indicate no fireplace. Some students chose to do the reverse, “0” for Fireplace = True and “1” for Fireplace = False.
The error I am speaking of is not that – deciding to use “0” for Fireplace = True. The error comes in not understanding what either choice means for how you interpret the outcome of the regression.
If you did the opposite and used “1” for Fireplace = True, you found a beta2 coefficient of +$5567.
The error happens when you try to interpret these outcomes.
It is straightforward to interpret the outcome if you let “1” indicate the presence of a fireplace, Image 1. That is that having a fireplace adds about $5567 in value to the sales price of a typical house. You see this when you use the CI/PI worksheet to forecast home values by putting in either a “0” or a “1” in the calculator.
But some students who used “0” to indicate the presence of a fireplace came up with an incorrect conclusion: that the presence of a fireplace reduced the price of a home because beta2 was negative.
In reality, for the students who used “0” to indicate the presence of a fireplace, the negative beta2 tells you that not having a fireplace reduces the value of a home, just the opposite.
Final thought: you may notice that the y-intercepts on the two coding methods are also different. But if you check, you will see that the y-intercept for Fireplace=True=0 is $5567 greater than the y-intercept of Fireplace=True=1. That makes sense because under that scenario, the starting point should be greater because the assumption is that the house has a fireplace.
Retraction Watch. (2017, Oct). Boys will be boys: Data error prompts U-turn on study of sex differences in school. Retrieved from Retraction Watch: http://retractionwatch.com/2017/10/17/boys-will-boys-data-error-prompts-u-turn-study-sex-differences-school/
Back in the dark ages when access to computers was not all that common, I was faced with developing a project schedule for, to me, a complex construction project. I was not that long out of school, so I sought out my boss with the hope he would give me some guidance on how to approach the problem.
He told me to use three-point estimation and to talk to some of the older engineers in the firm to get their ideas on the likely outcomes. So, I did and learned that the three points he was talking about were the worst case, the best case, and the most likely case for what would happen during the project. (Wikipedia, n.d.)
He also directed me to consider using PERT. I did and learned that form of project management scheduling including consideration of the optimistic time estimate (o), the most likely or normal time estimate (m), and the pessimistic time estimate (p). In PERT, instead of using probabilities for each estimate of the time required, the task time is calculated as (o + 4m + p) ÷ 6. (Taylor Jr., 2011)
To model a three-point estimate with a probability distribution you need to use a triangular distribution. Today, three-point estimates are commonly used in business and engineering, so it is somewhat surprising that Excel does not have a built-in function to help. I was recently faced with this dilemma in my quantitative methods course which I am trying to migrate away from expensive software solutions. [Read more…] about “Easy” Excel Inverse Triangular Distribution for Monte Carlo Simulations
One kind are “natural” pairings, such as spouses, siblings, and especially twins. This type of pairing is often used in medical observational research when it is difficult to construct a true experiment. (PennState, 2017)
But even more common are other types of pairing. A more accurate label for this two-sample test is a test for dependent samples. Samples are dependent when there is a relationship of some kind in play which causes the samples to not be independent.
I like this definition from the Minitab blog:
If the values in one sample affect the values in the other sample, then the samples are dependent. [Read more…] about Paired samples are not always obvious
Perhaps one of the simplest but toughest questions for my intro (and graduate) stats students seems to be those asking to classify a variable as discrete or continuous.
My quick rule of thumb (heuristic) is to think about whether the variable is countable or whether it must be measured. I tried to come up with a mnemonic like “population-parameter; sample: statistic” but the best I could do is “finger : digit: discrete” since you have to count your fingers.
Dogs, cats, people, houses, touchdowns, are countable, so they are discrete variables. And we do not often think of dividing a dog or house into parts, e.g. 1.6 dogs, so again that sounds like they are discrete.
Things we measure are Continuous
A person’s weight, gallons of water, the length of a football field, the speed of a car, the temperature of the ocean, price of gas, all must be measured, so they are continuous variables. Another clue is that continuous variables are often stated as fractions or decimals, as in 2.5 gallons of gas. [Read more…] about Discrete or Continuous?
The most recent assignment in my BUS 430 class was on simple linear regression. In some of the data sets, there are data points that seem to be inconsistent with the bulk of the data. One student called this to my attention and asked if he should just ignore those data points because they were “obviously a mistake.” His comment reminded me that in an earlier assignment, we had discussed briefly using a box plot software tool to identify outliers, but we had not discussed what to do about them.
When dealing with just two variables, it is quick and easy to make a scatter plot and inspect it for data points that do not follow the trend of the rest of the data. In the scatter plot below, we can see just such a data point in the lower right.
But using a box plot does not identify this point as an outlier in either the x or y axis: [Read more…] about What to do about outliers?