The **Empirical Rule** applies to a normal, bell-shaped curve which is symmetrical about the mean. It states that within one standard deviation of the mean (both left-side and right-side) there is __about__ 68% of the data; within two standard deviations of the mean (both left-side and right-side) there is __about __95% of the data; and within three standard deviations of the mean (both left-side and right-side) there is __about__ 99.7% of the data.

In the image below, we use the Greek letter sigma, σ, which is the symbol for the population standard deviation. The mean is the middle of the distribution and is indicated by the 0. Remember the Empirical Rule says “about” 68% of the data is between +1 and -1 sigma, σ. This is closer to 68.3%, but even that is still approximate.

And between +2 and -2 standard deviations, σ, we have approximately 95.4%.

Finally, between +3 and -3 σ, we have approximately 99.7%.

Because there must be 100% of the data under the curve, that tells us there must be 100 – 99.7 = 0.3% split between the two tails beyond +3 and -3. Thus, there is approximately 0.15% to the right of +3 and 0.15% to the left of -3 σ.

We can use the Empirical Rule to find percentages and probabilities.

Solution: 130 minus the mean of 100 = 30, which is 2 times 15, the standard deviation. Thus, 130 is 2 standard deviations to the right of the mean. 100 – 70 = 30 which is 2 times 15. Thus, 70 is 2 standard deviations to the left of the mean. Since 70 to 130 is within 2 standard deviations of the mean, we know that __about__ 95.4% of the IQ scores would be between 70 and 130.

In this example, we calculated the ** z-scores** of the two IQ’s of interest. Here is the formula for finding a z-score

The individual value we are interested in is “x.” The mean of the population is the Greek letter Mu, μ. And the standard deviation of the population is Sigma, σ.

Here is the problem solved by first finding the z-scores of the two IQ’s we are interested in:

__Note: you can have + and – z-scores. __

Converting an x value of interest into a z-score allows you to more easily use the Empirical Rule because the __z-score is the number of standard deviations from the mean__.

Solution: Because the area under the bell curve represents 100% of the possible data points, it also represents probability. Recall that the probability of an event is the number of ways the event can happen divided by the total number of outcomes. Thus, an area under the bell curve equal to 10% is also a 10% probability of “x” falling in that area.

First, find the z-scores of the two IQs of interest:

So, we need the area under the curve between z = +1 and z = -1

Thus, there is __about__ 68.3% probability a randomly selected individual will have an IQ between 85 and 115.

Looking at the Empirical Rule graph above, we want the probability of an IQ score to the right of, greater than, 115, because that represents all the IQ scores above 115. That means we are interested in IQs with z scores greater than +1. We know that about 68.3% of the scores are between -1 and +1 z. Because the bell curve is symmetrical, that gives us 100 – 68.3 = 31.7% split evenly between the two gold areas, Area 1 and Area 2. Thus, the probability of an individual having an IQ greater than 115 is Area 2, which is 31.7 divided by 2 = 15.85 or about 16%.

And by extension, the probability of a randomly selected individual having an IQ score __less than 85__ is also about 16%, which is Area 1.

Here is another image of the **Empirical Rule** that may help you solve z-score probability problems.

Thus, an IQ greater than 130 would have a z-score greater than +2. To find the probability, just add the areas to the __right __of +2:

2.1 + 0.15 = 2.25% or about 2.3%

Add the areas to the __right__ of +1 z: 13.6 + 2.1 + 0.15 = 15.85 or about 16%.

Add the areas to the __right__ of – 1 z: 34.1 + 34.1 + 13.6 + 2.1 + 0.15 = 84.05 or about 84%.

Add the areas to the __left__ of + 2 z: 13.6 + 34.1 + 34.1 + 13.6 + 2.1 + 0.15 = 97.65 or about 98%.

Or because the bell curve is symmetrical, add the areas to the __right__ of +2 and subtract from 100%.

2.1 + 0.15 = 2.25. 100 – 2.25 = 97.75 or about 98%.

The Empirical Rule gives __approximate values__, and the values in the graph are rounded off. So, you would look for the __closest__ answer on a multiple-choice quiz.

Excel solution

StatCruch solution

]]>The guidance is that we need to use the FPC when the ratio of the sample size n to the population size N is greater than 5%. For example, if the population size is 300 and the sample size is 30, we have a ratio of 10% and thus need to use the FPCF.

The most common formula for calculating the FPC is

As the population N gets large compared to the sample n, the FPC tends toward a value of 1.

If it is needed, we use the FPCF to adjust the Standard Error of the analysis at hand. We do that by a simple multiplication of the Standard Error by the FPC value.

If we are working with a sample mean, we adjust the Standard Error of the Mean by multiplying it by the FPC. It can be used to adjust the standard error when we are finding critical values such as for tests of hypothesis and confidence intervals.

If we are working with a sample proportion, we multiply the Standard Error of the Proportion by the FPC.

This is my Excel solution to a typical problem. Note I calculate the FPC factor first in cell B9 to simplify the process and then multiply it by the standard error in B10.

On the bottom, I show the same problem worked without using the FPC factor and you can see the change on the answers is slight because the FPC is close to 1.

]]>Understanding what all that means can be a bit daunting. Here is my attempt as a simpler explanation. For convenience, I am just using the output from the *t-test: Two-Sample Assuming Unequal Variances*, but the concepts apply to all three t-test tools.

A company surveyed a random sample of their employees on how satisfied they were with their job. The manager does not care if one group has a higher or lower rating, and only wants to know if there is a difference in how men and women rate their job satisfaction.

State the Null and Alternative hypotheses:

Null Hypothesis

Ho: Mean Rating Men = Mean Rating WomenAlternative Hypothesis

Ha: Mean Rating Men ≠ Mean Rating Women

**Important: ****Put the data ranges for the two groups in the tool dialog box in the same relationship as stated in the Null**. The Men group (green highlight) is on the left side of the Null equation and should be placed in the Variable 1 Range field. The Women group (red highlight) is on the right side of the Null equation and must be in the Variable 2 Range field.

Here is the output from the tool using a significance level of alpha, α, = 0.05. Note that the Men group is on the left and the Women group is on the right in the output.

We can see that the mean for the men is smaller than that of the women. But is the apparent difference “real”? Both groups have a lot of variance relative to the size of the means, 3.22 for the men and 2.83 for the women. So, the apparent difference might just be due to the “noise” of the variances.

To make our decision on rejecting (or not rejecting) the null, we can look at the three output values I have highlighted in yellow: the t statistic, the two-tail p-value, and the two-tail critical value of t.

Why two-tail? Consider this graphic:

The first rule for deciding whether to reject the null tells us to compare our test statistic to the critical value.

When we have a two-tail test, we must put half of our significance level of 5% in each tail to account for the possibility of our test statistic being either positive or negative, i.e. one sample mean being larger or smaller than the other. Putting 2.5% in each tail we can calculate a critical value of -2.042 on the left side and +2.042 on the right side.

If our test statistic, the t Stat, falls in either rejection area, less than – 2.042 or larger than + 2.042, we must reject the Null. But here our t Stat of – 1.886 does not fall in either rejection area, so we must decide to **not** reject the Null.

Another, and perhaps easier way to decide, is to compare the two-tail p-value against our significance level. Thankfully, for a two-tail test, we can just use the p-value the Excel tool gives us. It is 0.069 which is larger than our alpha of 0.05. Thus, this rule also tells us to **not** reject the Null that there is no difference in the ratings.

Note that __the two rules always agree__, unless your technology tool is faulty, which is very rare.

For this two-tail test, we **do not reject** the Null and we conclude that **there is no statistically significant difference in the job satisfaction rating for men and women**.

**Left-tail Test**

If the manager believes the men have a __lower__ mean rating then the women, we should run a left-tail test. Why not just run the two-tail? As you will see, a one-tail test gives us more “power” to detect a real effect that is there in the direction we believe it to have. The downside of a one-tail test, if you guess wrong and the effect is in the other direction, the test has no power to detect it.

Null Hypothesis Ho: Mean Rating Men >= Mean Rating Women

Alternative Hypothesis Ha: Mean Rating Men < Mean Rating Women

Here is our output again with the one-tail values we need highlighted in yellow.

The tail of the test is always determined by the math operator in the **Alternative hypothesis, **which in this example is the *less than* symbol. Remember the less than symbol **<** points to the left, so this is a left-tail test.

Here is our left-tail graphic:

Excel reports the absolute value of the critical values. For a left-tail test, we need the negative t critical which is -1.697. You should note that the one-tail critical value is “smaller” than the two-tail value of -2.042 because we put all of the alpha in that one tail which “pushes” the critical value toward the mean.

Now, the t Stat does fall in the rejection area, so the rule says we must reject the Null hypothesis.

To use the second rule, we can use the one-tail p-value directly from the output. The one-tail p-value Excel always gives us is the left-tail p-value, the area under the curve from the left end to our t Stat. It is 0.034 which is less than our alpha of 0.05.

You should notice that for the **same sample data**, the left tail test had to power to reject the Null while the two-tail did not.

So, this rule also tells us to **reject the Null** and we conclude that the **mean rating of men is significantly lower than that of women**.

**Right-tail test**

Now our manager believes the men have a __higher__ rating than the women.

Null Hypothesis Ho: Mean Rating Men <= Mean Rating Women

Alternative Hypothesis Ha: Mean Rating Men > Mean Rating Women

The alternative math operator is greater than > which points to the right, so this is a right-tail test.

We can use the same highlighted one-tail values.

Here is our graphic:

For a right-tail test, we are interested in what is happening on the right side of the curve. We use the positive one-tail critical value of +1.697 and we find our t Stat of -1.866 is very far away from the right tail rejection area. So the first rule tells us to __not __reject the null.

To find the right-side p-value, we must recall that the area under the curve is equal to 1. Excel always gives us the left-tail p-value for one-tail tests, so we must subtract that value from 1 to get the right-tail p-value. 1 – 0.0344 = 0.966, which is much larger than 0.05, so this rule tells us to **not** reject the Null of no difference in the ratings of the two groups.

Our conclusion is that the mean job satisfaction of the men is ** not greater** than that of the women.

**Summary**

It is important to note that while using the left-tail test gave us the power to detect the significant “less than” difference between the ratings, using the right-tail test does not. That is why you need to be careful if you decide to use a one-tail test and be pretty sure of the direction of the difference. Using a two-tail test is a bit more conservative in that it will pick up a larger difference either way but misses the smaller significant “less than” difference on the left side.

**Using the proper tail of the test makes all the difference.**

Support.Office. (n.d.). *Use the Analysis ToolPak to perform complex data analysis*. Retrieved from Support.Office: http://bit.ly/2XXgg6T

This is the Excel solution using slightly different values:

]]>

Unbreaking America. (2019). Retrieved from Represent Us: https://www.facebook.com/RepresentUs/videos/410253132875542/UzpfSTEwMDAxNTIwMTEwMDcwMDo1Nzc1MzAxNTk0MzAzNDk/?story_fbid=577530159430349&id=100015201100700¬if_id=1552594250718096¬if_t=feedback_reaction_generic

]]>A random sample of 100 observations from a population with a standard deviation of 44 yielded a sample mean of 108. Test the null hypothesis that μ = 100 against the alternative that μ > 100 at an alpha of 0.05.

Here because the alternative contains the > math operation, this is a right tail test and the p-value is 0.035 which is less than alpha. The decision is to reject the null and conclude that at the 5% significance level there is enough evidence to support a claim that the mean is greater than 100.

Here is the StatCrunch solution:

]]>

A Type I is a *false positive* where a __true__ null hypothesis that there is nothing going on is rejected. A Type II error is a *false negative*, where a __false__ null hypothesis is __not rejected__ – something is going on – but we decide to ignore it.

In this case, the software designers were trying to optimize the ability of the car’s autonomous systems to recognize humans and other obstacles so that the car did not slow or stop too often due to things like lane pylons, trash in the gutter, or street signs on the side of the road. If that happened, the car would have a jerky, uncomfortable ride and slower average speed. Recognizing a lane pylon or sign pole as an obstacle to the car would be a false positive, a Type I error. (O’Kane, 2018)

Understand this is fuzzy, complex science and the system designers had to integrate the information from multiple sensors. One sensor might “see” a data point as an object better in the dark than another. Still another system might “see” an object and project the object’s path to be stationary while still another system might calculate a motion into the car’s path. Most of the systems use artificial intelligence that had to be trained using thousands of “images” that may or may not be similar enough to this victim walking a bicycle. Logically, some overarching system has to evaluate all the data and make a decision that there is an object that is an obstacle in a threatening path or position and initiate evasive or braking actions.

Again, if the “master” system was too conservative in its decisions and registered “positive” for objects that were not threats, it would produce an unnecessarily jerky ride. My assumption is that the system designers set the detection program’s parameters to make false positives less likely, to ignore an object until the system was very sure it was a real obstacle to the car. At the same time, the system needed to not have false negatives, failing to recognize objects that were a threat. Finding the optimal setpoint is akin to picking a significance level.

Trying to minimize false positive is analogous to decreasing the alpha, the level of significance, so it is more difficult to reject the null that “nothing is going on here.” Because Type I and Type II errors are “connected,” as you make it more difficult to have a false positive, Type I error, you also make it more likely you will have a false negative, a Type II error.

After much review, Uber now says the autonomous system did “see” the pedestrian and first classified her as an unknown object at 6 seconds to impact. It then thought she and her bike was a car, and finally recognized her as a person about 1.3 seconds before impact. By then it was too late for the normal control system to react and Uber had disconnected the Volvo emergency braking system. For testing, the company had backup human observers who were supposed to recognize mistakes and take appropriate action. That backup failed at a crucial time and the autonomous system was on its own. (NTSB, 2018)

One way of thinking about this is that the sensors in the car actually detected the woman who was struck but “decided” initially the data was not sufficiently strong to register as a real obstacle and stop the car. The p-value it calculated, if you will, was greater than the alpha the system designers chose, and the system did not reject the null that “there is nothing of concern happening here.”

But the system continued to collect data, increase the sample size n so to speak, until the test statistic crossed into the rejection area, though that was too late to save the pedestrian.

The system initially made a Type II, false negative, decision and failed to reject the false null that the “object is not real.” (Marshall, 2018)

**The Key Takeaway**

My point is that when you decide on your level of significance in the real world, you must consider the costs of a mistake either way. You must evaluate the consequences of making a Type I, false positive decision or making a Type II, false negative decision and set your significance level appropriately.

Efrati, A. (2018, May 7). *Uber Finds Deadly Accident Likely Caused By Software Set to Ignore Objects On Road.* Retrieved from The Information: http://bit.ly/2KquXbH

Lee, T. (2018, May 7). *Report: Software bug led to death in Uber’s self-driving crash.* Retrieved from ARS Technica: https://arstechnica.com/tech-policy/2018/05/report-software-bug-led-to-death-in-ubers-self-driving-crash/

Marshall, A. (2018, May 29). *FALSE POSITIVES: SELF-DRIVING CARS AND THE AGONY OF KNOWING WHAT MATTERS*. Retrieved from Wired: https://www.wired.com/story/self-driving-cars-uber-crash-false-positive-negative/?mbid=social_twitter

NTSB. (2018). *Preliminary Report Highway HWY18MH010.* Washington D.C.: National Transportation Safety Board.

O’Kane, S. (2018, May 7). *Uber reportedly thinks its self-driving car killed someone because it ‘decided’ not to swerve*. Retrieved from The Verge: https://www.theverge.com/2018/5/7/17327682/uber-self-driving-car-decision-kill-swerve

Recall that the equation for the standard error is

where σ is the population standard deviation and n is the sample size.

One can find any number of precise “academic explanations” of why this is true, and I give my students links to those references. But I often get follow-up questions from students who look at those references and then ask for a simpler, clearer explanation that does not involve a lot of algebra.

So, I am going to attempt one here and in the companion video.

Let’s also assume that each roll of the dice is a sample of size n=1 since we just have one die. We know from basic probability that we need to look at the long-term when we are looking at empirical probabilities, so let us make 10,000 rolls of our single die. So, we then have **10,000 samples of sample size n = 1**.

I am going to simulate doing 10,000 rolls of a single die using basic Excel. Here in Figure 1 is a screenshot of my worksheet with most of the rows hidden.

Although calculating the mean, x-bar, of a sample of 1 is a bit trivial, I do that in column D using the Excel AVERAGE function. I copy that formula down the range D3:D10002, as you can see in column E where I show the formulas in column D. For the first sample, in cell D3 the average of the value on the single die, 3, is just 3.000. I am doing this to establish the format I will use for all the remaining sample sizes from 2 dice to 30 dice.

In cell F3 using the Excel AVERAGE function over the cell range D3:D10002, I calculated the average of the sample means of the 10,000 dice rolls and that is 3.483, which is very close to the actual population mean of 3.500. [(1+2+3+4+5+6)/6] I also calculated the standard deviation of the means of those 10,000 samples in cell H3 using the Excel function STDEV.P over the cell range D3:D10002. That value is 1.70971, again very close the actual population standard deviation, σ, of 1.7078. [The standard deviation of 1,2,3,4,5,6 = 1.7078.]

I could have simulated many more throws and would likely have gotten my “averages” closer to the theoretical values, but these 10,000 samples should be enough to give you the idea of how this works.

Next, I will repeat the simulation using 2 dice, which is a sample of n = 2. Figure 3 is an image of the Excel worksheet constructed for a sample of 2 dice as was done for the first example of one die, a sample of n = 1. You can consider this to be the sampling distribution for two dice rolled 10,000 times. In column D, I have calculated the mean, x-bar, of the two dice, which is 2.500 (cell D3) for the first sample of n = 2. In column E, I show the formulas in the adjacent cells in column D. In Cell F3, I again calculated the average of the sample means, x-bar, for those 10,000 samples of n = 2 and that value is 3.496, again close to the theoretical 3.500 population mean. This is what we expect if we believe the Central Limit Theorem is true.

Be sure to note the difference between the number of samples and the sample size, n. **Here the number of samples is 10,000, but the sample size is just n = 2.**

This next graph (Figure 4) is the relative frequency histogram of the average value of the two dice for the 10,000 samples. Note the distribution is beginning to resemble a “bell shape,” which is what should be expected if the Central Limit Theorem holds true. Also note, I increased the number of *bins* in the histogram to better display the distribution.

I repeated the Excel simulation for rolling 3 dice 10,000 times, and again for 4, 5, 6, … all the way to 30 dice being thrown 10,000 times, but I will not show all that information here in detail. I will use it for the very last part of this discussion.

And the **standard error of these 10,000 samples of n = 5 is 0.764**, calculated in cell I3. Recall that is the standard deviation of the sample means in column G, which is our sampling distribution of sample means.

Repeating the simulation using 10 dice, 20 and 30 dice, we get these three relative frequency distributions for the 10,000 sample means for samples sizes n = 10 (Figure 7), n = 20 (Figure 8), and n = 30 (Figure 9):

In the table in Figure 10, I have captured the relevant data from all the simulations I ran using this method in Excel. I also show a scatter plot of the data with the calculated standard error on the y-axis and the number of dice – the sample size n – on the x-axis. The orange dots are the individual data points for the 30 simulated standard errors.

Obviously, there is not a straight-line relationship here, if you recall plotting best-fit lines from geometry or algebra. Fortunately, Excel has a great tool that will let you select from lines that could fit your data. These are known as trend lines, or regression lines, and you will learn how to calculate them later in your intro stats course. For now, trust that Excel can find a “best fit” line for these data and calculate the equation that describes it.

Figure 11 shows a “Straight-line” or “Linear” best-fit trend line (small dots), which does not fit our data (large dots) all that well, as you would expect. The equation of this line Excel calculated is shown as y = – 0.0278*x + 0.9766. You can use that equation to predict values of y, the standard error, for different values on n. Of more interest now is the R

Of the options available in this Excel tool, the best “best fit” line to my eyes is called a “Power” function and this next image is the graph showing that trend line which is shown in Figure 13. The Power function Excel found to best fit our data does look pretty good as our 30 data points fall “precisely” on the calculated trend line. Excel gives this power line equation an R

A Power function is an equation where the value of y is a function of x raised to a power. In this case, y is equal to 1.710 multiplied by x to the -0.5 power.

Recall from algebra, a negative coefficient just means we can put x in the denominator and get a positive coefficient,

Raising something to the 0.5 power is the same as taking the square root. Then, this best fit line’s equation is

This 1.710 is very, very close to the actual value of 1.7078 for the population standard deviation, σ, found earlier. If I had run even more than 10,000 samples, it is likely the numerator would have converged on sigma. Since x is our sample size, n, we have essentially shown that the sample standard error is

Again, in this exercise, I used simulations, a lot of them granted, to approximate the standard errors used by Excel to calculate the power function equation. Use this practical explanation to supplement the more precise algebraic proofs. I hope this gives you a better feeling when your instructor of textbook tells you the “proof” of why we divide sigma by the square root of the sample size n is beyond the scope of your course.