Excel solution

StatCruch solution

]]>The guidance is that we need to use the FPC when the ratio of the sample size n to the population size N is greater than 5%. For example, if the population size is 300 and the sample size is 30, we have a ratio of 10% and thus need to use the FPCF.

The most common formula for calculating the FPC is

As the population N gets large compared to the sample n, the FPC tends toward a value of 1.

If it is needed, we use the FPCF to adjust the Standard Error of the analysis at hand. We do that by a simple multiplication of the Standard Error by the FPC value.

If we are working with a sample mean, we adjust the Standard Error of the Mean by multiplying it by the FPC. It can be used to adjust the standard error when we are finding critical values such as for tests of hypothesis and confidence intervals.

If we are working with a sample proportion, we multiply the Standard Error of the Proportion by the FPC.

This is my Excel solution to a typical problem. Note I calculate the FPC factor first in cell B9 to simplify the process and then multiply it by the standard error in B10.

On the bottom, I show the same problem worked without using the FPC factor and you can see the change on the answers is slight because the FPC is close to 1.

]]>Understanding what all that means can be a bit daunting. Here is my attempt as a simpler explanation. For convenience, I am just using the output from the *t-test: Two-Sample Assuming Unequal Variances*, but the concepts apply to all three t-test tools.

A company surveyed a random sample of their employees on how satisfied they were with their job. The manager does not care if one group has a higher or lower rating, and only wants to know if there is a difference in how men and women rate their job satisfaction.

State the Null and Alternative hypotheses:

Null Hypothesis

Ho: Mean Rating Men = Mean Rating WomenAlternative Hypothesis

Ha: Mean Rating Men ≠ Mean Rating Women

**Important: ****Put the data ranges for the two groups in the tool dialog box in the same relationship as stated in the Null**. The Men group (green highlight) is on the left side of the Null equation and should be placed in the Variable 1 Range field. The Women group (red highlight) is on the right side of the Null equation and must be in the Variable 2 Range field.

Here is the output from the tool using a significance level of alpha, α, = 0.05. Note that the Men group is on the left and the Women group is on the right in the output.

We can see that the mean for the men is smaller than that of the women. But is the apparent difference “real”? Both groups have a lot of variance relative to the size of the means, 3.22 for the men and 2.83 for the women. So, the apparent difference might just be due to the “noise” of the variances.

To make our decision on rejecting (or not rejecting) the null, we can look at the three output values I have highlighted in yellow: the t statistic, the two-tail p-value, and the two-tail critical value of t.

Why two-tail? Consider this graphic:

The first rule for deciding whether to reject the null tells us to compare our test statistic to the critical value.

When we have a two-tail test, we must put half of our significance level of 5% in each tail to account for the possibility of our test statistic being either positive or negative, i.e. one sample mean being larger or smaller than the other. Putting 2.5% in each tail we can calculate a critical value of -2.042 on the left side and +2.042 on the right side.

If our test statistic, the t Stat, falls in either rejection area, less than – 2.042 or larger than + 2.042, we must reject the Null. But here our t Stat of – 1.886 does not fall in either rejection area, so we must decide to **not** reject the Null.

Another, and perhaps easier way to decide, is to compare the two-tail p-value against our significance level. Thankfully, for a two-tail test, we can just use the p-value the Excel tool gives us. It is 0.069 which is larger than our alpha of 0.05. Thus, this rule also tells us to **not** reject the Null that there is no difference in the ratings.

Note that __the two rules always agree__, unless your technology tool is faulty, which is very rare.

For this two-tail test, we **do not reject** the Null and we conclude that **there is no statistically significant difference in the job satisfaction rating for men and women**.

**Left-tail Test**

If the manager believes the men have a __lower__ mean rating then the women, we should run a left-tail test. Why not just run the two-tail? As you will see, a one-tail test gives us more “power” to detect a real effect that is there in the direction we believe it to have. The downside of a one-tail test, if you guess wrong and the effect is in the other direction, the test has no power to detect it.

Null Hypothesis Ho: Mean Rating Men >= Mean Rating Women

Alternative Hypothesis Ha: Mean Rating Men < Mean Rating Women

Here is our output again with the one-tail values we need highlighted in yellow.

The tail of the test is always determined by the math operator in the **Alternative hypothesis, **which in this example is the *less than* symbol. Remember the less than symbol **<** points to the left, so this is a left-tail test.

Here is our left-tail graphic:

Excel reports the absolute value of the critical values. For a left-tail test, we need the negative t critical which is -1.697. You should note that the one-tail critical value is “smaller” than the two-tail value of -2.042 because we put all of the alpha in that one tail which “pushes” the critical value toward the mean.

Now, the t Stat does fall in the rejection area, so the rule says we must reject the Null hypothesis.

To use the second rule, we can use the one-tail p-value directly from the output. The one-tail p-value Excel always gives us is the left-tail p-value, the area under the curve from the left end to our t Stat. It is 0.034 which is less than our alpha of 0.05.

You should notice that for the **same sample data**, the left tail test had to power to reject the Null while the two-tail did not.

So, this rule also tells us to **reject the Null** and we conclude that the **mean rating of men is significantly lower than that of women**.

**Right-tail test**

Now our manager believes the men have a __higher__ rating than the women.

Null Hypothesis Ho: Mean Rating Men <= Mean Rating Women

Alternative Hypothesis Ha: Mean Rating Men > Mean Rating Women

The alternative math operator is greater than > which points to the right, so this is a right-tail test.

We can use the same highlighted one-tail values.

Here is our graphic:

For a right-tail test, we are interested in what is happening on the right side of the curve. We use the positive one-tail critical value of +1.697 and we find our t Stat of -1.866 is very far away from the right tail rejection area. So the first rule tells us to __not __reject the null.

To find the right-side p-value, we must recall that the area under the curve is equal to 1. Excel always gives us the left-tail p-value for one-tail tests, so we must subtract that value from 1 to get the right-tail p-value. 1 – 0.0344 = 0.966, which is much larger than 0.05, so this rule tells us to **not** reject the Null of no difference in the ratings of the two groups.

Our conclusion is that the mean job satisfaction of the men is ** not greater** than that of the women.

**Summary**

It is important to note that while using the left-tail test gave us the power to detect the significant “less than” difference between the ratings, using the right-tail test does not. That is why you need to be careful if you decide to use a one-tail test and be pretty sure of the direction of the difference. Using a two-tail test is a bit more conservative in that it will pick up a larger difference either way but misses the smaller significant “less than” difference on the left side.

**Using the proper tail of the test makes all the difference.**

Support.Office. (n.d.). *Use the Analysis ToolPak to perform complex data analysis*. Retrieved from Support.Office: http://bit.ly/2XXgg6T

This is the Excel solution using slightly different values:

]]>

Unbreaking America. (2019). Retrieved from Represent Us: https://www.facebook.com/RepresentUs/videos/410253132875542/UzpfSTEwMDAxNTIwMTEwMDcwMDo1Nzc1MzAxNTk0MzAzNDk/?story_fbid=577530159430349&id=100015201100700¬if_id=1552594250718096¬if_t=feedback_reaction_generic

]]>A random sample of 100 observations from a population with a standard deviation of 44 yielded a sample mean of 108. Test the null hypothesis that μ = 100 against the alternative that μ > 100 at an alpha of 0.05.

Here because the alternative contains the > math operation, this is a right tail test and the p-value is 0.035 which is less than alpha. The decision is to reject the null and conclude that at the 5% significance level there is enough evidence to support a claim that the mean is greater than 100.

Here is the StatCrunch solution:

]]>

A Type I is a *false positive* where a __true__ null hypothesis that there is nothing going on is rejected. A Type II error is a *false negative*, where a __false__ null hypothesis is __not rejected__ – something is going on – but we decide to ignore it.

In this case, the software designers were trying to optimize the ability of the car’s autonomous systems to recognize humans and other obstacles so that the car did not slow or stop too often due to things like lane pylons, trash in the gutter, or street signs on the side of the road. If that happened, the car would have a jerky, uncomfortable ride and slower average speed. Recognizing a lane pylon or sign pole as an obstacle to the car would be a false positive, a Type I error. (O’Kane, 2018)

Understand this is fuzzy, complex science and the system designers had to integrate the information from multiple sensors. One sensor might “see” a data point as an object better in the dark than another. Still another system might “see” an object and project the object’s path to be stationary while still another system might calculate a motion into the car’s path. Most of the systems use artificial intelligence that had to be trained using thousands of “images” that may or may not be similar enough to this victim walking a bicycle. Logically, some overarching system has to evaluate all the data and make a decision that there is an object that is an obstacle in a threatening path or position and initiate evasive or braking actions.

Again, if the “master” system was too conservative in its decisions and registered “positive” for objects that were not threats, it would produce an unnecessarily jerky ride. My assumption is that the system designers set the detection program’s parameters to make false positives less likely, to ignore an object until the system was very sure it was a real obstacle to the car. At the same time, the system needed to not have false negatives, failing to recognize objects that were a threat. Finding the optimal setpoint is akin to picking a significance level.

Trying to minimize false positive is analogous to decreasing the alpha, the level of significance, so it is more difficult to reject the null that “nothing is going on here.” Because Type I and Type II errors are “connected,” as you make it more difficult to have a false positive, Type I error, you also make it more likely you will have a false negative, a Type II error.

After much review, Uber now says the autonomous system did “see” the pedestrian and first classified her as an unknown object at 6 seconds to impact. It then thought she and her bike was a car, and finally recognized her as a person about 1.3 seconds before impact. By then it was too late for the normal control system to react and Uber had disconnected the Volvo emergency braking system. For testing, the company had backup human observers who were supposed to recognize mistakes and take appropriate action. That backup failed at a crucial time and the autonomous system was on its own. (NTSB, 2018)

One way of thinking about this is that the sensors in the car actually detected the woman who was struck but “decided” initially the data was not sufficiently strong to register as a real obstacle and stop the car. The p-value it calculated, if you will, was greater than the alpha the system designers chose, and the system did not reject the null that “there is nothing of concern happening here.”

But the system continued to collect data, increase the sample size n so to speak, until the test statistic crossed into the rejection area, though that was too late to save the pedestrian.

The system initially made a Type II, false negative, decision and failed to reject the false null that the “object is not real.” (Marshall, 2018)

**The Key Takeaway**

My point is that when you decide on your level of significance in the real world, you must consider the costs of a mistake either way. You must evaluate the consequences of making a Type I, false positive decision or making a Type II, false negative decision and set your significance level appropriately.

Efrati, A. (2018, May 7). *Uber Finds Deadly Accident Likely Caused By Software Set to Ignore Objects On Road.* Retrieved from The Information: http://bit.ly/2KquXbH

Lee, T. (2018, May 7). *Report: Software bug led to death in Uber’s self-driving crash.* Retrieved from ARS Technica: https://arstechnica.com/tech-policy/2018/05/report-software-bug-led-to-death-in-ubers-self-driving-crash/

Marshall, A. (2018, May 29). *FALSE POSITIVES: SELF-DRIVING CARS AND THE AGONY OF KNOWING WHAT MATTERS*. Retrieved from Wired: https://www.wired.com/story/self-driving-cars-uber-crash-false-positive-negative/?mbid=social_twitter

NTSB. (2018). *Preliminary Report Highway HWY18MH010.* Washington D.C.: National Transportation Safety Board.

O’Kane, S. (2018, May 7). *Uber reportedly thinks its self-driving car killed someone because it ‘decided’ not to swerve*. Retrieved from The Verge: https://www.theverge.com/2018/5/7/17327682/uber-self-driving-car-decision-kill-swerve

Recall that the equation for the standard error is

where σ is the population standard deviation and n is the sample size.

One can find any number of precise “academic explanations” of why this is true, and I give my students links to those references. But I often get follow-up questions from students who look at those references and then ask for a simpler, clearer explanation that does not involve a lot of algebra.

So, I am going to attempt one here and in the companion video.

Let’s also assume that each roll of the dice is a sample of size n=1 since we just have one die. We know from basic probability that we need to look at the long-term when we are looking at empirical probabilities, so let us make 10,000 rolls of our single die. So, we then have **10,000 samples of sample size n = 1**.

I am going to simulate doing 10,000 rolls of a single die using basic Excel. Here in Figure 1 is a screenshot of my worksheet with most of the rows hidden.

Although calculating the mean, x-bar, of a sample of 1 is a bit trivial, I do that in column D using the Excel AVERAGE function. I copy that formula down the range D3:D10002, as you can see in column E where I show the formulas in column D. For the first sample, in cell D3 the average of the value on the single die, 3, is just 3.000. I am doing this to establish the format I will use for all the remaining sample sizes from 2 dice to 30 dice.

In cell F3 using the Excel AVERAGE function over the cell range D3:D10002, I calculated the average of the sample means of the 10,000 dice rolls and that is 3.483, which is very close to the actual population mean of 3.500. [(1+2+3+4+5+6)/6] I also calculated the standard deviation of the means of those 10,000 samples in cell H3 using the Excel function STDEV.P over the cell range D3:D10002. That value is 1.70971, again very close the actual population standard deviation, σ, of 1.7078. [The standard deviation of 1,2,3,4,5,6 = 1.7078.]

I could have simulated many more throws and would likely have gotten my “averages” closer to the theoretical values, but these 10,000 samples should be enough to give you the idea of how this works.

Next, I will repeat the simulation using 2 dice, which is a sample of n = 2. Figure 3 is an image of the Excel worksheet constructed for a sample of 2 dice as was done for the first example of one die, a sample of n = 1. You can consider this to be the sampling distribution for two dice rolled 10,000 times. In column D, I have calculated the mean, x-bar, of the two dice, which is 2.500 (cell D3) for the first sample of n = 2. In column E, I show the formulas in the adjacent cells in column D. In Cell F3, I again calculated the average of the sample means, x-bar, for those 10,000 samples of n = 2 and that value is 3.496, again close to the theoretical 3.500 population mean. This is what we expect if we believe the Central Limit Theorem is true.

Be sure to note the difference between the number of samples and the sample size, n. **Here the number of samples is 10,000, but the sample size is just n = 2.**

This next graph (Figure 4) is the relative frequency histogram of the average value of the two dice for the 10,000 samples. Note the distribution is beginning to resemble a “bell shape,” which is what should be expected if the Central Limit Theorem holds true. Also note, I increased the number of *bins* in the histogram to better display the distribution.

I repeated the Excel simulation for rolling 3 dice 10,000 times, and again for 4, 5, 6, … all the way to 30 dice being thrown 10,000 times, but I will not show all that information here in detail. I will use it for the very last part of this discussion.

And the **standard error of these 10,000 samples of n = 5 is 0.764**, calculated in cell I3. Recall that is the standard deviation of the sample means in column G, which is our sampling distribution of sample means.

Repeating the simulation using 10 dice, 20 and 30 dice, we get these three relative frequency distributions for the 10,000 sample means for samples sizes n = 10 (Figure 7), n = 20 (Figure 8), and n = 30 (Figure 9):

In the table in Figure 10, I have captured the relevant data from all the simulations I ran using this method in Excel. I also show a scatter plot of the data with the calculated standard error on the y-axis and the number of dice – the sample size n – on the x-axis. The orange dots are the individual data points for the 30 simulated standard errors.

Obviously, there is not a straight-line relationship here, if you recall plotting best-fit lines from geometry or algebra. Fortunately, Excel has a great tool that will let you select from lines that could fit your data. These are known as trend lines, or regression lines, and you will learn how to calculate them later in your intro stats course. For now, trust that Excel can find a “best fit” line for these data and calculate the equation that describes it.

Figure 11 shows a “Straight-line” or “Linear” best-fit trend line (small dots), which does not fit our data (large dots) all that well, as you would expect. The equation of this line Excel calculated is shown as y = – 0.0278*x + 0.9766. You can use that equation to predict values of y, the standard error, for different values on n. Of more interest now is the R

Of the options available in this Excel tool, the best “best fit” line to my eyes is called a “Power” function and this next image is the graph showing that trend line which is shown in Figure 13. The Power function Excel found to best fit our data does look pretty good as our 30 data points fall “precisely” on the calculated trend line. Excel gives this power line equation an R

A Power function is an equation where the value of y is a function of x raised to a power. In this case, y is equal to 1.710 multiplied by x to the -0.5 power.

Recall from algebra, a negative coefficient just means we can put x in the denominator and get a positive coefficient,

Raising something to the 0.5 power is the same as taking the square root. Then, this best fit line’s equation is

This 1.710 is very, very close to the actual value of 1.7078 for the population standard deviation, σ, found earlier. If I had run even more than 10,000 samples, it is likely the numerator would have converged on sigma. Since x is our sample size, n, we have essentially shown that the sample standard error is

Again, in this exercise, I used simulations, a lot of them granted, to approximate the standard errors used by Excel to calculate the power function equation. Use this practical explanation to supplement the more precise algebraic proofs. I hope this gives you a better feeling when your instructor of textbook tells you the “proof” of why we divide sigma by the square root of the sample size n is beyond the scope of your course.

The process of statistics starts when we identify what group we want to study or learn something about. We call this group the population.

Note that the word “population” here (and in the entire course) is not just used to refer to people; it is used in the broader statistical sense, where population can refer not only to people, but also to animals, things etc. For example, we might be interested in:

- the opinions of the population of U.S. adults about the death penalty; or
- how the population of mice react to a certain chemical; or
- the average price of the population of all one-bedroom apartments in a certain city.

**The population, then, is the entire group that is the target of our interest.**

In most cases, the population is so large that as much as we might want to, there is absolutely no way that we can study all of it (imagine trying to get the opinions of all U.S. adults about the death penalty…).

A more practical approach would be to examine and collect data only from a sub-group of the population, which we call a sample. We call this first component, which involves choosing a sample and collecting data from it**, Producing Data or Sampling.**

**A sample is a subset of the population from which we collect data.**

It should be noted that since, for practical reasons, we need to compromise and examine only a sub-group of the population rather than the whole population, we should try to choose a sample in such a way that it will represent the population well. This is best done using a form of random sampling.

For example, if we choose a sample from the population of U.S. adults, and ask their opinions about a federal health care program, we do not want our sample to consist of only Republicans or only Democrats.

Once the data have been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way.

This second component, which consists of summarizing the collected data, is called **Exploratory Data Analysis** or **Descriptive** **Statistics**.

Now we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results.

Before we can do so, we need to look at how the sample we’re using may differ from the population, so that we can factor that into our analysis. To examine this difference, we use **Probability** which is the third component in the big picture.

**Probability **is the “machinery” that allows us to draw conclusions about the population based on the data collected in the sample.

Finally, we can use what we’ve discovered about our sample to draw conclusions about our population.

We call this final component in the process **Inference**.

This is the **Big Picture of Statistics**

**Example: EXAMPLE: Polling Customer Opinion**

In December 2018, Fast Technologies conducted a poll of their customers to determine how they rated the quality of the Fast 4 Terra Byte Solid State external hard drive.

- Producing Data or Sampling: A (representative) sample of 1,082 customers was chosen, and each customer was how many stars (up to 5) they would give the product.
- Descriptive Statistics: The collected data were summarized, and it was found that 65% of the sampled customers gave the product 4 stars.

3 and 4. Probability and Inference: Based on the sample result (of 65% giving 4 stars) and our knowledge of probability, it was concluded (with 95% confidence) that the percentage of those rating the hard drive 4 stars in the population of Fast Technologies’ customers is within 3% of what was obtained in the sample (i.e., between 62% and 68%). The following figure summarizes the example:

This brings us back to where we started, **the population.**

This material was adapted from the Carnegie Mellon University open learning statistics course available at http://oli.cmu.edu and is licensed under a Creative Commons License.

]]>