In the last lesson, you learned about the Two-way Chi-square Test for Independence. Using it, you can determine if two categorical variables are independent. If two variables are not independent, they are related. Knowing something about one variable can tell you something about the other variable.

In this lesson, you will explore further this idea of variables being related, not being independent. Categorical variables that are related do not have a __linear__ relationship for a number of reasons. One primary reason is that categorical variables are discrete, not continuous. This means that values of the categorical variable “jump” from one level to another rather than smoothly changing.

For example, if a survey has captured a person’s Age into buckets with 10-year intervals [e.g. 10 to 20, 21 to 30, 31 to 40,…], Age becomes a discrete categorical variable instead of being a continuous variable. If our Two-way Chi-square Test for Independence between Age and Gender in the survey data is statistically significant, we can just conclude that the variables Age and Gender are related. But we cannot __as easily__ predict a survey participant’s Age given their Gender.

If you are considering two continuous variables, you can use their __linear __relationship, their__ correlation,__ to develop an equation you can use to predict values of one given a value of the other. You can use __linear regression__ to do that. In this lesson, you will learn about linear regression using both one and multiple __predictor__ variables to develop an equation you can use to estimate the value of the continuous response variable.

And although its predictions are a bit more complicated, you can use __logistic regression__ to predict values of categorical response variables.

**Correlation**

Consider the following data table which shows a sample of 40 teenagers’ height in inches and American shoe size:

Is there a relationship between Height and Shoe Size? I think common sense would suggest Yes, but how can you test that there is a relationship that can be used to predict?

You should always start data analysis by graphing the data. Here is a scatter chart (also known as x-y chart) created using Excel of these data:

Using Excel to plot the data, you can easily add in a trend line which is a “best fit” line. There is an obvious pattern which indicates Shoe Size increases as Height increases.

This relationship can be quantified by calculating the __correlation__ between the two variables. The sample __correlation__ __coefficient__, r, can be quantified using the Excel function CORREL as shown in this image:

Note several rows of data (rows 5 through 37) have been hidden.

The __sample__ correlation coefficient, r, is 0.899. [An **R** is used to indicate the __population__ correlation coefficient.]

The correlation coefficient can be negative, indicating the y-variable decreases as the x-variable increases. Or positive, indicating the y-variable increases as the x-variable increases. The largest negative value r can be is -1. The largest positive value r can be is +1. An r of 0 indicates no correlation.

The following graphs which illustrate the range of values r can take on were created using random numbers to generate x-y pairs which have the indicated correlation coefficient r.

Alt-text: six scatter plots showing different correlation coefficients, ranging from absolute value 0.06 to 0.98. Generally, the closer the data points are to the trend line, the strong the correlation and larger r becomes.

Original image by D. Wright

Although the sign of r tells us if the slope of the trend line is positive (up) or negative (down), it is important to realize that the correlation coefficient r is not the mathematical slope of the trend line. The correlation coefficient r tells us the strength of the relationship of the two variables. In general, the closer the data points are to the trend line, the stronger the correlation.

**Strength of Correlation**

The strength the correlation is equal to the __absolute value of the correlation coefficient r__. The following table gives you a way to think of the strength of the correlation between two variables based on the absolute value of r.

**Correlation does not prove Causation**

Regardless of the strength of the correlation, __correlation alone does not prove causation__. Remember there may be other factors at play that are causing the two variables to move together. This image is from a fun website called Spurious Correlations [http://www.tylervigen.com/spurious-correlations] created by Tyler Vigen. This is one plot on that site:

Alt-text: x-y chart showing plots of data for the divorce rate in Maine each year from 2000 to 2009 and also the per capita margarine consumption in pounds per person.

Image adapted from tylervigen,com Spurious correlations.[ Copyleft notice: You are free to copy and adapt everything on this site. tylervigen.com] (Vigen, n.d)

Even though the r is large, 0.993, it should be obvious that there is no causative relationship between the Divorce Rate in Maine and the US per capita consumption of margarine. Showing causation requires more than just a correlation. See Hill’s Criteria Causation for more information (Fedak, Bernal, Capshaw, & Gross, 2015) [ https://dx.doi.org/10.1186%2Fs12982-015-0037-4 ] 20 min

**Getting the equation of the trend line**

Rather than calculate the correlation coefficient directly using the method shown above, you can get it by running a __linear regression__ which will calculate and test the linear correlation between two variables, and also determine the equation of the trend line which can be used to predict values of y given a value of x. The regression will calculate the slope of the trend line and the y-Intercept which are needed for the equation of the line.

For a great visual explanation of correlation, check out this Stat Quest Video (https://youtu.be/xZ_z8KWkhXE ) 19 min. to reinforce your understanding.

Fedak, K., Bernal, A., Capshaw, Z., & Gross, S. (2015). Applying the Bradford Hill criteria in the 21st century: how data integration has changed causal inference in molecular epidemiology. *Emerging Themes in Epidemiollogy*, 12:14. doi:https://dx.doi.org/10.1186%2Fs12982-015-0037-4

Vigen, T. (n.d). *Spurious Correlations*. Retrieved from tylervigen.com: http://www.tylervigen.com/spurious-correlations

]]>

The word “correlate” has several meanings, e.g. “If two or more facts, numbers, etc. correlate or are correlated, there is a relationship between them” (Cambridge Dictionary, n.d.) And “to show that a close connection exists between (two or more things)” (Merriam-Webster, 2019) Also, “to show that two things are connected.” (Macmillan, n.d.)

In statistics, we often think of __correlation__ as being the __linear __relationship between two variables. (Mukaka, 2012) In that case, we can calculate a numerical value for the linear correlation which we call the *correlation coefficient* R. But more generally, __correlation is dependence__ or association of the two variables in __linear or non-linear__ relationships. (Brownlee, 2019) (Minitab, 2019)(Wikipedia, 2019) We commonly use Pearson’s correlation coefficient which is only sensitive to linear relationships. (Wikipedia, 2019) But another is Spearman’s rank correlation which is more sensitive to non-linear relationships. Spearman’s “measures the extent to which, as one variable increases, the other increases [or decreases] without requiring that increase to be represented by a linear relationship.” (Wikipedia, 2019)

Although some balk at saying a significant Chi-square test for independence shows two variables are correlated, it does. That is not to say the Chi-square test directly provides a mathematical value for any __correlation statistic __equivalent to either Pearson’s or Spearman’s. It does not.

Brownlee, J. (2019, Aug 14). *How to Calculate Correlation Between Variables in Python*. Retrieved from Machine Learning Mastery: https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/

Cambridge Dictionary. (n.d.). *correlated*. Retrieved from Cambridge Dictionary: https://dictionary.cambridge.org/us/dictionary/english/correlated

Macmillan. (n.d.). *correlate*. Retrieved from Macmillan Disctionary: https://www.macmillandictionary.com/us/dictionary/american/correlate_1

Merriam-Webster. (2019, Dec 1). *correlation*. Retrieved from Merriam-Webster.com: https://www.merriam-webster.com/dictionary/correlation#other-words

Minitab. (2019). *Select the method for Correlation*. Retrieved from Minitab 18 Support: https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/how-to/correlation/perform-the-analysis/select-the-method/

Mukaka, M. (2012, Sept). *A guide to appropriate use of Correlation coefficient in medical research*. Retrieved from Malawi Medical Journal: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3576830/

Wikipedia. (2019, Nov 22). *Correlation and dependence*. Retrieved from Wikipedia: https://en.wikipedia.org/wiki/Correlation_and_dependence

]]>

The **Empirical Rule** applies to a normal, bell-shaped curve than is symmetrical about the mean. It states that within one standard deviation of the mean (both left-side and right-side) there is __about__ 68% of the data; within two standard deviations of the mean (both left-side and right-side) there is __about __95% of the data; and within three standard deviations of the mean (both left-side and right-side) there is __about__ 99.7% of the data.

In the image below, we use the Greek letter sigma, σ, which is the symbol for the population standard deviation. The mean is the middle of the distribution and is indicated by the 0. Remember the Empirical Rule says “about” 68% of the data is between +1 and -1 sigma, σ. This is closer to 68.3%, but even that is still approximate.

And between +2 and -2 standard deviations, σ, we have approximately 95.4%.

Finally, between +3 and -3 σ, we have approximately 99.7%.

Because there must be 100% of the data under the curve, that tells us there must be 100 – 99.7 = 0.3% split between the two tails beyond +3 and -3. Thus, there is approximately 0.15% to the right of +3 and 0.15% to the left of -3 σ.

We can use the Empirical Rule to find percentages and probabilities.

Solution: 130 minus the mean of 100 = 30, which is 2 times 15, the standard deviation. Thus, 130 is 2 standard deviations to the right of the mean. 100 – 70 = 30 which is 2 times 15. Thus, 70 is 2 standard deviations to the left of the mean. Since 70 to 130 is within 2 standard deviations of the mean, we know that __about__ 95.4% of the IQ scores would be between 70 and 130.

In this example, we calculated the ** z-scores** of the two IQ’s of interest. Here is the formula for finding a z-score

The individual value we are interested in is “x.” The mean of the population is the Greek letter Mu, μ. And the standard deviation of the population is Sigma, σ.

Here is the problem solved by first finding the z-scores of the two IQ’s we are interested in:

__Note: you can have + and – z-scores. __

Converting an x value of interest into a z-score allows you to more easily use the Empirical Rule because the __z-score is the number of standard deviations from the mean__.

Solution: Because the area under the bell curve represents 100% of the possible data points, it also represents probability. Recall that the probability of an event is the number of ways the event can happen divided by the total number of outcomes. Thus, an area under the bell curve equal to 10% is also a 10% probability of “x” falling in that area.

First, find the z-scores of the two IQs of interest:

So, we need the area under the curve between z = +1 and z = -1

Thus, there is __about__ 68.3% probability a randomly selected individual will have an IQ between 85 and 115.

Looking at the Empirical Rule graph above, we want the probability of an IQ score to the right of, greater than, 115, because that represents all the IQ scores above 115. That means we are interested in IQs with z scores greater than +1. We know that about 68.3% of the scores are between -1 and +1 z. Because the bell curve is symmetrical, that gives us 100 – 68.3 = 31.7% split evenly between the two gold areas, Area 1 and Area 2. Thus, the probability of an individual having an IQ greater than 115 is Area 2, which is 31.7 divided by 2 = 15.85 or about 16%.

And by extension, the probability of a randomly selected individual having an IQ score __less than 85__ is also about 16%, which is Area 1.

Here is another image of the **Empirical Rule** that may help you solve z-score probability problems.

Thus, an IQ greater than 130 would have a z-score greater than +2. To find the probability, just add the areas to the __right __of +2:

2.1 + 0.15 = 2.25% or about 2.3%

Add the areas to the __right__ of +1 z: 13.6 + 2.1 + 0.15 = 15.85 or about 16%.

Add the areas to the __right__ of – 1 z: 34.1 + 34.1 + 13.6 + 2.1 + 0.15 = 84.05 or about 84%.

Add the areas to the __left__ of + 2 z: 13.6 + 34.1 + 34.1 + 13.6 + 2.1 + 0.15 = 97.65 or about 98%.

Or because the bell curve is symmetrical, add the areas to the __right__ of +2 and subtract from 100%.

2.1 + 0.15 = 2.25. 100 – 2.25 = 97.75 or about 98%.

The Empirical Rule gives __approximate values__, and the values in the graph are rounded off. So, you would look for the __closest__ answer on a multiple-choice quiz.

Problem

]]>Excel solution

StatCruch solution

]]>The Finite Population Correction Factor, sometimes just called the FPC factor, is used when the sample size is large relative to the population size. For most situations, the population is so large, typical sample sizes are far too small to worry about the need for the FPC.

The guidance is that we need to use the FPC when the ratio of the sample size n to the population size N is greater than 5%. For example, if the population size is 300 and the sample size is 30, we have a ratio of 10% and thus need to use the FPCF.

The most common formula for calculating the FPC is

As the population N gets large compared to the sample n, the FPC tends toward a value of 1.

If it is needed, we use the FPCF to adjust the Standard Error of the analysis at hand. We do that by a simple multiplication of the Standard Error by the FPC value.

If we are working with a sample mean, we adjust the Standard Error of the Mean by multiplying it by the FPC. It can be used to adjust the standard error when we are finding critical values such as for tests of hypothesis and confidence intervals.

If we are working with a sample proportion, we multiply the Standard Error of the Proportion by the FPC.

This is my Excel solution to a typical problem. Note I calculate the FPC factor first in cell B9 to simplify the process and then multiply it by the standard error in B10.

On the bottom, I show the same problem worked without using the FPC factor and you can see the change on the answers is slight because the FPC is close to 1.

]]>Excel’s Data Analysis ToolPak has three tools for running tests of hypotheses using the t-distribution – t-tests. The output from the tools can be a bit confusing because, unlike other statistical software, these do not allow you to specify the “tail of the test” before you run the analysis. Here is how Microsoft explains how to interpret the output here:

“Under the assumption of equal underlying population means, if t < 0, “P(T <= t) one-tail” gives the probability that a value of the t-Statistic would be observed that is more negative than t. If t >=0, “P(T <= t) one-tail” gives the probability that a value of the t-Statistic would be observed that is more positive than t. “t Critical one-tail” gives the cutoff value, so that the probability of observing a value of the t-Statistic greater than or equal to “t Critical one-tail” is Alpha.

“P(T <= t) two-tail” gives the probability that a value of the t-Statistic would be observed that is larger in absolute value than t. “P Critical two-tail” gives the cutoff value, so that the probability of an observed t-Statistic larger in absolute value than “P Critical two-tail” is Alpha.

Understanding what all that means can be a bit daunting. Here is my attempt as a simpler explanation. For convenience, I am just using the output from the *t-test: Two-Sample Assuming Unequal Variances*, but the concepts apply to all three t-test tools.

We will start with the most common test, the two-tail test.

A company surveyed a random sample of its employees on how satisfied they were with their job. The manager does not care if one group has a higher or lower rating, and only wants to know if there is a difference in how men and women rate their job satisfaction.

State the Null and Alternative hypotheses:

Null Hypothesis

Ho: Mean Rating Men = Mean Rating WomenAlternative Hypothesis

Ha: Mean Rating Men ≠ Mean Rating Women

**Note: The tail of the test is indicated by the math operator in the Alternative. Here, not equal (****≠) does not “point” to ether side, so this is a two-tail test.**

**Important: ****Put the data ranges for the two groups in the tool dialog box in the same relationship as stated in the Null**. The Men group (green highlight) is on the left side of the Null equation and should be placed in the Variable 1 Range field. The Women group (red highlight) is on the right side of the Null equation and must be in the Variable 2 Range field. [your data can be anywhere in your worksheet, but it may be better to arrange it in the right relationship there as well.]

Here is the output from the tool using a significance level of alpha, α, = 0.05. Note that the Men group is on the left and the Women group is on the right in the output.

We can see that the mean for the men is smaller than that of the women. But is the apparent difference “real”? Both groups have a lot of variance relative to the size of the means, 3.22 for the men and 2.83 for the women. So, the apparent difference might just be due to the “noise” of the variances.

To make our decision on rejecting (or not rejecting) the null, we can look at the three output values I have highlighted in yellow: the t statistic, the two-tail p-value, and the two-tail critical value of t.

Why two-tail? Consider this graphic:

The first rule for deciding whether to reject the null tells us to compare our test statistic to the critical value.

When we have a two-tail test, we must put half of our significance level of 5% in each tail to account for the possibility of our test statistic being either positive or negative, i.e. one sample mean being larger or smaller than the other. Putting 2.5% in each tail we can calculate a critical value of -2.042 on the left side and +2.042 on the right side.

If our test statistic, the t Stat, falls in either rejection area, less than – 2.042 or larger than + 2.042, we must reject the Null. But here our t Stat of – 1.886 does not fall in either rejection area, so we must decide to **not** reject the Null.

Another, and perhaps easier way to decide, is to compare the two-tail p-value against our significance level. Thankfully, for a two-tail test, we can just use the p-value the Excel tool gives us. It is 0.069 which is larger than our alpha of 0.05. Thus, this rule also tells us to **not** reject the Null that there is no difference in the ratings.

Note that __the two rules always agree__, unless your technology tool is faulty, which is very rare.

For this two-tail test, we **do not reject** the Null and we conclude that **there is no statistically significant difference in the job satisfaction rating for men and women**.

If the manager believes the men have a __lower__ mean rating then the women, we should run a left-tail test. Why not just run the two-tail? As you will see, a one-tail test gives us more “power” to detect a real effect that is there in the direction we believe it to have. The downside of a one-tail test, if you guess wrong and the effect is in the other direction, the test has no power to detect it.

Null Hypothesis Ho: Mean Rating Men >= Mean Rating Women

Alternative Hypothesis Ha: Mean Rating Men < Mean Rating Women

Here is our output again with the one-tail values we need, highlighted in yellow.

The tail of the test is always determined by the math operator in the **Alternative hypothesis, **which in this example is the *less than* symbol. Remember the less than symbol **<** points to the left, so this is a left-tail test.

Here is our left-tail graphic:

**Excel reports the absolute value of the critical values**. Or you can consider it to be the Right-tail critical value because it is +.

For a left-tail test, we need the negative t critical which is -1.697. You should note that the one-tail critical value is “smaller” than the two-tail value of -2.042 because we put all of the alpha in that one tail which “pushes” the critical value toward the mean.

Now, the t Stat does fall in the rejection area, so the rule says we must reject the Null hypothesis.

To use the second rule, we need to determine the p-value.

Here, Excel’s output can be confusing. If the __t Stat is positive__, the reported __p-value is the right tail__ – the probability of getting a value for t-stat that is more positive. __If the t Stat is negative, the p-value is for the left tail__ – the probability of getting a value for t-stat that is more negative.

Here, the t-stat is negative, so the p-value **is** for the left tail test. It is 0.034 which is less than our alpha of 0.05.

You should notice that for the **same sample data**, the left tail test had to power to reject the Null while the two-tail did not.

So, this rule also tells us to **reject the Null** and we conclude that the **mean rating of men is significantly lower than that of women**.

We will use an example comparing the start time of a hospital procedure with the time the last required technician arrives in the surgical suite. In this example, the doctors are claiming that they are waiting for the technicians to arrive.

The null and alternative are:

Null Hypothesis Ho: Mean Procedure Start Time >= Mean Technician Ready Time

Alternative Hypothesis Ha: Mean Procedure Start Time < Mean Technician Ready Time

In a bit more clear language, the Null says the technicians are ready before the doctors need them. The Alternative says the technicians are not ready when the doctors need them. In the data, time is reported as hours past midnight. For example, the mean Procedure Start time is 7.196 hours or approximately 07:12 AM.

Because the t Stat is positive, the reported one-tail p-value is for the right tail test. We need to find the complement to use it for the left tail test here. So, 1 – 0.04310 = 0.9569. That is much larger than 0.05, so this method tells us to not reject the Null.

Remember the critical value for the left tail is -1.6956, We change the sign because Excel always reports the absolute value of the critical value, which is equivalent to the right tail critical value. Our t-stat of +1.7722 does not fall in the rejection area to the left of -1.6956, so this method also, as you should expect, tells us not to reject the Null.

We conclude that the mean procedure start time is not less (i.e. earlier) than the mean technician ready time. The doctor’s claim is not supported by the data. The doctors do not have to wait because of a late-arriving technician.

**Example 1**: Now our manager believes the men have a __higher__ rating than the women.

Null Hypothesis Ho: Mean Rating Men <= Mean Rating Women

Alternative Hypothesis Ha: Mean Rating Men > Mean Rating Women

Remember the tail of the test is indicated by the math operator in the Alternative. Here, the Alternative math operator is greater than > which points to the right, so this is a right-tail test.

We can use the same highlighted one-tail values.

Here is our graphic:

For a right-tail test, we are interested in what is happening on the right side of the curve. We use the positive one-tail critical value of +1.697 and we find our t Stat of -1.866 is very far away from the right tail rejection area. So the first rule tells us to __not __reject the null.

To find the right-side p-value, we must recall that the area under the curve is equal to 1. Excel always gives us the left-tail p-value for one-tail tests, so we must subtract that value from 1 to get the right-tail p-value. 1 – 0.0344 = 0.966, which is much larger than 0.05, so this rule tells us to **not** reject the Null of no difference in the ratings of the two groups.

Our conclusion is that the mean job satisfaction of the men is ** not greater** than that of the women.

**Let’s look at Example 2 where we have a positive t-stat in a right -tail test.**

Recall this example is comparing the start of a hospital procedure time with the arrival time of the last required technician. The doctors are claiming they have to wait for the technicians to arrive.

The null and alternative are:

Null Hypothesis Ho: Mean Procedure Start Time <= Mean Technician Ready Time

Alternative Hypothesis Ha: Mean Procedure Start Time > Mean Technician Ready Time

In this example, the doctor’s claim is the Null, that their mean procedure start time is not greater than the mean technician ready time.

Using the p-value method, we see the t-stat is positive. That means the Excel p-value is for the right-tail test and we can use it directly to decide to reject the Null, p-value of 0.0431 < 0.05.

And our t-stat of 1.772 is greater than the right tail critical value of 1.696, so that too tells us to reject the Null and conclude that, on average, the technicians are in place and ready before the mean procedure start time.

It is important to note, in the first example, that while using the left-tail test gave us the power to detect the significant “less than” difference between the ratings, using the right-tail test does not. That is why you need to be careful if you decide to use a one-tail test and be pretty sure of the direction of the difference. Using a two-tail test is a bit more conservative in that it will pick up a larger difference either way but misses the smaller significant “less than” difference on the left side.

**Using the proper tail of the test makes all the difference.**

Support.Office. (n.d.). *Use the Analysis ToolPak to perform complex data analysis*. Retrieved from Support.Office: http://bit.ly/2XXgg6T