One kind are “natural” pairings, such as spouses, siblings, and especially twins. This type of pairing is often used in medical observational research when it is difficult to construct a true experiment. (PennState, 2017)

But even more common are other types of pairing. A more accurate label for this two-sample test is a test for *dependent *samples. Samples are dependent when there is a relationship of some kind in play which causes the samples to not be independent.

I like this definition from the Minitab blog:

If the values in one sample affect the values in the other sample, then the samples are dependent.

If the values in one sample reveal no information about those of the other sample, then the samples are independent. (Minitab, n.d.)

Another author states the requirement for a two-sample sample test for independent samples is:

“The two samples are randomly selected in an independent manner from the two target populations.” (McClave, Benson, & Sincich, 2014)

Another way of thinking about dependent vs independent samples: If there is no random process in selecting the second sample, the samples are dependent.

One example of a paired/dependent sample situation is comparing daily sales for two specific restaurants. We randomly pick 12 days from 2016 and get the sales for the two restaurants on those 12 days. Are the two samples independent? [data from (McClave, Benson, & Sincich, 2014)]

The answer is they are not. Although we randomly picked the 12 days, once we get the sales data for restaurant 1 we must get the same 12 days for restaurant 2. The second sample is not random – it is linked to the first sample.

If we mistakenly run the independent samples t-test, we get the following:

The large p-value tells us the sales for the two restaurants are not different.

But, if we correctly run the paired samples t-test, we find a small p-value:

The sales for the two restaurants are different!

Remember there are more types of “paired” samples than just before and after.

P.S. The image below shows the Excel PHStat version of the two tests:

]]>A research firm claims that the distribution of the days of the week that people are most likely to order food for delivery is different from the distribution seen in the past. You randomly select 494 people and record which day of the week each is most likely to order food for delivery. The table below also shows the results of your count. At alpha, α, = 0.05, test the research firm’s claim.

This sounds like a test of Goodness of Fit between the historical pattern and the observed pattern.

The claim is that the actual pattern and the historical pattern are different. That means we need the inequality math operator, which, in turn, means the **claim is the alternative hypothesis**.

Stating our two hypotheses:

H_{0}: the distribution of people ordering food for delivery is 7% Sunday, 4% Monday, 5% Tuesday, …

H_{a}: the distribution of people ordering food for delivery differs from the expected distribution.

Putting this in math equation form:

H_{0}: Distribution = P_{Sunday} 0 .07; P_{Monday} 0.04; P_{Tuesday} 0.05; P_{Wednesday} 0.12; P_{Thursday} 0.11; P_{Friday} 0.37; P_{Saturday} 0.24

H_{a}: Distribution ≠ P_{Sunday} 0 .07; P_{Monday} 0.04; P_{Tuesday} 0.05; P_{Wednesday} 0.12; P_{Thursday} 0.11; P_{Friday} 0.37; P_{Saturday} 0.24

Although the ≠ math operator normally indicates a two-tailed test, **Chi-square Goodness of Fit tests are always right tail tests.**

First, let’s use StatCrunch; then we will use Excel.

Remember, if you are in MyStatLab, look for the small blue rectangles near the upper right of a table. Click on them to automatically load the data into StatCrunch (and into Excel).

This is how StatCrunch looks with the data entered. I labeled the history % “Expected.”

Hint: you do not have to convert the expected % to counts; StatCrunch will do that automatically. And it is smart enough to know if the expected data is already counts (frequency) – not sure how it does this. [Note: I think I figured it out. If the total of the “expected” values = 100, StatCrunch assumes the values are percentages. If they do not equal 100, StatCrunch assumes they are counts.]

Use the command sequence **Stat > Goodness of Fit > Chi-square test**. In the **Observed:** box, select the “Frequency_f” column and in the **Expected: **box, select “Expected.” In the **Display:** box, select “**Expected**” so the actual counts will be shown. Click **Compute!**.

The results box appears. The *X* test statistic is 21.107 and the p-value is 0.0018, which is less than our alpha of 0.05.

Remember to check to make sure each of the expected frequencies is greater than 5, which they are in this problem. If any of the expected frequencies is less than 5, the Chi-square test is not valid.

If your problem requires you to find the critical value and rejection region, use the StatCrunch Chi-square calculator: **Stat > Calculators > Chi-square**. Enter the degrees of freedom, **DF**, which are the number of levels of the variable minus 1, i.e. 7 – 1 = 6 for this problem. Always select the ≥ option in the **P(x)** box. Enter alpha, 0.05 for this problem, and click **Compute**.

The critical value, *X*_{0} is 12.59 and the rejection area is any value of *X* greater than 12.59. The *X* test statistic of 21.107 is greater than 12.59 and thus falls within the rejection area.

Using either the p-value approach or the critical value approach, we reject the null hypothesis.

Because our claim was the alternative, we conclude there is sufficient evidence to support the claim that the observed distribution of people ordering food for delivery is different from the expected pattern.

Now, let’s do the Excel solution. This takes a bit more time, but if you save your worksheet, you can reuse it on similar problems by editing the data ranges in the formulas. Note: you can click on an image to see if full size.

We get the same results as we did when using StatCrunch.

Hope this helps!

]]>Consider the following problem statement:

A bank auditor claims that credit card balances are normally distributed, with a mean of $2870 and a standard deviation of $900.

- What is the probability a randomly selected credit card holder has a card balance less than $2500?
- You randomly select 25 credit card holders. What is the probability that their mean card balance is less than $2500?
- Interpret the two probabilities in terms of the auditor’s claim.

I usually see students get one of the questions correct, but not all. And they either seem to get #1 or #2 correct in about equal proportions. When I inspect their solutions, I find that they get confused over the “standard deviation” to use in the equation for z.

Most students seem to get #1 correct. They use the formula for z:

correctly interpreting the problem’s “standard deviation of $900” as the population sigma.

Here is the Excel solution for part 1. Note, I give two formulas for finding the probability. In both, “True” gives the cumulative probability from left infinity; the left tail, in other words.

Here is the StatCrunch solution using the **Stat > Calculators > Normal **command sequence. In the dialog box, make sure the **Standard** option is active, enter the mean, sigma, x, and select the **<** to get the left tail, and then click **Compute**. I like to use StatCrunch for these types of problem since it gives a sketch as well as the probability.

In both solutions to part 1, we find that the probability of an ** individual** card holder having less than a $2500 balance is 34%, which is not unusual.

Let’s look at part 2 again:

- You randomly select 25 credit card holders. What is the probability that their mean card balance is less than $2500?

The mistake I see many students make is to use the population sigma in their calculations. That means they probably did not recognize that the question is about a mean for 25 randomly selected card holders. In other words, a sample.

To find z for a sample, you must use the *standard deviation of the sampling distribution of sample means*, the standard error, σ_{x̅}.

This is the formula for finding z for a sample mean, x̅:

Recall that the mean of a sampling distribution of sample means, µ_{x}_{̅} , is the population mean, µ.

Here is the Excel solution:

Here is the StatCrunch solution, again using the Normal calculator.

I just learned a neat “trick” about the calculator: you can use Excel-like formulas in the data entry windows. Here, to get the standard error, I entered 900/SQRT(25) in the **Std. Dev**. Window before I clicked **Compute**. Of course, you can use a regular calculator to find the standard error and enter that value.

In the StatCrunch graph, we can see that the $2500 sample mean balance is very far to the left of the population mean of $2870. Thus, the approximately 2% chance of getting less than $2500 for a ** sample** of 25 is reasonable.

But getting a sample mean of $2500 for this population would be unusual if our standard of labeling an event unusual is a 5% chance.

**Another common mistake** on a similar problem but with a key difference in the wording.

Use the normal distribution of fish lengths for which the mean is 11 inches and the standard deviation is 4 inches. Assume the variable x is normally distributed.

- What percent of the fish are longer than 14 inches?
- If 200 fish are randomly selected, about how many would you expect to be shorter than 9 inches?

Part 1 is straightforward. We are asked about individual fish, not a sample.

This time I will use StatCrunch first so we can see the sketch.

We can see that 14 inches is to the right side of the mean of 11 inches. The area under the normal curve to the right of 14 is 0.2266 which means that about 22.7% of the fish will be longer than 14 inches.

This is the Excel solution:

Because we need the right tail, we must subtract the value returned by the NORM.S.DIST function from one. Recall the TRUE parameter gives us the cumulative area under the curve from left infinity to our z value. Unfortunately, “False” does not give the right tail!

Students *who do not draw the sketch* often forget this important step and give an incorrect answer of 77.3%.

Part 2: If 200 fish are randomly selected, about how many would you expect to be shorter than 9 inches?

I think what throws some students on part 2 is the statement “If 200 fish are randomly selected” which sounds an awful lot like it is a sample. And it is a sample of n = 200.

But the key is that ** they do not ask for the mean or any other sample statistic**.

They want to know how many of the 200 fish will be shorter than 9 inches. That means we do not need to use the standard error, σ_{x̅}. We should again use sigma as the standard deviation in the StatCrunch normal calculator.

We get a probability of 30.9%, which means about 62 [200*30.9%] of the fish will be shorter than 9 inches.

Here is the Excel solution:

Because we are again interested in the left tail (see the StatCrunch sketch), we go back to our original formulas by deleting the “1-” in front of the NORM.S.DIST function. And, as with StatCrunch, we see that about 62 fish of the 200 will be shorter than 9 inches in length.

]]>

Consider the following problem statement:

In an investigation of the personality characteristics of drug dealers of a certain region, convicted drug dealers were scored on a scale that provides a quantitative measure of person’s level of need for approval and sensitivity to social situations. (Higher scores indicate a greater need for approval.) Based on the study results, it can be assumed that the scale scores for the population of convicted drug dealers of the region has a mean of 44 and a standard deviation of 7. Suppose that in a sample of 96 people from the region, the mean scale score is x̅ = 46. Is this sample likely to have been selected from the population of convicted drug dealers of the region? Explain. Consider an event with a probability less than 0.05 unlikely. (McClave, Benson, & Sincich, 2014)

Solution:

First, state the question: How unusual would it be to get a sample mean of 46 if the population mean is 44 and the population standard deviation is 7?

In math form, find P(x̅ ≥ 46) for µ = 44, σ = 7. Note: use the ≥ math operator because we need to find the area under the normal curve to the right of the sample mean. Because the normal distribution is continuous, the probability of getting exactly 46 is meaningless.

Second, identify the data provided:

3rd, draw a sketch. You can see that x̅ = 46 falls far to the right of the mean, µ = 44.

Excel solution:

Because p < α =0.05, it would be unusual to get a randomly collected sample mean of 46 or greater if the population mean is 44 and the population standard deviation is 7.

**StatCrunch solution**: Use the command sequence **Stats > Z Stats > One Sample > With Summary**. Enter the sample mean, the standard deviation, and n. (Note: enter the population standard deviation, σ; StatCrunch calculates the sample standard error.)

Select the radio button next to **Hypothesis test for µ**, enter the null value of 44, and select **> **for the alternative H_{a}. Click **Compute!**

We get the same results as we did with Excel.

]]>Consider the following problem:

A scientist employed simple linear regression to model the monthly price of recycled newspaper as a function of the monthly price of pulpwood. The results shown below were obtained for monthly data collected over a recent 10-year period (n = 120 months).

Use this information to conduct a simple linear regression analysis:

ŷ = 35.20+5.28x; for testing H_{0}: β_{1 }= 0, t = 2.45; r = 0.22; r^{2} = 0.05.

Solution: What is meant by “conduct a simple linear regression analysis”? We are given the regression equation, r and r^{2}, so that part is already done. The clue is we are given a t-value for “testing **H _{0}: **

The alternative hypothesis for these tests is **H _{a}: **

First, using Excel:

The StatCrunch approach: Use the command sequence **Stat > Calculators > T**. Enter the degrees of freedom, 118, the t of 2.45, select the right tail, and click **Compute**. Because this is a two-tail test, multiply the indicated p-value of 0.00788 by 2 to get p = 0.0158. Again, we decide to reject the null and conclude there is sufficient evidence of a linear relationship between x and y.

The Excel workbook can be downloaded here: Simple_regression_part3

My video version of this problem type is here https://youtu.be/cB79cOxUOt4

]]>See Excel Spreadsheet 11.2.10 Regression line basics

]]>The grade point averages for 12 randomly selected students are shown in the table below. Find the 99% confidence interval around the population mean, µ.

Solution:

Because this is a small sample, n < 30, use the t-distribution.

The Excel solution is:

Rounding to two decimal places, the interval is (1.22, 3.41).

For the StatCrunch solution, use the **Stat > T Stats > One Sample > With Data** sequence to open the dialog box.

Select the column containing the data, click on the radio button next to **Confidence Interval for ****µ**, enter the confidence level, c, and click **Compute!**

Again, we get the interval (1.22, 3.41).

]]>

Construct the indicated confidence interval for the population mean, µ.

c = 0.90, x-bar = 16.2, s = 5.0, n = 75.

To solve it you need to carefully inspect the data you are given. You should notice two things. You are not given the population standard deviation, σ, and the n is > 30.

Depending upon the author of your stats book (and your instructor), you will choose either the normal distribution or the t-distribution to solve it. Some authors say if you do not know sigma, use the t-distribution. Many authors say if n > 30, it is OK to use the normal distribution with s being approximately equal to σ. For these latter authors, the sample size, n, is used as a discriminator between small samples (n <30) and large samples (n >30).

We will solve it both ways and compare, but you should find out which way your textbook author leans because it might be important on a quiz.

Here is the Excel solution for the normal (z) distribution:

Here is the Excel solution for the t-distribution:

You can see the two intervals are close but different enough to cause you to miss the question even if they ask you to round to just one decimal, e.g. 14.9 vs 14.8 on the lower limits.

So, know the preference of your instructor/author. If you are in doubt, I would fall back on the rule of thumb that if n > 30, use the normal distribution. Both Larson & Farber, 4^{th}, and McClave et al, 12^{th}, use the “small vs large” concept.

Here are the StatCrunch solutions. Use the command sequence **Stat > T Stats [ or Z Stats] > One Sample > With**

An article in an online magazine states that 40% of home buyers found their real estate agent through referrals by a friend. However, a professor in a local college sampled 1000 home buyers and found that 426 chose an agent recommended by a friend.

Does the data refute the claim made by the magazine? Use a significance level of 0.02.

Solution:

- First, you should recognize that this is a test about a single proportion, not a mean or other statistic.
- The claim is that the proportion of home buyers who select their real estate agent based on the recommendation of a friend is 0.40. Therefore, the claim is p = 0.40.
- Since the claim contains an equality, =, it must be the null.
**Ho: p = 0.40**. - The alternative must be the complement,
**Ha: p****≠ 40**. - Remember the rule of thumb is that all hypothesis tests for proportions are z-tests. But you should confirm that you can use the normal distribution by checking that both n*p and n*q are greater than 5. Here n*p = 2000*0.40 = 800 and n*q = 2000*(1-0.40) = 1200. Both are > 5, there we can use the normal distribution.
- I recommend always sketching the situation described in the problem. Here we see that the sample count of 426 falls on the right side of the hypothesized mean of 400 for the population. Recall, the mean for a proportion is just the n*p or 0.4 * 1000. The standard deviation for a proportion is

- Recall, the math operator in the alternative always indicates the tail of the test. In this case, the ≠ math operator indicates this is a two-tailed test.
- With a two-tailed test, you must put half of alpha, α, in each tail. Thus, α/2 = 0.02/2 = 0.010.
- For StatCrunch, we use the
**Stat > Calculators > Normal**command sequence. I like to use the “Between” tool. Here, I entered 1- α = 0.98 in the box (highlighted in green in my image though not in actuality). This is a two-tail test, therefore put α/2 in each tail.

Read critical value of z, z_{α/2 }(sometimes shown as z_{0 }), = -2.33 and +2.33. The rejection regions are in the white areas to the left (less than) -2.33 and to the right (greater than) +2.33.

- We can find the z test statistic using the formula

- is our estimate of p found in the sample. We find it by dividing the number of ‘successes’, 426, by n to get 0.426.
- You can solve this with your calculator to find
**z = 1.678**. Because the test statistic does not fall in a rejection area, < – 2.33 or > + 2.33, we fail to reject the null hypothesis.

Because the null is our claim, we conclude: There is insufficient evidence to reject the claim that the proportion of home buyers who find their agents from referrals from friends is 40%.

- I like to use Excel to set up the problem so that I can save and reuse the worksheet on similar problems.

- We get the same values for the critical value of z, +/- 2.326, and the test statistic, z = 1.678.
- Remember that the p-value for a two-tail test is twice the value for the correct one-tail test. I set up for all three versions so that I can just pick the one that applies. The p-value for the two-tail test is 0.093, which is greater than alpha = 0.02. Therefore, we again decide to fail to reject the null.

- Here is the StatCrunch solution, using the
**Stat > Proportion Stats > One Sample > With Summary**command sequence. Enter the number of successes, 426 and the n of 1000. Set the null, Ho, to be 0.40, make sure the math operator in the alternative, Ha, is ≠, and click**Compute!**

The results are essentially the same as those of Excel’s.

Remember if you are using StatCrunch, you can quickly get the confidence interval around the sample proportion by clicking on the **Options** button in the upper left of the output box, click **Edit, **then select the radio button next to **Confidence interval for p,** and click **Compute!**