The Big Picture of Statistics

The process of statistics starts when we identify what group we want to study or learn something about. We call this group the population.

Alt-text: Pictorial representation of a population

Note that the word “population” here (and in the entire course) is not just used to refer to people; it is used in the broader statistical sense, where population can refer not only to people, but also to animals, things etc. For example, we might be interested in:

  • the opinions of the population of U.S. adults about the death penalty; or
  • how the population of mice react to a certain chemical; or
  • the average price of the population of all one-bedroom apartments in a certain city.

The population, then, is the entire group that is the target of our interest.

In most cases, the population is so large that as much as we might want to, there is absolutely no way that we can study all of it (imagine trying to get the opinions of all U.S. adults about the death penalty…).

A more practical approach would be to examine and collect data only from a sub-group of the population, which we call a sample. We call this first component, which involves choosing a sample and collecting data from it, Producing Data or Sampling.

Alt-text: Producing data is visualized as taking a subset of the population in order to define the current sample to be used.

A sample is a subset of the population from which we collect data.

It should be noted that since, for practical reasons, we need to compromise and examine only a sub-group of the population rather than the whole population, we should try to choose a sample in such a way that it will represent the population well. This is best done using a form of random sampling.

For example, if we choose a sample from the population of U.S. adults, and ask their opinions about a federal health care program, we do not want our sample to consist of only Republicans or only Democrats.

Once the data have been collected, what we have is a long list of answers to questions, or numbers, and in order to explore and make sense of the data, we need to summarize that list in a meaningful way.

This second component, which consists of summarizing the collected data, is called Exploratory Data Analysis or Descriptive Statistics.

Alt-text: Exploratory Data Analysis is performed on the data which is a subset of the population, our sample.

Now we’ve obtained the sample results and summarized them, but we are not done. Remember that our goal is to study the population, so what we want is to be able to draw conclusions about the population based on the sample results.

Before we can do so, we need to look at how the sample we’re using may differ from the population, so that we can factor that into our analysis. To examine this difference, we use Probability which is the third component in the big picture.

Alt-text: The data and summarization of the data created from data analysis are examined using probability, which is the first step in allowing us to draw conclusions about the population based on the data.

Probability is the “machinery” that allows us to draw conclusions about the population based on the data collected in the sample.

Finally, we can use what we’ve discovered about our sample to draw conclusions about our population.

We call this final component in the process Inference.

Alt-text: The data and summarization of the data created from data analysis are examined using probability, which is the first step in allowing us to draw conclusions about the population based on the data. Inference is the last step.

 

This is the Big Picture of Statistics

Example: EXAMPLE: Polling Customer Opinion

In December 2018, Fast Technologies conducted a poll of their customers to determine how they rated the quality of the Fast 4 Terra Byte Solid State external hard drive.

  1. Producing Data or Sampling: A (representative) sample of 1,082 customers was chosen, and each customer was how many stars (up to 5) they would give the product.
  2. Descriptive Statistics: The collected data were summarized, and it was found that 65% of the sampled customers gave the product 4 stars.

3 and 4. Probability and Inference: Based on the sample result (of 65% giving 4 stars) and our knowledge of probability, it was concluded (with 95% confidence) that the percentage of those rating the hard drive 4 stars in the population of Fast Technologies’ customers is within 3% of what was obtained in the sample (i.e., between 62% and 68%). The following figure summarizes the example:

Alt-text: A visual representation of the poll conducted about the opinions of Fast Technologies customers about the quality of the hard drive. The large population, which represents all Fast customers, and data was produced from 1082 of these customers by asking them how many stars they would rate the product. In the data set we have 1082 responses, and exploratory data analysis tells us that 65% gave the product 4 stars. Using both probability and inference, we can draw the conclusion that we are 95% sure that the population percentage is within 3% of 65% (i.e. between 62% and 68%).

This brings us back to where we started, the population.

 

This material was adapted from the Carnegie Mellon University open learning statistics course available at http://oli.cmu.edu and is licensed under a Creative Commons License.