5. Basic Descriptive Statistics

5. Basic Descriptive Statistics

🎯 Learning Goals

  • Understand core statistical concepts when exploring a new dataset
  • Describe and apply various measures of central tendency and spread
 

📗 Technical Vocabulary

  • Population
  • Sample
  • Descriptive Statistics
  • Inferential Statistics
  • Mean, median, mode
  • Outliers
  • Measures of Spread

🌎 What is Statistics?

Statistics is the study of how to collect, analyze, and draw conclusions from data. It is a hugely valuable tool that you can use to bring the future into focus and infer the answers to tons of questions. Example inferences that can be made using statistics:
  • What is the likelihood of someone purchasing your product?
  • How many calls will your support team receive?
  • How many hat sizes should you manufacture to fit 95% of the population?

Population and Samples

In statistics, the population is the set of all elements or items that you are interested in studying. Populations are often vast, which means it is often impossible to collect and analyze data from every element of a population. That is why statisticians usually try to make conclusions about a population by choosing and examining a representative subset of that population.
This subset of a population is called a sample. Ideally, the sample should preserve the essential statistical features of the population to a satisfactory extent. That way, you will be able to use the sample to glean conclusions about the population (source).
 

🧮 Descriptive & Inferential Statistics

Descriptive Statistics

Descriptive statistics describe a sample. Given a group that you are interested in, record data about the group members, and then use summary statistics and graphs to present the group’s properties (source).

Inferential Statistics

Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population (source).
 

📏 Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. The 3 most common measures of central tendency are mean, median, and mode.

Mean

This is the term for what we generally call an "average". We add all of the numbers in a field of our data set, and then divide by the size of the set. Sometimes a mean can be misleading and may not effectively show a typical value in our dataset. This is because a mean might be influenced by outliers. Outliers are the numbers which are either extremely high or extremely low compared to the rest of the numbers in a dataset.

Median

For a given set of numbers, the median is the number that separates the top half of the data from the bottom half. If the size of the data set is even (so there is no single data point in the middle), the median is given by the mean of the two middle values after arranging the data into ascending order. Note: A median is not influenced by the outliers.

Mode

The mode is the value that occurs most often. If no number in the list is repeated, then there is no mode for the list.

📐 Measures of Spread

A measure of spread, also called a measure of dispersion, is used to describe the variability in a sample or population. A measure of spread is usually used in conjunction with a measure of central tendency, such as the mean or median, to provide an overall description of a set of data (source).
While it isn't necessary to understand measures of spread for this data science camp, they are important concepts in data science more broadly. You might come across these terms when looking for data sets to use for your projects. For more information on them, head to Google or ask an IL for help!
We will be talking about four different measures of spread: Variance, Standard Deviation, Percentiles, and Ranges.

☯️ Variance

Variance is the average of the squared differences from the mean.
Sample Variance is the variance of a sample taken from the population.
Population Variance is the variance of a population.

📊 Standard Deviation

One common method to measure the variation of our dataset is to calculate the standard deviation (SD). SD is also a measurement to tell how a set of values spread out from their mean. A low SD shows that the values are close to the mean, and a high SD shows that the values are far from the mean.
 
 
Learn more about Variance and Standard Deviation here!

💯 Percentiles

A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations falls. For example, the 20th percentile is the value below which 20% of the observations may be found.

Quantiles

Quantiles are the set of values / points that divide the dataset into groups of equal size.

Quartiles

Quartiles are the three dividing points (or quantiles) that split data into four equally sized groups.

↔️ Ranges

Interquartile Range

The IQR describes the middle 50% of values when ordered from lowest to highest. To find the interquartile range (IQR), first find the median (middle value) of the lower and upper half of the data. These values are quartile 1 (Q1) and quartile 3 (Q3). The IQR is the difference between Q3 and Q1 (source).
 
 
 
 
📝
Practice: Basic Descriptive Statistics
In this practice, you'll become both the data collector and the data analyst! You'll design a mini-study, collect real data, and apply the statistical concepts you've learned.

Part 1: Design Your Study

  1. Choose a question you're curious about that can be answered with numerical data. Some ideas:
      • How many hours of sleep do your fellow scholars get per night?
      • How many minutes do people spend on their favorite app each day?
      • How many siblings do people in your camp have?
  1. Create a simple data collection form with:
      • Your research question
      • Space for at least 15-20 responses
      • Any necessary clarifications (e.g., "Round to the nearest minute")

Part 2: Collect Your Data

  1. Ask your fellow scholars, instructors, or friends to provide responses
    1. You can either design this in a form of your choice (like Google Forms) or you can ask simply ask them to answer via a Slack message and thread
  1. Record all data points in a neat table or list
  1. Aim to collect at least 15 responses

Part 3: Analyze Your Data

For your dataset, calculate and record:
  • Mean
  • Median
  • Mode (if applicable)
  • Range

Part 4: Interpret Your Findings

Write a brief summary (1-2 paragraphs) of your findings. Some ideas of what to include in your summary are:
  • Why you decided to ask the question that you did
  • What were your findings?
    • Explain which measure of central tendency best represents your data and why
    • Discuss what the spread of your data tells you
  • Identify any limitations of your study
  • Suggest one follow-up question for future research
 
page icon
For a summary of this lesson, check out the 5. Basic Descriptive Statistics One-Pager!