2. Crafting Research Questions

2. Crafting Research Questions

🎯 Learning Goals

  • Understand the data science process, from asking questions to data visualization
  • Write clear and focused research questions
  • Find and explore public datasets relevant to specific research questions

đź“— Technical Vocabulary

  • Data science process
  • Research question
  • Dataset
  • Target
  • Record
  • Field
🌥️
Warm-Up: 21 Questions
The goal of the game is for players to guess the secret thing by asking up to 21 yes-or-no questions. If you’re playing independently, use this prompt to play with ChatGPT:
”Let's play 21 questions! Select a person, place, or thing for me to guess, but don't tell me what it is! I can ask a maximum of 21 yes-or-no questions to guess the secret thing. At the end of each response, please tell me how many questions I have left. Ready to play?”
You may have noticed that asking the right questions led you to the solution faster. This activity connects directly to our lesson today: the data science process begins with curiosity and asking good questions!

Introduction to the Data Science Process

Data science is an iterative process that begins with asking a question and ends with sharing insights through data visualization and communication. The steps in this process often cycle back and forth, refining models and answers as you gather more information and test your hypotheses. As discussed in our previous lesson, the data science process typically involves the following steps:
  1. Define a problem. Data scientists are always curious and ask a lot of questions, so the first step of any data science project is figuring out what problem you want to solve and crafting a clear, focused question.
  1. Collect data. Once we know our problem, we can go out into the world and collect the data.
  1. Prepare the data. Data does not normally arrive in the shape or form we want it to, so we have to spend time cleaning the data to ensure it works for us.
  1. Analyze the data. At this step, we can explore relationships in the data. We can understand what the data tells us about the problem and what features are important in our data.
  1. Visualize the data. We will use Tableau to visualize our data.
  1. Communicate insights. We can now take our visualizations and communicate our insights to help stakeholders understand the problem and data.
    1. Derived from the process outlined in the original Harvard Data Science Course https://cs109.org, developed by Joe Blitzstein and Hanspeter Pfister.
      Derived from the process outlined in the original Harvard Data Science Course https://cs109.org, developed by Joe Blitzstein and Hanspeter Pfister.
For the remainder of this lesson, we’ll focus on the first two steps:
  1. Defining the problem with a well-defined research question
  1. Collecting data to help you answer that question

Craft a Well-Defined Question

Defining the problem begins with crafting a well-defined question. The more precise your question is, the more likely it is that you’ll find an answer. Brandon Rohrer, a data scientist and machine learning engineer, explains the importance of developing a “sharp” question like this:
When choosing your question, imagine that you are approaching an oracle that can tell you anything in the universe, as long as the answer is a number or a name. It’s a mischievous oracle, and its answer will be as vague and confusing as it can get away with. You want to pin it down with a question so airtight that the oracle can’t help but tell you what you want to know.
To craft a “sharp” question, many people use the SMART framework:
  • Specific
  • Measurable
  • Actionable
  • Relevant
  • Time-bound
Let’s imagine that we’re interested in life expectancy in different countries around the world. A non-example of a well-defined question would be: “What does life expectancy look like around the world?” It’s not specific—are we looking for data per country or region? It’s also not actionable—once we know which countries have longer life expectancy, what would we do with that information?
A well-defined or “sharp” research question might be: “How is access to wealth related to life expectancy?” The question is specific, focusing on the relationship between wealth and life expectancy. It’s measurable—both life expectancy at birth and wealth are measurable. The question is also actionable and relevant, because if we can establish a relationship between these two factors, this could inform public policies that improve life expectancy as a measure of well-being. Once we identify some datasets and see what data are available, we can improve the question even more by narrowing it to a specific timeframe.
đź’ˇ
Avoid Leading Questions
We want to approach data exploration with an open mind! When crafting your research question, it’s important to avoid framing it in a way that encourages you to subconsciously seek evidence supporting a particular point of view. A biased or leading question might sound like, “How much more do doctors ignore women patients than men patients?” Instead, a better approach would be to ask, “Do doctors listen to patients equally regardless of gender?” This way, you can explore the data without preconceived notions, leading to a more objective, data-driven exploration.

Find and Evaluate Public Datasets

Once you’ve defined your research question, it’s time to find data that can help you answer it. When you examine a dataset, start by looking at the fields available and identifying what each record in the dataset represents. In a dataset, all information is organized into records and fields.
  • A record is a single item or instance within the dataset. In a table, records are usually represented as rows.
  • A field is a specific feature, attribute, or characteristic that applies to all of the records. In a table, the fields are typically the columns.
A well-defined question will guide you in identifying what kind of data is needed and what key variables to focus on. For example, if our question is “How is access to wealth related to life expectancy?”, then we need to make sure the dataset includes life expectancy at birth by country and some measure of wealth. These target variables are the key metrics you're trying to understand or predict.
Once you find a dataset that has includes some of the variables you’re looking for, you can begin evaluating the dataset to see if it meets your needs. Check out this dataset from the UNDP (United Nations Development Programme). You can download the latest HDI dataset by clicking the Download latest HDI dataset link below the visualization.
✏️
Try-It | Evaluating Datasets
  • What does each record represent?
  • What fields are available?
  • How relevant is this information to your research question?
  • What other questions can you ask based on the information available?
After evaluating this dataset from UNDP, we can see that it is somewhat aligned with our question, but we may need to refine the question to better match the data available. This dataset includes life expectancy at birth—one of our target variables! However, it doesn’t include a direct measure of wealth, such as household income. Instead, this dataset includes GNI (gross national income) per capita, the average amount of money each person in a country would earn if the total GNI (income) were divided equally among everyone. For our purposes, GNI per capita can serve as a reasonable approximation of wealth. It also only includes data from 2022. With this information, we can refine our question to make it even more specific:
How is GNI per capita related to life expectancy at birth across different countries in 2022?
Now our question is more narrowly defined and is time-bound based on the data that were available. After exploring the available datasets, you might decide to revisit your initial question and refine it based on the data that’s available. That’s perfectly normal and okay!

Finding Datasets

There are many places you can start looking for publicly available data. Here are a few to help you get started!
Which of these resources did you find most helpful? Depending on what you’re interested in, some dataset sources will be more useful than others!
đź’ˇ
Correlation vs. Causation
Remember that observational data can show relationships between two variables, but it can’t definitively prove causation. It might be tempting to ask questions like "What causes people to live longer?” But proving causality is significantly more complex than demonstrating correlation.
A better question might be, "What factors are correlated with longer life expectancy?" This question allows you to explore various factors—like diet, access to healthcare, or exercise habits—that might be associated with longer life, without jumping to conclusions about what causes it. Correlation helps us spot patterns, but proving causality takes more in-depth experiments and control.
Well-defined questions are the backbone of the data science process. They guide data collection, cleaning, analysis, and storytelling, ensuring that insights are actionable and aligned with the original problem statement. Refining questions based on data exploration is a natural part of the process, leading to better analysis and stronger conclusions. In the next lesson, we’ll learn how to use SQL to explore a dataset even further!
📝
Practice | Crafting a Research Question
  1. Choose an interesting dataset from the list below.
  1. Define the problem by crafting a research question using the SMART criteria.
page icon
For a summary of this lesson, check out the 2. Crafting Research Questions One-Pager!
Â