Assignment: Identifying Missing Data Mechanisms

Missing Data

Author

Francisco Cardozo, PhD

Published

March 30, 2026

Overview

This assignment has two goals. First, you will practice reasoning about why data are missing by classifying missingness mechanisms from written scenarios. Second, you will handle missing data in practice by fitting a CFA model with FIML.

Learning Objectives

  • Distinguish between MCAR, MAR, and MNAR using narrative descriptions and formal notation.
  • Fit a CFA model using FIML (lavaan) or ML (Mplus) on data with missing values and interpret the results.

Instructions

Part 1: Scenario Classification (40 points)

Read each scenario carefully. For every scenario:

  1. Classify the missingness mechanism as MCAR, MAR, or MNAR
  2. Justify your answer in 1–2 sentences, referencing which variables (observed or unobserved) determine the probability of being missing
  3. Consequence: State whether complete-case analysis would produce biased estimates and, if so, in which direction

Scenario A – Lab Equipment Failure

A research team measures cortisol levels in 500 participants. Midway through data collection the freezer storing blood samples malfunctions overnight, destroying 60 samples. The destroyed samples correspond to whichever tubes happened to be in the broken unit; the assignment of tubes to freezers was arbitrary and unrelated to any participant characteristic.


Scenario B – Income on a Health Survey

A public-health survey asks respondents to report their annual household income. Researchers notice that 25% of income values are missing. Closer inspection reveals that respondents who reported higher education levels and who live in wealthier zip codes are more likely to provide their income. Missingness does not depend on the actual income value once education and zip code are accounted for.


Scenario C – Depression Symptom Diary

Participants in a longitudinal study are asked to complete a daily mood diary. Researchers find that entries are most likely to be missing on days when participants’ depression scores (the outcome of interest) would have been highest. In other words, the sicker someone feels on a given day, the less likely they are to fill out the diary that day.


Scenario D – Online Experiment Dropout

Participants in an online experiment are randomly assigned to a 10-minute or 30-minute condition. The 30-minute condition has a 40% dropout rate compared to 10% in the short condition. Dropout is predicted by condition assignment (fully observed) and does not depend on participants’ scores on the outcome measure once condition is accounted for.


Part 2: CFA with Missing Data Using FIML (60 points)

In this part you will work with a simulated dataset that already contains missing values. Your task is to fit a confirmatory factor analysis (CFA), compare FIML to listwise deletion, and interpret the results.

The Study

A research team developed a 6-item self-esteem scale. They surveyed 500 college students. The items are rated on a continuous scale (standardized). Unfortunately, some students skipped later items, especially those who scored lower on the first two items (a survey-fatigue pattern). The missingness mechanism is MAR: it depends on the observed early items, not on the missing items themselves.

Step 1: Generate the Data

Run the code below to create the dataset you will analyze. Use your student ID as the seed.

library(tidyverse)
library(lavaan)

set.seed(YOUR_STUDENT_ID)

pop_model <- "
  esteem =~ 0.85*item1 + 0.75*item2 + 0.70*item3 +
             0.65*item4 + 0.60*item5 + 0.55*item6
  esteem ~~ 1*esteem
"

complete_data <- simulateData(pop_model, sample.nobs = 500)

esteem_data <- complete_data |>
    as_tibble() |>
    mutate(
        item4 = ifelse(rbinom(n(), 1, plogis(-2.5 + 0.6 * item1)) == 1, NA, item4),
        item5 = ifelse(rbinom(n(), 1, plogis(-1.8 + 0.6 * item1 + 0.4 * item2)) == 1, NA, item5),
        item6 = ifelse(rbinom(n(), 1, plogis(-1.2 + 0.7 * item1 + 0.3 * item2)) == 1, NA, item6)
    )

Step 2: Inspect the Missing Data (10 points)

Report the number and percentage of missing values for each item. You can use:

esteem_data |>
    summarise(across(everything(), ~ sum(is.na(.x)))) |>
    pivot_longer(everything(), names_to = "Item", values_to = "N_Missing") |>
    mutate(Pct = round(N_Missing / nrow(esteem_data) * 100, 1))

Step 3: Fit the CFA Model (30 points)

Fit a one-factor CFA with all 6 items. Run two versions: one with listwise deletion and one with FIML.

You may use either lavaan or Mplus (or both for extra credit).

Option A: lavaan
cfa_model <- "
  esteem =~ item1 + item2 + item3 + item4 + item5 + item6
"

fit_listwise <- cfa(cfa_model, data = esteem_data, missing = "listwise")
fit_fiml <- cfa(cfa_model, data = esteem_data, missing = "fiml")

summary(fit_listwise, standardized = TRUE, fit.measures = TRUE)
summary(fit_fiml, standardized = TRUE, fit.measures = TRUE)
Option B: Mplus
TITLE: Self-Esteem CFA - FIML;

DATA: FILE IS esteem_data.dat;

VARIABLE:
    NAMES ARE item1 item2 item3 item4 item5 item6;
    USEVARIABLES ARE item1-item6;
    MISSING ARE ALL (-999);

ANALYSIS:
    ESTIMATOR = ML;

MODEL:
    esteem BY item1* item2 item3 item4 item5 item6;
    esteem@1;

OUTPUT: STANDARDIZED;

If you use Mplus, run the model once as shown above (FIML is the default with ESTIMATOR = ML) and once using LISTWISE = ON; under DATA: to compare.

Deliverables:

  1. Report your model syntax
  2. Report fit indices for both models (chi-square, df, p-value, CFI, TLI, RMSEA)
  3. Report standardized factor loadings for both models

Step 4: Compare and Reflect (20 points)

Create a comparison table and answer the questions below.

Comparison table:

Item True Loading Listwise Loading FIML Loading
item1 0.85 ? ?
item2 0.75 ? ?
item3 0.70 ? ?
item4 0.65 ? ?
item5 0.60 ? ?
item6 0.55 ? ?

Reflection questions:

  1. Which method (listwise or FIML) produced loadings closer to the true values? For which items is the difference most noticeable?
  2. How many cases did listwise deletion discard? How many did FIML use?
  3. The missingness in this dataset depends on item1 and item2. Why does this make FIML particularly effective here?
  4. If the missingness had depended on the missing items themselves (e.g., students with low self-esteem skipped the items because they would have scored low), would FIML still produce unbiased estimates? Why or why not?

Submission

Submit a Word document (.docx) containing:

  1. Part 1: Classification, justification, and consequence for all four scenarios
  2. Part 2: Missing data summary, model output, comparison table, and reflection answers

Grading Rubric

Component Points
Part 1: Scenario classification (4 scenarios, 10 pts each) 40
Part 2: CFA with FIML — missing data summary 10
Part 2: CFA with FIML — model fitting (listwise + FIML) 30
Part 2: CFA with FIML — comparison table and reflection 20
Total 100