library(tidyverse)
set.seed(YOUR_STUDENT_ID) # Use your student ID for reproducibility
n <- 400
# Create your simulation here
data_sim <- tibble(
id = 1:n,
group = rep(c("Treatment", "Control"), each = n/2),
# Add latent trait
theta = rnorm(n, mean = 0, sd = 1)
) %>%
mutate(
# Generate 5 Likert items with YOUR chosen factor loadings
# item1 = round(pmin(5, pmax(1, 3 + λ1 * theta + rnorm(n, 0, error1)))),
# ... continue for all 5 items
)Assignment: CFA with Missing Data
Software Integration
Overview
In this assignment, you will simulate psychological scale data, introduce missing data for one group, and then fit a confirmatory factor analysis (CFA) model. You will compare your estimated parameters to the true values from the simulation to evaluate how well your model recovers the population parameters.
Learning Objectives
- Practice simulating psychological data with known parameters
- Understand the impact of missing data on parameter estimation
- Gain experience running CFA models in Mplus or lavaan
- Critically evaluate model results against known true values
Instructions
Part 1: Data Simulation (20 points)
Create a simulated dataset with the following characteristics:
- Sample size: N = 400
- Groups: Two groups (e.g., “Treatment” and “Control”), with 200 participants each
- Latent construct: One latent factor (e.g., “Wellbeing”)
- Items: 5 Likert-scale items (1-5 scale)
- True factor loadings: Choose your own values between 0.5 and 0.9 for each item
Starter code:
Deliverable: Report your chosen factor loadings (λ₁ through λ₅) in a table.
Part 2: Introduce Missing Data (20 points)
After creating your complete dataset, introduce missing data only for the Treatment group.
Requirements:
- Make approximately 20% of the item responses missing for the Treatment group
- The Control group should have no missing data
- Missing data should be Missing Completely at Random (MCAR)
Starter code:
# Function to introduce MCAR missing data
introduce_missing <- function(x, prop_missing = 0.20) {
n <- length(x)
missing_idx <- sample(1:n, size = round(n * prop_missing))
x[missing_idx] <- NA
return(x)
}
# Apply only to Treatment group
data_missing <- data_sim %>%
mutate(
across(
starts_with("item"),
~ if_else(group == "Treatment", introduce_missing(.), .)
)
)
# Check missing data pattern
data_missing %>%
group_by(group) %>%
summarise(across(starts_with("item"), ~ mean(is.na(.))))Deliverable: Report the percentage of missing data for each item by group.
Part 3: CFA Model (30 points)
Fit a one-factor CFA model to your data with missing values. You may use either Mplus or lavaan (or both for extra credit).
Option A: lavaan
library(lavaan)
# Select only item variables
data_cfa <- data_missing %>%
select(starts_with("item"))
# Define model
cfa_model <- '
Wellbeing =~ item1 + item2 + item3 + item4 + item5
'
# Fit model with FIML for missing data
fit <- cfa(cfa_model,
data = data_cfa,
std.lv = TRUE,
missing = "fiml") # Full Information Maximum Likelihood
summary(fit, standardized = TRUE, fit.measures = TRUE)Option B: Mplus
TITLE: CFA with Missing Data
DATA: FILE IS cfa_data.dat;
VARIABLE:
NAMES ARE item1 item2 item3 item4 item5;
USEVARIABLES ARE item1-item5;
MISSING ARE ALL (-999); ! Or your missing value code
ANALYSIS:
ESTIMATOR = ML;
MODEL:
Wellbeing BY item1* item2 item3 item4 item5;
Wellbeing@1;
OUTPUT: STANDARDIZED;Deliverable:
- Report your model syntax
- Report fit indices (χ², df, p-value, CFI, TLI, RMSEA)
- Report unstandardized and standardized factor loadings
Part 4: Compare to True Values (30 points)
Create a comparison table and answer the reflection questions.
Comparison Table:
| Item | True λ | Estimated λ | Difference | % Bias |
|---|---|---|---|---|
| item1 | ? | ? | ? | ? |
| item2 | ? | ? | ? | ? |
| item3 | ? | ? | ? | ? |
| item4 | ? | ? | ? | ? |
| item5 | ? | ? | ? | ? |
Note: % Bias = ((Estimated - True) / True) × 100
Reflection Questions:
- How close were your estimated factor loadings to the true values?
- Did the model under- or over-estimate the loadings? Is there a pattern?
- How might the missing data in the Treatment group have affected your estimates?
- What would you expect to happen if you increased the missing data rate to 40%?
- Why is simulation useful for understanding statistical methods?
Extra Credit (10 points)
Choose one of the following:
- Run the CFA in both Mplus and lavaan and compare the results
- Run the analysis on the complete data (before introducing missing values) and compare to your results with missing data
- Try a different missing data mechanism (e.g., MAR - missing at random, where missingness depends on age or another variable) and discuss how it affects estimates
Submission
Submit the following:
- Your R script or Quarto document with all code
- Your Mplus input and output files (if using Mplus)
- A brief report (1-2 pages) with:
- Your true parameter values
- Missing data summary
- Model results
- Comparison table
- Answers to reflection questions
Grading Rubric
| Component | Points |
|---|---|
| Data simulation with correct structure | 20 |
| Missing data correctly introduced | 20 |
| CFA model correctly specified and run | 30 |
| Comparison table and reflection | 30 |
| Total | 100 |
| Extra credit | +10 |