Software Integration

Becoming Multilingual in Data Science

Francisco Cardozo, PhD

Agenda

The Multilingual Analyst: Why data interoperability matters.
The Rosetta Stone: Moving data between R, SPSS, SAS, and Mplus.
Comparing Mplus and R Outputs: Understanding different outputs from the same model.
AI-Assisted Analysis: Using AI tools responsibly and effectively.

Why be Multilingual?

Collaborators use different tools (SPSS, SAS, Stata, Mplus).
Some methods are implemented better (or only) in specific software.
Flexibility: Don’t get stuck because of a file format.
Reproducibility: Scripting data transformation reduces manual errors.
Cost of software licenses.

1. Simulating Data in R

Instead of relying on external datasets, we will create our own. This allows us to know the true parameters.

library(tidyverse)

set.seed(2026)
n <- 500

# Simulate responses to a 5-item psychological scale (e.g., Anxiety)
# Underlying latent trait (theta) drives all item responses

data_sim <- tibble(
    id = 1:n,
    age = sample(18:65, n, replace = TRUE),
    group = sample(c("Male", "Female"), n, replace = TRUE),
    # Latent trait: true anxiety level (unobserved in real life)
    theta = rnorm(n, mean = 0, sd = 1)
) %>%
    mutate(
        # Generate 5 Likert items (1-5 scale) based on latent trait
        # Each item has a loading (<U+03BB>) and unique error
        item1 = round(pmin(5, pmax(1, 3 + 0.8 * theta + rnorm(n, 0, 0.5)))),
        item2 = round(pmin(5, pmax(1, 3 + 0.7 * theta + rnorm(n, 0, 0.6)))),
        item3 = round(pmin(5, pmax(1, 3 + 0.9 * theta + rnorm(n, 0, 0.4)))),
        item4 = round(pmin(5, pmax(1, 3 + 0.6 * theta + rnorm(n, 0, 0.7)))),
        item5 = round(pmin(5, pmax(1, 3 + 0.75 * theta + rnorm(n, 0, 0.55))))
    )

2. Exporting to Excel/CSV

The universal exchange format.

# CSV (Comma Separated Values) - Good for interoperability
write_csv(data_sim, "data_sim.csv")

# Excel - Best for human viewing
library(writexl)
write_xlsx(data_sim, "data_sim.xlsx")

3. R $\leftrightarrow$ SPSS

Using the haven package to handle .sav files with metadata (labels).

library(haven)

# Add variable labels (SPSS loves labels)
data_spss <- data_sim %>%
    mutate(
        group = as.factor(group) # SPSS handles factors as labelled integers
    )

attr(data_spss$group, "label") <- "Gender"

# Write to SPSS
write_sav(data_spss, "data_sim.sav")

# Read back
data_from_spss <- read_sav("data_sim.sav")

4. R $\leftrightarrow$ SAS

SAS .sas7bdat files are common in clinical trials and public health.

library(haven)

# Write to SAS
write_sas(data_sim, "data_sim.sas7bdat")

# Read from SAS
data_from_sas <- read_sas("data_sim.sas7bdat")

5. R $\leftrightarrow$ Parquet

Parquet is a modern, efficient columnar format—great for big data and cross-language workflows (R, Python, Spark).

library(arrow)

# Write to Parquet - fast and compressed
write_parquet(data_sim, "data_sim.parquet")

# Read from Parquet
data_from_parquet <- read_parquet("data_sim.parquet")

# Parquet preserves data types and is much faster than CSV for large files

Why Parquet?

Speed: 10-100x faster than CSV for large datasets
Size: Efficient compression (often 50-90% smaller)
Cross-platform: Works with Python, Spark, and cloud tools

6. R $\leftrightarrow$ Mplus

MplusAutomation is the bridge. It handles the formatting Mplus requires (no headers, specific missing values).

library(MplusAutomation)

# Prepare data for Mplus
# This converts factors to numbers, handles missing values,
# and creates the .dat and .inp template
prepareMplusData(
    data_sim,
    filename = "data_sim.dat",
    inpfile = TRUE
)

Challenge

Import a dataset you have in your computer to R.
Export it to another software format.

Part II

Comparing Mplus and R Outputs

Goal: Fit the same CFA model in both Mplus and lavaan, then compare results.

True parameters (from simulation):

Factor loadings: λ₁=0.8, λ₂=0.7, λ₃=0.9, λ₄=0.6, λ₅=0.75

Step 1: Export Data for Mplus

library(MplusAutomation)

# Select only the items for CFA
data_cfa <- data_sim %>%
    select(item1, item2, item3, item4, item5)

# Export to Mplus format
prepareMplusData(
    data_cfa,
    filename = "cfa_data.dat",
    inpfile = FALSE
)

Step 2: Mplus Input File

Your input file for Mplus should look like this:

TITLE: CFA with 5 items - Anxiety Scale

DATA: FILE IS cfa_data.dat;

VARIABLE: 
    NAMES ARE item1 item2 item3 item4 item5;
    USEVARIABLES ARE item1-item5;

ANALYSIS:
    ESTIMATOR = ML;

MODEL:
    Anxiety BY item1* item2 item3 item4 item5;
    Anxiety@1;  ! Fix factor variance to 1

OUTPUT: STANDARDIZED MODINDICES;

Step 3: Run Mplus from R

library(MplusAutomation)

# Run the Mplus model
runModels("cfa_model.inp")

# Read the results back into R
mplus_results <- readModels("cfa_model.out")

# View parameter estimates
mplus_results$parameters$unstandardized

Step 4: CFA in lavaan

library(lavaan)

# Define the model syntax (similar to Mplus)
cfa_model <- "
    Anxiety =~ item1 + item2 + item3 + item4 + item5
"

# Fit the model
fit_lavaan <- cfa(cfa_model,
    data = data_cfa,
    std.lv = TRUE
) # Fix factor variance to 1

# View results
summary(fit_lavaan, standardized = TRUE, fit.measures = TRUE)

Step 5: Extract lavaan Parameters

# Get parameter estimates
parameterEstimates(fit_lavaan)

# Get standardized estimates
standardizedSolution(fit_lavaan)

# Get fit indices
fitMeasures(fit_lavaan, c("chisq", "df", "pvalue", "cfi", "tli", "rmsea"))

Comparing Results

Parameter	True Value	Mplus	lavaan
λ₁ (item1)	0.80	?	?
λ₂ (item2)	0.70	?	?
λ₃ (item3)	0.90	?	?
λ₄ (item4)	0.60	?	?
λ₅ (item5)	0.75	?	?

Question: Do both programs recover the true parameters?

Differences to Note

Aspect	Mplus	lavaan
Default estimator	MLR (robust)	ML
First loading	Fixed to 1	Fixed to 1
Output	Separate .out file	R console
Syntax	`BY`	`=~`

Challenge

Do the same CFA but use the items as categorical variables.

Part III

AI-Assisted Data Analysis

AI tools (ChatGPT, Claude, GitHub Copilot, Cursor) are changing how we write code and conduct analyses.

How do we use AI effectively?

What AI Can Help With

Code generation: Writing R, Mplus, or Python syntax
Debugging: Finding errors in your code
Translation: Converting code between languages (R → Python, SPSS → R)
Documentation: Explaining what code does
Learning: Understanding new statistical concepts

What AI Struggles With

Domain expertise: Understanding your specific research context
Data interpretation: Making substantive conclusions
Cutting-edge methods: Very recent techniques not in training data
Verification: AI can confidently produce incorrect code
Reproducibility: Same prompt may yield different results

Best Practices for AI in Analysis

Always verify output - Run the code, check the results
Understand before using - Don’t use code you can’t explain
Start specific - Detailed prompts yield better results
Iterate - Refine your prompts based on output ## Example: Good vs. Bad Prompts

Bad prompt:

“Write CFA code”

Good prompt:

“Write lavaan code for a one-factor CFA with 5 items (item1-item5), fix factor variance to 1, and request standardized estimates and fit indices”

Example: Using AI to Translate Code

Prompt:

“Convert this Mplus syntax to lavaan:

MODEL: Anxiety BY item1* item2 item3 item4 item5; Anxiety@1;”

AI Output:

model <- '
    Anxiety =~ item1 + item2 + item3 + item4 + item5
'
fit <- cfa(model, data = mydata, std.lv = TRUE)

Example: Using AI for Debugging

Prompt:

“I’m getting this error in lavaan: Error: lavaan ERROR: sample covariance matrix is not positive-definite.

My data has 5 Likert items. What could be causing this?”

AI can suggest: checking for missing data, constant variables, multicollinearity, or small sample size.

AI Ethics in Research

Verification: You are responsible for accuracy, not the AI
Data privacy: Don’t share sensitive data with AI tools

AI Tools for Data Analysis

Tool	Best For	Cost
ChatGPT	General coding, explanations	Free / $20 mo
Claude	Longer code, nuanced tasks	Free / $20 mo
GitHub Copilot	IDE integration, autocomplete	$10 mo (free for students)
Cursor	Full IDE with AI	Free / $20 mo

Hands-On: AI-Assisted Coding

Try this exercise:

Ask an AI to write code that simulates data for a 2-factor CFA
Run the code in R
Does it work? What needed fixing?
Ask the AI to add missing data to the simulation

Reflect: What was helpful? What required your expertise?

Summary

AI is a tool, not a replacement for statistical knowledge
Verify everything - AI makes confident mistakes
Use AI to accelerate your work, not to bypass learning
Your expertise is needed for interpretation and context

Key Takeaways

Be multilingual: R is your hub for connecting software
Simulation: Know the truth to evaluate your methods
Compare outputs: Different software, same results (mostly)
Use AI wisely: Powerful tool with important limitations

Software Integration

Agenda

Why be Multilingual?

1. Simulating Data in R

2. Exporting to Excel/CSV

3. R \(\leftrightarrow\) SPSS

4. R \(\leftrightarrow\) SAS

5. R \(\leftrightarrow\) Parquet

6. R \(\leftrightarrow\) Mplus

Challenge

Part II

Comparing Mplus and R Outputs

Step 1: Export Data for Mplus

Step 2: Mplus Input File

Step 3: Run Mplus from R

Step 4: CFA in lavaan

Step 5: Extract lavaan Parameters

Comparing Results

Differences to Note

Challenge

Part III

AI-Assisted Data Analysis

What AI Can Help With

What AI Struggles With

Best Practices for AI in Analysis

Example: Using AI to Translate Code

Example: Using AI for Debugging

AI Ethics in Research

AI Tools for Data Analysis

Hands-On: AI-Assisted Coding

Summary

Key Takeaways

Questions?