BreakNBuild
BreakNBuild.Rmd
# Install the package from github
# install.packages("devtools")
# devtools::install_github("https://github.com/focardozom/BreakNBuild")
library(BreakNBuild)
Machine Learning Model Evaluation using
BreakNBuild
The BreakNBuild
package provides tools for model
evaluation through progressive data splitting and performance analysis.
The progressive_splits()
function enables users to assess
model behavior across varying training set sizes, generating learning
curves that reveal critical insights into model characteristics such
as:
- Bias-variance trade-offs
- Learning rate patterns
- Optimal training set sizes
- Potential underfitting or overfitting issues
This vignette demonstrates the package’s capabilities using a synthetic dataset with embedded non-linear relationships. We compare two modeling approaches:
- Linear Regression (LM): A parametric model that assumes linear relationships, used to illustrate how model bias can affect generalization performance
- Support Vector Machine (SVM): A flexible non-linear model capable of capturing non-linear patterns, providing contrast in handling non-linear relationships
Through these examples, we’ll show how BreakNBuild
’s
diagnostic tools can inform model selection and debugging processes by
visualizing performance metrics across different training set sizes.
Simulating the Dataset
We’ll create a synthetic dataset hiding non-linear relationships. We’ll generate 500 observations with 5 predictor variables, and then construct a response variable that depends on both linear and quadratic terms of these predictors. This non-linear relationship will help us showcase how different models handle a non-linear patterns in the data. The final dataset will retain only the original predictors (without the quadratic terms) to simulate a realistic scenario where the true underlying non-linear relationship is “unknown” to the modeler.
set.seed(2023) # Ensures reproducibility
n <- 500 # Number of observations
X <- matrix(rnorm(n * 5), ncol = 5) # Generating 5 predictors
# Generating non-linear terms
X_nonlinear <- cbind(X, X[, 1]^2, X[, 3]^2)
beta_nonlinear <- c(.3, -.2, .4, -.1, .2, .6, .8)
y_nonlinear <- X_nonlinear %*% beta_nonlinear + rnorm(n)
# Constructing the dataset and removing non-linear terms for model evaluation
df_nonlinear <- cbind(y_nonlinear, X_nonlinear) |>
tibble::as_tibble() |>
dplyr::select(-c(V7, V8))
#> Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
#> `.name_repair` is omitted as of tibble 2.0.0.
#> ℹ Using compatibility `.name_repair`.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
# Assigning column names
colnames(df_nonlinear) <- c("Y", paste0("X", 1:5))
Modeling Approach
Now, we’ll specify the two modeling approaches we’ll use to evaluate the dataset.
Linear specification
Specify a linear regression model using the parsnip package:
library(parsnip)
lm_model <- linear_reg() |>
set_mode("regression")
Recipe Definition
We’ll create a recipe using the recipes
package to
define our model’s preprocessing steps. The recipe specifies the
relationship between our response variable Y
and all
available predictors. While this preprocessing step isn’t strictly
necessary for these simple models, it demonstrates
BreakNBuild
’s compatibility with the
tidymodels
workflow framework, allowing for more complex
preprocessing scenarios in real-world applications.
Data Splitting
The BreakNBuild
package’s
progressive_splits()
function creates multiple
training/testing splits of your data, where:
- The testing (assessment) set remains constant at 20% of your total
data.
- The training set starts small and progressively grows larger.
- The initial training set size (
start_size = 7
).
progressive_sets <- progressive_splits(df_nonlinear, assessment_size = 0.2, start_size = 7)
For example, with our dataset of 500 observations: - The assessment (testing) set is always 100 observations (20% of 500) - The training set starts with 7 observations - Each subsequent split increases the training set size (7, 8, 9, …) - The final split uses all remaining 400 observations for training
The figure below illustrates how the function organizes these progressive data splits:
Workflow Assembly
Let’s create a workflow object as used in the tidymodels
framework.
Workflow Assembly - Linear Model
We’ll create a workflow object that combines our linear regression
model with the preprocessing recipe. In the tidymodels
framework, workflows provide a convenient way to bundle model
specifications with preprocessing steps:
library(workflows)
lm_wf <- workflow() |>
add_model(lm_model) |>
add_recipe(recipe)
Model Estimation
The LM model cannot capture the non-linear relationship in the data, and therefore it will likely perform poorly.
To get the training error we use get_training_error
.
This function uses fit_resamples()
from the
rsample
package to fit the model on the training set and
then computes the error using the yardstick
package.
To get the validation error we use get_validation_error
.
This function uses fit_resamples()
and then uses the
collect_metrics() function to get the error from the validation
sets.
Compute the training error
training_error <- get_training_error(lm_wf, progressive_sets, "rsq")
Compute the validation error
validation_error <- get_validation_error(lm_wf, progressive_sets, "rsq")
Visualization of Findings
Once you have this two objects, you can use the
get_learning_curve()
function to plot the training and
validation errors:
get_learning_curve(validation_error, training_error)
The linear model’s performance confirms our expectations - it struggles to capture the non-linear relationship in the data. The validation error remains consistently poor, showing no improvement even with more training data. Furthermore, after reaching 50 observations in the training set, the model’s learning plateaus and begins to deteriorate, indicating that additional data does not help the model learn the underlying non-linear patterns.
Workflow Assembly - Non-linear Model
For our SVM model, we’ll create a similar workflow that combines the SVM specification with the same preprocessing recipe. This ensures consistent data preparation across both models while changing only the modeling approach:
svm_wf <- workflow() |>
add_model(svm_model) |>
add_recipe(recipe)
Model Estimation
As we did before, we can compute the training and validation errors for the SVM model.
Compute the training error
training_error <- get_training_error(svm_wf, progressive_sets, "rsq")
Compute the validation error
validation_error <- get_validation_error(svm_wf, progressive_sets, "rsq")
Visualization of Findings
get_learning_curve(validation_error, training_error)
So, much better! The SVM model is able to capture the non-linear relationship in the data, and therefore it performs better than the linear model. The validation error is lower, and the model continues to learn even when the training set size is increased. It looks like more data will be beneficial for the SVM model.
Conclusion
The progressive_splits()
function in
BreakNBuild
is an effective tool for evaluating model
performance across increasing training set sizes. It can help diagnose
issues such as suboptimal model performance due to biased model
selection. The get_learning_curve()
function provides an
intuitive visualization of the relationship between training set size
and model performance, helping users identify optimal training set sizes
for their specific use cases.