Scenario and data sets: Compare Python and Minitab models

Using a Jupyter notebook, we compare a Python model with a Minitab model to demonstrate the power of Minitab Model Ops®.

With Minitab Model Ops, we can easily compare an operationalized robust Minitab model, such as a Random Forests® Regression model with a non-Minitab model.

In this example, we use the Ames housing data set to compare a Multi-layer Perceptron Regressor (MLPRegressor) created in scikit-learn to an operationalized Random Forests Regression model created in Minitab Statistical Software.

A multi-layer perceptron model is a neural-network with at least 3 layers of nodes: an input, an output, and one or more hidden layers. While MLP regression models are deep learning models that work well with very large data sets, they often require extensive computational resources to calculate best outcomes.

Random Forests, meanwhile, is a decision tree-based algorithm that combines several individual decision trees into a single output. Random forest models are machine learning models that are quite powerful with large data sets, and usually require less computing resources than MLP regression models.

Method

To compare the models, we can use performance statistics, such as R², to assess which model is better. Minitab Model Ops calculates R² and the Mean Absolute Deviation (MAD) for models with continuous responses on a rolling basis, to assess whether model drift or dataset drift occurs. When drift is present, data scientists consider when to retrain or replace the model.

For the MLP model, this notebook calculates one R² value and one MAD value at a given time, to demonstrate the use and operationalization of proprietary Minitab algorithms in conjunction with freely available Python models.

Data sets

A team of researchers collects data from the sale of individual residential properties in Ames, Iowa. The researchers want to identify the variables that affect the sale price. Variables include the lot size and various features of the residential property.

Note

These data were adapted based on a public data set containing information on Ames housing data. Original data from DeCock, Truman State University.

This tutorial illustrates the deployment and comparison of a Random Forests® Regression model that was created in Minitab® Statistical Software. Use the following links to open the Minitab project file and the CSV data sets for baseline data, training and test data, and prediction data.

Baseline data: This data set includes over 2900 rows of data that include the response variable, Sale Price, and 77 predictor columns.
AmesHousing.csv
Training and test data: This data set includes the baseline data and adds a column that contains an indicator that identifies whether the row is training data or test data.
AmesHousingTrainingTest.csv
Prediction data: This data set includes the baseline data, the training and test indicators, and adds a column that contains the prediction results.
AmesHousingPredictions.csv

Note

To learn more about creating a Random Forests® Regression model in Minitab, go to Example of Random Forests® Regression.