Data considerations for Discover Best Model (Binary Response)

Note

This command is available with the Predictive Analytics Module. Click here for more information about how to activate the module.

To ensure that your results are valid, consider the following guidelines when you collect data, perform the analysis, and interpret your results.

The response variable should be binary
A binary response has two outcomes, such as pass or fail. If your response variable is continuous, use Discover Best Model (Continuous Response)
The predictors can be continuous or categorical

A continuous variable can be measured and ordered, and has an infinite number of values between any two values. For example, the diameters of a sample of tires is a continuous variable.

Categorical variables contain a finite, countable number of categories or distinct groups. Categorical data might not have a logical order. For example, categorical predictors include gender, material type, and payment method.

If you have a discrete variable, you can decide whether to treat it as a continuous or categorical predictor. A discrete variable can be measured and ordered but it has a countable number of values. For example, the number of people that live in a household is a discrete variable. The decision to treat a discrete variable as continuous or categorical depends on the number of levels, as well as the purpose of the analysis. For more information, go to What are categorical, discrete, and continuous variables?.

A test set is the default when the number of cases > 2000

Minitab uses cross-validation to compare the models when the number of cases is ≤ 2000. When the number of cases is larger than 2000, Minitab uses a test set. When the data set is large, validation with a test set reduces the time to analyze the data. To learn more about the settings for validation techniques in Discover Best Model (Binary Response), go to Specify the validation method for Discover Best Model (Binary Response).

The model should provide a good fit to the data

If the model does not fit the data, the results can be misleading. All of the model types include model summary statistics that describe the performance of the model. Use the results from the cross-validation or from the test set to determine if the model predicts the response well.