Example of prepping data

Use data prep tools to organize and prepare your data for more robust analyses.

Data prep steps

In this example, a compliance team is concerned about fraud detection accuracy in the automotive industry; however, the data need prep before analysis can begin. Follow these steps to prepare insurance_fraud_data.csv for further analysis. To make these modifications, select the column and open Data Prep Options to access the column cleanup options.
  1. Open Insurance Fraud Data in the Minitab Data Center.
  2. For claim_number, change the data type from numeric to text.
  3. For claim_number, prepend # to the column values.
  4. For age_of_driver, filter to only include drivers that are less than or equal to 100 years old.
  5. In gender, change M to male and F to female.
  6. For annual_income, filter to only include drivers that make more than 1.
  7. For address_change, change the data type from numeric to text.
  8. In address_change, change 1 to yes and 0 to no.
  9. For zip code, change the data type from numeric to text.
  10. Use Advanced Sort to sort by fraud, injury claim, and ZIP code.

Export data prep steps

After you apply all the prep steps, save the steps to use for future data sets with the same columns. To save the steps, export them as a .mdcs file.
  1. In the Steps pane, select Export Steps from the dropdown menu.
  2. The file is saved to your downloads folder or other save location and uses the same name as your data file. Change the name accordingly.

Import data prep steps

To apply the steps to a new data file, import them as a .mdcs file. Select Import Steps from the dropdown menu in the Steps pane.

Explore data summaries

Each column has a summary that shows the shape of the data, the range of the data, and an icon that represents the data type.

A quick look at the column graphical summaries show that channel has 3 levels and days open shows a bimodal distribution.

Open the Data Summary to get more information on the summary statistics on these columns.

The data summary for channel shows the frequency for each of the 3 levels.

What's next

Because the data for days open indicate two distributions, the insurance company wants to look at this further. Go to Example of analyzing data.