Objectives

Before analyzing fraud detection trends, the dataset must be cleaned and standardized. In this section, you will:

  • Correct data types
  • Remove invalid records
  • Standardize categorical values
  • Organize the dataset for analysis

All data preparation is completed in the Minitab Data Center.

Open your data source

  1. From the Minitab Solution Center Home page, select Data Prep.
  2. Select Add Data.
  3. Sign into your repository or browse to a local file.
  4. Open the Insurance Fraud Data.

    Insurance Fraud Data

Understand the Data Center views

The Data Center has two primary views:
Cleanup view
You can begin cleaning your data when you are in the Cleanup view.
Use the Cleanup view to:
  • Modify column values
  • Change data types
  • Filter rows
  • Standardize categories
  • Sort data
Data Source view
If you need to change the dataset schema or any settings that affect the entire dataset, select the data source file icon to open the Options panel.

For more information, go to Manage the dataset schema or Set data source options.

Use the Data Source view to:
  • Adjust dataset-wide settings
  • Modify schema
  • Configure data source options

Prepare the dataset

The compliance team wants to improve fraud detection accuracy. Before analysis begins, the dataset must be validated and standardized. Follow these steps to prepare insurance_fraud_data.csv for further analysis.
  1. Open Insurance Fraud Data in the Minitab Data Center.
  2. Make sure you are in the Cleanup view.
  3. Select the column and open the Data Prep Options dropdown menu to access the column cleanup options.

Standardize identifiers

Ensure claim identifiers are treated as text and clearly formatted.
  • Change claim_number data type from numeric to text.

  • Prepend the # symbol to all claim numbers.

This prevents numeric interpretation and preserves formatting.

Remove invalid or unrealistic values

Clean outliers and placeholder values that could affect analysis.
  • Filter age_of_driver to include only values ≤ 100.
  • Filter annual_income to include only values greater than 1.

This removes unrealistic ages and invalid income entries.

Standardize categorical values

Ensure consistent, readable labels.
  • In gender, replace:
    • M → male
    • F → female
  • Change address_change data type from numeric to text.
  • In address_change, replace:
    • 1 → yes
    • 0 → no

Standardized categories improve readability and reporting.

Correct data types

Some numeric fields represent identifiers rather than quantities.
  • Change zip_code data type from numeric to text.

This preserves leading zeros and prevents unintended numeric operations.

Organize the dataset

Use Advanced Sort to prepare for analysis. Sort by:
  • fraud reported
  • injury_claim
  • zip_code

Sorting helps organize fraud-related records for review.

Use Minitab AI to clean your data

The Minitab Data Center provides a conversational interface that guides your data preparation, while in the Cleanup view. For the example above, you can enter the following text into the Minitab AI prompt to get the same results as individual steps.

Make claim numbers to text. Add the number symbol to claim numbers. Remove drivers that are older than one hundred. Change m to male and f to female. Remove drivers that don’t have a valid income. Change address_change to text. Make 1 to yes and 0 to no for address changes. Sort by fraud, injury claim, and zip code.

For more information on using Minitab AI in the Data Center, go to Using Minitab AI to clean your data.

Merging or reshaping datasets

In addition to cleaning and standardizing data, you may need to combine or reorganize datasets before analysis.

The following operations help prepare data for reporting, statistical analysis, or dashboard creation.
Join
Combines related datasets by matching rows using one or more key fields. This adds columns and makes the dataset wider.

For more information, go to Join datasets.

Union
Stacks datasets with the same structure into one dataset. This adds rows and makes the dataset longer.

For more information, go to Union datasets.

Transpose
Switches rows and columns. This is useful when data is arranged in a format that is not ideal for analysis.

For more information, go to Transpose datasets.

Export data prep steps

After you apply all the prep steps, save the steps to use for future datasets with the same columns. To save the steps, export them as a .mdcs file.
  1. In the Steps pane on the left, select Export Steps from the dropdown menu.
  2. The file is saved to your downloads folder or other save location and uses the same name as your data file. Change the name accordingly.

Import data prep steps

To apply the steps to a new data file, import them as a .mdcs file. Select Import Steps from the dropdown menu in the Steps pane.

Explore data summaries

Each column has a graphical summary that shows the shape of the data, the range of the data, and an icon that represents the data type.

For example, channel has 3 levels and days open shows a bimodal distribution.

Open the Data Summary to get more information on the summary statistics on these columns.

The data summary for channel shows the frequency for each of the 3 levels.

Use the right-click menu to edit the grouping label, exclude the group from the dataset, or show only the rows that contain this value.

What's next

Because the data for days open indicate two distributions, the insurance company wants to look at this further. Go to Analyze your data.