Creating data pipelines

In the Minitab Data Center, you can create a data pipeline to clean and transform data from one or more sources into a ready-to-use dataset.

What is a data pipeline?

A data pipeline is a sequence of steps that collect, transform, and prepare data so it’s ready for analysis or reporting. Data pipelines help ensure that:
  • Data remain consistent and reliable
  • Updates occur on demand
  • Teams use the same trusted dataset
  • Errors are identified before the data are used

The data pipeline appears as an interactive visual diagram that lets you add, remove, and modify nodes while receiving real-time processing status and error messages.

For example, you can create a pipeline that pulls data from a CSV file and a Minitab worksheet, cleans and combines the data, then outputs a single dataset to use in your dashboard.
Note

Each pipeline supports up to 60 processing nodes, plus one output node (61 nodes total). You can have up to ten data source nodes.

Pipeline views

Every Data Center project contains an interactive pipeline diagram that represents the data processing steps.
Cleanup view
Use the Cleanup view to clean and prepare your data
Data Source view
Use the Data Source view to modify the dataset schema or any settings that affect the entire dataset.

For more information, go to Manage the dataset schema or Set data source options.

Adjust the pipeline display

Use the Zoom In, Zoom Out, or Fit View buttons on the pipeline canvas to adjust your view.
You can also select Auto Layout from the toolbar to optimize the pipeline view.
Note

You can drag and reposition nodes for optimal visual organization.

Available nodes

Most pipelines include the following types of nodes:
  • Data Source
  • Cleanup
  • Merge
  • Reshape
  • Output

Data source nodes

A data source node connects your pipeline to a dataset. Each pipeline supports a maximum of ten data source nodes.
To add a data source node, select Add Data from the toolbar. You can also select Add Data Source from the canvas context menu.

For more information on data source nodes, go to Source node basics.

Cleanup nodes

Fixes formatting issues, removes errors, and performs other data preparation operations.
The Data Center supports multiple cleanup nodes in flexible hierarchies to support all your data cleaning processes.
The first Cleanup node is added in series, then subsequent nodes are in parallel as follows. You can rename and move nodes at any time into any position.

To add an unparented cleanup node, either select Add Cleanup from the canvas context menu.

For more information on data cleanup nodes, go to Cleanup step basics.

Data merge nodes

Use Join or Union nodes to combine multiple datasets.
You can add join and union nodes from an existing node or the connector line.

To add an unparented data merge node, either select Add Join or Add Union from the canvas context menu.

For more information on data merge nodes, go to Merging datasets.

Reshaping nodes

Reshape datasets using Transpose operations.
You can add transpose nodes from an existing node or the connector line.

To add an unparented reshaping node, select Add Transpose from the canvas context menu.

For more information on reshaping nodes, go to Transpose datasets.

Output nodes

Indicates the terminal node of a data pipeline. Delivers data to a final destination, such as an analysis tool or dashboard.

To set an output node, open the right-click menu and choose Set Output from a parent node. From here, you can send a copy of the cleaned data to a Minitab project or a Minitab Dashboard.

You can also select Open In from the toolbar to send a copy of the cleaned data to a Minitab project or Minitab Dashboard.

For more information on exporting the data or the entire Data Center pipeline, go to Export data and projects.

Refresh the pipeline

Use Refresh to reprocess the data transformations within the data pipeline. Only Data Source nodes can be refreshed independently.

To refresh the entire pipeline, select Refresh from the toolbar.

To refresh an individual data source, select Refresh from the source node context menu. If a data source is not accessible, you will be prompted to reconnect or browse for the file.