We’ve all read articles extolling the limitless potential of Artificial Intelligence (AI) and its cousin, machine learning. As an IT leader, you may be nervous about the sluggish progress an AI project is making after having slid by multiple milestones about which your team expressed high confidence. Most likely, your problem is the tedium, high effort and unpredictability of data wrangling that causes:
- Stalled AI projects.
- Increased AI project costs.
- Doubt around AI project insights and recommendations.
- Disappointing benefits from AI projects.
Data wrangling refers to all the effort your data scientists and software developers invest in data preparation before the actionable insights you hope to gain through data exploration and data analytics will be revealed.
Improving data wrangling
You can speed up and improve data wrangling by using more software and clearer steps to make data preparation:
- Less tedious for expensive, experienced data experts.
- Less consumptive of staff effort overall.
- Quicker and therefore cheaper.
- More conducive to greatly improving data quality.
- More operationalized to run on a predefined schedule, in a governed and scalable manner.
The steps below are performed by every AI project before moving on to the data analytics step that produces the business value. If you can’t relate the planned work of your project team to these steps, then it’s highly likely that your AI project is inadequately organized and is at risk of not reaching the project goal.
Data discovery is the first step. Its goal is to identify the data sources required to achieve the goal of your AI project.
The benefit of data discovery is that it provides an initial indication of the:
- Extent to which your organization’s data can actually support the goal of the AI project.
- Likely data preparation effort.
If your AI project team hasn’t quickly identified the required data sources, then your project is off to a poor start. You may be dealing with:
- Fundamental disagreements about the project goal and how to reach it.
- An overly ambitious AI project goal.
- Significant gaps in your application portfolio that must be addressed before the AI project can succeed.
Data profiling is the process for detecting the magnitude of various data problems in every selected data source.
Data profiling results are used to plan the work to improve the quality and completeness of the data.
If your AI project team appears to be stuck in endless data profiling work, then the team needs to focus on how to better automate this step.
Structuring the data includes standardizing values of dates, numbers and units of measure across your selected datastores.
The benefit of structuring the data includes minimizing the risk that:
- Data scientists will sum incompatible numeric values together that first need to be recalculated.
- Data scientists will perform date arithmetic on incompatible date values.
- Confusion about column definitions and contents will produce misleading results.
If your AI project team appears to be struggling to complete data structuring, often the problem is that the team is immersed in endless debates about ideal data standards. Remind them that project schedule is important and that compromises are acceptable to you.
Cleaning up the data corrects the problems that the data profiling step identified. This step is traditionally the most time-consuming part of the data preparation process.
The benefit of cleaning up the data is that it ensures high confidence, reproducible insights from data analysis.
If your AI project team appears to be daunted by the amount of data cleanup work, then help the team to prioritize the work and back off self-imposed perfection.
Once data has been cleaned up, it must be validated by testing for errors introduced or missed by the data preparation process up to this point.
The benefit of validating the data is that previously unrecognized errors in the data and in the design of the data integration will become apparent and can then be corrected.
If your AI project team insists that all the errors identified during validation must address before the Ai project can move forward, then help the team to prioritize the errors and back off self-imposed over-validation.
Enriching data adds value to the data in your datastores by:
- Merging external data with internal data.
- Pre-calculating and persisting results for many common calculations.
The benefit of this enrichment step is to greatly speed up subsequent data analytics processes. The cost to create the enriched data is typically a small fraction of the benefit to the many data analytics processes that are often repeated multiple times.
If your AI project team has added too much scope to enriching data, then help the team to prioritize the ideas and leave some of the ideas for another day.
Now your AI project is ready for the data analytics step that produces the business value.
What strategies would you recommend that can reduce tedium, effort and cost of data wrangling? Let us know in the comments below.