CRISP-DM: Data Preparation
I learned about CRoss Industry Standard Process for Data Mining (CRISP-DM) process model recently when searching for data wrangling. It is an industrial methodology, made up of six phases for data mining project planning.
This framework has been a useful guide for some of my previous data projects. In this blog, we’ll talk specifically about the data preparation phase.
Data preparation is an essential step to prepare complete and relevant data prior to the modeling phase. The data may come from in-house or client databases, public/ open-source repositories, websites, etc. While preparing data, understanding the data is also crucial to gain insights from them, which will potentially be helpful for further analysis.
This process involves data processing such as:
- Handling missing values
There are several ways to deal with missing data. One way is to replace missing data with substituted values, which this process is called Imputation in statistics. You either replace the missing values with the mean, median, mode of the non-missing values in that column or with a constant value. This is only applicable for a numeric column. Another way would be to inspect the rows with missing values and check whether they are significant, if not, you can drop them.
2. Deduplication
Deduplication means eliminating any redundant data in your dataset. This step might help decrease storage capacity requirements as you remove the excessive duplicated data records.
3. Label Encoding & One-hot Encoding
Machine learning models require numeric input and output variables. If your dataset contains categorical variables, you must encode them so that they will become a machine-readable format for model training later on.
Example: One-hot encoding╔══════╗ ╔═══╦═══╦═══╦═══╗
║ Type ║ ║ A ║ B ║ C ║ D ║
╠══════║ ╠═══╬═══╬═══╬═══╬
║ A ║ ║ 1 ╬ 0 ╬ 0 ╬ 0 ║
║ B ║ ===> ║ 0 ╬ 1 ╬ 0 ╬ 0 ║
║ C ║ ║ 0 ╬ 0 ╬ 1 ╬ 0 ║
║ D ║ ║ 0 ╬ 0 ╬ 0 ╬ 1 ║
╚══════╝ ╚═══╩═══╩═══╩═══╝
4. Data Cleansing & Reformatting
Aside from missing values, your dataset might contain outliers that could skew analysis results, NaN (nulls), whitespaces that obfuscate values, special characters, values in multiple measurement units, different time formats, etc. The data cleansing and formatting process will take a substantial amount of time to complete before combining or merging the datasets for analysis purposes.
There are more data preparation tasks to be done such as dropping redundant columns, carrying out feature scaling/ normalization/ standardization, identifying outliers, and more.
To construct an efficient data preparation pipeline, we must know the data and customers’ requirements well to ensure all important aspects are covered in this data production phase.