An example of using Scikit-Learn’s pipeline in Python to build machine learning models

One big challenge that faces casualty insurance companies is that they tend to be overwhelmed by the amount of claims. To streamline the workload, it would be desirable if the claims can be prioritized in some ways. In the case of automobile accidents, the claim adjuster would probably like to know the potential loss based on the make and model of the vehicles and prioritize his cases on that metric.

In this blog, we are going to use a dataset from UCI. The main goal here is to illustrate a powerful feature called pipeline in Python’s Scikit Learn library where you can transform data on the fly while training your machine learning model. You may have noticed that the term pipeline is used extensively in machine learning. It may mean very different things when used elsewhere! But in this context, we’re using it to refer to pipeline objects in Scikit-Learn.

The data set and pertinent information can be found in the link below and will be not repeated here. Please read the description of the data set and refer to it when necessary.

https://archive.ics.uci.edu/ml/datasets/automobile

The original data file imports-85.data is comma delimited. So the extension was changed, column headings added and extension changed to csv. Rows with missing values in the “normalized-losses” column were removed and the file was renamed to AutoInsuranceClaimNoMissingLoss.csv for training the model.

#Import modules we'll need for this exercise
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

#Load and train data set
claim_data = pd.read_csv("c:/doc/AutoInsuranceClaimNoMissingLoss - 3_31_22.csv")
claim_data.head()

Examine the data types in the data frame.

claim_data.dtypes
symboling              int64
normalized.losses      int64
make                  object
fuel.type             object
aspiration            object
num.of.doors          object
body.style            object
drive.wheels          object
engine.location       object
wheel.base           float64
length               float64
width                float64
height               float64
curb.weight            int64
engine.type           object
num.of.cylinders      object
engine.size            int64
fuel.system           object
bore                 float64
stroke               float64
compression.ratio    float64
horsepower             int64
peak.rpm               int64
city.mpg               int64
highway.mpg            int64
price                  int64
dtype: object

Identify the numerical features and categorical features and look at the statistics of the numerical features.

numeric_features = ['symboling','wheel.base','length','width','height','curb.weight','engine.size','bore','stroke','compression.ratio','horsepower','peak.rpm','city.mpg','highway.mpg','price']
categorical_features = ['make',	'fuel.type','aspiration','num.of.doors','body.style','drive.wheels','engine.location','engine.type','num.of.cylinders','fuel.system']
claim_data[numeric_features + ['normalized.losses']].describe().of.cylinders','fuel.system']claim_data[numeric_features + ['normalized.losses']].describe()

Notice that columns bore and stroke have 0’s. According to the document, bore should have values                     from 2.54 to 3.94 and stroke should have values from 2.07 to 4.17. We will replace 0 with the mean of the feature.

bore_mean=claim_data[claim_data["bore"] != 0]["bore"].mean()
claim_data['bore'] = np.where(claim_data['bore'].eq(0),bore_mean,claim_data['bore'])
stroke_mean=claim_data[claim_data["stroke"] != 0]["stroke"].mean() 
claim_data['stroke'] = np.where(claim_data['stroke'].eq(0),stroke_mean,claim_data['stroke'])
claim_data[numeric_features + ['normalized.losses']].describe()
# Separate features and labels
# After separating the dataset, we now have numpy arrays named **X** containing the features, and **y** containing the labels.

X, y = claim_data[numeric_features + categorical_features ].values, claim_data['normalized.losses'].values

# Split data 70%-30% into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape[0]))
Training Set: 114 rows
Test Set: 50 rows

So far, we are looking at the data that is virtually loaded straight from a source file with only a little preprocessing.

In practice, it’s common to perform much more preprocessing of the data to make it easier for the algorithm to fit a model to it. There’s a huge range of preprocessing transformations you can perform to get your data ready for modeling. In fact according to surveys, data scientists spend about 80% of their time organizing data, doing feature engineering and preprocessing transformations of data. But we’ll limit ourselves to a few common techniques for this short demo of how pipeline works.

Scaling numeric features

Normalizing numeric features so they’re on the same scale is important. It prevents features with large values from producing coefficients that disproportionately affect the predictions.  When all features are in the same scale, it also helps algorithms to understand the relative relationship better.

There are multiple ways you can scale numeric data, such as calculating the minimum and maximum values for each column and assigning a proportional value between 0 and 1, or by using the mean and standard deviation of a normally distributed variable to maintain the same spread of values on a different scale.

More info can be found in the link below.

https://www.kdnuggets.com/2020/09/feature-engineering-numerical-data.html

Encoding categorical variables

Many machine learning models do not work with text values. Therefore, you generally need to convert categorical features into numeric representations. There are many ways to encode text values to numerical values, such as ordinal encoding which substitutes a unique integer value for each category, and one hot encoding that creates individual binary (0 or 1) features for each possible category value.

You can learn more about it in the following link.

https://www.kdnuggets.com/2019/07/categorical-features-machine-learning.html

To apply these preprocessing transformations to the insurance claims data, we’ll make use of a Scikit-Learn feature called pipelines. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and category encodings used with the training data).

# Train the model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Train the model
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

import numpy as np

# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,2,3,4,5,6,7,8,9,1,11,12,13]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode them)
categorical_features = [14,15,16,17,18,19,20,21,22,23]
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', GradientBoostingRegressor())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                   1, 11, 12, 13]),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  [14, 15, 16, 17, 18, 19, 20,
                                                   21, 22, 23])])),
                ('regressor', GradientBoostingRegressor())])

The model is trained with GradientBoostingRegressor, including the preprocessing steps. The following code shows how it performs with the validation data.

# Get predictions
predictions = model.predict(X_test)

# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Insurance Claim Predictions')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
MSE: 266.7675806643773
RMSE: 16.3330211738177
R2: 0.7351873720799795

The final pipeline is composed of two pipelines that do the transformations (preprocessor) and the algorithm used to train the model. To try an alternative algorithm you can just change the final pipeline to include a different kind of estimator. Code example below shows that the final pipeline uses the RandomForestRegressor.

# Use a different estimator in the pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', RandomForestRegressor())])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model, "\n")

# Get predictions
predictions = model.predict(X_test)

# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('nsurance Claim Predictions - Preprocessed')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                   1, 11, 12, 13]),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  [14, 15, 16, 17, 18, 19, 20,
                                                   21, 22, 23])])),
                ('regressor', RandomForestRegressor())]) 

MSE: 228.05054600000003
RMSE: 15.101342523100389
R2: 0.7736206767162103

Now we have seen how to use pipeline to transform data and train models. The question is can we also include hyperparameter tuning in the pipeline. The following code shows a way to do just that.

from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, r2_score

# Use a Gradient Boosting algorithm

alg = GradientBoostingRegressor()

# Try these hyperparameter values
params = {
 'learning_rate': [0.1,0.3, 0.5,0.8, 1.0],
 'n_estimators' : [50, 75, 100, 125,150]
}

# Find the best hyperparameter combination to optimize the R2 metric
score = make_scorer(r2_score)
gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)
#gridsearch.fit(X_train, y_train)




# Define preprocessing for numeric columns (scale them)
numeric_features = [0,1,2,3,4,5,6,7,8,9,1,11,12,13]
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode them)
categorical_features = [14,15,16,17,18,19,20,21,22,23]
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('regressor', gridsearch)])


# fit the pipeline to train a linear regression model on the training set
model = pipeline.fit(X_train, (y_train))
print (model)
Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  [0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                   1, 11, 12, 13]),
                                                 ('cat',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore'))]),
                                                  [14, 15, 16, 17, 18, 19, 20,
                                                   21, 22, 23])])),
                ('regressor',
                 GridSearchCV(cv=3, estimator=GradientBoostingRegressor(),
                              param_grid={'learning_rate': [0.1, 0.3, 0.5, 0.8,
                                                            1.0],
                                          'n_estimators': [50, 75, 100, 125,
                                                           150]},
                              return_train_score=True,
                              scoring=make_scorer(r2_score)))]
# Get predictions
predictions = model.predict(X_test)

# Display metrics
mse = mean_squared_error(y_test, predictions)
print("MSE:", mse)
rmse = np.sqrt(mse)
print("RMSE:", rmse)
r2 = r2_score(y_test, predictions)
print("R2:", r2)

# Plot predicted vs actual
plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')
plt.ylabel('Predicted Labels')
plt.title('Insurance Claim Predictions')
z = np.polyfit(y_test, predictions, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test), color='magenta')
plt.show()
MSE: 292.23372823226
RMSE: 17.094845077749607
R2: 0.709907848070147

Summary

That concludes the introduction to pipeline in Scikit Learn We have shown code examples on how to build pipelines to transform data and train models.

You can download the notebook and the data set from the links below.

https://www.quadbase.com/upload/Predict_Insurance_Claim.ipynb

https://www.quadbase.com/upload/AutoInsuranceClaimNoMissingLoss_3_31_22.csv