Scikit Learn: Scaling of features

The Why

When developing ML models based on multiple features, more often than not, we deal with features having completely different orders of magnitude. For example, in a housing price prediction problem, you may have the land area (in sq. ft.) as one feature, with typical values ranging from 1000 to 10,000, and the distance from the nearest school (in km.) as another feature, with typical values ranging from 1 to 10. A model doesn’t understand units, and it doesn’t understand what the features represent. It only sees the numbers and performs calculations.

Without scaling, the features having a higher magnitude may often end up grabbing higher importance in the model. This is very significant for models that calculate distances (like KNN), but not quite significant for models that are not generally influenced by the order of magnitude of the features with respect to each other (like decision trees).

In the example discussed above, without scaling, the land area would influence a KNN-like model much more than the distance from the nearest school. Because a 1% change in the land area changes the magnitude by 50, whereas a 100% change in distance from the nearest school just changes the magnitude by 5. After scaling, when both the features are in the same range (typically 0 to 1, or -1 to 1), there is a level playing field and the model will be able to decipher trends that actually matter.

The How

Scikit learn comes with several scalers in the preprocessing module. Of these, two are very popular: MinMaxScaler() and StandardScaler(). The difference between the two is covered in the next section. The procedure to scale data, using either of the scalers, is as follows:

  1. Import the scaler
  2. Fit the data to the scaler
  3. Transform the data

Let’s see examples with both StandardScaler() and MinMaxScaler().

StandardScaler

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

MinMaxScaler

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

What is data in both the above implementations? Well, data represents the features that you want to scale. Scikit-learn documentation describes this argument as array-like, of shape (n_samples, n_features). Thus, it can be a 2D array, or even a pandas dataframe, or any other array-like construct.

As you can see, we first fit the scaler on the data and then transform the data using the scaler, in both cases. This can be achieved in a single step, using the fit_transform function.

An example using MinMaxScaler is shown below:

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)

To give you an example, consider data =

array([[1, 1, 1],
       [2, 2, 2],
       [3, 3, 3]])

In each column, the first row has the lowest value, and the last row has the highest value, and the values are in proportion.

The output of StandardScaler:

array([[-1.22474487, -1.22474487, -1.22474487],
       [ 0.        ,  0.        ,  0.        ],
       [ 1.22474487,  1.22474487,  1.22474487]])

The output of MinMaxScaler:

array([[0. , 0. , 0. ],
       [0.5, 0.5, 0.5],
       [1. , 1. , 1. ]])

Pandas DataFrame

The scalers in scikit-learn (StandardScaler, MinMaxScaler, etc.) can be applied directly to a pandas dataframe, provided the columns are numerical. Think of the columns of the pandas dataframe as features. Just like you apply the scaler (fit_transform, or transform) to a feature matrix, you can also apply it to the dataframe.

The syntax is as straightforward as it can be. An example is shown below with MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_scaled = scaler.fit_transform(df)

Please note that the output (df_scaled) is not a dataframe, but a Numpy array. You can convert it to a dataframe as follows:

df_scaled = pd.DataFrame(df_scaled, columns = df.columns)

If you only wish to scale certain columns of your dataframe, you can construct a shorter dataframe containing only the columns of your interest, and pass it through the scaled:

df_short = df[['col1','col2',..,'coln']]

If your dataframe contains any column which is not integer or float (like str, datetime, etc.), you will get an error in the scaling. Some example error statements are given below:

  1. Using string column: could not convert string to float: ‘Apple’
  2. Using datetime column: invalid type promotion

Standard Scaler vs.

MinMaxScaler

StandardScaler ‘standardizes’ the features. In other words, it transforms each feature such that the scaled equivalent has mean = 0, and variance = 1. Thus, the formula used to scale data, using StandardScaler, is:

x_scaled = (x – x_mean)/x_variance

x_mean is the mean of all values for that feature, and x_variance is the variance of all values for that feature. Please note that the mean and variance are different for each feature.

MinMaxScaler proportionately scales down the features to lie in the 0 to 1 range (unless another range is provided). Thus, the formula to scale data, using the MinMaxScaler is:

x_scaled = (x – x_min)/(x_max – x_min)

Here x_max and x_min are the min and max values for that feature.

With this context, the output of both the above examples will make more sense.

As you can see, both StandardScaler and MinMaxScaler will lead to a shrunk feature range, if there are outliers in the data. Therefore, it is a good idea to remove outliers before performing the scaling. Alternatively, you can try RobustScaler, which reduces the effect of outliers. You can read more about it here.

Both StandardScaler and MinMaxScaler have other arguments that can be defined. You can read more about them here:

StandardScaler

MinMaxScaler

Why scaler fitting is not done on test data

When you scale the data, you are essentially transforming the data based on its distribution. The formula is, of course, different, depending on whether you use StandardScaler, MinMaxScaler, or any other scaler. But what is important is that the scaling formula depends on the distribution of your dataset.

If you fit the scaler on training and test data, the scaler formula will be influenced by the distribution of the test data. Note that the test data is supposed to be completely unknown to the model. When you include the test data in your scaling operation, you indirectly leak information pertaining to the test data’s distribution in your model, and the test data no longer remains completely unknown.

Therefore, just like model training is done strictly on the training dataset, scaler fitting is also done on the training dataset only. The scaler is then used to transform the test dataset. Thus, for training datasets, you may see .fit() followed by .transform(), or .fit_transform(), whereas for test dataset, you will only see .transform().


Check out other articles on python on iotespresso.com. If you are interested in data science, visualization, and machine learning using Python, you may find this course by Jose Portilla on Udemy to be very helpful. It has been the foundation course in Python for me and several of my colleagues.

Leave a comment

Your email address will not be published.