Scikit learn: f1-weighted vs. f1-micro vs. f1-macro

If you look at the f1_score function in sklearn.metrics, you will see an ‘average’ argument. This argument defaults to ‘binary’. However, when dealing with multi-class classification, you can’t use average = ‘binary’. You can choose one of ‘micro’, ‘macro’, or ‘weighted’ for such a case (you can also use None; you will get f1_scores for each label in this case, and not a single value).

When you set average = ‘macro’, you calculate the f1_score of each label and compute a simple average of these f1_scores to arrive at the final number.

By setting average = ‘weighted’, you calculate the f1_score for each label, and then compute a weighted average (weights being proportional to the number of items belonging to that label in the actual data).

When you set average = ‘micro’, the f1_score is computed globally. Total true positives, false negatives, and false positives are counted. Essentially, global precision and recall are considered.


This can be understood with an example. Consider:

y_true = [0,0,0,1,1,1,2,2,2,2]
y_pred = [1,0,0,1,1,0,2,2,1,2]

Now, let’s first compute the f1_scores for the individual labels:

from sklearn.metrics import f1_score
f1_score(y_true, y_pred, average = None)
>> array([0.66666667, 0.57142857, 0.85714286])

Now, the macro score, a simple average of the above numbers, should be 0.698.

f1_score(y_true, y_pred, average = 'macro')
>> 0.6984126984126985

The weighted average has weights equal to the number of items of each label in the actual data. So, it should equal (0.6667*3+0.5714*3+0.857*4)/10 = 0.714

f1_score(y_true, y_pred, average = 'weighted')
>> 0.7142857142857142

For the micro average, let’s first calculate the global recall. Out of all the labels in y_true, 7 are correctly predicted in y_pred. This brings the recall to 0.7. Next, let us calculate the global precision. Out of all the labels in y_pred, 7 have correct labels. This brings the precision to 0.7. Thus, micro f1_score will be 2*0.7*0.7/(0.7+0.7) = 0.7

f1_score(y_true, y_pred, average = 'micro')
>> 0.7

You can try this for any other y_true and y_pred arrays. The global precision and global recall are always the same. Therefore, calculating the micro f1_score is equivalent to calculating the global precision or the global recall.

Check out other articles on python on If you are interested in data science, visualization, and machine learning using Python, you may find this course by Jose Portilla on Udemy to be very helpful. It has been the foundation course in Python for me and several of my colleagues.

Leave a comment

Your email address will not be published.