Oct 31, 2020
Technically, tree-based models don't require the encoding of categorical features. That said, sklearn only takes numbers as its parameters so we are forced to encode. I will typically try 4-6 different encoders to see if any of them improve my performance. Also, I almost always encode categories which appear less than 1% or 2% of the time as 'rare'. This helps with encoding categorical features and helps to remove 'noise' from our data which helps with overfitting.