Using Proxy-Labelling to Identify Drift

Avi Chawla

Mar 18, 2024

An intuitive way to detect drift.

Read →

3 Comments

Guna

Mar 24, 2024

Great share avi bro. I think we can do the same by using some statistical methods

like kolmogorov-simnrov test helps us to detect whether those two distributions are same or not.

Expand full comment

Ahmed Besbes

Mar 19, 2024Edited

Interesting technique. I have two questions:

- how do you quantitatively say that there are high-importance features, and low importance features after training the random forest? Do you use a threshold or do you cluster the values?

- given a real-time ML system, at which frequency do you use proxy-labeling techniques? Curious to know if you have thought about an architecture about this.

Expand full comment

Reply (1)

Avi Chawla

Mar 19, 2024Edited

- Yes, mostly threshold-based but with some element of relative importance too. In random forest, these importance values add to 1, so if all features nearly have the same importance values, it typically means that there is no significant drift yet. I don't remember if I have used any cluster-based analysis to be honest.

- There are periodic review cycles which one can adhere to. In my case, we wrote one script which was then handed over to the operations team who was managing the model. One can run it whenever they wanted to, it would fetch the data and have multiple drift checks (including the one I shared in this post), which would then be made available to the team. Point is that it kind of boils down how you define your model review cycles. If you have other automated feedback collection mechanism (like user rating on recommendation which you would see on YouTube, for instance), these can also help. Eventually, you would train a new model (or adapt the existing one) and all you are doing is trying to identify when to do that.

P.S. Seen you on Medium, cool stuff :)

Expand full comment

Daily Dose of Data Science

Using Proxy-Labelling to Identify Drift