4 Comments

Under which conditions would you choose one sampling method over another?

Expand full comment

That's a great question.

And the answer is obviously quite subjective but here's a way I look at it.

1. Simple Random: Pretty easy to implement and does not involve an expensive computation so can be easily used in situations where cluster-based sampling is difficult. In fact, this is the default with many sampling methods of sklearn/numpy you would see typically. This can be useful when you have loads of data in a database and brining them to local disk just isn't feasible. Just query some random rows. If the population size is significantly large, random sample (in most cases) is typically expected to be quite representative.

2. Stratified sampling: Much better than random in most cases as it retains the distribution you created your stratas on. As far as I can think while writing this, this one can be implemented as well if you have your data in a database.

3. Cluster sampling (both of them): These are prone to the performance of the clustering method you begin with. But keeping that aside, it (kind of) guarantees less variability in data and can help in determining some patterns specific to that group, which can be useful for the overall model. But, as we are selecting clusters, there's a chance that the selected clusters might not be representative.

But again, it is pretty subjective. Personally, in most cases, I have resorted to stratified sampling. Would love to hear your thoughts on this

Expand full comment

Good Post

Expand full comment

Thanks, Terry :)

Expand full comment