Give Your AI the Keys to the Web
If you’re training AI on pre-built datasets, you’re already behind.
This is because the web is evolving every second—and your AI needs to evolve with it.
Bright Data lets you access, search, collect, and browse the web at an unlimited scale for AI.
With Bright Data, you can:
Deploy AI agents that search, browse, and extract data at scale.
Feed your models real-time, structured web data.
Integrate live data streams into LLM training.
And many more use cases.
AI built on stale data is guessing. AI built on live data is leading.
Check out every AI use case Bright Data solves here →
Thanks to Bright Data for partnering today!
Run-time complexity of ML algos
This visual depicts the run-time complexity of the 10 most popular ML algorithms.
Why care?
Everyone is a big fan of sklearn implementations.
It takes just two (max three) lines of code to run any ML algorithm with sklearn.
However, due to this simplicity, most people often overlook the core understanding of an algorithm and the data-specific conditions that allow us to use an algorithm.
For instance, you cannot use SVM or t-SNE on a big dataset:
SVM’s run-time grows cubically with the total number of samples.
t-SNE’s run-time grows quadratically with the total number of samples.
Another advantage of understanding the run-time is that it helps us understand how an algorithm works end-to-end.
That said, we made a few assumptions in the above table:
In a random forest, all decision trees may have different depths. We have assumed them to be equal.
During inference in kNN, we first find the distance to all data points. This gives a list of distances of size
n
(total samples).Then, we find the k-smallest distances from this list.
The run-time to determine the k-smallest values may depend on the implementation.
Sorting and selecting the k-smallest values will be
O(nlogn)
.But if we use a priority queue, it will take
O(nlog(k))
.
In t-SNE, there’s a learning step. However, the major run-time comes from computing the pairwise similarities in the high-dimensional space. You can learn how t-SNE works here: tSNE article.
Today, as an exercise, I would encourage you to derive these run-time complexities yourself.
This activity will give you confidence in algorithmic understanding.
👉 Over to you: Can you tell the inference run-time of KMeans Clustering?
Thanks for reading!
P.S. For those wanting to develop “Industry ML” expertise:
At the end of the day, all businesses care about impact. That’s it!
Can you reduce costs?
Drive revenue?
Can you scale ML models?
Predict trends before they happen?
We have discussed several other topics (with implementations) that align with such topics.
Here are some of them:
Learn sophisticated graph architectures and how to train them on graph data: A Crash Course on Graph Neural Networks – Part 1.
So many real-world NLP systems rely on pairwise context scoring. Learn scalable approaches here: Bi-encoders and Cross-encoders for Sentence Pair Similarity Scoring – Part 1.
Learn techniques to run large models on small devices: Quantization: Optimize ML Models to Run Them on Tiny Hardware.
Learn how to generate prediction intervals or sets with strong statistical guarantees for increasing trust: Conformal Predictions: Build Confidence in Your ML Model’s Predictions.
Learn how to identify causal relationships and answer business questions: A Crash Course on Causality – Part 1
Learn how to scale ML model training: A Practical Guide to Scaling ML Model Training.
Learn techniques to reliably roll out new models in production: 5 Must-Know Ways to Test ML Models in Production (Implementation Included)
Learn how to build privacy-first ML systems: Federated Learning: A Critical Step Towards Privacy-Preserving Machine Learning.
Learn how to compress ML models and reduce costs: Model Compression: A Critical Step Towards Efficient Machine Learning.
All these resources will help you cultivate key skills that businesses and companies care about the most.