75 Key Terms That Data Scientists Remember by Heart

Must-know concepts/terms in data science.

Jan 01, 2024

Data science has a pretty diverse glossary.

Once I prepared the following glossary sheet, which lists the 75 most common and important terms data scientists use frequently in their day-to-day work. Thus, being aware of them is extremely crucial.

Wherever possible, I have linked my reference resources that explain these terms in detail.

How many terms do you know?

Let’s discuss them in brief one by one:

A:
- Accuracy: Measure of the correct predictions divided by the total predictions.
- Area Under Curve: Metric representing the area under the Receiver Operating Characteristic (ROC) curve, used to evaluate classification models.
- ARIMA: Autoregressive Integrated Moving Average, a time series forecasting method.
B:
- Bias: The difference between the true value and the predicted value in a statistical model.
- Bayes Theorem: Probability formula that calculates the likelihood of an event based on prior knowledge.
- Binomial Distribution: Probability distribution that models the number of successes in a fixed number of independent Bernoulli trials.
C:
- Clustering: Grouping data points based on similarities. Read the newsletter issue on 6 types of clustering algorithms here: Beyond KMeans: 6 Must-Know Types of Clustering Algorithms in Machine Learning.
- Confusion Matrix: Table used to evaluate the performance of a classification model.
- Cross-validation: Technique to assess model performance by dividing data into subsets for training and testing. Read about 5 must-know cross-validation techniques here.
D:
- Decision Trees: A tree-like model used for classification and regression tasks.
- Dimensionality Reduction: Process of reducing the number of features in a dataset while preserving important information.
- Discriminative Models: Models that learn the boundary between different classes. Read about discriminative models in this issue and how they differ from generative models in this issue.
E:
- Ensemble Learning: Technique that combines multiple models to improve predictive performance.
- EDA (Exploratory Data Analysis): Process of analyzing and visualizing data to understand its patterns and properties. Learn about 8 automated EDA tools in this issue.
- Entropy: Measure of uncertainty or randomness in information.
F:
- Feature Engineering: Process of creating new features from existing data to improve model performance.
- F-score: Metric that balances precision and recall for binary classification.
- Feature Extraction: Process of automatically extracting meaningful features from data.
G:
- Gradient Descent: Optimization algorithm used to minimize a function by adjusting parameters iteratively.
- Gaussian Distribution: Normal distribution with a bell-shaped probability density function.
- Gradient Boosting: Ensemble learning method that builds multiple weak learners sequentially. Adaboost is a popular gradient-boosting algorithm. Learn how it works in this newsletter issue.
H:
- Hypothesis: Testable statement or assumption in statistical inference.
- Hierarchical Clustering: Clustering method that organizes data into a tree-like structure.
- Heteroscedasticity: Unequal variance of errors in a regression model.
I:
- Information Gain: Measure used in decision trees to determine the importance of a feature.
- Independent Variable: Variable that is manipulated in an experiment to observe its effect on the dependent variable.
- Imbalance: Situation where the distribution of classes in a dataset is not equal.
J:
- Jupyter: Interactive computing environment used for data analysis and machine learning.
- Joint Probability: Probability of two or more events occurring together. Here’s a visual guide to Joint, Marginal, and Conditional Probabilities.
- Jaccard Index: Measure of similarity between two sets.
K:
- Kernel Density Estimation: Non-parametric method to estimate the probability density function of a continuous random variable.
- KS Test (Kolmogorov-Smirnov Test): Non-parametric test to compare two probability distributions. Read about it in this newsletter issue.
- KMeans Clustering: Partitioning data into K clusters based on similarity.
L:
- Likelihood: Chance of observing the data given a specific model. Understand the difference between likelihood and probability in this newsletter issue.
- Linear Regression: Statistical method for modeling the relationship between dependent and independent variables.
- L1/L2 Regularization: Techniques to prevent overfitting by adding penalty terms to the model’s loss function. Ever wondered where did the regularization term originated from? Read this article to understand everything about it: The Probabilistic Origin of Regularization.
M:
- Maximum Likelihood Estimation: Method to estimate the parameters of a statistical model. Here’s an intuitive explanation to maximum likelihood estimation (MLE) in machine learning.
- Multicollinearity: A situation where two or more independent variables are highly correlated in a regression model.
- Mutual Information: Measure of the amount of information shared between two variables.
N:
- Naive Bayes: Probabilistic classifier based on Bayes Theorem with the assumption of feature independence.
- Normalization: Scaling data to have a mean of 0 and std-dev of 1. But there’s a common misconception about scaling in general; read about it here: Are You Misinterpreting the Purpose of Feature Scaling and Standardization?
- Null Hypothesis: Hypothesis of no significant difference or effect in statistical testing.
O:
- Overfitting: When a model performs well on training data but poorly on new, unseen data.
- Outliers: Data points that significantly differ from other data points in a dataset.
- One-hot encoding: Process of converting categorical variables into binary vectors. Here’s an issue on the most overlooked problem with one-hot encoding.
P:
- PCA (Principal Component Analysis): Dimensionality reduction technique to transform data into orthogonal components. I have a full deep dive on PCA, which you can read here: Formulating the Principal Component Analysis (PCA) Algorithm From Scratch.
- Precision: Proportion of true positive predictions among all positive predictions in a classification model. If you struggle to interpret precision and recall, this issue will solve it for you forever.
- p-value: Probability of observing a result at least as extreme as the one obtained if the null hypothesis is true.
Q:
- QQ-plot (Quantile-Quantile Plot): Graphical tool to compare the distribution of two datasets. Here’s a visual and intuitive guide to the QQ plot that will teach you how this plot is created and how to interpret it.
- QR decomposition: Factorization of a matrix into an orthogonal and an upper triangular matrix.
R:
- Random Forest: Ensemble learning method that uses multiple decision trees to make predictions with the help of bagging. Understand why bagging is so ridiculously effective at variance reduction here.
- Recall: Proportion of true positive predictions among all actual positive instances in a classification model.
- ROC Curve (Receiver Operating Characteristic Curve): Graph showing the performance of a binary classifier at different thresholds.
S:
- SVM (Support Vector Machine): Supervised machine learning algorithm used for classification and regression.
- Standardisation: Scaling data to have a mean of 0 and a standard deviation of 1.
- Sampling: Process of selecting a subset of data points from a larger dataset.
T:
- t-SNE (t-Distributed Stochastic Neighbor Embedding): Dimensionality reduction technique for visualizing high-dimensional data in lower dimensions. I have a full deep dive on t-SNE, which you can read here: Formulating and Implementing the t-SNE Algorithm From Scratch.
- t-distribution: Probability distribution used in hypothesis testing when the sample size is small. The above t-SNE guide will also clear what t-distribution is.
- Type I/II Error: Type I error is a false positive, and Type II error is a false negative in hypothesis testing.
U:
- Underfitting: When a model is too simple to capture the underlying patterns in the data.
- UMAP (Uniform Manifold Approximation and Projection): Dimensionality reduction technique for visualizing high-dimensional data.
- Uniform Distribution: Probability distribution where all outcomes are equally likely.
V:
- Variance: Measure of the spread of data points around the mean.
- Validation Curve: Graph showing how model performance changes with different hyperparameter values.
- Vanishing Gradient: Issue in deep neural networks when gradients become very small during training.
W:
- Word embedding: Representation of words as dense vectors in natural language processing. If you want to learn about the history of embeddings, you should not miss this insightful issue: A Pivotal Moment in NLP Research Which Made Static Embeddings (Almost) Obsolete.
- Word cloud: Visualization of text data where word frequency is represented through the size of the word.
- Weights: Parameters that are learned by a machine learning model during training.
X:
- XGBoost: Extreme Gradient Boosting, a popular gradient boosting library.
- XLNet: Generalized Autoregressive Pretraining of Transformers, a language model.
Y:
- YOLO (You Only Look Once): Real-time object detection system.
- Yellowbrick: Python library for machine learning visualization and diagnostic tools.
Z:
- Z-score: Standardized value representing how many standard deviations a data point is from the mean.
- Z-test: Statistical test used to compare a sample mean to a known population mean.
- Zero-shot learning: Machine learning method where a model can recognize new classes without seeing explicit examples during training.

👉 Over to you: Of course, a lot has been left out here. As an exercise, can you add more terms to this?

👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights.