75 Key Terms That All Data Scientists Remember By Heart
Must-know concepts/terms in data science.
Data science has a diverse glossary. The sheet lists the 75 most common and important terms that data scientists use almost every day.
Thus, being aware of them is extremely crucial.
A:
Accuracy: Measure of the correct predictions divided by the total predictions.
Area Under Curve: Metric representing the area under the Receiver Operating Characteristic (ROC) curve, used to evaluate classification models.
ARIMA: Autoregressive Integrated Moving Average, a time series forecasting method.
B:
Bias: The difference between the true value and the predicted value in a statistical model.
Bayes Theorem: Probability formula that calculates the likelihood of an event based on prior knowledge.
Binomial Distribution: Probability distribution that models the number of successes in a fixed number of independent Bernoulli trials.
C:
Clustering: Grouping data points based on similarities.
Confusion Matrix: Table used to evaluate the performance of a classification model.
Cross-validation: Technique to assess model performance by dividing data into subsets for training and testing.
D:
Decision Trees: Tree-like model used for classification and regression tasks.
Dimensionality Reduction: Process of reducing the number of features in a dataset while preserving important information.
Discriminative Models: Models that learn the boundary between different classes.
E:
Ensemble Learning: Technique that combines multiple models to improve predictive performance.
EDA (Exploratory Data Analysis): Process of analyzing and visualizing data to understand its patterns and properties.
Entropy: Measure of uncertainty or randomness in information.
F:
Feature Engineering: Process of creating new features from existing data to improve model performance.
F-score: Metric that balances precision and recall for binary classification.
Feature Extraction: Process of automatically extracting meaningful features from data.
G:
Gradient Descent: Optimization algorithm used to minimize a function by adjusting parameters iteratively.
Gaussian Distribution: Normal distribution with a bell-shaped probability density function.
Gradient Boosting: Ensemble learning method that builds multiple weak learners sequentially.
H:
Hypothesis: Testable statement or assumption in statistical inference.
Hierarchical Clustering: Clustering method that organizes data into a tree-like structure.
Heteroscedasticity: Unequal variance of errors in a regression model.
I:
Information Gain: Measure used in decision trees to determine the importance of a feature.
Independent Variable: Variable that is manipulated in an experiment to observe its effect on the dependent variable.
Imbalance: Situation where the distribution of classes in a dataset is not equal.
J:
Jupyter: Interactive computing environment used for data analysis and machine learning.
Joint Probability: Probability of two or more events occurring together.
Jaccard Index: Measure of similarity between two sets.
K:
Kernel Density Estimation: Non-parametric method to estimate the probability density function of a continuous random variable.
KS Test (Kolmogorov-Smirnov Test): Non-parametric test to compare two probability distributions.
KMeans Clustering: Partitioning data into K clusters based on similarity.
L:
Likelihood: Chance of observing the data given a specific model.
Linear Regression: Statistical method for modeling the relationship between dependent and independent variables.
L1/L2 Regularization: Techniques to prevent overfitting by adding penalty terms to the model's loss function.
M:
Maximum Likelihood Estimation: Method to estimate the parameters of a statistical model.
Multicollinearity: A situation where two or more independent variables are highly correlated in a regression model.
Mutual Information: Measure of the amount of information shared between two variables.
N:
Naive Bayes: Probabilistic classifier based on Bayes Theorem with the assumption of feature independence.
Normalization: Scaling data to have a mean of 0 and standard deviation of 1.
Null Hypothesis: Hypothesis of no significant difference or effect in statistical testing.
O:
Overfitting: When a model performs well on training data but poorly on new, unseen data.
Outliers: Data points that significantly differ from other data points in a dataset.
One-hot encoding: Process of converting categorical variables into binary vectors.
P:
PCA (Principal Component Analysis): Dimensionality reduction technique to transform data into orthogonal components.
Precision: Proportion of true positive predictions among all positive predictions in a classification model.
p-value: Probability of observing a result at least as extreme as the one obtained if the null hypothesis is true.
Q:
QQ-plot (Quantile-Quantile Plot): Graphical tool to compare the distribution of two datasets.
QR decomposition: Factorization of a matrix into an orthogonal and an upper triangular matrix.
R:
Random Forest: Ensemble learning method using multiple decision trees to make predictions.
Recall: Proportion of true positive predictions among all actual positive instances in a classification model.
ROC Curve (Receiver Operating Characteristic Curve): Graph showing the performance of a binary classifier at different thresholds.
S:
SVM (Support Vector Machine): Supervised machine learning algorithm used for classification and regression.
Standardisation: Scaling data to have a mean of 0 and a standard deviation of 1.
Sampling: Process of selecting a subset of data points from a larger dataset.
T:
t-SNE (t-Distributed Stochastic Neighbor Embedding): Dimensionality reduction technique for visualizing high-dimensional data in lower dimensions.
t-distribution: Probability distribution used in hypothesis testing when the sample size is small.
Type I/II Error: Type I error is a false positive, and Type II error is a false negative in hypothesis testing.
U:
Underfitting: When a model is too simple to capture the underlying patterns in the data.
UMAP (Uniform Manifold Approximation and Projection): Dimensionality reduction technique for visualizing high-dimensional data.
Uniform Distribution: Probability distribution where all outcomes are equally likely.
V:
Variance: Measure of the spread of data points around the mean.
Validation Curve: Graph showing how model performance changes with different hyperparameter values.
Vanishing Gradient: Issue in deep neural networks when gradients become very small during training.
W:
Word embedding: Representation of words as dense vectors in natural language processing.
Word cloud: Visualization of text data where word frequency is represented through the size of the word.
Weights: Parameters that are learned by a machine learning model during training.
X:
XGBoost: Extreme Gradient Boosting, a popular gradient boosting library.
XLNet: Generalized Autoregressive Pretraining of Transformers, a language model.
Y:
YOLO (You Only Look Once): Real-time object detection system.
Yellowbrick: Python library for machine learning visualization and diagnostic tools.
Z:
Z-score: Standardized value representing how many standard deviations a data point is from the mean.
Z-test: Statistical test used to compare a sample mean to a known population mean.
Zero-shot learning: Machine learning method where a model can recognize new classes without seeing explicit examples during training.
👉 Over to you: Of course, a lot has been left out here. As an exercise, can you add more terms to this?
👉 If you liked this post, don’t forget to leave a like ❤️. It helps more people discover this newsletter on Substack and tells me that you appreciate reading these daily insights. The button is located towards the bottom of this email.
👉 Tell the world what makes this newsletter special for you by leaving a review here :)
👉 If you love reading this newsletter, feel free to share it with friends!
👉 Sponsor the Daily Dose of Data Science Newsletter. More info here: Sponsorship details.
Find the code for my tips here: GitHub.
I like to explore, experiment and write about data science concepts and tools. You can read my articles on Medium. Also, you can connect with me on LinkedIn and Twitter.