Summary
- Science and Data
- Key Terms and Areas
- Data Storage
- Statistics and Data Analysis
- Causal Inference
- Feature Engineering
- Machine Learning
- Time Series
- Generative AI
- Miscellaneous
Science and Data
What is the difference between Art and Science?
Science is knowledge which we understand so well that we can teach it to a computer.
Computer Programming as an Art; Donald Knuth (1974)
What is the difference between Data, Information and Knowledge?
Data are different symbols and characters whose meaning only becomes clear when they connect with context. Collecting and measuring observations generates data. Usually machines sent, receive and process data. The confusion between data and information often arises because information is made out of data. Data reaches a more complex level and becomes information by integrating them to a context. Information provides expertise about facts or persons. Knowledge thus describes the collected information that is available about a particular fact or a person. The knowledge of this situation makes it possible to make informed decisions and solve problems. Thus, knowledge influences the thinking and actions of people. Machines can also make decisions based on new knowledge generated by information. In order to gain knowledge, it is necessary to process information.
What is the difference between data, information and knowledge?; Sebastian Pierper (2017)
What is Data Science?
Multi-disciplinary field that brings together concepts from computer science, statistics/machine learning, and data analysis to understand and extract insights from the ever-increasing amounts of data. Two paradigms of data research: 1. Hypothesis-Driven (given a problem, what kind of data do we need to help solve it?); 2. Data-Driven (given some data, what interesting problems can be solved with it?). The heart of data science is to always ask questions. Always be curious about the world: 1. What can we learn from this data?; 2. What actions can we take once we find whatever it is we are looking for?
Data Science Cheatsheet; Maverick Lin (2018)
Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid.
What differentiates data science from statistics is that data sicence is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.
What Is Data Science?; Mike Loukides (2010)
Data Science is a mix of traditional data analysis techniques with advanced algorithms for handling a considerable measure of games. It has likewise approached finding new sorts of data.
Data Science From Scratch: How to Become a Data Scientist; David Park (2019)
What does a Data Scientist do?
A data researcher is somebody who decodes large measures of data and extracts importance to support an association or organization to improve its activities. They utilize various tools, philosophies, insights, systems, calculations, etc. to examine data additionally. […] The job of the data researcher fundamentally is to the pursuit and perused the data, preparing and speaking to it and bringing a feeling of the data for down to earth use. [Besides that,] in request to check the present status of an organization or where it stands, a Business [Intelligent] Analyst utilizes data and searches for examples, business patterns, connections and concocts a representation and report. […] A Machine Learning Engineer works with various calculations identified with machine learning like grouping, choice tress, characterization, arbitrary backwoods, etc.
Data Science From Scratch: How to Become a Data Scientist; David Park (2019)
How is Data Science applied to Business?
The data science gives an immense chance to the budgetary firm to rehash the business. In account, the use of data science is automating Risk Managment, Predictive Analytics, Managing client data, Fraud identification, Real-time Analytics, Algorithmic exchanging, Consumer Analytics. […] Data Science helps in understanding different patterns and futhermore helps in setting choices concerning advancement and advertising so the items can achieve the clients, and in the long run, increment the income of the organization.
Data Science From Scratch: How to Become a Data Scientist; David Park (2019)
Key Terms and Areas
What is Artificial Intelligence?
The study of making computers do things that the human needs intelligence to do. […] Classes of problems requiring intelligence include inference based on knowledge, reasoning with uncertain or incomplete information, various forms of perception and learning, and applications to problems such as control, prediction, classification and optimization
Fundamentals of the New Artificial Intelligence; Toshinori Munakata (2008)
Artificial Intelligence involves using methods based on the intelligent behavior of humans and other animals to solve complex problems.
Artificial Intelligence: Illuminated; Ben Coppin (2004)
What is Machine Learning?
Machine learning can be understood as and application of AI. (It) was born from pattern recognition and the theory that computers can learn without being programmed to perform specific tasks. This includes techniques such as Bayesian methods; neural networks; inductive logic programming; explanation-based, natural language processing; decision tree; and reinforcement learning. […] Systems that have hard-coded knowledge bases will typically experience difficulties in new environments. Certain difficulties can be overcome by a system that can acquire its own knowledge. This capability is known as machine learning.
Machine Learning and AI for Healthcare; Arjun Panesar (2019)
What is Data Analysis?
Data analysis is an art [apart from Data Science]. It is not something yet that we can teach to a computer. Data analysts have many tools at their disposal, from linear regression to classification trees and even deep learning, and these tools have all been carefully taught to computers. But ultimately, a data analyst must find a way to assemble all of the tools and apply them to data to answer a relevant question—a question of interest to people. […] While a study includes developing and executing a plan for collecting data, a data analysis presumes the data have already been collected. More specifically, a study includes the development of a hypothesis or question, the designing of the data collection process (or study protocol), the collection of the data, and the analysis and interpretation of the data. Because a data analysis presumes that the data have already been collected, it includes development and refinement of a question and the process of analyzing and interpreting the data.
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
What is Inference?
Inference is one of many possible goals in data analysis. […] In general, the goal of inference is to be able to make a statement about something that is not observed, and ideally to be able to characterize any uncertainty you have about that statement. Inference is difficult because of the difference between what you are able to observe and what you ultimately want to know. [..] The language of inference can change depending on the application, but most commonly, we refer to the things we cannot observe (but want to know about) as the population or as features of the population and the data that we observe as the sample. The goal is to use the sample to somehow make a statement about the population.
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
What is Knowledge Discovery from Data (KDD)?
It is an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data); 2. Data integration (where multiple data sources may be combined); 3. Data selection (where data relevant to the analysis task are retrieved from the database); 4. Data transformation (where data are transformed and consolidated into forms of appropriate for mining by performing summary or aggregation operations); 5. Data mining (an essential process where intelligent methods are applied to extract data patterns); 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures); 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users).
Data Mining; Jiawei Hang, Micheline Kamber and Jian Pei (2012)
What is Feature Engineering?
Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to strengthen the performance of machine learning models. Feature engineering can be considered as applied machine learning itself.
The “Generic” Data Science Life-Cycle; Sivakar Siva (2020)
Outlier detection, one hot encoding, handling missing data are few basic examples of Feature Engineering.
What is feature engineering; CodeBasics (2020)
What is a Statistical Model?
In a very general sense, a model is something we construct to help us understand the real world. […] The process of building a model involves imposing a specific structure on the data and creating a summary of the data. […] A statistical model serves two key purposes in a data analysis, which are to provide a quantitative summary of your data and to impose a specific structure on the population from which the data were sampled. […] At its core, a statistical model provides a description of how the world works and how the data were generated. The model is essentially an expectation of the relationships between various factors in the real world and in your dataset. What makes a model a statistical model is that it allows for some randomness in generating the data.
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
Statistical model, which is a formal representation of the relationships between variables, that we can use to provide the desired explanations or predictions.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is Statistic?
The first key element of a statistical model is data reduction. The basic idea is you want to take the original set of numbers consisting of your dataset and transform them into a smaller set of numbers. […] The process of data reduction typically ends up with a statistic. Generally speaking, a statistic is any summary of the data. The sample mean, or average, is a statistic. So is the median, the standard deviation, the maximum, the minimum, and the range. Some statistics are more or less useful than others but they are all summaries of the data.
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
What is Big Data?
Big data is when the size of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.
What Is Data Science?; Mike Loukides (2010)
That is a considerable measure of data, so much data, and it turned out to be challenging to deal with by utilizing conventional innovations. Along these lines, we called it Big Data.
Data Science From Scratch: How to Become a Data Scientist; David Park (2019)
What is Data Warehouse?
A “data warehouse” may be a basic operational reporting environment built from a single transactional system or it may be a cutting-edge solution uniting transactional, machine and social data to support deep and complex analysis in real time. It may provide information for daily (or monthly, or quarterly) reports or it may feed complex analysis into live business processes several times a second. […] Vincent Rainardi outlines five basic requirements that a data warehouse typically must meet: 1. An integrated view of the organization’s data for strategic analysis; 2. A consistent view of the organization’s data resources with data that has been cleared of anomalies which can lead to a false impression of the business’ function; 3. A consolidation of the organization’s data history beyond what is retained by current operations for deep analysis of the business’ functions over time; 4. A tested and verified environment for doing data analysis to access data so that each new draw of data doesn’t become a “science experiment” in and of itself; 5. A high-performance environment for doing data analysis that does not interfere with day-to-day activity of the business.
The Modern Data Warehouse: A New Approach for a New Era; Tom Traubitz (2018)
What is ETL?
The step where the data is pulled processed and loaded into a data warehouse, this is generally done through an ETL pipeline. ETL stands for Extract, Transform and Load. ETL is a 3 steps process: (1) Extracting data from single or multiple Data Sources. (2) Transforming data as per business logic. Transformation is in itself a two steps process - data cleansing and data manipulation. (3) Loading the previously transformed data into the target data source or data warehouse.
The “Generic” Data Science Life-Cycle; Sivakar Siva (2020)
What is Bioinformatics?
Restorative Science: In Bioinformatics, Data Science alongside Genome data is helping scientists and specialists to examine genetic structures and see how specific medications can follow up on sicknesses.
Data Science From Scratch: How to Become a Data Scientist; David Park (2019)
Bioinformaticians are concerned with deriving biological understanding from large amounts of data with specialized skills and tools. Early in biology’s history, the datasets were small and manageable. […] However, this is all rapidly changing. Large sequencing datasets are widespread, and will only become more common in the future. Analyzing this data takes different tools, new skills, and many computers with large amounts of memory, processing power, and disk space.
Bioinformatics Data Skills; Vince Buffalo (2015)
Data Storage
What is Data Serialization?
The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructured later. There are many, many data serialization formats. When considering a format to work with, you might want to consider different characteristics such as human readability, access patterns, and whether it’s based on text or binary, which influences the size of its files. Some examples are JSON, CSV and Parquet.
Designing Machine Learning Systems; Chip Huyen (2021)
How do we Store Data for Analytics?
Relational databases are designed for consistency, to support complex transactions that can easily be rolled back if any one of a complex set of operations fails. While rock-solid consistency is crucial to many applications, it’s not really necessary for the kind of analysis we’re discussing here. […] Precision has an allure, but in most data-driven applications outside of finance, that allure is deceptive. Most data analysis is comparative […]. To store huge datasets effectively, we’ve seen a new breed of databases appear. These are frequently called NoSQL databases, or Non-Relational databases […]. They group together fundamentally dissimilar products by telling you what they aren’t. Many of these databases are [..] designed to be distributed across many nodes, to provide ”eventual consistency” but not absolute consistency, and to have very flexible schema.
What Is Data Science?; Mike Loukides (2010)
What are the Challenges of the modern Data Warehouse?
Today your business faces an unprecedented sets of challenges. Bigger data volumes. New data types. A deluge of machine data from the Internet of Things. Digital business models that require real-time performance all the time drive the need for zero-latency reporting. Data-driven businesses need more complex, more extensive, and yet paradoxically faster and more easily accessed analytics. To suceed, you need a deeper understanding of the Why bnehind what your customers, your competitors, and the market as a whole is up to. These are the challenges of the modern Data Warehouse. And to meet these challenges, you need something more than just a database.
The Modern Data Warehouse: A New Approach for a New Era; Tom Traubitz (2018)
Statistics and Data Analysis
What are the different types of data?
Nominal, Ordinal, Interval and Ratio.
There are two basic types of structured data: numeric and categorical. Numeric data comes in two forms: continuous, such as wind speed or time duration, and discrete, such as the count of the occurrence of an event. Categorical data takes only a fixed set of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Alabama, Alaska, etc.). Binary data is an important special case of categorical data that takes on only one of two values, such as 0/1, yes/no, or true/false. Another useful type of categorical data is ordinal data in which the categories are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5).
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is a Population?
A population can be thought of as a physical group of individuals, but also as the provider of the probability distribution for a random observation drawn from that population. Populations can be summarized through parameters that mirror the statistical synthesis of the sample data.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What are the types of Population?
There are three types of populations from which a sample can be drawn: (i) Literal population: This is an identifiable group, such as when we randomly select a person to conduct a survey. Or it can be a group of individuals that can be measured, and, although we do not randomly pick one of them, we have volunteer data; (ii) Virtual population: We usually take measurements using some instrument, for example by measuring someone’s blood pressure or air pollution. We know it is always possible to take new measurements and obtain slightly different responses. The closeness of multiple readings depends on the accuracy of the instrument and the stability of the circumstances — we could think of this as extracting observations from a virtual population of all the measurements that could be taken if we had enough time; (iii) Metaphoric population: When there is no larger population. This is an unusual concept. Here we act as if the data were randomly drawn from some population, when it is clear that this is not the case—for example, the children who undergo heart surgery: we did no sampling, we have all the data, and there is nothing more to collect. Think of the number of murders that occur each year, the exam results for a specific class, or the data on all the countries in the world—none of these can be considered a sample from a real population.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is Random Sampling?
Random sampling is a process in which each available member of the population being sampled has an equal chance of being chosen for the sample at each draw. The sample that results is called a simple random sample. Sampling can be done with replacement, in which observations are put back in the population after each draw for possible future reselection. Or it can be done without replacement, in which case observations, once selected, are unavailable for future draws. Data quality often matters more than data quantity when making an estimate or a model based on a sample. Data quality in data science involves completeness, consistency of format, cleanliness, and accuracy of individual data points. Statistics adds the notion of representativeness.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What are Percentiles and Quantiles?
To avoid the sensitivity to outliers, we can look at the range of the data after dropping values from each end. Formally, these types of estimates are based on differences between percentiles. In a data set, the Pth percentile is a value such that at least P percent of the values take on this value or less and at least (100 – P) percent of the values take on this value or more. For example, to find the 80th percentile, sort the data. Then, starting with the smallest value, proceed 80 percent of the way to the largest value. Note that the median is the same thing as the 50th percentile. The percentile is essentially the same as a quantile, with quantiles indexed by fractions (so the .8 quantile is the same as the 80th percentile).
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is Interquartile Range (IQR)?
A common measurement of variability is the difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR). Here is a simple example: {3,1,5,3,6,7,2,9}. We sort these to get {1,2,3,3,5,6,7,9}. The 25th percentile is at 2.5, and the 75th percentile is at 6.5, so the interquartile range is 6.5 – 2.5 = 4.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What are the Estimates of Location?
Mean, Weighted Mean, Median, Percentile, Weighted Median and Trimmed Mean (the average of all values after dropping a fixed number of extreme values).
The basic metric for location is the mean, but it can be sensitive to extreme values (outlier). Other metrics (median, trimmed mean) are less sensitive to outliers and unusual distributions and hence are more robust.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
Mean: The sum of the numbers divided by the number of occurrences;Median: The middle value when the numbers are arranged in order; Mode: The number that appears most often.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What are the Estimates of Variability?
Deviations (the difference between the observed values and the estimate of location), Variance, Standard Deviation (the square root of the variance), Mean Absolute Deviation, Median Absolute Deviation (the median of the absolute values of deviations from the median), Range, Percentile and Interquartile Range.
Variance and standard deviation are the most widespread and routinely reported statistics of variability. Both are sensitive to outliers. More robust metrics include mean absolute deviation, median absolute deviation from the median, and percentiles (quantiles).
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
Range, Interquartile Range and Standard Devitation.
The standard deviation is a widely used measure of dispersion. It is the most complex from a technical point of view and appropriate only for well-behaved symmetric data, since it is also unduly influenced by very discrepant values. The Gini index is a measure of dispersion used for data with a high level of distortion, such as income, and is widely used to measure inequality, but it has a complex and not very intuitive form. The square of the standard deviation is known as variance: difficult to interpret directly, but nevertheless of great mathematical utility.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What are the Key Terms for Exploring the Distribution?
Boxplot, Frequency Table, Histogram and Density Plot (a smoothed version of the histogram, often based on a kernel density estimate).
A frequency histogram plots frequency counts on the y-axis and variable values on the x-axis; it gives a sense of the distribution of the data at a glance. A frequency table is a tabular version of the frequency counts found in a histogram. A boxplot - with the top and bottom of the box at the 75th and 25th percentiles, respectively - also gives a quick sense of the distribution of the data; it is often used in side-by-side displays to compare distributions. A density plot is a smoothed version of a histogram; it requires a functions to estimate a plot based on the data (multiple estimates are possible, of course).
Both frequency tables and percentiles summarize the data by creating bins. In general, quartiles and deciles will have the same count in each bin (equal-count bins), but the bin sizes will be different. The frequency table, by contrast, will have different counts in the bins (equal-size bins), and the bin sizes will be the same.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What are the Normal and Gamma Distributions?
[Normal model] says that the randomness in a set of data can be explained by the Normal distribution, or a bell-shaped curve. The Normal distribution is fully specified by two parameters — the mean and the standard deviation. [The Gamma distribution] has the feature that it only allows positive values, so it eliminates the problem we had with negative values with the Normal distribution.
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
The Normal distribution is also referred to as Gaussian distribution.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is the t-Distribution?
The t-distribution is actually a family of distributions resembling the normal distribution but with thicker tails. The t-distribution is widely used as a reference basis for the distribution of sample means, differences between two sample means, regression parameters, and more.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is z-Score?
The result of standardizing an individual data point. Standardize means subtracting the mean and dividing by the standard deviation.
A standard normal distribution is one in which the units on the x-axis are expressed in terms of standard deviations away from the mean. To compare data to a standard normal distribution, you subtract the mean and then divide by the standard deviation; this is also called normalization or standardization. The transformed value is termed a z-score, and the normal distribution is sometimes called the z-distribution.
Converting data to z-scores (i.e., standardizing or normalizing the data) does not make the data normally distributed. It just puts the data on the same scale as the standard normal distribution, often for comparison purposes.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is a QQ-Plot?
A QQ-Plot is used to visually determine how close a sample is to a specified distribution - in this case, the normal distribution. The QQ-Plot orders the z-scores from low to high and plots each value’s z-score on the y-axis; the x-asis is the corresponding quantile of a normal distribution for the value’s rank. Since the data is normalized, the units correspond to the number of standard deviations away from the mean. If the points roughly fall on the diagonal line, then the sample distribution can be considered close to normal.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is the Pearson Correlation Coefficient?
It is convenient to use a single number to summarize a consistent relationship of increase or decrease between pairs of numbers shown in a scatter plot. The number usually chosen for this is the Pearson correlation coefficient. A Pearson correlation lies in the interval between –1 and 1 and expresses how close the points are to a straight line. A correlation of 1 occurs if all points lie on an ascending straight line, while a correlation of –1 is observed when all points lie on a descending straight line. A correlation close to 0 may have to do with a random scatter of points, or any other pattern in which there is no systematic tendency upward or downward.
An alternative measure is Spearman’s rank correlation, which depends only on the ordering of the data, and not on their specific values. Thus, the coefficient can be close to 1 or –1 if the points are near a line that rises or falls consistently, even if it is not a straight line.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What are the Key Terms for Correlation?
Correlation Coefficient (measures the extent to which numeric variables are associated with one another), Correlation Matrix and Scatterplot.
Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go with low values of Y. If high values of X go with low values of Y, and vice versa, the variables are negatively correlated.
Like the mean and standard deviation, the correlation coefficient is sensitive to outliers in the data. Software packages offer robust alternatives to the classical correlation coefficient. For example, the R package robust uses the function covRob to compute a robust estimate of correlation. The methods in the scikit-learn module sklearn.covariance implement a variety of approaches.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What are the Key Terms for Exploring Two or More Variables?
Scatterplot, Contingency Table, Hexagonal Binning, Contour Plot and Violin Plot.
For plotting numeric vs numeric data, scatterplots are fine when there is a relatively small number of data values. For data sets with hundreds of thousands or millions of records, a scatterplot will be too dense, so we need a different way to visualize the relationship. Heat maps, hexagonal binning, and contour plots all give a visual representation of a two-dimensional density. In this way, they are natural analogs to histograms and density plots.
A useful way to summarize two categorical variables is a contingency table - a table of counts by category.
Boxplots are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable. A violin plot, introduced by [Hintze-Nelson-1998], is an enhancement to the boxplot and plots the density estimate with the density on the y-axis. The density is mirrored and flipped over, and the resulting shape is filled in, creating an image resembling a violin. The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot. On the other hand, the boxplot more clearly shows the outliers in the data.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What are the Key Terms for Exploring Binary and Categorical Data?
Mode, Expected Value (when the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence), Bar Charts and Pie Charts.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is the difference between the Expected Value and the Mean?
Expected value ($\mathbb{E}[X]$) is a theoretical quantity of a random variable $X$:
$\mathbb{E}[X]=\sum_x x\,P(X=x)\quad\text{or}\quad \int x\,f_X(x)\,dx$
It’s the distribution’s long-run average — i.e., the population mean — when the first moment exists.
Mean can mean two different things: (1) Population mean — the same as $\mathbb{E}[X]$ (if finite); or (2) Sample mean $\bar X=\tfrac{1}{n}\sum_{i=1}^n X_i$ — a statistic computed from data that estimates $\mathbb{E}[X]$. It’s unbiased ($\mathbb{E}[\bar X]=\mathbb{E}[X]$) and, by the Law of Large Numbers, $\bar X \to \mathbb{E}[X]$ as $n$ grows.
When they differ / pitfalls:
- Some distributions (e.g., Cauchy) have no finite expected value; talking about a population “mean” is not meaningful, though you can still compute a sample average (it won’t stabilize).
- People sometimes say “mean” when they mean median or other averages (geometric, harmonic); those are not $\mathbb{E}[X]$.
- In general, $\mathbb{E}[g(X)] \neq g(\mathbb{E}[X])$ for nonlinear $g$.
What is the Law of Large Numbers?
The Law of Large Numbers states that if you sample a random variable independently a large number of times, the measured average should converge to the random variable’s true expectation (mean). This is important in studying the longer-term behavior of random variables over time. As an example, a coin might land on heads 5 times in a row, but over a much larger n we would expect the proportion of heads to be approximately half of the total flips. Similarly, a casino might experience a loss on any individual game, but over the long run should see a predictable profit over time.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
The variability of the observed proportion decreases as the sample size increases — this is the law of large numbers.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is the Central Limit Theorem?
The Central Limit Theorem (CLT) states that if you repeatedly sample a random variable a large number of times, the distribution of the sample mean will approach a normal distribution regardless of the initial distribution of the random variable. The CLT provides the basis for much of hypothesis testing. At a very basic level, you can consider the implications of this theorem on coin flipping: the probability of getting some number of heads flipped over a large n should be approximately that of a normal distribution. Whenever you’re asked to reason about any particular distribution over a large sample size, you should remember to think of the CLT, regardless of whether it is Binomial, Poisson, or any other distribution.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
[…] But it is not only the binomial distribution that tends toward a normal curve as the sample size increases — it is a remarkable fact that, whatever the shape of the population distribution from which each of the original measurements is drawn, for large sample sizes their mean can be regarded as having been drawn from a normal curve. This will have a mean equal to the mean of the original distribution, and a standard deviation in a simple relation to the standard deviation of the original population distribution.
The central limit theorem implies that sample means and other statistical summaries will have an approximately normal distribution, for large samples.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is Standard Error?
The standard error is a single metric that sums up the variability in the sampling distribution for a statistic. The standard error can be estimated using a statistic based on the standard deviation s of the sample values, and the sample size n. As the sample size increases, the standard error decreases. The relationship between standard error and sample size is sometimes referred to as the square root of n rule: to reduce the standard error by a factor of 2, the sample size must be increased by a factor of 4.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is Alpha (significance level)?
Statisticians frown on the practice of leaving it to the researcher’s discretion to determine whether a result is “too unusual” to happen by chance. Rather, a threshold is specified in advance, as in “more extreme than 5% of the chance (null hypothesis)results”; this threshold is known as alpha. Typical alpha levels are 5% and 1%. Any chosen level is an arbitrary decision—there is nothing about the process that will guarantee correct decisions x% of the time. This is because the probability question being answered is not “What is the probability that this happened by chance?” but rather “Given a chance model, what is the probability of a result this extreme?” We then deduce backward about the appropriateness of the chance model, but that judgment does not carry a probability. This point has been the subject of much confusion.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What are Confidence Intervals?
Confidence intervals are a way to place uncertainty around our estimates. The smaller the sample size, the larger the standard error, and the wider the confidence interval.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
Confidence intervals are a part of Data Science and basically, they show us the probability of an event occurring. Confidence intervals are generally used in statistics to give a range of values within which we are confident that a parameter lies. Confidence intervals are a part of Data Science. Confidence intervals help us understand the behavior of a certain dataset. A confidence interval is a range of values that a parameter is expected to fall within. […] A confidence interval is a range of values that a statistic is likely to fall between. The confidence interval is denoted by the number of standard errors that the statistic is above and below the mean value. The standard error is the standard deviation of a statistic, divided by the square root of the sample size. This is a long way of saying that the standard error is the standard deviation of your statistic. Confidence intervals vary by which data scientist you talk to, but generally, a 95% confidence interval means that you are 95% sure that the statistic falls in that range.
Confidence intervals are a part of Data Science; Rijul Singh Malik (2022)
Confidence intervals always come with a coverage level, expressed as a (high) percentage, say 90% or 95%. One way to think of a 90% confidence interval is as follows:it is the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic. More generally, an x% confidence interval around a sample estimate should, on average, contain similar sample estimates x% of the time (when a similar sampling procedure is followed). […] The percentage associated with the confidence interval is termed the level of confidence. The higher the level of confidence, the wider the interval. Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty). Both make sense: the more confident you want to be, and the less data you have, the wider you must make the confidence interval to be sufficiently assured of capturing the true value.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
A confidence interval is the range of population parameters for which our observed statistic is a plausible consequence. A simple practical rule is that if you are estimating the percentage of people who prefer, say, coffee instead of tea, from a random sample of the population, then your margin of error is roughly plus or minus 100 divided by the square root of the sample size. Thus, for a survey with 1,000 people (the industry standard), the margin of error is generally mentioned as ±3%. So if 400 said they prefer coffee and 600 said they prefer tea, then it is possible to estimate roughly that the percentage preferring coffee in the population is about 40% ± 3%, or between 37% and 43%. A 95% confidence interval is the result of a procedure that, if anchored in correct assumptions, contains the true value of the parameter 95% of the time. One cannot say that a specific interval has a 95% probability of containing the true value, but only that the procedure yields such intervals with that frequency.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
This is a range of values that, if a large sample were taken, would contain the parameter value of interest $(1 - \alpha)\%$ of the time. For instance, a 95\% confidence interval would contain the true value 95\% of the time. If 0 is included in the confidence intervals, then we cannot reject the null hypothesis (and vice versa).
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
What is the relationship between Standard Error and Variables’ Variance?
The standard error is inversely proportional to the variance of the variable. This means that, if doesn’t change much, it will be hard to estimate its effect on the outcome. This also makes intuitive sense. Take it to the extreme and pretend you want to estimate the effect of a drug, so you conduct a test with 10000 individuals but only 1 of them get the treatment. This will make finding the ATE very hard, we will have to rely on comparing a single individual with everyone else. Another way to say this is that we need lots of variability in the treatment to make it easier to find its impact.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
What is the Null Hypothesis?
Hypothesis tests use the following logic: “Given the human tendency to react to unusual but random behavior and interpret it as something meaningful and real, in our experiments we will require proof that the difference between groups is more extreme than what chance might reasonably produce.” This involves a baseline assumption that the treatments are equivalent, and any difference between the groups is due to chance. This baseline assumption is termed the null hypothesis. Our hope, then, is that we can in fact prove the null hypothesis wrong and show that the outcomes for groups A and B are more different than what chance might produce.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
The null hypothesis is what we are willing to assume happens until proven otherwise. It is relentlessly negative, denying all progress and change. The null hypothesis is never proved or established, but it can be refuted in the course of experimentation. One can say that every experiment exists only to give the facts a chance to refute the null hypothesis. A defendant can be considered guilty, but no one is ever considered innocent, there is simply no proof of guilt. In the same way, we may reject the null hypothesis, but if we do not have sufficient evidence to do so, this does not mean we can accept it as true. It is only a working premise until something better appears.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
The process of testing whether or not a sample of data supports a particular hypothesis is called hypothesis testing. Generally, hypotheses concern particular properties of interest for a given population (such as its parameters), like, for example, the mean conversion rate among a set of users. The steps in testing a hypothesis are as follows:
- State a null hypothesis and an alternative hypothesis: Either the null hypothesis will be rejected (in favor of the alternative hypothesis), or it will fail to be rejected (although failing to reject the null hypothesis does not necessarily mean it is true, but rather that there is not sufficient evidence to reject it).
- Use a particular test statistic of the null hypothesis to calculate the corresponding p-value.
- Compare the p-value to a certain significance level ($\alpha$).
Since the null hypothesis typically represents a baseline (e.g., “the marketing campaign did not increase conversion rates,” etc.), the goal is to reject the null hypothesis with statistical significance and show that there’s a significant outcome.
Hypothesis tests are either one-tailed or two-tailed tests. A one-tailed test has the following types of null and alternative hypotheses:
- $H_0 : \mu = \mu_0$ versus $H_1 : \mu < \mu_0$, or
- $H_0 : \mu = \mu_0$ versus $H_1 : \mu > \mu_0$
whereas a two-tailed test has these types:
- $H_0 : \mu = \mu_0$ versus $H_1 : \mu \neq \mu_0$
where $\mu_0$ is the null hypothesis and $\mu_1$ is the alternative hypothesis, and $\mu$ is the parameter of interest.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
What is p-Value?
Like with confidence intervals (and most frequentist statistics, as a matter of fact), the true definition of p-values can be very confusing. So, to not take any risks, I’ll copy the definition from Wikipedia: “the p-value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct”. To put it more succinctly, the p-value is the probability of seeing such data, given that the null hypothesis is true. It measures how unlikely it is that you are seeing a measurement if the null hypothesis is true. Naturally, this often gets confused with the probability of the null hypothesis being true.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
A p-value is the probability of obtaining a result at least as extreme as the one we obtained, if the null hypothesis (and all other modeling assumptions) were really true.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
Put simply, a p-value is the probability of observing the value of the calculated test statistic under the null hypothesis assumptions. Usually, the p-value is assessed relative to some predetermined level of significance (0.05 is often chosen).
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
What is Statistical Significance?
The idea of statistical significance is straightforward: if a p-value is sufficiently small, then we say the results are statistically significant.
To perform a statistical significance test, follow these steps: (1) Define a question in terms of a null hypothesis we want to test; (2) Generate a sampling distribution of this test statistic, where the null hypothesis is true; (3) Verify whether the observed statistic lies in one of the tails of this distribution and summarize this observation through a p-value: the probability, if the null hypothesis is true, of observing such an extreme statistic; (4) It is necessary to carefully define ‘extreme’—if, for instance, very large values, both positive and negative, of the test statistic are considered incompatible with the null hypothesis, then the p-value should account for that; (5) Declare the result with statistical significance if the p-value lies below some critical threshold.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is a Test Statistic?
A test statistic is a numerical summary designed for the purpose of determining whether the null hypothesis or the alternative hypothesis should be accepted as correct. More specifically, it assumes that the parameter of interest follows a particular sampling distribution under the null hypothesis. For example, the number of heads in a series of coin flips may be distributed as a binomial distribution, but with a large enough sample size, the sampling distribution should be approximately normally distributed. Hence, the sampling distribution for the total number of heads in a large series of coin flips would be considered normally distributed.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
What are Degrees of Freedom?
In the documentation and settings for many statistical tests and probability distributions, you will see a reference to “degrees of freedom.” The concept is applied to statistics calculated from sample data, and refers to the number of values free to vary. For example, if you know the mean for a sample of 10 values, there are 9 degrees of freedom(once you know 9 of the sample values, the 10th can be calculated and is not free to vary). The degrees of freedom parameter, as applied to many probability distributions, affects the shape of the distribution. The number of degrees of freedom is an input to many statistical tests. For example, degrees of freedom is the name given to the n – 1 denominator seen in the calculations for variance and standard deviation. Why does it matter? When you use a sample to estimate the variance for a population, you will end up with an estimate that is slightly biased downward if you use n in the denominator. If you use n – 1 in the denominator, the estimate will be free of that bias.
The number of degrees of freedom (d.f.) forms part of the calculation to standardize test statistics so they can be compared to reference distributions (t-distribution, F-distribution, etc.).
Is it important for data science? Not really, at least in the context of significance testing. For one thing, formal statistical tests are used only sparingly in data science. For another, the data size is usually large enough that it rarely makes a real difference for a data scientist whether, for example, the denominator has n or n – 1. (As n gets large, the bias that would come from using n in the denominator disappears.)There is one context, though, in which it is relevant: the use of factored variables in regression (including logistic regression). Some regression algorithms choke if exactly redundant predictor variables are present. This most commonly occurs when factoring categorical variables into binary indicators (dummies). Consider the variable “day of week.” Although there are seven days of the week, there are only six degrees of freedom in specifying day of week. For example, once you know that day of week is not Monday through Saturday, you know it must be Sunday. Inclusion of the Mon–Sat indicators thus means that also including Sunday would cause the regression to fail, due to a multicollinearity error.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is the Z-test?
Assumes the test statistic follows a normal distribution under the null hypothesis.
Generally, the Z-test is used when the sample size is large (to invoke the Central Limit Theorem) or when the population variance is known. A t-test is used when the sample size is small and when the population variance is unknown. The Z-test for a population mean is formulated as:
$z = \frac{\bar{x} - \mu_0}{\sigma / \sqrt{n}} \sim N(0,1)$
in the case where the population variance $\sigma^2$ is known.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
What is the Student’s t-test?
Uses a Student’s t-distribution rather than a normal distribution as test statistic.
The t-test is structured similarly to the Z-test but uses the sample variance $s^2$ in place of population variance. The t-test is parameterized by the degrees of freedom, which refer to the number of independent observations in a dataset, denoted below by $n - 1$:
$t = \frac{\bar{x} - \mu_0}{s / \sqrt{n}} \sim t_{n-1}$
where $s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n - 1}$
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
The t-value, also known as the t-test, is an important focus of attention, since it is the link that tells us whether the association between an explanatory variable and the response has statistical significance. The t-value is simply the estimate divided by the standard error, and so it can be interpreted as the distance of the estimate from 0, measured in standard errors. Given a t-value and the sample size, the software can provide an exact p-value; for large samples, t-values greater than 2 or less than –2 correspond to p < 0.05, although these thresholds are higher for smaller sample sizes.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is the Chi-Square Test?
Used to assess goodness of fit and to check whether two categorical variables are independent.
The Chi-squared test statistic is used to assess goodness of fit and is calculated as follows:
$\chi^2 = \sum \frac{(O - E)^2}{E}$
where $O_i$ is the observed value of interest and $E_i$ is its expected value. A Chi-squared test statistic takes on a particular number of degrees of freedom, which is based on the number of categories in the distribution.
To use the Chi-squared test to check whether two categorical variables are independent: (1) Create a table of counts (called a contingency table) with the values of one variable forming its rows and the values of the other variable forming its columns; (2) Compute the Chi-squared test statistic as above and check for intersections.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
The Chi-square test is a hypothesis test that is used when you want to determine whether there is a relationship between two categorical variables.
Web testing often goes beyond A/B testing and tests multiple treatments at once. The chi-square test is used with count data to test how well it fits some expected distribution. The most common use of the chi-square statistic in statistical practice is with r × c contingency tables, to assess whether the null hypothesis of independence among variables is reasonable.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
The chi-square statistic is a general measure of the dissimilarity between the observed and expected counts.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is the difference between a chi-square test and a correlation?
Both correlations and chi-square tests can test for relationships between two variables. However, a correlation is used when you have two quantitative variables and a chi-square test of independence is used when you have two categorical variables.
Scribbr - Frequently asked questions (2022)
What are the Type I and Type II Errors?
There are two errors that are frequently assessed:
- A Type I error, which is also known as a false positive error, occurs when the null hypothesis is rejected when it is actually correct.
- A Type II error, which is also known as a false negative error, occurs when the null hypothesis is not rejected when it is incorrect.
Usually, $1 - \alpha$ is referred to as the confidence level, and $1 - \beta$ is referred to as the power of the test.
- $\alpha$ (alpha) represents the probability of a Type I error.
- $\beta$ (beta) represents the probability of a Type II error.
- Power = 1 − $\beta$, which reflects the probability of correctly rejecting a false null hypothesis.
Generally, tests are set up in such a way as to have both $1 - \alpha$ and $1 - \beta$ relatively high (for example, 0.95 and 0.8, respectively). If you plot sample size versus power, generally you should see that a larger sample size corresponds to higher power. It can be useful to look at power curves in order to gauge the sample size needed for detecting a significant effect.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
What is the Bonferroni Correction?
If you run many experiments—even if a particular outcome for one is unlikely—you may see a statistically significant result at least once by pure chance. For example, if you set $\alpha = 0.05$ and have 100 hypothesis tests, you would expect 5 out of 100 to be statistically significant by chance alone. To control for this, a more desirable outcome is achieved by adjusting $\alpha$. This can be done by setting a new $\alpha$ value as:
$\alpha’ = \frac{\alpha}{n}$
where $n$ is the number of hypothesis tests.
This adjustment is known as the Bonferroni correction and helps ensure that the overall rate of false positives is controlled within a multiple-testing framework.
Ace the Data Science Interview; Kevin Huo, Nick Singh (2022)
What is Regression to Mean?
Regression to the mean occurs when more extreme responses tend to revert and move closer to the mean in the long run, since some contribution to their initially extreme character happened merely by chance.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is Bias?
In statistics, bias is the difference between the expected value of an estimator and its estimand. […] (it) refers to results that are systematically off the mark. Think archery where your bow is sighted incorrectly. High bias doesn’t mean you’re shooting all over the place (that’s high variance), but may cause a perfect archer hit below the bullseye all the time.
What is AI bias?; Cassie Kozyrkov (2019)
In causality, bias is what makes association different from causation. […] The bias is given by how the treated and control group differ before the treatment, in case neither of them has received the treatment. […] you can think of bias arising because many things we can’t control are changing together with the treatment.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
Statistical bias refers to measurement or sampling errors that are systematic and produced by the measurement or sampling process. An important distinction should be made between errors due to random chance and errors due to bias. Consider the physical process of a gun shooting at a target. It will not hit the absolute center of the target every time, or even much at all. An unbiased process will produce error, but it is random and does not tend strongly in any direction. Bias comes in different forms, and may be observable or invisible. When a result does suggest bias (e.g., by reference to a benchmark or actual values), it is often an indicator that a statistical or machine learning model has been misspecified, or an important variable left out.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is Sample Bias?
[When] the sample is different in some meaningful and nonrandom way from the larger population it was meant to represent. The term nonrandom is important - hardly any sample, including random samples, will be exactly representative of the population. Sample bias occurs when the difference is meaningful, and it can be expected to continue for other samples drawn in the same way as the first.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is the Bias-Variance trade-off?
Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.
Understanding the Bias-Variance Tradeoff; Seema Singh (2018)
The tension between oversmoothing and overfitting is an instance of the bias-variance trade-off, a ubiquitous problem in statistical model fitting. Variance refers to the modeling error that occurs because of the choice of training data; that is, if you were to choose a different set of training data, the resulting model would be different. Bias refers to the modeling error that occurs because you have not properly identified the underlying real-world scenario; this error would not disappear if you simply added more training data. When a flexible model is overfit, the variance increases. You can reduce this by using a simpler model, but the bias may increase due to the loss of flexibility in modeling the real underlying situation. A general approach to handling this trade-off is through crossvalidation.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is ANOVA?
ANOVA is a statistical procedure for analyzing the results of an experiment with multiple groups. It is the extension of similar procedures for the A/B test, used to assess whether the overall variation among groups is within the range of chance variation. A useful outcome of ANOVA is the identification of variance components associated with group treatments, interaction effects, and errors.
What is Bootstrap?
One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample. This procedure is called the bootstrap, and it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.
The bootstrap (sampling with replacement from a data set) is a powerful tool for assessing the variability of a sample statistic.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
[…] have an idea of how much our estimate varies. This process is known as bootstrapping the data — the magical idea of bootstrapping is reflected in this ability to learn about variability in an estimate without having to make any assumptions about the shape of the population distribution. If we repeat this resampling, say, a thousand times, we will obtain a thousand possible estimates of the mean. These are known as sampling distributions of estimates, since they reflect the variability in the estimates that arise from repeated samplings of the data. The distributions of estimates based on resampled data are almost symmetric around the mean of the original data, almost independently of the shape of the original data distribution. The second important characteristic is that the bootstrap distributions narrow as the sample size increases, reflected in the 95% uncertainty intervals becoming ever narrower. Bootstrapping provides an intuitive way, with heavy use of the computer, to assess the uncertainty in our estimates, without needing to make strong assumptions or use probability theory.
Bootstrapping a sample consists of creating new datasets of the same size by resampling the original data, with replacement. Sample statistics calculated from bootstrap resamplings tend toward a normal distribution for large datasets, regardless of the shape of the original data distribution. Uncertainty intervals based on bootstrapping take advantage of modern computing power and do not require assumptions about the mathematical form of the population nor complex probability theory.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is the difference between Bootstrap and Permutation?
There are two main types of resampling procedures: the bootstrap and permutation tests. The bootstrap is used to assess the reliability of an estimate. Permutation tests are used to test hypotheses, typically involving two or more groups.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is Bagging?
With classification and regression trees (also called decision trees), running multiple trees on bootstrap samples and then averaging their predictions (or, with classification, taking a majority vote) generally performs better than using a single tree. This process is called bagging (short for “bootstrap aggregating”)
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What are the Epicycles of Data Analysis?
Develop Expectations -> Collect Data -> Match Expectations with Data
Starting The Question -> Exploratory Data Analisys -> Model Building -> Interpret -> Communicate
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
What are the differences between Descriptive Analysis and Inferential Statistics?
Descriptive statistics describe a sample. That’s pretty straightforward. You simply take a group that you’re interested in, record data about the group members, and then use summary statistics and graphs to present the group properties. With descriptive statistics, there is no uncertainty because you are describing only the people or items that you actually measure. You’re not trying to infer properties about a larger population.[…] Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population. This requirement affects our process.
Difference between Descriptive and Inferential Statistics; Jim Frost (2020)
What are the 4 different Categories of Data Analysis?
Descriptive Analytics (tells you what happened in the past); Diagnostic Analytics (helps you understand why something happened in the past); Predictive Analytics (predicts what is most likely to happen in the future); Prescriptive Analytics (recommends actions you can take to affect those outcomes).
Comparing Descriptive, Predictive, Prescriptive, and Diagnostic Analytics; Brian Brinkmann (2019)
What is Associational Analyses?
Associational analyses are ones where we are looking at an association between two or more features in the presence of other potentially confounding factors. There are three classes of variables that are important to think about in an associational analysis: Outcome (the feature of your dataset that is thought to change along with your key predictor); Key predictor (often for associational analyses there is one key predictor of interest); Potential confounders (this is a large class of predictors that are both related to the key predictor and the outcome).
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
What is Prediction Analyses?
In the previous section we described associational analyses, where the goal is to see if a key predictor x and an outcome y are associated. But sometimes the goal is to use all of the information available to you to predict y. Furthermore, it doesn’t matter if the variables would be considered unrelated in a causal way to the outcome you want to predict because the objective is prediction, not developing an understanding about the relationships between features. With prediction models, we have outcome variables–features about which we would like to make predictions–but we typically do not make a distinction between “key predictors” and other predictors. In most cases, any predictor that might be of use in predicting the outcome would be considered in an analysis and might, a priori, be given equal weight in terms of its importance in predicting the outcome. Prediction analyses will often leave it to the prediction algorithm to determine the importance of each predictor and to determine the functional form of the model.
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
What is the Classical Statistical Inference Pipeline?
Formulate hypothesis -> Design experiment -> Collect Data -> Inference/conclusions
The term inference reflects the intention to apply the experiment results, which involve a limited set of data, to a larger process or population.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is Data Conditioning?
[It is] the first step of any data analysis project [and means] getting data into a state where it’s usable. Data conditioning can involve cleaning up messy HTML with tools […], natural language processing to parse plain text in English and other languages, or even getting humans to do the dirty work.
What Is Data Science?; Mike Loukides (2010)
What is QMV?
It is an iterative process of questioning, modeling, and validation to data analysis and model building.
Model Building and Validation by AT&T, Online Course - Advanced Techniques for Analyzing Data
Causal Inference
What is Causality?
Causality, in the statistical sense, means that when we make interventions, the chances of obtaining different results are systematically modified. It is difficult to establish causality statistically; for this, well-designed randomized studies are the best tool we have. Observational data may include background factors that influence the apparent relationships observed between an exposure and an outcome; these may be either observed confounders or hidden factors.
Our “statistical” idea of causality is not strictly deterministic. When we say that X causes Y, we are not trying to say that every time X occurs, Y will also occur. Or that Y only occurs if X occurs. We only mean that if we intervene and force X to occur, then Y will tend to occur more frequently. Thus, we can never say that X caused Y in a specific case, only that X increases the proportion of times that Y occurs.
First, to infer causality with real confidence, the ideal is to intervene and conduct experiments. Second, since this is a statistical or stochastic world, we need to intervene more than once to accumulate evidence.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What are Counterfactuals?
Counterfactual reasoning means thinking about alternative possibilities for past or future events: what might happen/ have happened if…? In other words, you imagine the consequences of something that is contrary to what actually happened or will have happened (“counter to the facts”).
Conceptually: Counterfactuals (2022)
[…] we will talk a lot in terms of potential outcomes. They are potential because they didn’t actually happen. Instead they denote what would have happened in the case some treatment was taken. We sometimes call the potential outcome that happened, factual, and the one that didn’t happen, counterfactual.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
What is the fundamental problem of Causal Inference?
The fundamental problem of causal inference is that we can never observe the same unit with and without treatment. It is as if we have two diverging roads and we can only know what lies ahead of the one we take.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
What is the difference between Causation and Association?
Inferences about causation are concerned with “what if” questions in counterfactuals worlds, such as “what would be the risk if everybody had been treated?” and “what would be the risk if everybody had been untreated?”, whereas inferences about association are concerned with questions in the actual world, such as “what is the risk in the treated?” and “what is the risk in the untreated?”.
Association is defined by a different risk in two disjoint subsets of the population determined by the individuals’ actual treatment value (A = 1 or A = 0), whereas causation is defined by a different risk in the same population under two different treatment values (a = 1 or a = 0).
In ideal randomized experiments, association is causation.
Causal Inference: What if; Miguel A. Hernán and James M. Robins (2022)
What is the purpose of a Clinical Experiment?
The purpose of a clinical experiment is to conduct an “honest test” that properly determines a causality and estimates the average effect of a new medical treatment, without introducing biases that may give us a mistaken idea of its effectiveness. An adequate medical experiment should ideally follow the following principles: (1) Allocation of treatment: it is important to compare like with like, so the treatment and comparison groups need to be as similar as possible. The best way to ensure this is to randomly assign participants to be treated or not, and then observe what happens to them — this is known as a Randomized Controlled Trial (RCT).
(2) All individuals in the groups to which they were allocated must be counted: the individuals allocated to the “statin” group of the EPC were included in the final analysis even if they did not take their statins.
(3) If possible, people should not know which group they are in: in studies with statins, both the real medication pills and the placebo pills had the same appearance, so that the participants did not know the treatment they were receiving — a blind test. If possible, those evaluating the final results should not know which group of subjects they are examining.
(4) Evaluate all individuals: every effort should be made to follow all individuals, since people who drop out of the study may, for example, have done so due to the drug’s side effects.
(5) Do not rely on a single study: a single experiment with statins cannot tell us that the drug worked in a particular group in a particular place, but robust conclusions require multiple studies.
(6) Systematically review the evidence: when examining multiple experiments, it is important to include any study that has been conducted, and thus create what is known as a systematic review. The results can then be formally combined in a meta-analysis.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is Average Treatment Effect (ATE)?
The average treatment effect (ATE) is a measure used to compare treatments (or interventions) in randomized experiments, evaluation of policy interventions, and medical trials. The ATE measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control.
Average treatment effect - Wikipedia (2022)
What is Confounding Bias?
The first significant cause of bias is confounding. It happens when the treatment and the outcome share a common cause. For example, let’s say that the treatment is education, and the outcome is income. It is hard to know the causal effect of education on wages because both share a common cause: intelligence. So we could argue that more educated people earn more money simply because they are more intelligent, not because they have more education. We need to close all backdoor paths between the treatment and the outcome to identify the causal effect. If we do so, the only effect that will be left is the direct effect T->Y. In our example, if we control for intelligence, that is, we compare people with the same level of intellect but different levels of education, the difference in the outcome will be only due to the difference in schooling since intelligence will be the same for everyone. To fix confounding bias, we need to control all common causes of the treatment and the outcome.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
Any correlation between ice cream sales and drownings must be due to both being influenced by the weather. When an apparent association between results can be explained by a common factor influencing both, this common cause is known as a confounder, or confounding variable. The simplest technique to deal with confounding variables is to examine the apparent relationship within each level of the confounder. This is known as adjustment, or stratification. Thus, for example, we could explore the relationship between drownings and ice cream sales on days with more or less the same temperature. This is known as Simpson’s paradox, which occurs when the apparent direction of an association is reversed by a confounder, requiring a complete change in the apparent information from the data. Statisticians delight in finding real-life examples of this, each reinforcing the caution required in interpreting observational data. However, it shows the insight gained by dividing data according to factors that may help explain observed associations.
In a randomized study, there should be no need for adjustment for confounding variables, since random allocation in theory ensures that all other factors not being studied are balanced between the groups.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is Selection Bias?
Often, selection bias arises when we control for more variables than we should. It might be the case that the treatment and the potential outcome are marginally independent but become dependent once we condition on a collider.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
While confounding is the bias from failing to control for a common cause, selection bias is when we control for a common effect or a variable in between the path from cause to effect. As a rule of thumb, always include confounders and variables that are good predictors of in your model. Always exclude variables that are good predictors of only, mediators between the treatment and outcome or common effect of the treatment and outcome.
Selection bias is so pervasive that not even randomization can fix it. Better yet, it is often introduced by the ill advised, even in random data! Spotting and avoiding selection bias requires more practice than skill. Often, they appear underneath some supposedly clever idea, making it even harder to uncover.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
What is the difference between Correlated Variables and Confounding Variables?
With correlated variables, the problem is one of commission: including different variables that have a similar predictive relationship with the response. With confounding variables, the problem is one of omission: an important variable is not included in the regression equation. Naive interpretation of the equation coefficients can lead to invalid conclusions.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
Which features/controls/predictors should we add to a Causal Inference model?
We should add controls that are both correlated with the treatment and the outcome (confounder). We should also add controls that are good predictors of the outcome, even if they are not confounders, because they lower the variance of our estimates. However, we should NOT add controls that are just good predictors of the treatment, because they will increase the variance of our estimates.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
What is the Vast Search Effect?
Bias or nonreproducibility resulting from repeated data modeling, or modeling data with large numbers of predictor variables. […] Typical forms of selection bias in statistics, in addition to the vast search effect, include nonrandom sampling, cherry-picking data, selection of time intervals that accentuate a particular statistical effect, and stopping an experiment when the results look “interesting.”
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
Feature Engineering
How to deal with Missing Values?
Frequently data contains missing values or null values which lead to lower the potential of the model. So we try to impute the missing values. 1. For continuous values, we fill in the null values using the mean, mode or the median depending on the need. 2. For categorical values, we use the most frequently occurred categorical value.
The “Generic” Data Science Life-Cycle; Sivakar Siva (2020)
What is One Hot Encoding?
One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. […] It is used to perform “binarization” of the category and include it as a feature to train the model.
What is One Hot Encoding? Why and When Do You Have to Use it? (2017)
Machine Learning
What is the basic difference between Inferential Statistics and Machine Learning?
Inferential statistics is a way to learn from data, and one of the tool of Machine Learning. Both use a set of observations to discover underlying processes or patterns, then be able to predict. If you have all the houses characteristics and prices in a given area, you can find out what is determining the price, then predict the price for a new house. Simple statistical analysis. Now if you want to build an app to predict houses prices, it’s another story. You need a lot more work, on data pre-processing, multiple algorithms, other models of regression, etc … That’s machine learning territory. Inferential statistics is only one of the tool. Machine Learning also wants to learn from “big data”, high dimensional data, unstructured, streaming data, find connexions in a social network, group press releases by similar topics, recognize images, compress pictures etc. No nice excel-like data set for this. It requires a different set of tools (whose goal is basically to turn everything the messy world is throwing at us into a nice excel-like data set with matrices that compute fast). The techniques that deal with high dimensional and streaming data have all the attention today, but a lot of the implementations of Machine Learning are still classic regression. You hear a lot that a business can be “moneyballed”, referring to base ball statistics. The idea is that you can take something that is “obviously” not data driven (“I have been doing this business for 30 years and let me tell you it’s all about connecting with people”), and prove that it can be run more effectively with data. Most of that is indeed inferential statistics, plus additional techniques. It’s all “learning from data”.
Quora Answer; Philippe Hocquet (2017)
What is a Linear Model?
Mathematically, the function $f(x) = wx + b$ is an affine transformation, not a linear one, since true linear transformations require $b = 0$. However, in machine learning, we often call such models “linear” whenever the parameters appear linearly in the equation — meaning $w$ and $b$ are only multiplied by inputs or constants and added, without multiplying each other, being raised to powers, or appearing inside functions like $e^w$.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What is Linear Regression?
Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things: (1) does a set of predictor variables do a good job in predicting an outcome (dependent) variable? (2) Which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable? These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.
What is Linear Regression?; Statistics Solutions (2013)
In basic regression analysis, the dependent variable is the quantity we want to predict or explain, usually forming the vertical y-axis of a graph, and is also known as the response variable. The independent variable is the quantity we use to make the prediction or explanation, generally forming the horizontal x-axis of a graph, and is also known as the explanatory variable. The gradient — slope — is also known as the regression coefficient.
The meaning of these gradients depends entirely on our assumptions about the relationship between the variables being studied. For correlation data, the gradient indicates how much the dependent variable would be expected to change, on average, if we observe a difference of one unit in the independent variable. If, however, we assumed a causal relationship, then the interpretation of the gradient would be very different — it would be the expected change in the dependent variable if we intervened and changed the independent variable by one unit.
Statistical models have two main components. First, a mathematical formula that expresses a deterministic, predictable component, such as the straight line fit that allows us to predict a child’s height from the parent’s height. But the deterministic part of a model will never be a perfect representation of the observed world. The difference between what the model predicts and what actually happens is the second component of a model, known as the residual error — although it may sound misleading, it is simply the inevitable inability of a model to represent exactly what we observe. Thus, in sum, we assume that: observation = deterministic model + residual error.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
OLS is a common technique used in analyzing linear regression. In brief, it compares the difference between individual points in your data set and the predicted best fit line to measure the amount of error produced.
Interpreting Linear Regression Through statsmodels .summary(); Tim McAleer (2020)
How is the model fit to the data? When there is a clear relationship, you could imagine fitting the line by hand. In practice, the regression line is the estimate that minimizes the sum of squared residual values, also called the residual sum of squares or RSS. The method of minimizing the sum of the squared residuals is termed least squares regression, or ordinary least squares (OLS) regression.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
How do Neural Networks differ from Linear Models?
A neural network differs from a linear model in two fundamental ways:
(1) It applies fixed non-linear functions to the outputs of trainable linear functions.
(2) Its structure is deeper, combining multiple functions hierarchically through layers.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What are Logits?
Logits are the raw outputs of a neural network, prior to applying an activation function.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What is Logistic Regression?
Logistic regression is commonly used for binary classification tasks. Unlike linear regression, which produces outputs ranging from −∞ to ∞, logistic regression always outputs values between 0 and 1. It does this by applying the sigmoid function to a linear combination of inputs. Logistic regression can serve either as a standalone model or as the output layer in a larger neural network.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What is the difference between Linear Regression and Logistic Regression?
Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms which come under supervised learning technique. Since both the algorithms are of supervised in nature hence these algorithms use labeled dataset to make the predictions. But the main difference between them is how they are being used. The Linear Regression is used for solving Regression problems whereas Logistic Regression is used for solving the Classification problems.
Linear Regression vs Logistic Regression - Java T Point (2022)
Linear regression and logistic regression share many commonalities. Both assume a parametric linear form relating the predictors with the response. Exploring and finding the best model are done in very similar ways. Extensions to the linear model, like the use of a spline transform of a predictor, are equally applicable in the logistic regression setting. Logistic regression differs in two fundamental ways: (1) The way the model is fit (least squares is not applicable); (2) The nature and analysis of the residuals from the model
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
Logistic regression is commonly used for binary classification tasks. Unlike linear regression, which produces outputs ranging from minus infinity to infinity, logistic regression puts values between 0 and 1. It can serve either as a standalone model or as the output layer in a larger neural network.
A common choice for the loss function in this case is binary cross-entropy, also called logistic loss.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What Assumptions do we make for Regression models?
When doing a simple regression model, we make the (often reasonable!) assumptions that: a) The errors are normally distributed and, on average, zero; b) The errors all have the same variance (they are homoscedastic), and c) The errors are unrelated to each other (they are independent across observations).
Pratical Time Series - The State University of New York (2024)
What is the difference between Simple Linear Regression and Correlation?
Both are ways of measuring how two variables are related. The difference is that while correlation measures the strength of an association between two variables, regression quantifies the nature of the relationship.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
What is Multicollinearity?
It will be recalled that one of the factors that affects the standard error of a partial regression coefficient is the degree to which that independent variable is correlated with the other independent variables in the regression equation. Other things being equal, an independent variable that is very highly correlated with one or more other independent variables will have a relatively large standard error. This implies that the partial regression coefficient is unstable and will vary greatly from one sample to the next. This is the situation known as multicollinearity. Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. Multicollinearity is a problem because it undermines the statistical significance of an independent variable. Other things being equal, the larger the standard error of a regression coefficient, the less likely it is that this coefficient will be statistically significant.
The problem of multicollinearity - Understanding Regression Analysis; Michael Patrick Allen (1997)
What is Heteroskedasticity?
Heteroskedasticity is the lack of constant residual variance across the range of the predicted values. In other words, errors are greater for some portions of the range than for others. Visualizing the data is a convenient way to analyze residuals.
Heteroskedasticity indicates that prediction errors differ for different ranges of the predicted value, and may suggest an incomplete model.
Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)
This phenomenon of having a region of low variance and another of high variance is called heteroskedasticity. Put it simply, heteroskedasticity is when the variance is not constant across all values of the features.
Causal Inference for the Brave and True; Matheus Facure Alves (2022)
What is Loss in Machine Learning?
Smaller errors mean the model fits the data better. The loss, which aggregates these errors, measures how well the model aligns with the dataset.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
How do we find the Optimum of a Function?
To find the optimum (minimum or maximum) of a function, we calculate its first derivative. At the optimum, the first derivative equals 0. For functions of two or more variables, like the loss function $J(w, b)$, we compute partial derivatives with respect to each variable.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What is the Cross-Entropy Loss?
Cross-entropy, also known as logarithmic loss or log loss, is a popular loss function used in machine learning to measure the performance of a classification model. It measures the average number of bits required to identify an event from one probability distribution, p, using the optimal code for another probability distribution, q. In other words, cross-entropy measures the difference between the discovered probability distribution of a classification model and the predicted values. The cross-entropy loss function is used to find the optimal solution by adjusting the weights of a machine learning model during training. The objective is to minimize the error between the actual and predicted outcomes. A lower cross-entropy value indicates better performance.
Binary cross-entropy (averaged over $N$ samples): $L = -\frac{1}{N}[\sum_{j=1}^{N}(t_j\log(p_j)+(1-t_j)\log(1-p_j))]$ where $t_j\in{0,1}$ is the true label and $p_j$ is the predicted probability for sample $j$. For binary, $p_j$ usually comes from a sigmoid, while for more labels if comes from the softmax.
Cross-Entropy Loss Function in Machine Learning: Enhancing Model Accuracy; Kurtis Pykes (2025)
When used with softmax in the output layer, cross-entropy guides the network to assign high probabilities to correct classes while reducing probabilities for incorrect ones.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What is the Gradient in Gradient Descent?
The gradient of the loss function is a vector containing all partial derivatives with respect to the model’s parameters. It indicates the direction of steepest ascent in the loss function. To minimize loss, parameters are updated in the opposite direction of the gradient.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What is Gradient Descent and How Does it Work?
Gradient descent is an iterative optimization algorithm that updates model parameters to minimize the loss function. Steps include:
- Initialize parameters randomly.
- Compute predictions.
- Compute gradients of the loss with respect to parameters.
- Update parameters using, adjusting in the direction that decreases the loss function. This adjustment involves taking a small step in the opposite direction of the gradient.
- Calculate the loss by substituting the updated values.
- Repeat until the loss converges to a minimum.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What is the Learning Rate in Gradient Descent?
The learning rate is a hyperparameter that controls the step size during updates. If the learning rate is too small, training is very slow; if it is too large, the algorithm may overshoot the minimum or even diverge. Choosing an appropriate learning rate is critical for convergence.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What is Convergence in Gradient Descent?
Convergence occurs when subsequent iterations yield minimal decreases in loss. A properly tuned learning rate ensures steady progress toward the minimum of the loss function.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What is Automatic Differentiation?
Automatic differentiation (autograd) is a feature in modern ML frameworks (e.g., PyTorch, TensorFlow) that computes derivatives directly from Python code. It eliminates the need for manual derivations, even for complex models, and enables gradient-based optimization efficiently.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What is Backpropagation?
Backpropagation is the algorithm used to compute gradients in neural networks. It applies differentiation rules (chain rule) over a computational graph of the model. The process involves two passes:
- Forward pass: data flows from input to output to compute predictions.
- Backward pass: gradients flow from output to input to update parameters.
[The Hundred-Page Language Models Book; Andriy Burkov (2025)]
What are Activation Functions?
For a one-dimensional input, the model becomes: $y = \phi(wx + b)$ where $\phi$ is a fixed non-linear function (activation). Common activations include:
- ReLU (Rectified Linear Unit): $ \mathrm{ReLU}(z) = \max(0, z) $, widely used in neural networks.
- Sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$, outputs values in $[0,1]$, suitable for binary classification.
- Tanh: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} $, outputs values in $[-1,1]$.
- Softmax: Transforms a vector $\mathbf{z}$ into a probability distribution: $\mathrm{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}$, ensuring that the sum is 1. Tasks involving three or more classes generally employ the softmax activation function paired with cross-entropy loss.
Neural network softmax outputs are better characterized as “probability scores” rather than true statistical probabilities, despite summing to one and resembling class likelihoods. Unlike logistic regression or Naive Bayes models, neural networks don’t generate genuine class probabilities.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What are Feedforward Neural Networks (FNNs) and Multilayer Perceptrons (MLPs)?
A feedforward neural network (FNN) is one where information flows in one direction — left to right — without loops. […] When each layer connects to all units in the next, it is called a multilayer perceptron (MLP), and the layers are known as fully connected (dense) layers.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What are Convolutional Neural Networks (CNNs)?
Convolutional neural networks (CNNs) are feedforward neural networks with convolutional layers that are not fully connected. Initially designed for image processing, they are also effective for tasks like text classification.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
Why was ReLU important in Deep Learning?
The ReLU activation function, despite its simplicity, was a breakthrough in machine learning. Neural networks before 2012 relied more on smooth activations like tanh and sigmoid, which made training deep models difficult.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What is the Brier Score?
Although the ROC curve assesses how well the algorithm separates the groups, and the calibration plot checks whether the probabilities actually correspond to what they claim, the ideal would be to find a simple composite measure that combines these two aspects into a single number that we could then use to compare algorithms. Fortunately, meteorologists in the 1950s discovered exactly how to do this. If we were predicting a numerical quantity, such as tomorrow’s temperature at noon in a given place, accuracy would generally be summarized by the error — the difference between the observed temperature and the predicted one. The usual way to summarize error over a series of days is the mean squared error (MSE) — the average of the squared errors—analogous to the least-squares criterion we saw used in regression analysis. The trick for probabilities is to use the same mean-squared-error criterion we use when predicting a quantity, assigning the value 1 to a future observation of “rain” and the value 0 to “no rain.” The average of the squared errors is known as the Brier score, in honor of meteorologist Glenn Brier, who described the method in 1950.
The Art of Statistics: Learning from Data; David Spiegelhalter (2019)
What is the High Accuracy Paradox?
Accuracy is not useful when trying to predict things that are not common. Accuracy is simply the proportion of correctly classified instances. It is usually the first metric you look at when evaluating a model. However, when the data is imbalanced (where most of the instances belong to one of the classes), or you are more interested in the performance on either one of the classes, accuracy doesn’t really capture the effectiveness of a classifier. Normally in classification problems, we’re typically more concerned about the errors that we make. Because the target class is usually the area of interest that we’re trying to focus on. This is called accuracy paradox.
Machine Learning - Accuracy Paradox; Randy Lao (2017)
Accuracy is not a reliable metric to determine a model performance. That’s why it’s called a Paradox because, intuitively, you’d expect a Model with a higher Accuracy to have been the best Model but Accuracy Paradox tells us that this, sometimes, isn’t the case.
Accuracy Paradox in Classification Models; Amit Ranjan (2020)
What is Underfitting?
In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.
Understanding the Bias-Variance Tradeoff; Seema Singh (2018)
What is Overfitting?
In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex like Decision trees which are prone to overfitting.
Understanding the Bias-Variance Tradeoff; Seema Singh (2018)
Why is Model Tuning relevant?
Model tuning. A hallmark of prediction algorithms is their many tuning parameters. Sometimes these parameters can have large effects on prediction quality if they are changed and so it is important to be informed of the impact of tuning parameters for whatever algorithm you use. There is no prediction algorithm for which a single set of tuning parameters works well for all problems. Most likely, for the initial model fit, you will use “default” parameters, but these defaults may not be sufficient for your purposes. Fiddling with the tuning parameters may greatly change the quality of your predictions. It’s very important that you document the values of these tuning parameters so that the analysis can be reproduced in the future.
The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)
What is PAC Learning?
PAC - probability for approximately correct - learning theory helps to analyze whether and under what conditions a learning algorithm will probably output an approximately correct classifier.
The Hundred-Page Machine Learning Book; Andriy Burkov (2019)
A Concept Class (C) is PAC-learnable by a Learner (L) using a Hypothesis Space (H), if L will, with probability 1 - delta (with ‘delta’ being the certainty goal), output a hypothesis h (belonging to H) such that the error of h is less than epsilon (with ‘epsilon’ being the error goal) in time and samples polynomial in 1/epsilon, 1/delta.
PAC Learning - Georgia Tech - Machine Learning (2015)
Time Series
What is Stationarity in a Time Series context?
A strict stationarity imposes a stronger condition of identical probability distributions across different time points, while weak stationarity allows for changes in the distribution but requires the mean, variance, and autocorrelation structure to remain constant over time. In practice, weak stationarity is often more applicable and easier to verify, making it a commonly used assumption in time series analysis.
Pratical Time Series - The State University of New York (2024)
Why is it important to know Stationarity and Inversability for ARIMA models?
Stationarity and invertibility are crucial concepts in the context of ARIMA (AutoRegressive Integrated Moving Average) models, and understanding these properties is essential for ensuring the validity and reliability of the model. Here’s why these properties are important:
Stationarity:
- Statistical Assumption: ARIMA models assume that the time series data is stationary. Stationarity means that the statistical properties of the time series, such as mean and variance, do not change over time. This assumption is necessary for the model to capture meaningful patterns and relationships.
- Differencing Requirement: If the original time series is not stationary, differencing is applied to make it stationary. Differencing involves taking the difference between consecutive observations. Stationarity is important because it simplifies the modeling process and allows for more reliable parameter estimation.
Invertibility:
- Interpretability: Invertibility is a property that ensures the model is interpretable. An invertible model implies that the current value of the time series only depends on past values and white noise. This property is crucial for understanding the impact of past observations on the present without causing feedback loops.
- Meaningful Forecasts: Invertibility is important for making meaningful forecasts. If a model is not invertible, the forecasted values may not have clear interpretability, and it might be challenging to attribute changes in the forecast to specific changes in the input data.
- Numerical Stability: Invertibility is related to the numerical stability of the model. Invertible models are more likely to produce stable and reliable parameter estimates, making them more suitable for forecasting.
In summary, stationarity ensures that the statistical properties of the time series remain consistent over time, making it suitable for modeling. Invertibility ensures that the model is interpretable and capable of providing meaningful forecasts. Both properties contribute to the reliability and accuracy of ARIMA models in capturing and forecasting time series patterns.
Can I apply ARIMA on a Non Stationary and Non Invertible Time Series?
The ARIMA (AutoRegressive Integrated Moving Average) model assumes that the time series data is stationary. If your time series is non-stationary, you typically need to apply differencing to make it stationary before applying ARIMA. Similarly, invertibility is a desirable property of ARIMA models to ensure that the model is interpretable and suitable for forecasting. An invertible model implies that the current value of the time series only depends on past values and white noise. If a model is not invertible, it may lead to challenges in interpretation and potentially less reliable forecasts. Here are the general steps when dealing with a non-stationary time series: (1) Differencing - If your time series is non-stationary, you may need to apply differencing to make it stationary. Differencing involves taking the difference between consecutive observations; (2) ARIMA Model - Once the data is stationary, you can apply the ARIMA model. The ARIMA model is typically denoted as ARIMA(p, d, q), where p is the order of the autoregressive (AR) component, d is the degree of differencing, q is the order of the moving average (MA) component; (3) Invertibility Check: After fitting the ARIMA model, it’s important to check if the model is invertible. If the model is not invertible, you might need to reconsider the model specification or apply transformations to achieve invertibility.
How to model a Time Series for ARIMA?
Modeling: Trend suggests differencing; Variation in variance suggests transformation (common transformation: log, then differencing); ACF (auto-correlation function) suggests order of moving average process (q); PACF (partial ACF) suggests order of autoregressive process (p).
Pratical Time Series - The State University of New York (2024)
Generative AI
What is a Corpus?
A collection of text documents used in machine learning.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What is a Token?
Splitting a document into small indivisible parts is called tokenization, and each part is a token. There are different ways to tokenize. Sometimes, it’s useful to break words into smaller units, called subwords, to keep the vocabulary size manageable.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
How does the Bag of Words work?
- Create a vocabulary: List all unique words in the corpus to create the vocabulary.
- Vectorize documents: Convert each document into a feature vector, where each dimension represents a word from the vocabulary. The value indicates the word’s presence, absence, or frequency in the document.
While the bag-of-words approach offers simplicity and practicality, it has notable limitations. Most significantly, it fails to capture token order or context. Consider how “the cat chased the dog” and “the dog chased the cat” yield identical representations, despite conveying opposite meanings. N-grams provide one solution to this challenge.
Another limitation of bag-of-words is how it handles out-of-vocabulary words. When a word appears during inference that wasn’t present during training - and thus isn’t in the vocabulary - it can’t be represented in the feature vector. Similarly, the approach struggles with synonyms and near-synonyms. Words like “movie” and “film” are processed as completely distinct terms, forcing the model to learn separate parameters for each. Since labeled data is often costly to obtain, resulting in rather small labeled datasets, it would be more efficient if the model could recognize and collectively process words with similar meanings. Word embeddings address this by mapping semantically similar words to similar vectors.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
How does the N-grams work?
An n-gram consists of $n$ consecutive tokens from text. By preserving sequences of tokens, n-grans retains contextual information that information that individual tokens cannot capture. However, using n-grams comes at a cost. The vocabulary expands considerably, increasing the computational cost of model training. Additionally, the model requires larger datasets to effectively learn weights for the expanded set of possible n-grams.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
How do Word Embeddings work?
Word embeddings overcome the limitations of the bag-of-words model by representing words as dense vectors rather than sparse one-hot vectors. These lower-dimensional representations contain mostly non-zero values, with similar words having embeddings than exhibit high cosine similarity. The embeddings are learned from vast unlabeled datasets spanning millions to hundreds of millions of documents.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What are Skip-grams?
Skip-grams are word sequences where one word is omitted. Training a model to predict these skipped words from their surrouding context helps it learn semantic relationships between words. The process can also work in reverse: the skipped word can be used to predict its context words. […] The skip-gram model uses cross-entropy as its loss function. […] Once training is complete, the output layer is discarded. The embedding layer then serves as the new output layer.
Word2vec is just one method for learning word embeddings from large, unlabeled text corpora. Other methods, such as GloVe and FastText, offer alternative approaches, focusing on capturing global co-occurrence statistics or subword information to create more robust embeddings.
Using word embeddings to represent text offers clear advantages over bag of words. One advantage is dimensionality reduction, which compresses the word representation from the size of the vocabularity (as in one-hot encoding) to a small vector, typically between 100 and 1000 dimensions. Semantic similarity is another advantage of word embeddings. Words with similar meanings are mapped to vectors that are close to each other in the embedding space.
The Hundred-Page Language Models Book; Andriy Burkov (2025)
What is Cosine Similarity?
Cosine similarity is a widely used similarity metric that determines how similar two data points are based on the direction they point rather than their length or size. It is especially effective in high-dimensional spaces where traditional distance-based metrics can struggle.
Computing cosine similarity requires measuring the cosine of the angle (theta) between two non-zero vectors in an inner product space. This measurement produces a cosine similarity score. Cosine similarity values range from -1 to 1:
- A cosine similarity score of 1 indicates that the vectors are pointing in the exact same direction.
- A cosine similarity score of 0 indicates that the vectors are orthogonal, meaning they have no directional similarity.
- A cosine similarity score of -1 indicates that the vectors point in exactly opposite directions.
Think of it like comparing arrows: if they’re pointing in the same direction, they are highly similar. Those at right angles are unrelated, and arrows pointing in opposite directions are dissimilar.
$cs = (A \cdot B) / (|A| \times |B|)$ , where $||A||$ is the magnitude (length) of vector $A$ and $|B|$ is the magnitude of vector $B$. In a 2D space, if vector $A=(x,y)$, its magnitude is $|A|=\sqrt{x^{2}+y^{2}}$.
What is cosine similarity?; IBM (2025)
Miscellaneous
How is Moore’s Law applied to data?
Since the early ’80s, processor speed has increased from 10 MHz to 3.6 GHz — an increase of 360 (not counting increases in word length and number of cores). But we’ve seen much bigger increases in storage capacity, on every level. RAM has moved from 1,000/MB to roughly 25/GB — a price reduction of about 40000, to say nothing of the reduction in size and increase in speed. Hitachi made the first gigabyte disk drives in 1982, weighing in at roughly 250 pounds; now terabyte drives are consumer equipment, and a 32 GB microSD card weighs about half a gram. Whether you look at bits per gram, bits per dollar, or raw capacity, storage has more than kept pace with the increase of CPU speed. The importance of Moore’s law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put into it. […] Increased storage capacity demands increased sophistication in the analysis and use of that data. That’s the foundation of data science.
What Is Data Science?; Mike Loukides (2010)
Gordon Moore (a cofounder of Intel) observed that the number of transistors in computer chips doubles roughly every two years. More transistors per chip translates to faster speeds in computer processors and more random access memory in computers, which leads to more powerful computers. This extraordinary rate of technological improvement - output doubling every two years - is likely the fastest growth in technology humanity has ever seen. Yet, since 2011, the amount of sequencing data stored in the Short Read Archive has outpaced even this incredible growth, having doubled every year.
Bioinformatics Data Skills; Vince Buffalo (2015)
Should I use R programming language for Data Science?
R in data science is considered as [one of] the best programming language[s]. It is a programming language and programming condition for illustrations and measurable registering. It is space explicit and has fantastic top notch run. R comprises of open source bundles for measurable and quantitative applications. This incorporates progressed plotting, non-direct relapse, neural systems, phylogenetics, and some more. For analysing data, Data Scientists and Data Miners use R broadly.
Data Science From Scratch: How to Become a Data Scientist; David Park (2019)
What is the MapReduce approach?
Data is only useful if you can do something with it, and enormous datasets present computational problems. Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search across thousands of processors, and then combine the results into a single set of answers. What’s less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning. The most popular open source implementation of MapReduce is the Hadoop project.
What Is Data Science?; Mike Loukides (2010)
What is Hadoop?
Hadoop goes far beyond a simple MapReduce implementation (of which there are several); it’s the key component of a data platform. It incorporates HDFS, a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive, which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig; and other components. If anything can be called a one-stop information platform, Hadoop is it. Hadoop has been instrumental in enabling ”agile” data analysis. […] Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP) is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics on sites like Twitter.
What Is Data Science?; Mike Loukides (2010)
The Hadoop platform was designed to solve problems where you have a lot of data - perhaps a mixture of complex and structured data - and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting.
The Modern Data Warehouse: A New Approach for a New Era; Tom Traubitz (2018)
What are the 5 V’s of Big Data?
Volume, velocity, variety, value and veracity.