Data Science Concepts

Summary

Science and Data
Key Terms and Areas
Data Storage
Statistics and Data Analysis
Causal Inference
Feature Engineering
Machine Learning
Miscellaneous

Science and Data

What is the difference between Art and Science?

Science is knowledge which we understand so well that we can teach it to a computer.

Computer Programming as an Art; Donald Knuth (1974)

What is the difference between Data, Information and Knowledge?

Data are different symbols and characters whose meaning only becomes clear when they connect with context. Collecting and measuring observations generates data. Usually machines sent, receive and process data. The confusion between data and information often arises because information is made out of data. Data reaches a more complex level and becomes information by integrating them to a context. Information provides expertise about facts or persons. Knowledge thus describes the collected information that is available about a particular fact or a person. The knowledge of this situation makes it possible to make informed decisions and solve problems. Thus, knowledge influences the thinking and actions of people. Machines can also make decisions based on new knowledge generated by information. In order to gain knowledge, it is necessary to process information.

What is the difference between data, information and knowledge?; Sebastian Pierper (2017)

What is Data Science?

Multi-disciplinary field that brings together concepts from computer science, statistics/machine learning, and data analysis to understand and extract insights from the ever-increasing amounts of data. Two paradigms of data research: 1. Hypothesis-Driven (given a problem, what kind of data do we need to help solve it?); 2. Data-Driven (given some data, what interesting problems can be solved with it?). The heart of data science is to always ask questions. Always be curious about the world: 1. What can we learn from this data?; 2. What actions can we take once we find whatever it is we are looking for?

Data Science Cheatsheet; Maverick Lin (2018)

Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid.

What Is Data Science?; Mike Loukides (2010)

Data Science is a mix of traditional data analysis techniques with advanced algorithms for handling a considerable measure of games. It has likewise approached finding new sorts of data.

Data Science From Scratch: How to Become a Data Scientist; David Park (2019)

What is the difference between Data Science and Statistics?

What differentiates data science from statistics is that data sicence is a holistic approach. We’re increasingly finding data in the wild, and data scientists are involved with gathering data, massaging it into a tractable form, making it tell its story, and presenting that story to others.

What Is Data Science?; Mike Loukides (2010)

What is a Data Scientist?

A data researcher is somebody who decodes large measures of data and extracts importance to support an association or organization to improve its activities. They utilize various tools, philosophies, insights, systems, calculations, etc. to examine data additionally. […] The job of the data researcher fundamentally is to the pursuit and perused the data, preparing and speaking to it and bringing a feeling of the data for down to earth use. [Besides that,] in request to check the present status of an organization or where it stands, a Business [Intelligent] Analyst utilizes data and searches for examples, business patterns, connections and concocts a representation and report. […] A Machine Learning Engineer works with various calculations identified with machine learning like grouping, choice tress, characterization, arbitrary backwoods, etc.

Data Science From Scratch: How to Become a Data Scientist; David Park (2019)

How is Data Science applied to Business?

The data science gives an immense chance to the budgetary firm to rehash the business. In account, the use of data science is automating Risk Managment, Predictive Analytics, Managing client data, Fraud identification, Real-time Analytics, Algorithmic exchanging, Consumer Analytics. […] Data Science helps in understanding different patterns and futhermore helps in setting choices concerning advancement and advertising so the items can achieve the clients, and in the long run, increment the income of the organization.

Data Science From Scratch: How to Become a Data Scientist; David Park (2019)

Key Terms and Areas

What is Data Analysis?

Data analysis is an art [apart from Data Science]. It is not something yet that we can teach to a computer. Data analysts have many tools at their disposal, from linear regression to classification trees and even deep learning, and these tools have all been carefully taught to computers. But ultimately, a data analyst must find a way to assemble all of the tools and apply them to data to answer a relevant question—a question of interest to people. […] While a study includes developing and executing a plan for collecting data, a data analysis presumes the data have already been collected. More specifically, a study includes the development of a hypothesis or question, the designing of the data collection process (or study protocol), the collection of the data, and the analysis and interpretation of the data. Because a data analysis presumes that the data have already been collected, it includes development and refinement of a question and the process of analyzing and interpreting the data.

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What is Big Data?

Big data is when the size of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.

What Is Data Science?; Mike Loukides (2010)

That is a considerable measure of data, so much data, and it turned out to be challenging to deal with by utilizing conventional innovations. Along these lines, we called it Big Data.

Data Science From Scratch: How to Become a Data Scientist; David Park (2019)

What is Data Warehouse?

A “data warehouse” may be a basic operational reporting environment built from a single transactional system or it may be a cutting-edge solution uniting transactional, machine and social data to support deep and complex analysis in real time. It may provide information for daily (or monthly, or quarterly) reports or it may feed complex analysis into live business processes several times a second. […] Vincent Rainardi outlines five basic requirements that a data warehouse typically must meet: 1. An integrated view of the organization’s data for strategic analysis; 2. A consistent view of the organization’s data resources with data that has been cleared of anomalies which can lead to a false impression of the business’ function; 3. A consolidation of the organization’s data history beyond what is retained by current operations for deep analysis of the business’ functions over time; 4. A tested and verified environment for doing data analysis to access data so that each new draw of data doesn’t become a “science experiment” in and of itself; 5. A high-performance environment for doing data analysis that does not interfere with day-to-day activity of the business.

The Modern Data Warehouse: A New Approach for a New Era; Tom Traubitz (2018)

What is ETL?

The step where the data is pulled processed and loaded into a data warehouse, this is generally done through an ETL pipeline. ETL stands for Extract, Transform and Load. ETL is a 3 steps process: (1) Extracting data from single or multiple Data Sources. (2) Transforming data as per business logic. Transformation is in itself a two steps process - data cleansing and data manipulation. (3) Loading the previously transformed data into the target data source or data warehouse.

The “Generic” Data Science Life-Cycle; Sivakar Siva (2020)

What is Bioinformatics?

Restorative Science: In Bioinformatics, Data Science alongside Genome data is helping scientists and specialists to examine genetic structures and see how specific medications can follow up on sicknesses.

Data Science From Scratch: How to Become a Data Scientist; David Park (2019)

Bioinformaticians are concerned with deriving biological understanding from large amounts of data with specialized skills and tools. Early in biology’s history, the datasets were small and manageable. […] However, this is all rapidly changing. Large sequencing datasets are widespread, and will only become more common in the future. Analyzing this data takes different tools, new skills, and many computers with large amounts of memory, processing power, and disk space.

Bioinformatics Data Skills; Vince Buffalo (2015)

What is Artificial Intelligence?

The study of making computers do things that the human needs intelligence to do. […] Classes of problems requiring intelligence include inference based on knowledge, reasoning with uncertain or incomplete information, various forms of perception and learning, and applications to problems such as control, prediction, classification and optimization

Fundamentals of the New Artificial Intelligence; Toshinori Munakata (2008)

Artificial Intelligence involves using methods based on the intelligent behavior of humans and other animals to solve complex problems.

Artificial Intelligence: Illuminated; Ben Coppin (2004)

What is Machine Learning?

Machine learning can be understood as and application of AI. (It) was born from pattern recognition and the theory that computers can learn without being programmed to perform specific tasks. This includes techniques such as Bayesian methods; neural networks; inductive logic programming; explanation-based, natural language processing; decision tree; and reinforcement learning. […] Systems that have hard-coded knowledge bases will typically experience difficulties in new environments. Certain difficulties can be overcome by a system that can acquire its own knowledge. This capability is known as machine learning.

Machine Learning and AI for Healthcare; Arjun Panesar (2019)

What is Feature Engineering?

Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to strengthen the performance of machine learning models. Feature engineering can be considered as applied machine learning itself.

The “Generic” Data Science Life-Cycle; Sivakar Siva (2020)

Outlier detection, one hot encoding, handling missing data are few basic examples of Feature Engineering.

What is feature engineering; CodeBasics (2020)

What is Information Science?

It is the field of science where unique logical methodologies and approaches are consolidated to consider data innovation. In layman language, it is technically the science for examining information.

Data Science From Scratch: How to Become a Data Scientist; David Park (2019)

What is a Model?

In a very general sense, a model is something we construct to help us understand the real world. […] The process of building a model involves imposing a specific structure on the data and creating a summary of the data. […] A statistical model serves two key purposes in a data analysis, which are to provide a quantitative summary of your data and to impose a specific structure on the population from which the data were sampled. […] At its core, a statistical model provides a description of how the world works and how the data were generated. The model is essentially an expectation of the relationships between various factors in the real world and in your dataset. What makes a model a statistical model is that it allows for some randomness in generating the data.

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What is Statistic?

The first key element of a statistical model is data reduction. The basic idea is you want to take the original set of numbers consisting of your dataset and transform them into a smaller set of numbers. […] The process of data reduction typically ends up with a statistic. Generally speaking, a statistic is any summary of the data. The sample mean, or average, is a statistic. So is the median, the standard deviation, the maximum, the minimum, and the range. Some statistics are more or less useful than others but they are all summaries of the data.

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What is Knowledge Discovery from Data (KDD)?

It is an iterative sequence of the following steps: 1. Data cleaning (to remove noise and inconsistent data); 2. Data integration (where multiple data sources may be combined); 3. Data selection (where data relevant to the analysis task are retrieved from the database); 4. Data transformation (where data are transformed and consolidated into forms of appropriate for mining by performing summary or aggregation operations); 5. Data mining (an essential process where intelligent methods are applied to extract data patterns); 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on interestingness measures); 7. Knowledge presentation (where visualization and knowledge representation techniques are used to present mined knowledge to users).

Data Mining; Jiawei Hang, Micheline Kamber and Jian Pei (2012)

Data Storage

How to Store Data in Data Science scenario?

Relational databases are designed for consistency, to support complex transactions that can easily be rolled back if any one of a complex set of operations fails. While rock-solid consistency is crucial to many applications, it’s not really necessary for the kind of analysis we’re discussing here. […] Precision has an allure, but in most data-driven applications outside of finance, that allure is deceptive. Most data analysis is comparative […]. To store huge datasets effectively, we’ve seen a new breed of databases appear. These are frequently called NoSQL databases, or Non-Relational databases […]. They group together fundamentally dissimilar products by telling you what they aren’t. Many of these databases are [..] designed to be distributed across many nodes, to provide ‘‘eventual consistency’’ but not absolute consistency, and to have very flexible schema.

What Is Data Science?; Mike Loukides (2010)

What are the challenges of the modern Data Warehouse?

Today your business faces an unprecedented sets of challenges. Bigger data volumes. New data types. A deluge of machine data from the Internet of Things. Digital business models that require real-time performance all the time drive the need for zero-latency reporting. Data-driven businesses need more complex, more extensive, and yet paradoxically faster and more easily accessed analytics. To suceed, you need a deeper understanding of the Why bnehind what your customers, your competitors, and the market as a whole is up to. These are the challenges of the modern Data Warehouse. And to meet these challenges, you need something more than just a database.

The Modern Data Warehouse: A New Approach for a New Era; Tom Traubitz (2018)

What is Data Serialization?

The process of converting a data structure or object state into a format that can be stored or transmitted and reconstructured later. There are many, many data serialization formats. When considering a format to work with, you might want to consider different characteristics such as human readability, access patterns, and whether it’s based on text or binary, which influences the size of its files. Some examples are JSON, CSV and Parquet.

Designing Machine Learning Systems; Chip Huyen (2021)

Statistics and Data Analysis

What are the different Types of Data?

Nominal, Ordinal, Interval and Ratio.

What are the basic types of structured data?

There are two basic types of structured data: numeric and categorical. Numeric data comes in two forms: continuous, such as wind speed or time duration, and discrete, such as the count of the occurrence of an event. Categorical data takes only a fixed set of values, such as a type of TV screen (plasma, LCD, LED, etc.) or a state name (Alabama, Alaska, etc.). Binary data is an important special case of categorical data that takes on only one of two values, such as 0/1, yes/no, or true/false. Another useful type of categorical data is ordinal data in which the categories are ordered; an example of this is a numerical rating (1, 2, 3, 4, or 5).

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are Percentiles and Quantiles?

To avoid the sensitivity to outliers, we can look at the range of the data after dropping values from each end. Formally, these types of estimates are based on differences between percentiles. In a data set, the Pth percentile is a value such that at least P percent of the values take on this value or less and at least (100 – P) percent of the values take on this value or more. For example, to find the 80th percentile, sort the data. Then, starting with the smallest value, proceed 80 percent of the way to the largest value. Note that the median is the same thing as the 50th percentile. The percentile is essentially the same as a quantile, with quantiles indexed by fractions (so the .8 quantile is the same as the 80th percentile).

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Interquartile Range (IQR)?

A common measurement of variability is the difference between the 25th percentile and the 75th percentile, called the interquartile range (or IQR). Here is a simple example: {3,1,5,3,6,7,2,9}. We sort these to get {1,2,3,3,5,6,7,9}. The 25th percentile is at 2.5, and the 75th percentile is at 6.5, so the interquartile range is 6.5 – 2.5 = 4.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are the Estimates of Location?

Mean, Weighted Mean, Median, Percentile, Weighted Median and Trimmed Mean (the average of all values after dropping a fixed number of extreme values).

The basic metric for location is the mean, but it can be sensitive to extreme values (outlier). Other metrics (median, trimmed mean) are less sensitive to outliers and unusual distributions and hence are more robust.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are the Estimates of Variability?

Deviations (the difference between the observed values and the estimate of location), Variance, Standard Deviation (the square root of the variance), Mean Absolute Deviation, Median Absolute Deviation (the median of the absolute values of deviations from the median), Range, Percentile and Interquartile Range.

Variance and standard deviation are the most widespread and routinely reported statistics of variability. Both are sensitive to outliers. More robust metrics include mean absolute deviation, median absolute deviation from the median, and percentiles (quantiles).

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are the Key Terms for Exploring the Distribution?

Boxplot, Frequency Table, Histogram and Density Plot (a smoothed version of the histogram, often based on a kernel density estimate).

A frequency histogram plots frequency counts on the y-axis and variable values on the x-axis; it gives a sense of the distribution of the data at a glance. A frequency table is a tabular version of the frequency counts found in a histogram. A boxplot - with the top and bottom of the box at the 75th and 25th percentiles, respectively - also gives a quick sense of the distribution of the data; it is often used in side-by-side displays to compare distributions. A density plot is a smoothed version of a histogram; it requires a functions to estimate a plot based on the data (multiple estimates are possible, of course).

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the difference between Frequency Tables and Percentiles?

Both frequency tables and percentiles summarize the data by creating bins. In general, quartiles and deciles will have the same count in each bin (equal-count bins), but the bin sizes will be different. The frequency table, by contrast, will have different counts in the bins (equal-size bins), and the bin sizes will be the same.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are the Key Terms for Exploring Binary and Categorical Data?

Mode, Expected Value (when the categories can be associated with a numeric value, this gives an average value based on a category’s probability of occurrence), Bar Charts and Pie Charts.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are the Key Terms for Correlation?

Correlation Coefficient (measures the extent to which numeric variables are associated with one another), Correlation Matrix and Scatterplot.

Variables X and Y (each with measured data) are said to be positively correlated if high values of X go with high values of Y, and low values of X go with low values of Y. If high values of X go with low values of Y, and vice versa, the variables are negatively correlated.

Like the mean and standard deviation, the correlation coefficient is sensitive to outliers in the data. Software packages offer robust alternatives to the classical correlation coefficient. For example, the R package robust uses the function covRob to compute a robust estimate of correlation. The methods in the scikit-learn module sklearn.covariance implement a variety of approaches.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are the Key Terms for Exploring Two or More Variables?

Scatterplot, Contingency Table, Hexagonal Binning, Contour Plot and Violin Plot.

For plotting numeric vs numeric data, scatterplots are fine when there is a relatively small number of data values. For data sets with hundreds of thousands or millions of records, a scatterplot will be too dense, so we need a different way to visualize the relationship. Heat maps, hexagonal binning, and contour plots all give a visual representation of a two-dimensional density. In this way, they are natural analogs to histograms and density plots.

A useful way to summarize two categorical variables is a contingency table - a table of counts by category.

Boxplots are a simple way to visually compare the distributions of a numeric variable grouped according to a categorical variable. A violin plot, introduced by [Hintze-Nelson-1998], is an enhancement to the boxplot and plots the density estimate with the density on the y-axis. The density is mirrored and flipped over, and the resulting shape is filled in, creating an image resembling a violin. The advantage of a violin plot is that it can show nuances in the distribution that aren’t perceptible in a boxplot. On the other hand, the boxplot more clearly shows the outliers in the data.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are the Normal Distribution and the Gamma Distribution?

[Normal model] says that the randomness in a set of data can be explained by the Normal distribution, or a bell-shaped curve. The Normal distribution is fully specified by two parameters — the mean and the standard deviation. [The Gamma distribution] has the feature that it only allows positive values, so it eliminates the problem we had with negative values with the Normal distribution.

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

The Normal distribution is also referred to as Gaussian distribution.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Inference?

Inference is one of many possible goals in data analysis. […] In general, the goal of inference is to be able to make a statement about something that is not observed, and ideally to be able to characterize any uncertainty you have about that statement. Inference is difficult because of the difference between what you are able to observe and what you ultimately want to know. [..] The language of inference can change depending on the application, but most commonly, we refer to the things we cannot observe (but want to know about) as the population or as features of the population and the data that we observe as the sample. The goal is to use the sample to somehow make a statement about the population.

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What is Linear Regression?

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things: (1) does a set of predictor variables do a good job in predicting an outcome (dependent) variable? (2) Which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable? These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.

What is Linear Regression?; Statistics Solutions (2013)

What is Bias?

In statistics, bias is the difference between the expected value of an estimator and its estimand. […] (it) refers to results that are systematically off the mark. Think archery where your bow is sighted incorrectly. High bias doesn’t mean you’re shooting all over the place (that’s high variance), but may cause a perfect archer hit below the bullseye all the time.

What is AI bias?; Cassie Kozyrkov (2019)

In causality, bias is what makes association different from causation. […] The bias is given by how the treated and control group differ before the treatment, in case neither of them has received the treatment. […] you can think of bias arising because many things we can’t control are changing together with the treatment.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

Statistical bias refers to measurement or sampling errors that are systematic and produced by the measurement or sampling process. An important distinction should be made between errors due to random chance and errors due to bias. Consider the physical process of a gun shooting at a target. It will not hit the absolute center of the target every time, or even much at all. An unbiased process will produce error, but it is random and does not tend strongly in any direction. Bias comes in different forms, and may be observable or invisible. When a result does suggest bias (e.g., by reference to a benchmark or actual values), it is often an indicator that a statistical or machine learning model has been misspecified, or an important variable left out.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Sample Bias?

[When] the sample is different in some meaningful and nonrandom way from the larger population it was meant to represent. The term nonrandom is important - hardly any sample, including random samples, will be exactly representative of the population. Sample bias occurs when the difference is meaningful, and it can be expected to continue for other samples drawn in the same way as the first.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the Bias-Variance trade-off?

Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time.

Understanding the Bias-Variance Tradeoff; Seema Singh (2018)

The tension between oversmoothing and overfitting is an instance of the bias-variance trade-off, a ubiquitous problem in statistical model fitting. Variance refers to the modeling error that occurs because of the choice of training data; that is, if you were to choose a different set of training data, the resulting model would be different. Bias refers to the modeling error that occurs because you have not properly identified the underlying real-world scenario; this error would not disappear if you simply added more training data. When a flexible model is overfit, the variance increases. You can reduce this by using a simpler model, but the bias may increase due to the loss of flexibility in modeling the real underlying situation. A general approach to handling this trade-off is through crossvalidation.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Random Sampling?

Random sampling is a process in which each available member of the population being sampled has an equal chance of being chosen for the sample at each draw. The sample that results is called a simple random sample. Sampling can be done with replacement, in which observations are put back in the population after each draw for possible future reselection. Or it can be done without replacement, in which case observations, once selected, are unavailable for future draws. Data quality often matters more than data quantity when making an estimate or a model based on a sample. Data quality in data science involves completeness, consistency of format, cleanliness, and accuracy of individual data points. Statistics adds the notion of representativeness.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the Vast Search Effect?

Bias or nonreproducibility resulting from repeated data modeling, or modeling data with large numbers of predictor variables. […] Typical forms of selection bias in statistics, in addition to the vast search effect, include nonrandom sampling, cherry-picking data, selection of time intervals that accentuate a particular statistical effect, and stopping an experiment when the results look “interesting.”

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Bootstrap?

One easy and effective way to estimate the sampling distribution of a statistic, or of model parameters, is to draw additional samples, with replacement, from the sample itself and recalculate the statistic or model for each resample. This procedure is called the bootstrap, and it does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.

The bootstrap (sampling with replacement from a data set) is a powerful tool for assessing the variability of a sample statistic.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Bagging?

With classification and regression trees (also called decision trees), running multiple trees on bootstrap samples and then averaging their predictions (or, with classification, taking a majority vote) generally performs better than using a single tree. This process is called bagging (short for “bootstrap aggregating”)

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the difference between Bootstrap and Permutation?

There are two main types of resampling procedures: the bootstrap and permutation tests. The bootstrap is used to assess the reliability of an estimate. Permutation tests are used to test hypotheses, typically involving two or more groups.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Average Treatment Effect (ATE)?

The average treatment effect (ATE) is a measure used to compare treatments (or interventions) in randomized experiments, evaluation of policy interventions, and medical trials. The ATE measures the difference in mean (average) outcomes between units assigned to the treatment and units assigned to the control.

Average treatment effect - Wikipedia (2022)

What is Standard Error?

The standard error is a single metric that sums up the variability in the sampling distribution for a statistic. The standard error can be estimated using a statistic based on the standard deviation s of the sample values, and the sample size n. As the sample size increases, the standard error decreases. The relationship between standard error and sample size is sometimes referred to as the square root of n rule: to reduce the standard error by a factor of 2, the sample size must be increased by a factor of 4.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the relationship between Standard Error and Variables’ Variance?

The standard error is inversely proportional to the variance of the variable. This means that, if doesn’t change much, it will be hard to estimate its effect on the outcome. This also makes intuitive sense. Take it to the extreme and pretend you want to estimate the effect of a drug, so you conduct a test with 10000 individuals but only 1 of them get the treatment. This will make finding the ATE very hard, we will have to rely on comparing a single individual with everyone else. Another way to say this is that we need lots of variability in the treatment to make it easier to find its impact.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

What are Confidence Intervals?

Confidence intervals are a way to place uncertainty around our estimates. The smaller the sample size, the larger the standard error, and the wider the confidence interval.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

Confidence intervals are a part of Data Science and basically, they show us the probability of an event occurring. Confidence intervals are generally used in statistics to give a range of values within which we are confident that a parameter lies. Confidence intervals are a part of Data Science. Confidence intervals help us understand the behavior of a certain dataset. A confidence interval is a range of values that a parameter is expected to fall within. […] A confidence interval is a range of values that a statistic is likely to fall between. The confidence interval is denoted by the number of standard errors that the statistic is above and below the mean value. The standard error is the standard deviation of a statistic, divided by the square root of the sample size. This is a long way of saying that the standard error is the standard deviation of your statistic. Confidence intervals vary by which data scientist you talk to, but generally, a 95% confidence interval means that you are 95% sure that the statistic falls in that range.

Confidence intervals are a part of Data Science; Rijul Singh Malik (2022)

Confidence intervals always come with a coverage level, expressed as a (high) percentage, say 90% or 95%. One way to think of a 90% confidence interval is as follows:it is the interval that encloses the central 90% of the bootstrap sampling distribution of a sample statistic. More generally, an x% confidence interval around a sample estimate should, on average, contain similar sample estimates x% of the time (when a similar sampling procedure is followed). […] The percentage associated with the confidence interval is termed the level of confidence. The higher the level of confidence, the wider the interval. Also, the smaller the sample, the wider the interval (i.e., the greater the uncertainty). Both make sense: the more confident you want to be, and the less data you have, the wider you must make the confidence interval to be sufficiently assured of capturing the true value.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is p-Value?

Like with confidence intervals (and most frequentist statistics, as a matter of fact), the true definition of p-values can be very confusing. So, to not take any risks, I’ll copy the definition from Wikipedia: “the p-value is the probability of obtaining test results at least as extreme as the results actually observed during the test, assuming that the null hypothesis is correct”. To put it more succinctly, the p-value is the probability of seeing such data, given that the null hypothesis is true. It measures how unlikely it is that you are seeing a measurement if the null hypothesis is true. Naturally, this often gets confused with the probability of the null hypothesis being true.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

Given a chance model that embodies the null hypothesis, the p-value is the probability of obtaining results as unusual or extreme as the observed results.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Alpha in the context of p-Value?

Statisticians frown on the practice of leaving it to the researcher’s discretion to determine whether a result is “too unusual” to happen by chance. Rather, a threshold is specified in advance, as in “more extreme than 5% of the chance (null hypothesis)results”; this threshold is known as alpha. Typical alpha levels are 5% and 1%. Any chosen level is an arbitrary decision—there is nothing about the process that will guarantee correct decisions x% of the time. This is because the probability question being answered is not “What is the probability that this happened by chance?” but rather “Given a chance model, what is the probability of a result this extreme?” We then deduce backward about the appropriateness of the chance model, but that judgment does not carry a probability. This point has been the subject of much confusion.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is z-Score?

The result of standardizing an individual data point. Standardize means subtracting the mean and dividing by the standard deviation.

A standard normal distribution is one in which the units on the x-axis are expressed in terms of standard deviations away from the mean. To compare data to a standard normal distribution, you subtract the mean and then divide by the standard deviation; this is also called normalization or standardization. The transformed value is termed a z-score, and the normal distribution is sometimes called the z-distribution.

Converting data to z-scores (i.e., standardizing or normalizing the data) does not make the data normally distributed. It just puts the data on the same scale as the standard normal distribution, often for comparison purposes.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is a QQ-Plot?

A QQ-Plot is used to visually determine how close a sample is to a specified distribution - in this case, the normal distribution. The QQ-Plot orders the z-scores from low to high and plots each value’s z-score on the y-axis; the x-asis is the corresponding quantile of a normal distribution for the value’s rank. Since the data is normalized, the units correspond to the number of standard deviations away from the mean. If the points roughly fall on the diagonal line, then the sample distribution can be considered close to normal.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the t-Distribution?

The t-distribution is actually a family of distributions resembling the normal distribution but with thicker tails. The t-distribution is widely used as a reference basis for the distribution of sample means, differences between two sample means, regression parameters, and more.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the Chi-Square Test?

The Chi-square test is a hypothesis test that is used when you want to determine whether there is a relationship between two categorical variables.

Web testing often goes beyond A/B testing and tests multiple treatments at once. The chi-square test is used with count data to test how well it fits some expected distribution. The most common use of the chi-square statistic in statistical practice is with r × c contingency tables, to assess whether the null hypothesis of independence among variables is reasonable.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the difference between a chi-square test and a correlation?

Both correlations and chi-square tests can test for relationships between two variables. However, a correlation is used when you have two quantitative variables and a chi-square test of independence is used when you have two categorical variables.

Scribbr - Frequently asked questions (2022)

What is the Null Hypothesis?

Hypothesis tests use the following logic: “Given the human tendency to react to unusual but random behavior and interpret it as something meaningful and real, in our experiments we will require proof that the difference between groups is more extreme than what chance might reasonably produce.” This involves a baseline assumption that the treatments are equivalent, and any difference between the groups is due to chance. This baseline assumption is termed the null hypothesis. Our hope, then, is that we can in fact prove the null hypothesis wrong and show that the outcomes for groups A and B are more different than what chance might produce.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What are Degrees of Freedom?

In the documentation and settings for many statistical tests and probability distributions, you will see a reference to “degrees of freedom.” The concept is applied to statistics calculated from sample data, and refers to the number of values free to vary. For example, if you know the mean for a sample of 10 values, there are 9 degrees of freedom(once you know 9 of the sample values, the 10th can be calculated and is not free to vary). The degrees of freedom parameter, as applied to many probability distributions, affects the shape of the distribution. The number of degrees of freedom is an input to many statistical tests. For example, degrees of freedom is the name given to the n – 1 denominator seen in the calculations for variance and standard deviation. Why does it matter? When you use a sample to estimate the variance for a population, you will end up with an estimate that is slightly biased downward if you use n in the denominator. If you use n – 1 in the denominator, the estimate will be free of that bias.

The number of degrees of freedom (d.f.) forms part of the calculation to standardize test statistics so they can be compared to reference distributions (t-distribution, F-distribution, etc.).

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

When are Degrees of Freedom important to Data Scientists?

Is it important for data science? Not really, at least in the context of significance testing. For one thing, formal statistical tests are used only sparingly in data science. For another, the data size is usually large enough that it rarely makes a real difference for a data scientist whether, for example, the denominator has n or n – 1. (As n gets large, the bias that would come from using n in the denominator disappears.)There is one context, though, in which it is relevant: the use of factored variables in regression (including logistic regression). Some regression algorithms choke if exactly redundant predictor variables are present. This most commonly occurs when factoring categorical variables into binary indicators (dummies). Consider the variable “day of week.” Although there are seven days of the week, there are only six degrees of freedom in specifying day of week. For example, once you know that day of week is not Monday through Saturday, you know it must be Sunday. Inclusion of the Mon–Sat indicators thus means that also including Sunday would cause the regression to fail, due to a multicollinearity error.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is ANOVA?

ANOVA is a statistical procedure for analyzing the results of an experiment with multiple groups. It is the extension of similar procedures for the A/B test, used to assess whether the overall variation among groups is within the range of chance variation. A useful outcome of ANOVA is the identification of variance components associated with group treatments, interaction effects, and errors.

What are the 4 different Categories of Data Analysis?

Descriptive Analytics (tells you what happened in the past); Diagnostic Analytics (helps you understand why something happened in the past); Predictive Analytics (predicts what is most likely to happen in the future); Prescriptive Analytics (recommends actions you can take to affect those outcomes).

Comparing Descriptive, Predictive, Prescriptive, and Diagnostic Analytics; Brian Brinkmann (2019)

What are the Epicycles of Data Analysis?

Develop Expectations -> Collect Data -> Match Expectations with Data

Starting The Question -> Exploratory Data Analisys -> Model Building -> Interpret -> Communicate

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What is QMV?

It is an iterative process of questioning, modeling, and validation to data analysis and model building.

Model Building and Validation by AT&T, Online Course - Advanced Techniques for Analyzing Data

What is Data Conditioning?

[It is] the first step of any data analysis project [and means] getting data into a state where it’s usable. Data conditioning can involve cleaning up messy HTML with tools […], natural language processing to parse plain text in English and other languages, or even getting humans to do the dirty work.

What Is Data Science?; Mike Loukides (2010)

What is Associational Analyses?

Associational analyses are ones where we are looking at an association between two or more features in the presence of other potentially confounding factors. There are three classes of variables that are important to think about in an associational analysis: Outcome (the feature of your dataset that is thought to change along with your key predictor); Key predictor (often for associational analyses there is one key predictor of interest); Potential confounders (this is a large class of predictors that are both related to the key predictor and the outcome).

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What is Prediction Analyses?

In the previous section we described associational analyses, where the goal is to see if a key predictor x and an outcome y are associated. But sometimes the goal is to use all of the information available to you to predict y. Furthermore, it doesn’t matter if the variables would be considered unrelated in a causal way to the outcome you want to predict because the objective is prediction, not developing an understanding about the relationships between features. With prediction models, we have outcome variables–features about which we would like to make predictions–but we typically do not make a distinction between “key predictors” and other predictors. In most cases, any predictor that might be of use in predicting the outcome would be considered in an analysis and might, a priori, be given equal weight in terms of its importance in predicting the outcome. Prediction analyses will often leave it to the prediction algorithm to determine the importance of each predictor and to determine the functional form of the model.

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What are the differences between Descriptive Analysis and Inferential Statistics?

Descriptive statistics describe a sample. That’s pretty straightforward. You simply take a group that you’re interested in, record data about the group members, and then use summary statistics and graphs to present the group properties. With descriptive statistics, there is no uncertainty because you are describing only the people or items that you actually measure. You’re not trying to infer properties about a larger population.[…] Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. Because the goal of inferential statistics is to draw conclusions from a sample and generalize them to a population, we need to have confidence that our sample accurately reflects the population. This requirement affects our process.

Difference between Descriptive and Inferential Statistics; Jim Frost (2020)

What is the Classical Statistical Inference Pipeline?

Formulate hypothesis -> Design experiment -> Collect Data -> Inference/conclusions

The term inference reflects the intention to apply the experiment results, which involve a limited set of data, to a larger process or population.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What assumptions do we make for regression models?

When doing a simple regression model, we make the (often reasonable!) assumptions that: a) The errors are normally distributed and, on average, zero; b) The errors all have the same variance (they are homoscedastic), and c) The errors are unrelated to each other (they are independent across observations).

Pratical Time Series - The State University of New York (2024)

What is stationarity in a time series context?

A strict stationarity imposes a stronger condition of identical probability distributions across different time points, while weak stationarity allows for changes in the distribution but requires the mean, variance, and autocorrelation structure to remain constant over time. In practice, weak stationarity is often more applicable and easier to verify, making it a commonly used assumption in time series analysis.

Pratical Time Series - The State University of New York (2024)

Why is it important to know stationarity and inversability for ARIMA models?

Stationarity and invertibility are crucial concepts in the context of ARIMA (AutoRegressive Integrated Moving Average) models, and understanding these properties is essential for ensuring the validity and reliability of the model. Here’s why these properties are important:

Stationarity:

Statistical Assumption: ARIMA models assume that the time series data is stationary. Stationarity means that the statistical properties of the time series, such as mean and variance, do not change over time. This assumption is necessary for the model to capture meaningful patterns and relationships.

Differencing Requirement: If the original time series is not stationary, differencing is applied to make it stationary. Differencing involves taking the difference between consecutive observations. Stationarity is important because it simplifies the modeling process and allows for more reliable parameter estimation.

Invertibility:

Interpretability: Invertibility is a property that ensures the model is interpretable. An invertible model implies that the current value of the time series only depends on past values and white noise. This property is crucial for understanding the impact of past observations on the present without causing feedback loops.

Meaningful Forecasts: Invertibility is important for making meaningful forecasts. If a model is not invertible, the forecasted values may not have clear interpretability, and it might be challenging to attribute changes in the forecast to specific changes in the input data.

Numerical Stability: Invertibility is related to the numerical stability of the model. Invertible models are more likely to produce stable and reliable parameter estimates, making them more suitable for forecasting.

In summary, stationarity ensures that the statistical properties of the time series remain consistent over time, making it suitable for modeling. Invertibility ensures that the model is interpretable and capable of providing meaningful forecasts. Both properties contribute to the reliability and accuracy of ARIMA models in capturing and forecasting time series patterns.

Can I apply ARIMA on a non stationary and non invertible time series?

The ARIMA (AutoRegressive Integrated Moving Average) model assumes that the time series data is stationary. If your time series is non-stationary, you typically need to apply differencing to make it stationary before applying ARIMA. Similarly, invertibility is a desirable property of ARIMA models to ensure that the model is interpretable and suitable for forecasting. An invertible model implies that the current value of the time series only depends on past values and white noise. If a model is not invertible, it may lead to challenges in interpretation and potentially less reliable forecasts. Here are the general steps when dealing with a non-stationary time series: (1) Differencing - If your time series is non-stationary, you may need to apply differencing to make it stationary. Differencing involves taking the difference between consecutive observations; (2) ARIMA Model - Once the data is stationary, you can apply the ARIMA model. The ARIMA model is typically denoted as ARIMA(p, d, q), where p is the order of the autoregressive (AR) component, d is the degree of differencing, q is the order of the moving average (MA) component; (3) Invertibility Check: After fitting the ARIMA model, it’s important to check if the model is invertible. If the model is not invertible, you might need to reconsider the model specification or apply transformations to achieve invertibility.

How to model a time series for ARIMA?

Modeling: Trend suggests differencing; Variation in variance suggests transformation (common transformation: log, then differencing); ACF (auto-correlation function) suggests order of moving average process (q); PACF (partial ACF) suggests order of autoregressive process (p).

Pratical Time Series - The State University of New York (2024)

Causal Inference

What are Counterfactuals?

Counterfactual reasoning means thinking about alternative possibilities for past or future events: what might happen/ have happened if…? In other words, you imagine the consequences of something that is contrary to what actually happened or will have happened (“counter to the facts”).

Conceptually: Counterfactuals (2022)

[…] we will talk a lot in terms of potential outcomes. They are potential because they didn’t actually happen. Instead they denote what would have happened in the case some treatment was taken. We sometimes call the potential outcome that happened, factual, and the one that didn’t happen, counterfactual.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

What is the fundamental problem of Causal Inference?

The fundamental problem of causal inference is that we can never observe the same unit with and without treatment. It is as if we have two diverging roads and we can only know what lies ahead of the one we take.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

What is the difference between Causation and Association?

Inferences about causation are concerned with “what if” questions in counterfactuals worlds, such as “what would be the risk if everybody had been treated?” and “what would be the risk if everybody had been untreated?”, whereas inferences about association are concerned with questions in the actual world, such as “what is the risk in the treated?” and “what is the risk in the untreated?”.

Association is defined by a different risk in two disjoint subsets of the population determined by the individuals’ actual treatment value (A = 1 or A = 0), whereas causation is defined by a different risk in the same population under two different treatment values (a = 1 or a = 0).

In ideal randomized experiments, association is causation.

Causal Inference: What if; Miguel A. Hernán and James M. Robins (2022)

What is Confounding Bias?

The first significant cause of bias is confounding. It happens when the treatment and the outcome share a common cause. For example, let’s say that the treatment is education, and the outcome is income. It is hard to know the causal effect of education on wages because both share a common cause: intelligence. So we could argue that more educated people earn more money simply because they are more intelligent, not because they have more education. We need to close all backdoor paths between the treatment and the outcome to identify the causal effect. If we do so, the only effect that will be left is the direct effect T->Y. In our example, if we control for intelligence, that is, we compare people with the same level of intellect but different levels of education, the difference in the outcome will be only due to the difference in schooling since intelligence will be the same for everyone. To fix confounding bias, we need to control all common causes of the treatment and the outcome.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

What is Selection Bias?

Often, selection bias arises when we control for more variables than we should. It might be the case that the treatment and the potential outcome are marginally independent but become dependent once we condition on a collider.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

While confounding is the bias from failing to control for a common cause, selection bias is when we control for a common effect or a variable in between the path from cause to effect. As a rule of thumb, always include confounders and variables that are good predictors of in your model. Always exclude variables that are good predictors of only, mediators between the treatment and outcome or common effect of the treatment and outcome.

Selection bias is so pervasive that not even randomization can fix it. Better yet, it is often introduced by the ill advised, even in random data! Spotting and avoiding selection bias requires more practice than skill. Often, they appear underneath some supposedly clever idea, making it even harder to uncover.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

Which features/controls/predictors should we add to a Causal Inference model?

We should add controls that are both correlated with the treatment and the outcome (confounder). We should also add controls that are good predictors of the outcome, even if they are not confounders, because they lower the variance of our estimates. However, we should NOT add controls that are just good predictors of the treatment, because they will increase the variance of our estimates.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

Feature Engineering

How to deal with Missing Values?

Frequently data contains missing values or null values which lead to lower the potential of the model. So we try to impute the missing values. 1. For continuous values, we fill in the null values using the mean, mode or the median depending on the need. 2. For categorical values, we use the most frequently occurred categorical value.

The “Generic” Data Science Life-Cycle; Sivakar Siva (2020)

What is One Hot Encoding?

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. […] It is used to perform “binarization” of the category and include it as a feature to train the model.

What is One Hot Encoding? Why and When Do You Have to Use it? (2017)

Machine Learning

Why is Model Tuning relevant?

Model tuning. A hallmark of prediction algorithms is their many tuning parameters. Sometimes these parameters can have large effects on prediction quality if they are changed and so it is important to be informed of the impact of tuning parameters for whatever algorithm you use. There is no prediction algorithm for which a single set of tuning parameters works well for all problems. Most likely, for the initial model fit, you will use “default” parameters, but these defaults may not be sufficient for your purposes. Fiddling with the tuning parameters may greatly change the quality of your predictions. It’s very important that you document the values of these tuning parameters so that the analysis can be reproduced in the future.

The Art of Data Science; Roger D. Peng and Elizabeth Matsui (2017)

What is the basic difference between Inferential Statistics and Machine Learning?

Inferential statistics is a way to learn from data, and one of the tool of Machine Learning. Both use a set of observations to discover underlying processes or patterns, then be able to predict. If you have all the houses characteristics and prices in a given area, you can find out what is determining the price, then predict the price for a new house. Simple statistical analysis. Now if you want to build an app to predict houses prices, it’s another story. You need a lot more work, on data pre-processing, multiple algorithms, other models of regression, etc … That’s machine learning territory. Inferential statistics is only one of the tool. Machine Learning also wants to learn from “big data”, high dimensional data, unstructured, streaming data, find connexions in a social network, group press releases by similar topics, recognize images, compress pictures etc. No nice excel-like data set for this. It requires a different set of tools (whose goal is basically to turn everything the messy world is throwing at us into a nice excel-like data set with matrices that compute fast). The techniques that deal with high dimensional and streaming data have all the attention today, but a lot of the implementations of Machine Learning are still classic regression. You hear a lot that a business can be “moneyballed”, referring to base ball statistics. The idea is that you can take something that is “obviously” not data driven (“I have been doing this business for 30 years and let me tell you it’s all about connecting with people”), and prove that it can be run more effectively with data. Most of that is indeed inferential statistics, plus additional techniques. It’s all “learning from data”.

Quora Answer; Philippe Hocquet (2017)

What is Underfitting?

In supervised learning, underfitting happens when a model unable to capture the underlying pattern of the data. These models usually have high bias and low variance. It happens when we have very less amount of data to build an accurate model or when we try to build a linear model with a nonlinear data. Also, these kind of models are very simple to capture the complex patterns in data like Linear and logistic regression.

Understanding the Bias-Variance Tradeoff; Seema Singh (2018)

What is Overfitting?

In supervised learning, overfitting happens when our model captures the noise along with the underlying pattern in data. It happens when we train our model a lot over noisy dataset. These models have low bias and high variance. These models are very complex like Decision trees which are prone to overfitting.

Understanding the Bias-Variance Tradeoff; Seema Singh (2018)

What is the High Accuracy Paradox?

Accuracy is not useful when trying to predict things that are not common. Accuracy is simply the proportion of correctly classified instances. It is usually the first metric you look at when evaluating a model. However, when the data is imbalanced (where most of the instances belong to one of the classes), or you are more interested in the performance on either one of the classes, accuracy doesn’t really capture the effectiveness of a classifier. Normally in classification problems, we’re typically more concerned about the errors that we make. Because the target class is usually the area of interest that we’re trying to focus on. This is called accuracy paradox.

Machine Learning - Accuracy Paradox; Randy Lao (2017)

Accuracy is not a reliable metric to determine a model performance. That’s why it’s called a Paradox because, intuitively, you’d expect a Model with a higher Accuracy to have been the best Model but Accuracy Paradox tells us that this, sometimes, isn’t the case.

Accuracy Paradox in Classification Models; Amit Ranjan (2020)

What is PAC Learning?

PAC - probability for approximately correct - learning theory helps to analyze whether and under what conditions a learning algorithm will probably output an approximately correct classifier.

The Hundred-Page Machine Learning Book; Andriy Burkov (2019)

A Concept Class (C) is PAC-learnable by a Learner (L) using a Hypothesis Space (H), if L will, with probability 1 - delta (with ‘delta’ being the certainty goal), output a hypothesis h (belonging to H) such that the error of h is less than epsilon (with ‘epsilon’ being the error goal) in time and samples polynomial in 1/epsilon, 1/delta.

PAC Learning - Georgia Tech - Machine Learning (2015)

What is Multicollinearity?

It will be recalled that one of the factors that affects the standard error of a partial regression coefficient is the degree to which that independent variable is correlated with the other independent variables in the regression equation. Other things being equal, an independent variable that is very highly correlated with one or more other independent variables will have a relatively large standard error. This implies that the partial regression coefficient is unstable and will vary greatly from one sample to the next. This is the situation known as multicollinearity. Multicollinearity exists whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. Multicollinearity is a problem because it undermines the statistical significance of an independent variable. Other things being equal, the larger the standard error of a regression coefficient, the less likely it is that this coefficient will be statistically significant.

The problem of multicollinearity - Understanding Regression Analysis; Michael Patrick Allen (1997)

What is the difference between Simple Linear Regression and Correlation?

Both are ways of measuring how two variables are related. The difference is that while correlation measures the strength of an association between two variables, regression quantifies the nature of the relationship.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the difference between Correlated Variables and Confounding Variables?

With correlated variables, the problem is one of commission: including different variables that have a similar predictive relationship with the response. With confounding variables, the problem is one of omission: an important variable is not included in the regression equation. Naive interpretation of the equation coefficients can lead to invalid conclusions.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is Heteroskedasticity?

Heteroskedasticity is the lack of constant residual variance across the range of the predicted values. In other words, errors are greater for some portions of the range than for others. Visualizing the data is a convenient way to analyze residuals.

Heteroskedasticity indicates that prediction errors differ for different ranges of the predicted value, and may suggest an incomplete model.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

This phenomenon of having a region of low variance and another of high variance is called heteroskedasticity. Put it simply, heteroskedasticity is when the variance is not constant across all values of the features.

Causal Inference for the Brave and True; Matheus Facure Alves (2022)

What is Ordinary Least Squares?

OLS is a common technique used in analyzing linear regression. In brief, it compares the difference between individual points in your data set and the predicted best fit line to measure the amount of error produced.

Interpreting Linear Regression Through statsmodels .summary(); Tim McAleer (2020)

How is the model fit to the data? When there is a clear relationship, you could imagine fitting the line by hand. In practice, the regression line is the estimate that minimizes the sum of squared residual values, also called the residual sum of squares or RSS. The method of minimizing the sum of the squared residuals is termed least squares regression, or ordinary least squares (OLS) regression.

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

What is the difference between Linear Regression and Logistic Regression?

Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms which come under supervised learning technique. Since both the algorithms are of supervised in nature hence these algorithms use labeled dataset to make the predictions. But the main difference between them is how they are being used. The Linear Regression is used for solving Regression problems whereas Logistic Regression is used for solving the Classification problems.

Linear Regression vs Logistic Regression - Java T Point (2022)

Linear regression and logistic regression share many commonalities. Both assume a parametric linear form relating the predictors with the response. Exploring and finding the best model are done in very similar ways. Extensions to the linear model, like the use of a spline transform of a predictor, are equally applicable in the logistic regression setting. Logistic regression differs in two fundamental ways: (1) The way the model is fit (least squares is not applicable); (2) The nature and analysis of the residuals from the model

Practical Statistics for Data Scientists; Peter Bruce, Andrew Bruce & Peter Gedeck (2020)

Miscellaneous

How is Moore’s Law applied to data?

Since the early ’80s, processor speed has increased from 10 MHz to 3.6 GHz — an increase of 360 (not counting increases in word length and number of cores). But we’ve seen much bigger increases in storage capacity, on every level. RAM has moved from 1,000/MB to roughly 25/GB — a price reduction of about 40000, to say nothing of the reduction in size and increase in speed. Hitachi made the first gigabyte disk drives in 1982, weighing in at roughly 250 pounds; now terabyte drives are consumer equipment, and a 32 GB microSD card weighs about half a gram. Whether you look at bits per gram, bits per dollar, or raw capacity, storage has more than kept pace with the increase of CPU speed. The importance of Moore’s law as applied to data isn’t just geek pyrotechnics. Data expands to fill the space you have to store it. The more storage is available, the more data you will find to put into it. […] Increased storage capacity demands increased sophistication in the analysis and use of that data. That’s the foundation of data science.

What Is Data Science?; Mike Loukides (2010)

Gordon Moore (a cofounder of Intel) observed that the number of transistors in computer chips doubles roughly every two years. More transistors per chip translates to faster speeds in computer processors and more random access memory in computers, which leads to more powerful computers. This extraordinary rate of technological improvement - output doubling every two years - is likely the fastest growth in technology humanity has ever seen. Yet, since 2011, the amount of sequencing data stored in the Short Read Archive has outpaced even this incredible growth, having doubled every year.

Bioinformatics Data Skills; Vince Buffalo (2015)

Should I use R programming language for Data Science?

R in data science is considered as [one of] the best programming language[s]. It is a programming language and programming condition for illustrations and measurable registering. It is space explicit and has fantastic top notch run. R comprises of open source bundles for measurable and quantitative applications. This incorporates progressed plotting, non-direct relapse, neural systems, phylogenetics, and some more. For analysing data, Data Scientists and Data Miners use R broadly.

Data Science From Scratch: How to Become a Data Scientist; David Park (2019)

What is the MapReduce approach?

Data is only useful if you can do something with it, and enormous datasets present computational problems. Google popularized the MapReduce approach, which is basically a divide-and-conquer strategy for distributing an extremely large problem across an extremely large computing cluster. In the “map” stage, a programming task is divided into a number of identical subtasks, which are then distributed across many processors; the intermediate results are then combined by a single reduce task. In hindsight, MapReduce seems like an obvious solution to Google’s biggest problem, creating large searches. It’s easy to distribute a search across thousands of processors, and then combine the results into a single set of answers. What’s less obvious is that MapReduce has proven to be widely applicable to many large data problems, ranging from search to machine learning. The most popular open source implementation of MapReduce is the Hadoop project.

What Is Data Science?; Mike Loukides (2010)

What is Hadoop?

Hadoop goes far beyond a simple MapReduce implementation (of which there are several); it’s the key component of a data platform. It incorporates HDFS, a distributed filesystem designed for the performance and reliability requirements of huge datasets; the HBase database; Hive, which lets developers explore Hadoop datasets using SQL-like queries; a high-level dataflow language called Pig; and other components. If anything can be called a one-stop information platform, Hadoop is it. Hadoop has been instrumental in enabling ‘‘agile’’ data analysis. […] Hadoop (and particularly Elastic MapReduce) make it easy to build clusters that can perform computations on long datasets quickly. Hadoop is essentially a batch system, but Hadoop Online Prototype (HOP) is an experimental project that enables stream processing. Hadoop processes data as it arrives, and delivers intermediate results in (near) real-time. Near real-time data analysis enables features like trending topics on sites like Twitter.

What Is Data Science?; Mike Loukides (2010)

The Hadoop platform was designed to solve problems where you have a lot of data - perhaps a mixture of complex and structured data - and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting.

The Modern Data Warehouse: A New Approach for a New Era; Tom Traubitz (2018)

What are the 5 V’s of Big Data?

Volume, velocity, variety, value and veracity.

Back to Home Page