STATISTICS FOR DATA SCIENCE

statistics-for-data-science

STATISTICS FOR DATA SCIENCE

1. VARIABLE : It a place holder which stores values.

Image for post

2. Random variable : It is a random collection of variables.

A. Numerical variable : A numerical is one that may take on any value within a finite or infinite interval (e.g., height, weight, temperature, blood glucose, …)

Numerical variable is further divided into two parts :

A.1. Continuous(floating number) : A continuous variable is one which have decimal values. For example : 5.6, 7.8, 0.001, 846.245

A.2. Discrete(whole number) : Discrete numbers are the basic counting numbers. For example : 0, 1, 2, 3, 4, 5, 6

B. Categorical Variable : A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values (e.g. race, sex, age group)

Categorical Variable is further divided into two parts :

B.1. Nominal : A nominal variable does not have orders.

B.2. Ordinal : An ordinal variable is a categorical variable for which the possible values are ordered (e.g. education level (“high school”, ”BS”, ”MS”, ”PhD”))

RANDOM VARIABLE CONCLUSION :

Image for post

Image for post

3. MEASURE OF CENTRAL TENDENCIES :

A. MEAN : it is the sum of a collection of numbers divided by the count of numbers in the collection

mean = sum of number of collection / total collection

B. MEDIAN : The “middle” of a sorted list of numbers(When there are two middle numbers we average them).

C. MODE : The mode of a set of data values is the value that appears most often.

NOTE : mean, median, mode helps in handling missing values.

4. RANGE : The Range is the difference between the lowest and highest values. Example: In {4, 6, 9, 3, 7} the lowest value is 3, and the highest is 9. So the range is 9 − 3 = 6.

5. POPULATION, SAMPLE, POPULATION MEAN, SAMPLE MEAN :

POPULATION : a population is a set of similar items or events.

SAMPLE : small collection of items from population.

Every dataset that we get to perform ML model is a sample of data.

Population vs sample use case : exit poll on election.

POPULATION MEAN : The population mean is an average of a group characteristic.

SAMPLE MEAN : A sample mean refers to the average of the sample data.

Image for post

                    POPULATION VS SAMPLE

Image for post

POPULATION MEAN VS SAMPLE MEAN
6. VARIANCE :
variance : It is the desire for the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers is spread out from their average worth.
Image for post
7. Standard deviation and measure of dispersion:
Standard deviation (SD) is the most usually utilized measure of dispersion. It is a measure of spread of data about the mean. SD is the square root of aggregate of squared deviation from the mean isolated by the number of observations.The standard deviation is a measure of the measure of variation or dispersion of a set of values. A low standard deviation shows that the values will in general be near the mean of the set, while an exclusive requirement deviation demonstrates that the values are spread out over a wider range.

Image for post

                                 STANDARD DEVIATION

8. GAUSSIAN/NORMAL DISTRIBUTION :
Normal distribution, otherwise called the Gaussian distribution, is a probability distribution that is symmetric about the mean, demonstrating that data near the mean are more frequent in occurrence than data far from the mean. In graph form, normal distribution will appear as a chime curve.Gaussian distribution to Standard normal distribution(mean=0 and standard deviation=1) [(x-mean)/standard deviation = (z-score)].

Image for post

GAUSSIAN / NORMAL DISTRIBUTION

9. STANDARD NORMAL DISTRIBUTION :

The standard normal distribution is a normal distribution with a mean of zero and standard deviation of 1.

Empirical formula :
68.2% lies in 1st standard deviation
95.4% lies in 1st standard deviation
99.7% lies in 1st standard deviation

Image for post

10. Z-SCORE :

The estimation of the z-score reveals to you the number of standard deviations you are away from the mean. In the event that a z-score is equivalent to 0, it is on the mean. A positive z-score shows the raw score is higher than the mean average. For instance, if a z-score is equivalent to +1, it is 1 standard deviation over the mean.

Image for post

                                Z-SCORE

10. PROBABILITY DENSITY FUNCTION :

A probability density capacity, or density of a nonstop random variable, is a capacity whose esteem at some random example in the example space can be interpreted as providing a relative probability that the estimation of the random variable would rise to that example.

Image for post
                                                                     PROBABILITY DENSITY FUNCTION

11. CUMULATIVE DISTRIBUTION FUNCTION :

The cumulative distribution function (CDF) of a real-valued random variable , is the probability that will take a value less than or equal to.

Image for post

                                                                          CUMULATIVE DISTRIBUTION FUNCTION

12. HYPOTHESIS TESTING :

Speculation testing in measurements is a route for you to test the results of a survey or experiment to check whether you have significant results. You’re essentially testing whether your results are substantial by figuring out the chances that your results have occurred by some coincidence. In the event that your results may have occurred by some coincidence, the experiment won’t be repeatable thus has little use.

Image for post

 

                                                                                         HYPOTHESIS TESTING13. KERNEL DENSITY ESTIMATION(KDE) :

KERNEL DENSITY ESTIMATION(KDE) is a non-parametric way to estimate the probability density function of a random variable.

Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

Image for post

                                        KERNAL DENSITY ESTIMATOR

14. CENTRAL LIMIT THEOREM :
The central limit theorem expresses that on the off chance that you have a populace with mean μ and standard deviation σ and take adequately large random examples from the populace with replacement , at that point the distribution of the example means will be approximately normally distributed.The central limit theorem reveals to us that regardless of what the distribution of the populace is, the state of the examining distribution will approach normality as the example size (N) increases.

Image for post

 

                                                            CENTRAL LIMIT THEOREM

15. SKEWNESS :
Skewness refers to distortion or asymmetry in a symmetrical chime curve, or normal distribution, in a set of data. In the event that the curve is moved to one side or to the right, it is supposed to be slanted. Skewness can be measured as a representation of the degree to which a given distribution varies from a normal distribution.

Image for post

 

                                                                                                          SKEWNESS
16. COVARIANCE :covariance is a measure of the joint variability of two random variables. On the off chance that the greater values of one variable for the most part correspond with the greater values of the other variable, and similar holds for the lesser values, the covariance is positive.

Image for post
                                                                                                    COVARIANCE FORMULA

Image for post

POSITIVE, NEGATIVE AND ZERO COVARIANCE17. PEARSON CORRELATION COVARIANCE :
Pearson’s correlation coefficient (r) is a measure of the strength of the relationship between the two variables.Pearson Correlation Coefficient helps in feature choice.

Pearson Correlation Coefficient lies b/w – 1 to 1.

Pearson Correlation Coefficient tells magnitude and direction.

Image for post
PEARSON CORRELATION COVARIANCE

Image for post

                                                        FORMULA OF PEARSON CORRELATION COEFFICIENT

18. SPEARMAN RANK CORRELATION :
It surveys how well the relationship between two variables can be described utilizing a monotonic function(function between ordered sets that preserves or reverses the provided order.).

Spearman’s rank correlation coefficient tells magnitude and direction in any event, for non linear data and outliers.

FORMULA OF SPEARMAN RANK CORRELATION

Image for post

FORMULA OF SPEARMAN RANK CORRELATION

SAME RESULT WHEN THERE IS NO OUTLIER

Image for post

SAME RESULT WHEN THERE IS NO OUTLIER

SPEARMAN GIVE BETTER RESULT IN OUTLIER

Image for post

SPEARMAN GIVE BETTER RESULT IN OUTLIER

POSITIVE SPEARMAN CORRELATION

Image for post

POSITIVE SPEARMAN CORRELATION

NEGATIVE SPEARMAN CORRELATION

Image for post

NEGATIVE SPEARMAN CORRELATION19. Q-Q PLOT :
Q–Q (quantile-quantile) plot is a probability plot, which is a graphical strategy for comparing two probability distributions by plotting their quantiles against one another.A Q–Q plot is utilized to compare the states of distributions, providing a graphical perspective on how properties, for example, area, scale, and skewness are similar or different in the two distributions.

Image for post

Image for post

Image for post

Image for post

Image for post

20. CHEBYSHEV’S INEQUALITY :
Chebyshev’s inequality guarantees that, for a wide class of probability distributions, close to a certain fraction of values can be more than a certain good ways from the mean.

In particular, close to 1/k2 of the distribution’s values can be more than k standard deviations from the mean (or proportionately, in any event 1 − 1/k2 of the distribution’s values are within k standard deviations of the mean)

Image for post

 

                                                                             CHEBYSHEV’S INEQUALITY FORMULA21. BINOMIAL DISTRIBUTION :
A binomial distribution can be thought of as just the probability of a SUCCESS or FAILURE result in an experiment or survey that is repeated on different occasions. The binomial is a type of distribution that has two potential results (the prefix “bi” signifies two, or twice). For instance, a coin throw has just two potential results: heads or tails and stepping through an examination could have two potential results: pass or come up short.Binomial distributions should likewise meet the accompanying three criteria:

A. The number of observations or trials is fixed.

B. Every observation or trial is free.

C. The probability of accomplishment is actually the equivalent from one trial to another.

Real Life Examples :

On the off chance that another drug is introduced to cure an illness, it either cures the ailment (it’s fruitful) or it doesn’t cure the ailment (it’s a failure). On the off chance that you purchase a lottery ticket, you’re either going to win cash, or you aren’t. Essentially, anything you can think about that must be a triumph or a failure can be represented by a binomial distribution.

Image for post
BINOMIAL DISTRIBUTION FORMULA

Image for post

n stands for the number of times the experiment runs and p represents the probability of one specific outcome.
22. BERNOULLI DISTRIBUUTION :A Bernoulli distribution is a discrete probability distribution for a Bernoulli trial — a random experiment that has just two results (ordinarily called a “Triumph” or a “Failure”). For instance, the probability of getting a heads (a “triumph”) while flipping a coin is 0.5. The probability of “failure” is 1 — P (1 less the probability of accomplishment, which likewise rises to 0.5 for a coin throw). It is an exceptional instance of the binomial distribution for n = 1. In other words, it is a binomial distribution with a solitary trial (for example a solitary coin throw).Image for post

23. LOG-NORMAL DISTRIBUTION :

A log-normal distribution is a constant probability distribution of a random variable whose logarithm is normally distributed. In this manner, in the event that the random variable X is log-normally distributed, at that point Y = ln(X) has a normal distribution.

Image for post

                                                                   LOG-NORMAL DISTRIBUTION

Lognormal is extremely helpful when examining stock prices. However long the growth factor utilized is thought to be normally distributed.The log-normal distribution curve can therefore be utilized to assist better with recognizing the compound return that the stock can hope to accomplish over a period of time. Note that log-normal distributions are positively slanted with long right tails because of low mean values and high variances in the random variables.24. POWER LAW :

The power law (likewise called the scaling law) expresses that a relative change in one quantity results in a proportional relative change in another. The least difficult case of the law in real life is a square; in the event that you twofold the length of a side (say, from 2 to 4 inches) at that point the area will quadruple (from 4 to 16 inches squared).

25. BOX-COX TRANSFORM :

A Box Cox transformation is a transformation of a non-normal ward variables into a normal shape. Normality is an important suspicion for some measurable methods; if your data isn’t normal, applying a Box-Cox implies that you are ready to run a broader number of tests.

Image for post

Image for post

26. POISSON DISTRIBUTION :

The Poisson distribution is the discrete probability distribution of the number of occasions occurring in a given time span, given the average number of times the occasion occurs over that time-frame.

Model : A certain drive-through eatery gets an average of 3 visitors to the drive-through per minute. This is only an average, however. The genuine sum can vary.

Image for post
                                                       POISSON DISTRIBUTION

Image for post

27. NON-GAUSSIAN DISTRIBUTION :
In spite of the fact that the normal distribution becomes the overwhelming focus in insights, numerous processes follow a non normal distribution. This can be because of the data naturally following a particular type of non normal distribution (for instance, bacteria growth naturally follows an exponential distribution). In other cases, your data assortment strategies or other systems might be to blame.

Dealing with Non Normal Distributions

You have several alternatives for taking care of your non normal data. Numerous tests, including the one example Z test, T test and ANOVA accept normality. You may at present have the option to run these tests if your example size is sufficiently large (as a rule over 20 items). You can likewise decide to transform the data with a capacity, forcing it to fit a normal model. However, on the off chance that you have a minuscule example, an example that is slanted or one that naturally fits another distribution type, you might need to run a non parametric test. A non parametric test is one that doesn’t accept the data fits a particular distribution type. Non parametric tests incorporate the Wilcoxon marked rank test, the Mann-Whitney U Test and the Kruskal-Wallis test.