Statistics is a form of mathematical analysis of data leading to reasonable conclusions from data. It is useful while evaluating claims, drawing key insights, or making predictions and decisions using data. Let us look at some common terms used in statistics before we jump into different types of statistical analysis.
Common Terms used in Statistics
Population and Sample
In statistics, the population includes all members of a defined group. The population can comprise all customers using the mobile network of a company ABC across geographies. It is not possible many times to survey the entire population given the large number. Hence, a sample, a smaller group from the population, is selected to represent the population.
By studying the sample, the researcher tries to draw valid conclusions about the population.
Types of Data
Data comes in various formats like age, income, sales or profits, and race, gender, name, or address. The data type determines the type of statistics that can be used. Data types can broadly be defined as ‘quantitative’ if it is numerical and ‘qualitative’ if not. Qualitative data can contain unstructured data, like photographs, videos, sound recordings, and so on. Let us look at the different types of data.
A statistical distribution is a mathematical function that gives the probabilities of all possible outcomes of a random variable. There are discrete and continuous distributions depending on the variable it models. Example: When you roll a die, you can get either 1, 2, 3, 4, 5, or 6. All numbers have an equal chance.
Let us understand some common examples.
Bernoulli distribution (Discrete Distribution): A Bernoulli distribution has only two possible outcomes, namely 1 (success) and 0 (failure), and a single trial.
Normal distribution (Continuous Distribution): This is the most common distribution. Here, the mean, median, and mode of the distribution are equal, and the curve of the distribution is bell-shaped, unimodal, and symmetrical about the mean. The spread of the curve is determined by its standard deviation (σ) showing that more data is near the mean (μ). The total area under the curve is 1, as it represents the probability of all outcomes.
Statistical analysis is the process of collecting, observing, manipulating, summarizing and interpreting qualitative or quantitative data to identifying trends, and relationships in the data.
But how can the population or sample be studied to get an understanding of the data?
A statistic, which is a measure of the characteristics of a sample (Eg: the mean is a statistic that measures the average of a sample) is used. It gives an estimate of the same value for the population from which the sample was selected.
Statistical analysis can be divided into descriptive statistics, which help us understand the data by providing summaries such as percentages, means, variances, and correlations, and inferential statistics, which helps infer properties of the population using t-tests, chi-square tests, regression, and analysis of variance (ANOVA). We will cover inferential statistics in another article. We will study descriptive statistics in detail in this article.
Descriptive statistics provide simple, quantitative summaries of datasets, usually combined with descriptive graphics. After data collection, the first step is to get basic information about the data using descriptive statistics. This provides easy-to-understand information that helps answer basic questions based on the average, spread, deviation of values, and so on. They give analysts a rough idea about what the data is indicating so that later they can perform more formal and targeted analyses. They are not developed based on probability theory and are frequently non-parametric statistics.
Below table describes different descriptive statistics:
Measures of Central Tendency
These statistics are a one-number summary of the data that typically describes the center of the data. It gives a typical value or the middle of the data.
Mean (μ or X̄): Mean is defined as the ratio of the sum of all the values in the data to the total number of values. It is applicable to numerical variables only.
Suppose, we have a sample of student grades as: 25, 40, 75, 80, 65, 69, 60, 57, 75, 54, 50. The mean can be calculated as 59.09 using the formula.
It is influenced by the value of every entry in the population and hence can be misleading when the data is skewed due to outliers (extreme small or big values). If we add an outlier of 200, the mean drastically changes to 70.83 and does not give the correct central point of the data.
Median (M): If the mean becomes misleading because of skewed data or outliers, we can use the median as another way of representing the typical value. Median is the value that divides the data into two equal parts when data is arranged in either ascending or descending order. It applies to numerical variables only.
- It will the middle term when number of terms is odd. Median of 25, 40, 75, 80, 65, 69, 60, 57, 75, 54, 50, after arranging in ascending order: 25, 40, 50, 54, 57, 60, 65, 69, 75, 75, 80 is 60.
- It will be the average of the middle two terms when the number of terms is even. Median of 200, 25, 40, 75, 80, 65, 69, 60, 57, 75, 54, 50 after arranging in ascending order: 25, 40, 50, 54, 57, 60, 65, 69, 75, 75, 80, 200 is 62.5 (average of 60 and 65).
Every entry may not influence the median. In the above example, the outlier of 200 did not influence the median. Hence, the median is more robust or representative than the mean value.
Mode (Mo): The mode of a set of data is the most popular value, the value with the highest frequency. Unlike the mean and median, the mode has to be a value in the data set. It is applicable to numerical as well as categorical variables.
Mode of 25, 40, 75, 80, 65, 69, 60, 57, 75, 54, 50 is 75.
There could be data sets with no repeating values and hence no mode, or bimodal data sets with two repeating values, or multimodal data sets with many repeating values.
Measures of Variability/Dispersion
Even if two data sets have the same mean, there may be differences between the data sets. We can differentiate between each data set by looking at the spread of values from the mean. Measures of dispersion describe the spread of the data around the central value.
Range: The range is a very easy way of measuring how spread out the values are. It is the difference between the largest and the smallest data values.
Range of 25, 40, 75, 80, 65, 69, 60, 57, 75, 54, 50 is 80 – 25 = 55
It describes only the width of the data but not how it is dispersed between the range. It is sensitive to outliers and may give misleading results in that case.
Percentiles: Percentiles indicate the values below which a certain percentage of the data in a data set is found. They help to understand where a value falls within a distribution of values, divide the dataset into portions, identify the central tendency, and measure the dispersion of a distribution. Median is the 50th percentile. To calculate percentile, values in the data set should always be in ascending order. The formula is:
where N = number of values in the data set, P = percentile and n = ordinal rank of a given value in ascending ordered data.
The score that marks the 20th percentile of 25, 40, 75, 80, 65, 69, 60, 80, 84, 64, 76, 57, 75, 54, 50 = (20/100) * 15 = 3. Data arranged in ascending order is: 25, 40, 50, 54, 57, 60, 64, 65, 69, 75, 75, 76, 80, 80, 84. The 3rd value is 50, which marks the 20th percentile of the students in the class. 20% of students earned a score of 50 or lower.
Quartiles: Quartiles are values that divide the data into quarters, and they are based on percentiles. Q2 is the median. The interquartile range (IQR) is a measure of statistical dispersion between upper (75th) and lower (25th) quartiles. It is a mini range and a lot less sensitive to outliers as they use the central 50% of the data. A larger IQR indicates that the data are more spread out. Box and whisker plots are very helpful to visualize the quartiles.
Variance (σ 2): Range and interquartile range indicate the difference between high and low values, but we don’t get an idea about the variability of the data points. Variance is a statistic of measuring spread, and it is the average of the distance of values from the mean squared.
where N is the total number of data points, Xi are the data values and X̄ is the mean.
The sum of the distance of values from the mean will always be 0, hence we need to square them. A high variance indicates that data points are spread widely in between the range.
Standard Deviation (σ) : The variance gives the spread in terms of squared distance from the mean, and the unit of measurement is not the same as the original data. For example, if the data is in meters, the variance will be in squared meters, which is not very intuitive. We take the square root of the squared variance to get the standard deviation (σ).
The smaller the standard deviation, the closer values are to the mean. The smallest value the standard deviation can take is 0.
Standard Scores/ Z-score: Standard scores give you a way of comparing values across different data sets where the mean and standard deviation differ. For example, if you want to compare sales across two different locations having different mean and standard deviations, standard scores would help.
Such comparisons are possible by ‘standardizing’ the distribution. Standard normal distribution is a special normal distribution with a mean of 0 and a standard deviation (SD) of 1.
The standard score is the number of standard deviations from the mean. If it is 0, it is equal to the mean. If a value is within 1 standard deviation of the mean, it is in the central part between μ − σ and μ + σ. It can be positive or negative (above or below the mean).
Z-score is measured in terms of standard deviations from the mean and shows how far away the value is from the mean. The formula is:
Skewness: Skewness is a measurement of symmetry in a probability distribution. You can see the positions of median, mode, and mean for different skewness in data. A histogram is effective in showing skewness.
Pearson Second Coefficient of Skewness (Median(M) skewness) is the most common method of calculating skew. The sign gives the direction of skewness. The larger the value, the larger the distribution differs from a normal distribution.
Kurtosis: Kurtosis describes whether the data is light-tailed, indicating lack of outliers or heavy-tailed indicating presence of outliers when compared to a normal distribution. Histogram works well to show kurtosis. There are three types of kurtosis.
- Mesokurtic distributions: Kurtosis is zero, and it is similar to the normal distributions.
- Leptokurtic distributions: The tail of the distribution is heavy, indicating the presence of outliers, and kurtosis is higher than the normal distribution.
- Platykurtic distributions: The tail of the distribution is thin, indicating a lack of outliers, and kurtosis is lesser than the normal distribution.
Measures of Association
Measures of association quantify a relationship between variables. The mathematical concepts of covariance and correlation are very similar. Both describe the variance between two variables.
Covariance: Covariance evaluates the extent to which one variable changes in relation to the other variable. We can only get an idea about the direction of the relationship but it does not indicate the strength of the relationship.
A positive covariance denotes a direct relationship, where as a negative covariance denotes an inverse relationship. The value of covariance lies in the range of -∞ and +∞. The magnitude of the covariance is not standardized and hence depends on the magnitudes of the variables. Formula to calculate covariance is:
It helps find essential variables on which other variables depend and predict one variable from another.
Correlation (ρ): Correlation is an important statistical technique for multivariate analysis that shows how strongly the variables are related. It measures both the direction (positive or negative or none) and the strength of the linear relationship between two variables. It is a function of the covariance and can be obtained by dividing the covariance by the product of standard deviation of both variables.
The difference between covariance and correlation lies in the values; correlation values are standardized (between -1 and +1). It outputs the correlation coefficient (r), which does not have any dimension, ranging from -1 and +1. Value or r closer to -1 (negative) or +1 (positive) indicates high correlation.
Techcanvass is an IT training and consulting organization. We are an IIBA Canada Endorsed education provider (EEP) and offer business analysis certification courses for professionals.
We offer CBDA certification training to help you gain expertise in Business Analytics and work as a Business Analyst in Data Science projects.