In order to understand skewness, one first need to understand the importance of statistics in machine learning.
Skew data is very common in data science and is nothing but the distortion in data. If unattended, this distortion of data will result in development of wrong machine learning algorithm thus resulting in wrong predictions of outcomes which may lead to serious losses in business. For example take the salary of individuals as shown in the table below.
If the data consist of a value which is much higher or lower than the all other values present in that data set, the data is then referred as skew data
Using mean to calculate mid point of the data set
Individuals
|
Salary/month
|
A
|
30k
|
B
|
35k
|
C
|
38k
|
D
|
40k
|
E
|
39k
|
F
|
41k
|
G
|
34k
|
H
|
42k
|
I
|
500k
|
Table (i)
From the table it can be seen that the average salary or mean is 88k/month. Where as if a close look is taken at the table it can be clearly observed that the majority of salaried individuals ranges roughly from 35k - 42k per month. So the average value should be between 35k and 42k but this is clearly not the case.Here the average value is shifted towards the right side of the graph just because of one individual having a salary of 500k/month. This type of data imbalance is called skewed data. Here if a graph is drawn between the individual and their respective salary, it can be clearly observed that the average get shifted to the right side of x-axis.
Now let us assume the same above example with a slight change. This time the individual "I" has taken a loan of 200k and has to pay 100k / month. In this scenario the above table becomes something like this.
If the data is skewed, the mean always shifts to the right or left side of the axis, depending on if the data has a maximum or minimum value.
Individuals
|
Salary/month
|
A
|
30k
|
B
|
35k
|
C
|
38k
|
D
|
40k
|
E
|
39k
|
F
|
41k
|
G
|
34k
|
H
|
42k
|
I
|
-500k
|
table(ii)
Here it can be seen that the individual "I" is in debt of 500k which then shifts the average towards the left side of the x-axis. From the above examples it is clear that, when dealing with skewed data, taking mean is not an optimum solution as mean is affected by the values of instances, hence in order to develop better algorithm we use another statistical tool called median.
Using median to calculate mid point of the data set
Now let us take the same example given by table(i), but this time instead of calculating the mean, let us calculate the median and see if the median is affected by the salary of individual "I". In order to calculate the median, we first need to sort the data in form of ascending or descending order. After, count the number of instances. There can be even number of instances or odd number of instances.
Individuals
|
Salary/month
|
A
|
30k
|
B
|
34k
|
C
|
35k
|
D
|
38k
|
E
|
39k
|
F
|
40k
|
G
|
41k
|
H
|
42k
|
I
|
500k
|
No comments: