What is skewed Data?

In order to understand skewness, one first need to understand the importance of statistics in machine learning.

Skew data is very common in data science and is nothing but the distortion in data. If unattended, this distortion of data will result in development of wrong machine learning algorithm thus resulting in wrong predictions of outcomes which may lead to serious losses in business. For example take the salary of individuals as shown in the table below.

If the data consist of a value which is much higher or lower than the all other values present in that data set, the data is then referred as skew data

Using mean to calculate mid point of the data set

Individuals	Salary/month
A	30k
B	35k
C	38k
D	40k
E	39k
F	41k
G	34k
H	42k
I	500k

Table (i)

From the table it can be seen that the average salary or mean is 88k/month. Where as if a close look is taken at the table it can be clearly observed that the majority of salaried individuals ranges roughly from 35k - 42k per month. So the average value should be between 35k and 42k but this is clearly not the case.Here the average value is shifted towards the right side of the graph just because of one individual having a salary of 500k/month. This type of data imbalance is called skewed data. Here if a graph is drawn between the individual and their respective salary, it can be clearly observed that the average get shifted to the right side of x-axis.

Now let us assume the same above example with a slight change. This time the individual "I" has taken a loan of 200k and has to pay 100k / month. In this scenario the above table becomes something like this.

If the data is skewed, the mean always shifts to the right or left side of the axis, depending on if the data has a maximum or minimum value.

Individuals	Salary/month
A	30k
B	35k
C	38k
D	40k
E	39k
F	41k
G	34k
H	42k
I	-500k

table(ii)

Here it can be seen that the individual "I" is in debt of 500k which then shifts the average towards the left side of the x-axis. From the above examples it is clear that, when dealing with skewed data, taking mean is not an optimum solution as mean is affected by the values of instances, hence in order to develop better algorithm we use another statistical tool called median.

Using median to calculate mid point of the data set

Now let us take the same example given by table(i), but this time instead of calculating the mean, let us calculate the median and see if the median is affected by the salary of individual "I". In order to calculate the median, we first need to sort the data in form of ascending or descending order. After, count the number of instances. There can be even number of instances or odd number of instances.

Individuals	Salary/month
A	30k
B	34k
C	35k
D	38k
E	39k
F	40k
G	41k
H	42k
I	500k

table(iii)

If the number of instance is odd, the number of instance is given by,

median = (n+1)/2 th and,

If the number of instance is even, the number of instance is given by,

median = [ n/2 th + (n/2+1) th ] /2

In our example, the number of instance is 9, which is odd. So using the median formula for odd number we have median = (9+1)/2 = 5th which is 39k. Hence the 5th instance of the table is the median. Here we can observe that when we calculated the mean the average value we got was around 88k but when we shifted to median we got an correct average value of 39k.

What is skewed Data?

In order to understand skewness, one first need to understand the importance of statistics in machine learning.

Using mean to calculate mid point of the data set

Table (i)

Now let us assume the same above example with a slight change. This time the individual "I" has taken a loan of 200k and has to pay 100k / month. In this scenario the above table becomes something like this.

table(ii)

Using median to calculate mid point of the data set

table(iii)

If the number of instance is odd, the number of instance is given by,

median = (n+1)/2 th and,

If the number of instance is even, the number of instance is given by,

median = [ n/2 th + (n/2+1) th ] /2

If the data is skewed, we always use median instead of mean, since median is unaffected by the maximum or minimum value a data has.

No comments:

Search This Blog

Blog Archive

Labels

Report Abuse

About Me

Counter

About me

Popular posts

Contact ME

admob