Friday, January 24, 2020

What is skewed Data?

In order to understand skewness, one first need to understand the importance of statistics in machine learning.


Skew data is very common in data science and is nothing but the distortion in data. If unattended, this distortion of data will result in development of wrong machine learning algorithm thus resulting in wrong predictions of outcomes which may lead to serious losses in business.  For example take the salary of individuals as shown in the table below.


If the data consist of a value which is much higher or lower than the all other values present in that data set, the data is then referred as skew data

Using mean to calculate mid point of the data set

 
Individuals
Salary/month
A
30k
B
35k
C
38k
D
40k
E
39k
F
41k
G
34k
H
42k
I
500k

         Table (i)

From the table it can be seen that the average salary or mean is 88k/month. Where as if a close look is taken at the table it can be clearly observed that the majority of salaried individuals ranges roughly from 35k - 42k per month. So the average value should be between 35k and 42k but this is clearly not the case.Here the average value is shifted towards the right side of the graph just because of one individual having a salary of 500k/month. This type of data imbalance is called skewed data. Here if a graph is drawn between the individual and their respective salary, it can be clearly observed that the average get shifted to the right side of x-axis.

 Now let us assume the same above example with a slight change. This time the individual "I" has taken a loan of 200k and has to pay 100k / month. In this scenario the above table becomes something like this.

If the data is skewed, the mean always shifts to the right or left side of the axis, depending on if the data has a maximum or minimum value.

Individuals
Salary/month
A
30k
B
35k
C
38k
D
40k
E
39k
F
41k
G
34k
H
42k
I
-500k

 table(ii)

Here it can be seen that the individual "I" is in debt of 500k which then shifts the average towards the left side of the x-axis. From the above examples it is clear that, when dealing with skewed data, taking mean is not an optimum solution as mean is affected by the values of instances, hence in order to develop better algorithm we use another statistical tool called  median.


Using median to calculate mid point of the data set

Now let us take the same example given by table(i), but this time instead of calculating the mean, let us calculate the median and see if the median is affected by the salary of individual "I". In order to calculate the median, we first need to sort the data in form of ascending or descending order. After, count the number of instances. There can be even number of instances or odd number of instances. 

Individuals
Salary/month
A
30k
B
34k
C
35k
D
38k
E
39k
F
40k
G
41k
H
42k
I
500k

        table(iii)

If the number of instance is odd, the number of instance is given by,

                              median = (n+1)/2 th                and, 

If the number of instance is even, the number of instance is given by,

                              median = [ n/2 th + (n/2+1) th ] /2 

In our example, the number of instance is 9, which is odd. So using the median formula for odd number we have median = (9+1)/2 = 5th which is 39k. Hence the 5th instance of the table is the median. Here we can observe that when we calculated the mean the average value we got was around 88k but when we shifted to median we got an correct average value of 39k.

If the data is skewed, we always use median instead of mean, since median is unaffected by the maximum or minimum value a data has.
 

 

  

What is Binning?

 In real cases, the data are huge and contains unwanted amount of meaningless data which does not help in anyway to develop meaningful machine learning algorithm. Thus in such cases we need to smoothen the data in order to get meaningful algorithm. One of the process of smoothing the data is called binning. There are basically two types of datacategorical and continuous data. Binning is the process of converting continuous data into categorical data or discrete data. 

Binning or discretization is the process of transforming numerical variables into categorical counterparts.

Binning method for data smoothing – 

Here, we are need the Binning method for data smoothing. In this method the data is first categorized and grouped and then the sorted data are put together into a number of buckets or bins. As binning methods consult the neighborhood of values, they perform local smoothing.

How to perform smoothing on the data?

There are three approaches to perform smoothing –
  1. Smoothing by bin means : In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
  2. Smoothing by bin median : In this method each bin value is replaced by its bin median value.
  3. Smoothing by bin boundary : In smoothing by bin boundaries, the minimum and maximum values in a given bin are identified as the bin boundaries. Each bin value is then replaced by the closest boundary value.
Sorted data for price(in dollar) : 2, 6, 7, 9, 13, 20, 21, 25, 30
Partition using equal frequency approach:
Bin 1 : 2, 6, 7
Bin 2 : 9, 13, 20
Bin 3 : 21, 24, 30

Smoothing by bin mean :
Bin 1 : 5, 5, 5
Bin 2 : 14, 14, 14
Bin 3 : 25, 25, 25

Smoothing by bin median :
Bin 1 : 6, 6, 6
Bin 2 : 13, 13, 13
Bin 3 : 24, 24, 24

Smoothing by bin boundary :
Bin 1 : 2, 7, 7
Bin 2 : 9, 9, 20
Bin 3 : 21, 21, 30

Binning can also be used as a discretization technique. Here discretization refers to the process of converting or partitioning continuous attributes, features or variables to discretized or nominal attributes/features/variables/intervals.
For example, attribute values can be discretized by applying equal-width or equal-frequency binning, and then replacing each bin value by the bin mean or median, as in smoothing by bin means or smoothing by bin medians, respectively. Then the continuous values can be converted to a nominal or discretized value which is same as the value of their corresponding bin.

Difference between Machine Learning and AI

Artificial Intelligence

Artificial intelligence is a field of computer science which makes a computer system that can mimic human intelligence. It is comprised of two words "Artificial" and "intelligence", which means "a human-made thinking power." Hence we can define it as,
Artificial intelligence is a technology using which we can create intelligent systems that can simulate human intelligence.

The Artificial intelligence system does not require to be pre-programmed, instead of that, they use such algorithms which can work with their own intelligence. It involves machine learning algorithms such as Reinforcement learning algorithm and deep learning neural networks. AI is being used in multiple places such as Siri, Google?s AlphaGo, AI in Chess playing, etc.

Based on capabilities, AI can be classified into three types:

  • Weak AI
  • General AI
  • Strong AI
Currently, we are working with weak AI and general AI. The future of AI is Strong AI for which it is said that it will be intelligent than humans.

Machine learning

Machine learning is about extracting knowledge from the data. It can be defined as,
Machine learning is a subfield of artificial intelligence, which enables machines to learn from past data or experiences without being explicitly programmed.

Machine learning enables a computer system to make predictions or take some decisions using historical data without being explicitly programmed. Machine learning uses a massive amount of structured and semi-structured data so that a machine learning model can generate accurate result or give predictions based on that data.

Machine learning works on algorithm which learn by it?s own using historical data. It works only for specific domains such as if we are creating a machine learning model to detect pictures of dogs, it will only give result for dog images, but if we provide a new data like cat image then it will become unresponsive. Machine learning is being used in various places such as for online recommender system, for Google search algorithms, Email spam filter, Facebook Auto friend tagging suggestion, etc.

It can be divided into three types:

  • Supervised learning
  • Reinforcement learning
  • Unsupervised learning

Key differences between Artificial Intelligence (AI) and Machine learning (ML):

Artificial Intelligence Machine learning
Artificial intelligence is a technology which enables a machine to simulate human behavior. Machine learning is a subset of AI which allows a machine to automatically learn from past data without programming explicitly.
The goal of AI is to make a smart computer system like humans to solve complex problems. The goal of ML is to allow machines to learn from data so that they can give accurate output.
In AI, we make intelligent systems to perform any task like a human. In ML, we teach machines with data to perform a particular task and give an accurate result.
Machine learning and deep learning are the two main subsets of AI. Deep learning is a main subset of machine learning.
AI has a very wide range of scope. Machine learning has a limited scope.
AI is working to create an intelligent system which can perform various complex tasks. Machine learning is working to create machines that can perform only those specific tasks for which they are trained.
AI system is concerned about maximizing the chances of success. Machine learning is mainly concerned about accuracy and patterns.
The main applications of AI are Siri, customer support using catboats, Expert System, Online game playing, intelligent humanoid robot, etc. The main applications of machine learning are Online recommender system, Google search algorithms, Facebook auto friend tagging suggestions, etc.
On the basis of capabilities, AI can be divided into three types, which are, Weak AI, General AI, and Strong AI. Machine learning can also be divided into mainly three types that are Supervised learning, Unsupervised learning, and Reinforcement learning.
It includes learning, reasoning, and self-correction. It includes learning and self-correction when introduced with new data.
AI completely deals with Structured, semi-structured, and unstructured data. Machine learning deals with Structured and semi-structured data.

What is DATA?

Data is collection of facts such as numbers, measurements , words or observations. We generate billions and billions bytes of data every day. In fact the time i was writing this post, i was generating data at google servers and similarly when you are reading this post, even you generated some amount of data. We are generating data at an UN-measurable rate. Some of the sources of data generations include, social media accounts like Facebook, Twitter, Instagram, Pintrest and more. Other sources include your mobile phone, sensors in a car, Forex trading, banking servers and many countless other sources.

 

Data is collection of facts such as numbers, measurements , words or observations.

In machine learning there are basically 3 types of data namely training data, validation data and test data.


What is training Data ?

Training data are the set of data we use to train the machines so as in future it will be able to predict correct outcome if the data of similar nature is fed to it. The training data  is generally 80% of overall data. This number can vary based on the amount of data and objective of the machine learning algorithm.

Artificial intelligence is a technology using which we can create intelligent systems that can simulate human intelligence.

Data used to train the algorithm are called training data.

What is Validation Data ?

  Validation data is nothing but the data we use to check the outcome of machine learning algorithm after it has been trained. This data is to check whether the result provided by the algorithm is correct or not. If the outcome is desired output we move to next step or else if the outcome is flawed we tune the data, feed it again in machine learning algorithm, train it and recheck the outcome. This process is also called tuning of data.

Data used to validate the accuracy of the algorithm are called validating data.

What is Test Data ?

Once the algorithm is trained and validated then comes the testing phase where we feed the remaining data to test and check if the algorithm is working correctly and in order. This data sets type is you can say the final evaluation that a model need to go through after the training and validation stage in model development. This data basically defines the working accuracy of a given model.

Data used to test the algorithm are called test data.

  The image below shows an simple overview of data flow


 

What is Artificial Intellegence?



What Is Artificial Intelligence (AI)?

Artificial intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. The term may also be applied to any machine that exhibits traits associated with a human mind such as learning and problem-solving.

Applications of Artificial Intelligence

The applications for artificial intelligence are endless. The technology can be applied to many different sectors and industries. AI is being tested and used in the healthcare industry for dosing drugs and different treatment in patients, and for surgical procedures in the operating room.
Other examples of machines with artificial intelligence include computers that play chess and self-driving cars. Each of these machines must weigh the consequences of any action they take, as each action will impact the end result. In chess, the end result is winning the game. For self-driving cars, the computer system must account for all external data and compute it to act in a way that prevents a collision.
Artificial intelligence also has applications in the financial industry, where it is used to detect and flag activity in banking and finance such as unusual debit card usage and large account deposits—all of which help a bank's fraud department. Applications for AI are also being used to help streamline and make trading easier. This is done by making supply, demand, and pricing of securities easier to estimate.

KEY TAKEAWAYS

  • Artificial intelligence refers to the simulation of human intelligence in machines.
  • The goals of artificial intelligence include learning, reasoning, and perception.
  • AI is being used across different industries including finance and healthcare.
  • Weak AI tends to be simple and single-task oriented, while strong AI carries on tasks that are more complex and human-like.

Categorization of Artificial Intelligence

Artificial intelligence can be divided into two different categories: weak and strong. Weak artificial intelligence embodies a system designed to carry out one particular job. Weak AI systems include video games such as the chess example from above and personal assistants such as Amazon's Alexa and Apple's Siri. You ask the assistant a question, it answers it for you.
Strong artificial intelligence systems are systems that carry on the tasks considered to be human-like. These tend to be more complex and complicated systems. They are programmed to handle situations in which they may be required to problem solve without having a person intervene. These kinds of systems can be found in applications like self-driving cars or in hospital operating rooms.


Special Considerations

Since its beginning, artificial intelligence has come under scrutiny from scientists and the public alike. One common theme is the idea that machines will become so highly developed that humans will not be able to keep up and they will take off on their own, redesigning themselves at an exponential rate.
Another is that machines can hack into people's privacy and even be weaponized. Other arguments debate the ethics of artificial intelligence and whether intelligent systems such as robots should be treated with the same rights as humans.
Self-driving cars have been fairly controversial as their machines tend to be designed for the lowest possible risk and the least casualties. If presented with a scenario of colliding with one person or another at the same time, these cars would calculate the option that would cause the least amount of damage.
Another contentious issue many people have with artificial intelligence is how it may affect human employment. With many industries looking to automate certain jobs through the use of intelligent machinery, there is a concern that people would be pushed out of the workforce. Self-driving cars may remove the need for taxis and car-share programs, while manufacturers may easily replace human labor with machines, making people's skills more obsolete.