There has been a lot of debate on whether machines will surpass human intelligence soon. To understand how much merit is in this statement, we should understand how humans learn and how machines learn at this time.
Humans learn from variety of sources at different stages of life.
How does the baby learn to cry when she is born, how does she fall upside down after few months, how does she start sitting, crawling, walking and running? These activities are all biologically and genetically driven. No one teaches the baby, nor is the baby putting conscious effort to learn.
Once the baby starts crawling, her mother starts teaching how to speak few words, which is the precursor to formal education led by a teacher at play school, primary school, high school, collage, university etc.
Humans learn a lot of things from the environment they live in, from friends, relatives, news channels, books, entertainment, and at times simply from their own experiences etc.
Based on the culture, community, region we belong to, some beliefs are inherited. We simply follow these customs/practices/beliefs religiously.
What is Machine Learning?
Following are the two formal definitions of machine learning:
Arthur Samuel (1959), Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed. Arthur Samuel coined the term “Machine Learning” in 1959.
Tom Mitchell (1998), Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.
How do Machine Learning Algorithms work?
All the algorithms typically form internal representation of data that maps input data to desired output. To give an idea on how this representation looks like, let us examine Linear Regression and Logistics Regression Algorithms in detail below. These are the earliest algorithms developed much before ML and statistical analysis fields emerged. They are like the grandparents of today’s machine learning.
Regression Analysis is a method for finding the relationships/associations between variables. Linear regression deals with the linear relationship between variables. If the dependent variable is Y, and independent variables are X1, X2, X3 etc., then the relationship among them takes the form as given below
Y = a0 + a1*X1 + a2*X2 + a3*X3 + …, where a1, a2, a3 are the coefficients of X1, X2, X3 independent variables and a0 is the constant that accounts for any unknown/unmeasurable independent variables other than known variables X1, X2, and X3.
With the training data set, these algorithms find values for all coefficients a1, a2, a3 and the constant a0. Once we have this full equation, then for any future values of X1, X2, X3, we can predict Y! The process of finding these coefficients is called training/learning. This is an iterative process of predicting Y, measure the error between predicted Y and actual Y (from historic training data), using which update coefficients, and this process goes on till the error is minimized to the lowest possible value.
In statistical analysis, the same problem is solved in one shot, without iterations, using an analytical solution that requires the computation of matrix inverse. In statistical analysis, training examples are small (as we take only sample data), and number of variables is very less, so it is computationally possible to compute matrix inverse and get an analytical solution which is guaranteed to provide best solution.
In Machine Learning, data volume is large (as we take population data), and the number of variables is also very large., More importantly, we remove the assumption that these independent variables have no correlation among them. When there is a strong correlation among these variables, computation of matrix inverse may not be possible. Also, the volume of data and large number of variables make it highly time consuming, if not impossible to compute. Hence, we go for a numerical solution using optimization algorithm. Numerical methods have a risk of getting stuck in local minimum, instead of global minimum (best possible solution), hence, may result in sub-optimal solution.
- Simple Linear Regression: This involves only one independent variable so the equation takes the form of Y = a1*X1 + Constant.
- Multiple Linear Regression: Here we have more than one independent variable.
- Time Series Regression with seasonality and trend factors: Here the independent variable is the time, e.g. prediction of stock prices, inflation, GDP growth etc. over a time period
This is similar to Linear regression, but the dependent variable Y is categorical and binary (can take 0 or 1 only). It is a classification algorithm used to assess if a person defaults on the loan or not, if a customer moves away from the service or not(churn), if a particular claim is a fraudulent or not etc.
The relationship between dependent variable and independent variables is determined by estimating the probabilities using a logarithmic function.
Ln(p/(1-p)) = a0 + a1*X1 + a2*X2 + a3*X3 ….+ an*Xn
Using the historic data, the coefficients a0 to an are estimated. Then the model is ready to classify the dependent variable for any given set of independent variable values.
It can also be extended to multi-class problems, e.g. classifying a hand-written digit into 0 to 9, one of the 10 classes.
This algorithm also uses numerical optimization methods to compute coefficients.
As can be seen from the way these algorithms work, they are essentially mapping input to output thru an equation. This equation can also take non-linear polynomial terms.
Probabilistic algorithms use probability distributions like Gaussian, Bernoulli for mapping input to output, instead of an equation.
Kernel Methods use higher dimensional representation of original training data to map input to output.
Neural Networks use its parameters (weights and biases in each of the layers) to map input to output.
Non-parametric models don’t have any learning process, hence, no parameters to learn. They simply compute geometric distance between various observations in the training data, and map unknown observations to the class of nearest known observations.
Essentially, they all use different approaches to map input to output, by forming an intermediate representation of original training data that clearly separates various classes of observations.
It is clear that the way humans learn and the way machines are taught are quite different. As it stands today, machines can emulate only very small fraction of human learning, where a teacher’s involvement is required, as in formal education. Hence, machines can’t learn anything by themselves, someone has to tell them what to learn and how to learn.