This video on "Machine Learning Algorithms" will provide you with a comprehensive and detailed knowledge of Artificial Intelligence concepts with hands-on examples.
What are Machine Learning Algorithms | Machine Learning Algorithms Tutorial for Beginners
Transcript
Interactive Transcript - Enable basic transcript mode by pressing the escape key
You may navigate through the transcript using tab. To save a note for a section of text press CTRL + S. To expand your selection you may use CTRL + arrow key. You may contract your selection using shift + CTRL + arrow key. For screen readers that are incompatible with using arrow keys for shortcuts, you can replace them with the H J K L keys. Some screen readers may require using CTRL in conjunction with the alt key
Welcome to this course on Machine Learning Algorithms.
My name is Denis Batalov,
I've been with Amazon 13 years and currently work as
a principle solutions architect
specializing in Machine Learning,
and I even have a PhD in this field.
You are now ready to start diving into
machine learning algorithms, exciting times.
Many customers are struggling with
understanding how to translate
their business problems into
an IT solution that
somehow incorporates machine learning.
So in this course, we're going to review
the different types of machine learning
algorithms and the problems they solve.
Perhaps you've already heard about supervised learning,
unsupervised learning,
reinforcement learning, and deep learning.
By the end of this course,
you should be able to speak
confidently about these categories of
ML algorithms with your customers and
help them determine the category that fits their problem.
So let's get going.
Before we can intelligently speak of Machine Learning,
let's recall what artificial intelligence
or machine intelligence is all about.
A system exhibiting intelligent behavior
normally needs to possess two fundamental faculties.
First is the ability to
acquire and systematize knowledge.
This is relying on the so-called inductive reasoning or
coming up with rules that would
explain the individual observations.
Of course simple facts or
truths need to be acquired as well.
But if a rule can be learned so
that many truths can be derived from it,
it would be easier to remember
a single rule, wouldn't it?
For example, as you hear me speak,
you don't need to constantly
remind yourself that there is
no life person in front of you
because you know how video recordings work.
Now, the second faculty is inference or the ability to
use the acquired knowledge to derive
the truths when needed like making predictions,
choose actions, or make complex plans.
This ability relies on deductive reasoning,
which was popularized so much by Conan Doyle.
You heard me use the terms learning and
predictions when describing these faculties,
and this is of course no accident.
All machine learning algorithms
must possess them in some form.
Early algorithms in the AI space were
relying primarily on the second faculty of
inference by having humans
acquire and feed
all the necessary knowledge into the machine.
This unfortunately proved to be
impossible for most practical problems,
and that's why ML algorithms rule
the day, machines learn automatically.
We're now ready to discuss the different categories of ML
based on how the machine learns and what it can infer.
Currently, when people think of machine learning,
they typically think of supervised learning because of
its wide applicability and many successful applications.
It's called supervised because
there needs to be a supervisor,
a teacher or trainer
showing the right answers during the learning.
No wonder we also call it
training a machine learning model.
A model, because the algorithm is
effectively able to simulate or model the teacher.
Oftentimes, the teacher is simply not there
and all we're left with is just observations or data.
Can something useful be learned
from the data in such a case?
You guessed it. This is
the domain of the so-called unsupervised learning.
One typical example from
this category is a clustering algorithm,
which divides the observations into
what appear to be different clusters.
We will see others later.
I should point out that there exists
so-called mixed or semi-supervised algorithms,
but let's not over complicate things for now.
Another kind of learning that has been gaining in
popularity recently is
the so-called reinforcement learning.
In some sense, this type of
an algorithm is attempting to solve
the complete AI problem of building
an agent capable of
exhibiting entire intelligent behaviors,
not just making isolated decisions.
This is why it's an exciting area of research,
but that's what makes it also
difficult in applied practical settings.
In reinforcement learning,
the agent controlled by the algorithm is interacting
with the possibly completely unknown environment
and is learning optimal actions via trial and error.
Here, there's no explicit teacher telling
the agent what is the right action at any given time.
Instead, the agent is getting
an often delayed reward or penalty called reinforcement,
and is designed to maximize long-term rewards.
Think of playing a computer game
with possibly unknown rules,
but your goal is to get maximum points.
Not surprisingly, this approach has
been rather popular in game-play
from early successes with
simple games like tic tac toe in the 1960's,
to bag gammon in 1990's,
down to the very recent and highly publicized triumph
of ML over the game of Go.
Now, let's look closer at supervised learning.
Suppose we want a machine learning algorithm
to distinguish between circles and squares,
a supervised learning algorithm
would require many examples of
both figures and a teacher
who would tell it which is which.
After the training is finished,
a successful learning algorithm would be able
to decide on its own whether
any given figure is a circle or
square with sufficient accuracy,
hopefully substantially better than random guessing.
It would do so even for circles or
squares that it has never seen during training,
and this is ultimately the power of supervised learning.
Do the correct answer is always
need to come from a human teacher?
The notion of a teacher may be generalized to
any complex system or
phenomena that consists of machines,
humans, or natural processes.
Here, you see a Rube Goldberg machine
to represent this complex system.
You can view this system as a function that accepts
input parameters and produces an outcome of sorts.
The known outcomes form the so-called ground truths,
and this set of historic observations
is our training dataset.
We then train the machine learning model
by feeding it this training dataset.
The resulting model is set to predict
the same outcome based on
previously unseen input parameters.
Hopefully, the model prediction is the
same or close to what
the original system would have produced.
The reason we're interested in building
such models is that the original system is either
impossible or expensive to procure and scale
or takes too long to produce
the outcome which we want to obtain sooner.
If the predicted value is of
binary nature as was the case with circles and squares,
we say that the model is
performing a binary classification,
in other words, labeling
the observation in two possible ways.
This is just a special case of
multi-class prediction where the data point
can belong to one of
many different and mutually exclusive classes
such as circles, squares, or triangles.
Mutual exclusivity is tricky to
assure though as this picture makes clear.
Sometimes it's entirely a matter of perspective.
Now, if the variable being predicted is numeric,
then the model is set to be solving a regression problem.
In other words, determining
the unknown value of
the dependent variable based on input parameters.
Let us now look at some examples.
For instance, we might have historical records of
volcano eruptions and various observations
and measurements leading up to them.
We can imagine training
a machine learning model
capable of predicting future eruptions.
The teacher, with a source of truth in
this case is simply nature itself.
Another example is when we want the model to predict
impending equipment failure so
that we could make pro active repairs,
this is known as predictive maintenance.
Similarly, we may want to
build a model that predicts which of our clients
are about to stop purchasing
our services possibly leaving for a competitor,
this is known as customer churn prediction.
In these examples, all we
need to do to obtain a good training dataset with
properly labeled observations is to
systematically record historical observations,
together with observed value of
variables we ultimately want to predict.
You will eventually know when a volcano
erupts or when your equipment breaks down.
That is only possible if
the real system we're trying to model
is already functioning on
a regular basis and is easy to observe.
If say we want to train a model to label people
in photos as either smiling or frowning,
then we would first need to have someone go through
a large number of photos and label them manually.
If such human labeling process was not already in place,
obtaining a training dataset
could be difficult and time-consuming.
Fortunately, tools to crowdsource
human decisions are available such as
Amazon Mechanical Turk or similar offerings
from AWS partners such as figure eight and others.
So how many different
supervised learning algorithms are there.
There's literally hundreds of them in existence,
so there's no point in mentioning them all.
Instead, we can focus on a few families of
algorithms that are most
popular and have proven to be successful.
One of the earliest and simplest algorithms
is based on learning parameters of a linear function,
likely in a multi-dimensional space.
You've already seen an example of regression where
we find the linear function that best fits the data.
When it comes to predicting a category,
or a class as in circles or squares,
we typically want to find a hyperplane also known as
a decision boundary that best separates
the data samples belonging to
the classes as shown in this picture.
If there exists a linear surface
that separates the two classes,
we say that they are linearly separable.
This is rarely the case in practice,
however so some errors must be expected.
To arrive at the binary classifier,
we can apply a logistic function
to the output of a linear combination
of input parameters in order to
restrict the values to the range from zero to one.
This forms the basis of
the so-called logistic regression algorithm.
In fact, Amazon SageMaker has
a built-in algorithm called linear learner,
which is effectively a combination
of linear and logistic regression.
But other linear algorithms exist as well.
You may have heard about Support Vector Machines or SVMs
which strive to find a hyperplane
with maximum margin between classes.
More modern variants of the algorithm also
introduce non-linearity with kernel functions.
So strictly speaking would not
belong to pure linear methods anymore.
Perceptron is
another rather simple linear classifier that forms
the foundational unit of
the so-called artificial neural networks,
which we'll look at later in this course.
As I pointed out earlier,
in most practical settings,
we're not dealing with linearly separable classes
as demonstrated here.
A circular decision boundary would work here,
but so would a square-shaped one aligned
with coordinate axis shown here.
This is exactly the decision boundary used by
algorithms that end up
constructing the so-called decision trees.
In order to make a classification,
we start with the root of the tree and descend
through the decision nodes
until we arrive at a classification.
In this example, points with
X coordinate outside of the range
from X1 to X2 are immediately classified as red.
But for those in the range,
we need to additionally consult the y-coordinate and
check against Y1 and Y2 values.
Instead of constructing a single tree,
the algorithms from the tree family often construct
many trees and combine their predictions in some ways.
Algorithms such as Random Forest
and XGboost are based on these approaches.
In fact, Amazon SageMaker includes the XGboost algorithm.
It is based on the idea of building
a strong classifier out of
many weak classifiers in the form of decision trees.
Such an approach is called boosting.
XGboost is a general-purpose supervised algorithm.
Factorization Machines algorithm on the other hand works
best when we deal with large amounts of spars data,
such as the case with the problem of click prediction for
online advertising or recommendations in general.
Factorization Machines is also built into SageMaker.
As usual, many other approaches
and algorithms exist as well.
For example, we could define the decision boundary to be
polynomial as in circular or parabolic boundaries.
Of course, we'll come back to
neural networks later in this course.
Let us now examine the unsupervised learning algorithms.
Clustering is an especially popular type
of an unsupervised algorithm.
Given a collection of data points,
we're trying to divide them into groups or clusters with
the assumption that points belonging to
the same cluster are somehow similar,
whereas those belonging to
different clusters are somehow dissimilar.
We're still required to give
some guidance to the algorithms
such as specifying the number
of clusters we're looking for.
One problem with clustering algorithms is that
we usually don't know how many clusters to pick.
Here's an example of the result
if we request just two clusters.
But depending on various
parameters and distance measures,
a differently tuned algorithm might
provide a different answer for
the same two clusters requested.
If for this dataset we
request four different clusters instead,
we might get something that looks like this.
This points to another problem with such algorithms,
it is ultimately up to us how to interpret
the results and assign
meaning to the discovered clusters.
Another entirely different family of
unsupervised algorithms attempts to detect
anomalies or generally find outliers in the data.
In this picture, we see the red green and blue lines
representing different sensory readouts
of the electrocardiogram,
while the top magenta line corresponds to
the anomaly scores produced by
the algorithm after observing the data.
The higher the score,
the more pronounced the anomaly is.
There's no explicit teacher
labeling the historic data as anomalous,
instead the algorithm learns on its
own what normal looks like by simply observing the data.
One anomaly detection algorithm was
developed by scientists working at Amazon.
So it's worth taking a closer look.
It's called Random Cut Forest.
The algorithm works by constructing
a forest of the so-called random cut trees.
Each tree is constructed by
a recursive procedure which
first surrounds the data points
with a bounding box and then cuts or splits at
along the coordinate axis by picking cut points randomly.
The procedure is repeated until
every point is sorted into a particular leaf of the tree.
For full details you can read the paper presented on
the International Conference for Machine Learning
in 2016.
Yet another example of unsupervised algorithm is
the so-called topic modeling for
documents with text content.
The algorithm is the basis of
the eponymous feature in the Amazon comprehend service.
Given a collection of documents,
news articles for example and
the number of topics we would like to discover,
the algorithm produces the top words
that appear to define the topic,
together with the weight that each
of these words has in relation to the topic.
In this case you see top words
that likely pertained to sports.
As with clustering in general,
the approach is sensitive to the number of
topics requested and its still
requires us to assign the meaning to
the discovered topic such as health, sports or politics.
To summarize, Amazon SageMaker includes
a popular clustering algorithm called k-means.
It's an Amazon improvement over
the well-known and scalable algorithm
called Web-scale k-means.
Another member of the unsupervised family is called
Principal Component Analysis or PCA for short.
Likewise available in SageMaker.
It's especially useful in reducing
the dimensionality of the dataset and is often
used as a feature engineering step
before passing the data to a supervised algorithms.
Latent Dirichlet Allocation or LDA
is the name of a particular topic modelings algorithm.
A variant is used by the topic modeling feature of Amazon
comprehend and the algorithm
is also available in sage maker.
The Random cut forest algorithm
for anomaly detection is available in
SageMaker as well as an Amazon Kinesis Data Analytics,
for easy application to streaming data.
Kinesis Data Analytics also features hotspot detection,
another example of
an unsupervised learning algorithm which
you can use to identify
relatively dense regions in your data.
Okay, time for a quick quiz.
Suppose we have a problem of predicting
future values of a time series data.
For example, suppose that we
want to predict the future sales of some item.
We have observed historical daily sales figures up to
today and now want to
predict what the sales figures would be in the future.
Is this a supervised or unsupervised learning task?
At first we might be tempted to
answer that it is unsupervised,
similar to Anomaly Detection,
because there does not appear to be a teacher anywhere.
But this is a bit of a trick question.
You'd see all observations of historical sales leading to
a particular day D in the past can be viewed
as a training sample or
observation with the correct label being
the actual recorded sale for day D, eight in this case.
In other words, just like with
the historical volcano eruptions,
the teacher here is the external environment.
Finally, it's time for us to look at
deep learning which is really
a resurgence of neural networks.
To understand them, let's first look at
an element of neural networks called a neuron.
Through a neuron you see in this diagram,
a data sample is seen as
a vector of numeric input values,
which are then linearly combined with its weights.
In other words, the neuron is
computing a weighted sum and then applies
a so-called activation function to
produce output in the range from zero to one.
With proper thresholding this
can work as a binary classifier.
Remember the perceptron that I
mentioned earlier in this course?
This is effectively what a perceptron is.
Except, a single neuron would not be
sufficient for practical classification needs.
Instead, we could combine
them into fully-connected layers to produce
the so-called artificial neural networks
also known as multilayer perceptrons.
In a feedforward pass,
the network is turning the input values into output,
which forms the prediction of the algorithm.
A special technique called
backpropagation is then used to reduce
the error between the desired or true output
and the actual one produced by the network.
Originally, these neural networks were inspired
by some aspects of a biological nervous system but
at this point there are really
a computational apparatus for
complex dependency modeling and function approximation.
What I showed so far is an example of
a traditional neural network
prior to the advent of deep-learning.
Since a few years back we have seen
a resurgence of neural networks rebranded as
deep-learning due to several important advances
that relate to the algorithms themselves.
Accumulation of large amounts of data for training and
emergence of powerful specialized hardware such as GPUs,
which are able to crunch
this massive amount of data by passing
it through very deep networks in
terms of the sheer number of layers.
Some of the results proved rather spectacular,
enabling many exciting applications
such as an image and speech recognition,
natural language processing and so on.
So how deep is deep?
Here's an example that is rather
puny by modern standards.
In fact, networks with over
1,000 layers have been experimented with.
Such networks have billions of
parameters and many millions
of images could be used in training.
The shear computational power
required to train such networks is not cheap to
procure and this is where eta BFS
comes handy with GPU based ec2 instances,
housing powerful chipsets such
as NVIDIA Volta in the P3 family.
More importantly, you can distribute the training
across multiple GPUs in order to speed it
up and AWS makes it rather
economical to set up the hardware cluster
just for the time of training not having to
worry about expensive hardware sitting idly afterwards.
One important breakthrough in
deep learning was the invention of
the so-called Convolutional Neural Networks
or CNNs for short,
which are especially useful for image processing.
The main idea behind CNNs is that it is able to relate
nearby pixels in the image instead of
treating them as completely independent inputs,
which was the case prior to CNNs.
A special operation called
convolution is applied to the entire subsections of
the image and more importantly the parameters of
these convolutions are also being learned in the process.
If several convolutional layers
are stacked one after another,
each convolutional layer learns to recognize patterns of
increasing complexity as we move through the layers.
We don't have time in this course to dive
into the details of how CNN function.
Our goal is to understand some of the common use cases.
Of course recognizing objects and images and
generally classifying images is a very common use case.
But CNNs enabled
many other exciting applications related to images.
For example, they can be
used for semantic segmentation or
classification of individual pixels as
belonging or not belonging to detected objects,
a motorcyclist in this case.
Furthermore, they have been used for
other novel applications such as
artistic style transfer, where one image,
here a photo of a cat is modified by applying
an artistic style to it which was
previously extracted from
another image typically a painting.
In the bottom right corner,
they have even been used to generate photos of cats.
These cats look photorealistic and
yet none of them actually existed in real life.
If we take the output of a neuron and feed it as
input to itself or neurons from previous layers,
we are creating the so-called recurrent neural networks.
It's as if the neuron remembers its output from
the previous iteration thus creating a kind of memory.
On the right-hand side you see just one unit
of a more complex network called LSTM,
which stands for Long Short-Term Memory.
It is commonly used for
speech recognition and translation.
In fact, LSTMs are used as a building block to
the so-called Sequence to Sequence
modeling which is used in neural machine translation.
A high-level architecture in the diagram shows how does
include house input is
being translated into the greenhouse.
Amazon released an entire library
called Sockeye for state of
the art sequence to sequence
modeling tasks that customers can use in their projects.
We just looked at convolutional
and recurrent neural networks
which are the two most
common families of neural networks.
But as you can see here
many network topologies exist and they're being studied.
Amazon SageMaker
conveniently provides a built-in algorithm
for image classification based on Resnet, a kind of CNN,
but it also provides a sequence to sequence algorithm,
a neural topic modeling algorithm
to complement Latent Dirichlet allocation
and also DeepAR forecasting
algorithm for time series
prediction which we already looked at.
Remember the quiz? So our
deep learning algorithm's supervised or unsupervised?
Well, they can be either.
The algorithms shown in the slide are all supervised
except for neural topic
modeling which the icons on the left indicate.
Deep learning algorithms have even been
employed as a key component
of a reinforcement learning algorithm.
Well, this concludes our review of
various Machine Learning algorithms.
Hopefully, you've come to
understand the different categories of
Machine Learning algorithms and how they
relate to the business problems they help solve.
Thanks for listening. You can follow me on Twitter
and please tune into other courses in this series.
Transcript Help Us Translate Interactive Transcript - Enable basic transcript mode by pressing the escape key You may navigate through the
0 Comments