The goal was to find weight values to minimize MSE on some training data: fitting a hyperplane.
This is the logistic or sigmoid function:
It looks like this:
After applying a sigmoid non-linearity we can interpret the output as a probability.
The logistic function has a simple derivative:
Goal is to learn the parameters
One option is to attempt to find the single most probable set of parameters given our data and our prior:
Laplace smoothing for naive Bayes can be seen as an example of this!
An even simpler alternative is to just pick the parameters that make the data most probable:
Simple example… Let’s say
What value of
Numerically, we are usually better off working with the log likelihood:
We want a likelihood function for our binary classifier. It could look like this:
This is much nicer to work with:
This makes the log likelihood:
The goal is to maximize the log likelihood, which is the same as minimizing the negative log likelihood:
This loss function is usually called cross-entropy.
Quiz: What is the cross-entropy loss for
Remember…
Zero and one are numbers, so there is no reason we couldn’t use the loss function we used for linear regression:
Here is how the two loss functions compare:
Space, Right Arrow or swipe left to move to next slide, click help below for more details