In Bayesian learning, we use data to update a probability distribution over our model parameters
The goal here is not to find the single best fit (or point estimate) for our parameters. Instead, we maintain a probability distribution over our parameters.
This is related to the idea of ensemble learning.
Here is a recent survey if you are interested: Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users
An alternative is to attempt to find the single most probable set of parameters given our data and our prior:
Laplace smoothing for naive Bayes can be seen as an example of this!
An even simpler alternative is to just pick the parameters that make the data most probable:
Numerically, we are usually better off working with the log likelihood:
This is the logistic or sigmoid function:
It looks like this:
If we let
The logistic function has a simple derivative:
We want a likelihood function for our binary classifier. It could look like this:
This is much nicer to work with:
This makes the log likelihood:
The goal is to maximize the log likelihood, which is the same as minimizing the negative log likelihood:
This loss function is usually called cross-entropy.
Quiz: What is the cross-entropy loss for
Remember…
Zero and one are numbers, so there is no reason we couldn’t use the loss function we used for linear regression:
Here is how the two loss functions compare:
Space, Right Arrow or swipe left to move to next slide, click help below for more details