In Bayesian learning, we use data to update a probability distribution over our model parameters θ given our training data D=(x1,x2,...xn):
P(θ∣D)=P(D∣θ)P(θ)P(D)
The goal here is not to find the single best fit for our parameters. Instead we maintain a probability distribution over our parameters.
An alternative is to attempt to find the single most probable set of parameters given our data and our prior:
θMAP=argmaxθP(D∣θ)P(θ)
An even simpler alternative is to just pick the parameters that make the data most probable: θML=argmaxθP(D∣θ)
P(D∣θ) is commonly referred to as likelihood: L(θ)=P(D∣θ)=∏xi∈DP(xi∣θ) (The likelihood isn't really a probability distribution... We know what the data is, so the probability of the data is 1. It is a measure of how probable (likely) that data is given our model parameters.)
Numerically, we are usually better off working with the log likelihood: LL(θ)=∑xi∈DlogP(xi∣θ)
This is the logistic or sigmoid function: σ(z)=11+e−z
It looks like this:
If we let z=wTx+b, we end up with: P(y=1∣x,w)=σ(wTx+b)=11+e−(wTx+b)P(y=0∣x,w)=1−σ(wTx+b)
The logistic function has a simple derivative:
σ′(x)=σ(x)(1−σ(x))
We want a likelihood function for our binary classifier. It could look like this: L(w)=n∏i=1{P(y=1∣xi,w),if yi=1P(y=0∣xi,w),if yi=0
This is much nicer to work with: L(w)=n∏i=1P(y=1∣xi,w)yi×P(y=0∣xi,w)1−yi (remember... x1=x,x0=1)
This makes the log likelihood: LL(w)=n∑i=1yilogP(y=1∣xi,w)+(1−yi)logP(y=0∣xi,w) (remember... logb(xy)=ylogb(x))
The goal is to maximize the log likelihood, which is the same as minimizing the negative log likelihood: −LL(w)=−n∑i=1yilogP(y=1∣xi,w)+(1−yi)logP(y=0∣xi,w)
This loss function is usually called cross-entropy. −LL(w)=−n∑i=1yilog(σ(wTx+b))+(1−yi)log(1−σ(wTx+b))
Quiz: What is the cross-entropy loss for σ(wTx+b)=1 and yi=0? yi=1?
Remember... log(1)=0, log(0)=−∞
Zero and one are numbers, so there is no reason we couldn't use the loss function we used for linear regression: E(w)=n∑i=1(yi−σ(wTx+b))2
Here is how the two loss functions compare: