Bayesian Learning

In Bayesian learning, we use data to update a probability distribution over our model parameters $θ$ given our training data $D = (x_{1}, x_{2}, . . . x_{n})$ :
$P (θ ∣ D) = \frac{P (D ∣ θ) P (θ)}{P (D)}$
The goal here is not to find the single best fit (or point estimate) for our parameters. Instead, we maintain a probability distribution over our parameters.
This is related to the idea of ensemble learning.
Here is a recent survey if you are interested: Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users

Maximum a Posteriori Learning

An alternative is to attempt to find the single most probable set of parameters given our data and our prior:
$θ_{M A P} = {argmax}_{θ} P (D ∣ θ) P (θ)$
Laplace smoothing for naive Bayes can be seen as an example of this!

An even simpler alternative is to just pick the parameters that make the data most probable: $θ_{M L} = {argmax}_{θ} P (D ∣ θ)$
$P (D ∣ θ)$ is commonly referred to as likelihood: $L (θ) = P (D ∣ θ) = \prod_{x_{i} \in D} P (x_{i} ∣ θ)$ (The likelihood isn’t really a probability distribution… We know what the data is, so the probability of the data is 1. It is a measure of how probable (likely) that data is given our model parameters.)
Numerically, we are usually better off working with the log likelihood: $LL (θ) = \sum_{x_{i} \in D} \log P (x_{i} ∣ θ)$

This is the logistic or sigmoid function: $σ (z) = \frac{1}{1 + e^{- z}}$
It looks like this:
If we let $z = w^{T} x + b$ , we end up with: $\begin{aligned} P (y = 1 ∣ x, w) & = σ (w^{T} x + b) = \frac{1}{1 + e^{- (w^{T} x + b)}} \\ P (y = 0 ∣ x, w) & = 1 - σ (w^{T} x + b) \end{aligned}$

The logistic function has a simple derivative:

$σ^{'} (x) = σ (x) (1 - σ (x))$

We want a likelihood function for our binary classifier. It could look like this: $L (w) = \prod_{i = 1}^{n} {\begin{cases} P (y = 1 ∣ x_{i}, w), if y_{i} = 1 \\ P (y = 0 ∣ x_{i}, w), if y_{i} = 0 \end{cases}$
This is much nicer to work with: $L (w) = \prod_{i = 1}^{n} P (y = 1 ∣ x_{i}, w)^{y_{i}} \times P (y = 0 ∣ x_{i}, w)^{1 - y_{i}}$ (remember… $a^{1} = a, a^{0} = 1$ )
This makes the log likelihood: $LL (w) = \sum_{i = 1}^{n} y_{i} \log P (y = 1 ∣ x_{i}, w) + (1 - y_{i}) \log P (y = 0 ∣ x_{i}, w)$ (remember… $\log_{b} (x^{y}) = y \log_{b} (x)$ )
The goal is to maximize the log likelihood, which is the same as minimizing the negative log likelihood: $- LL (w) = - \sum_{i = 1}^{n} y_{i} \log P (y = 1 ∣ x_{i}, w) + (1 - y_{i}) \log P (y = 0 ∣ x_{i}, w)$

This loss function is usually called cross-entropy. $- LL (w) = - \sum_{i = 1}^{n} y_{i} \log (σ (w^{T} x + b)) + (1 - y_{i}) \log (1 - σ (w^{T} x + b))$
Quiz: What is the cross-entropy loss for $σ (w^{T} x + b) = 1$ and $y_{i} = 0$ ? $y_{i} = 1$ ?

Remember… $\log (1) = 0, \log (0) = - \infty$

Zero and one are numbers, so there is no reason we couldn’t use the loss function we used for linear regression: $E (w) = \sum_{i = 1}^{n} (y_{i} - σ (w^{T} x + b))^{2}$
Here is how the two loss functions compare:

Unlike linear regression, finding the minima for logistic regression does not have a closed-form solution.
We can use simple gradient descent, but it is more common to use specialized solvers that calculate or approximate the Hessian (multi-dimensional generalization of the second derivative).
The loss function for logistic regression is convex: guaranteed to have a single global minima.
Some nice animations: Animations of Logistic Regression with Python

Pros:
- Simple and explainable: individual attribute coeffients indicate the relationship between the attribute and the class. Let’s look at an example…
- Unique global minimima means that we can be confident that when our algorithm converges we have the “correct” model.
- Prediction is fast.
- Provides a real-valued score function that may be interpreted as a probability.
- Looking ahead: provides key machinery for multi-layer neural networks.
Cons
- Linear decision boundary
- Subject to overfitting in high-dimensional settings