In Bayesian learning, we use data to update a probability distribution over our model parameters \(\theta\) given our training data \(\mathcal{D} = (\mathbf{x}_1, \mathbf{x}_2, ... \mathbf{x}_n)\):
\(P(\theta \mid \mathcal{D}) = \frac{ P( \mathcal{D} \mid \theta) P(\theta)} {P(\mathcal{D})}\)
The goal here is not to find the single best fit (or point estimate) for our parameters. Instead, we maintain a probability distribution over our parameters.
This is related to the idea of ensemble learning.
Here is a recent survey if you are interested: Hands-on Bayesian Neural Networks - a Tutorial for Deep Learning Users
An alternative is to attempt to find the single most probable set of parameters given our data and our prior:
\[\theta_{MAP} = \text{argmax}_\theta P(\mathcal{D} \mid \theta) P(\theta)\]
Laplace smoothing for naive Bayes can be seen as an example of this!
An even simpler alternative is to just pick the parameters that make the data most probable: \[\theta_{ML} = \text{argmax}_\theta P(\mathcal{D} \mid \theta)\]
\(P(\mathcal{D} \mid \theta)\) is commonly referred to as likelihood: \[\mathcal{L}(\theta) = P(\mathcal{D} \mid \theta) = \prod_{\mathbf{x}_i \in \mathcal{D}} P(\mathbf{x}_i \mid \theta)\] (The likelihood isn’t really a probability distribution… We know what the data is, so the probability of the data is 1. It is a measure of how probable (likely) that data is given our model parameters.)
Numerically, we are usually better off working with the log likelihood: \[\mathcal{LL}(\theta) = \sum_{\mathbf{x}_i \in \mathcal{D}} \log P(\mathbf{x}_i \mid \theta)\]
This is the logistic or sigmoid function: \[\sigma(z) = \frac{1}{1 + e^{-z}}\]
It looks like this:
If we let \(z = \mathbf{w}^T\mathbf{x} + b\), we end up with: \[ \begin{split} P(y = 1 \mid \mathbf{x}, \mathbf{w}) &= \sigma( \mathbf{w}^T\mathbf{x} + b ) = \frac{1}{1 + e^{-(\mathbf{w}^T\mathbf{x }+ b)}}\\ P(y = 0 \mid \mathbf{x}, \mathbf{w}) &= 1 - \sigma( \mathbf{w}^T\mathbf{x} + b ) \end{split} \]
The logistic function has a simple derivative:
\[\sigma'(x) = \sigma(x)(1 - \sigma(x))\]
We want a likelihood function for our binary classifier. It could look like this: \[\mathcal{L}(\mathbf{w}) = \prod_{i=1}^n \begin{cases} P(y = 1 \mid \mathbf{x_i}, \mathbf{w}), \text{if } y_i = 1 \\ P(y = 0 \mid \mathbf{x_i}, \mathbf{w}), \text{if } y_i = 0 \\ \end{cases} \]
This is much nicer to work with: \[\mathcal{L}(\mathbf{w}) = \prod_{i=1}^n P(y = 1 \mid \mathbf{x_i}, \mathbf{w})^{y_i} \times P(y = 0 \mid \mathbf{x_i}, \mathbf{w})^{1-y_i} \] (remember… \(a^1 = a, a^0 = 1\))
This makes the log likelihood: \[\mathcal{LL}(\mathbf{w}) = \sum_{i=1}^n y_i \log P(y = 1 \mid \mathbf{x_i}, \mathbf{w}) + (1 - y_i) \log P(y = 0 \mid \mathbf{x_i}, \mathbf{w}) \] (remember… \(\log_b(x^y) = y\log_b(x)\))
The goal is to maximize the log likelihood, which is the same as minimizing the negative log likelihood: \[-\mathcal{LL}(\mathbf{w}) = -\sum_{i=1}^n y_i \log P(y = 1 \mid \mathbf{x_i}, \mathbf{w}) + (1 - y_i) \log P(y = 0 \mid \mathbf{x_i}, \mathbf{w}) \]
This loss function is usually called cross-entropy. \[-\mathcal{LL}(\mathbf{w}) = -\sum_{i=1}^n y_i \log(\sigma( \mathbf{w}^T\mathbf{x} + b )) + (1 - y_i) \log (1 - \sigma( \mathbf{w}^T\mathbf{x} + b )) \]
Quiz: What is the cross-entropy loss for \(\sigma( \mathbf{w}^T\mathbf{x} + b ) = 1\) and \(y_i = 0\)? \(y_i = 1\)?
Remember… \(\log(1) = 0, ~~~ \log(0) = -\infty\)
Zero and one are numbers, so there is no reason we couldn’t use the loss function we used for linear regression: \[E(\mathbf{w}) = \sum_{i=1}^n (y_i - \sigma(\mathbf{w}^T\mathbf{x} + b))^2\]
Here is how the two loss functions compare: