Turning our single layer network into a probability distribution…

This is the logistic or sigmoid function: $σ (z) = \frac{e^{z}}{1 + e^{z}} = \frac{1}{1 + e^{- z}}$
It looks like this:

Today: Logisitic Regression

After applying a sigmoid non-linearity we can interpret the output as a probability.

If we let $z = w^{T} x + b$ , we end up with: $\begin{aligned} P (y = 1 ∣ x, w) & = σ (w^{T} x + b) = \frac{1}{1 + e^{- (w^{T} x + b)}} \\ P (y = 0 ∣ x, w) & = 1 - σ (w^{T} x + b) \end{aligned}$

The logistic function has a simple derivative:

$σ^{'} (x) = σ (x) (1 - σ (x))$

Goal is to learn the parameters $θ$ of a probabilistic model.
One option is to attempt to find the single most probable set of parameters given our data and our prior:
$θ_{M A P} = {argmax}_{θ} P (D ∣ θ) P (θ)$
Laplace smoothing for naive Bayes can be seen as an example of this!

An even simpler alternative is to just pick the parameters that make the data most probable: $θ_{M L} = {argmax}_{θ} P (D ∣ θ)$
$P (D ∣ θ)$ is commonly referred to as likelihood: $L (θ) = P (D ∣ θ) = \prod_{x_{i} \in D} P (x_{i} ∣ θ)$ (The likelihood isn’t really a probability distribution… We know what the data is, so the probability of the data is 1. It is a measure of how probable (likely) that data is given our model parameters.)
Simple example… Let’s say $θ$ represents the probabilility that a rigged coin will come up heads: $θ = .9$ . What is the likelihood of observing the following outcomes? $[H e a d s, T a i l s, H e a d s]$
What value of $θ$ would maximize the likelihood for this data set?
Numerically, we are usually better off working with the log likelihood: $LL (θ) = \sum_{x_{i} \in D} \log P (x_{i} ∣ θ)$

We want a likelihood function for our binary classifier. It could look like this: $L (w) = \prod_{i = 1}^{n} {\begin{cases} P (y = 1 ∣ x_{i}, w), if y_{i} = 1 \\ P (y = 0 ∣ x_{i}, w), if y_{i} = 0 \end{cases}$
This is much nicer to work with: $L (w) = \prod_{i = 1}^{n} P (y = 1 ∣ x_{i}, w)^{y_{i}} \times P (y = 0 ∣ x_{i}, w)^{1 - y_{i}}$ (remember… $a^{1} = a, a^{0} = 1$ )
This makes the log likelihood: $LL (w) = \sum_{i = 1}^{n} y_{i} \log P (y = 1 ∣ x_{i}, w) + (1 - y_{i}) \log P (y = 0 ∣ x_{i}, w)$ (remember… $\log_{b} (x^{y}) = y \log_{b} (x)$ )
The goal is to maximize the log likelihood, which is the same as minimizing the negative log likelihood: $- LL (w) = - \sum_{i = 1}^{n} y_{i} \log P (y = 1 ∣ x_{i}, w) + (1 - y_{i}) \log P (y = 0 ∣ x_{i}, w)$

This loss function is usually called cross-entropy. $- LL (w) = - \sum_{i = 1}^{n} y_{i} \log (σ (w^{T} x + b)) + (1 - y_{i}) \log (1 - σ (w^{T} x + b))$
Quiz: What is the cross-entropy loss for $σ (w^{T} x + b) = 1$ and $y_{i} = 0$ ? $y_{i} = 1$ ?

Remember… $\log (1) = 0, \log (0) = - \infty$

Zero and one are numbers, so there is no reason we couldn’t use the loss function we used for linear regression: $E (w) = \sum_{i = 1}^{n} (y_{i} - σ (w^{T} x + b))^{2}$
Here is how the two loss functions compare:

Unlike linear regression, finding the minima for logistic regression does not have a closed-form solution.
We can use simple gradient descent, but it is more common to use specialized solvers that calculate or approximate the Hessian (multi-dimensional generalization of the second derivative).
The loss function for logistic regression is convex: guaranteed to have a single global minima.
Some nice animations: Animations of Logistic Regression with Python

Pros:
- Simple and explainable: individual attribute coeffients indicate the relationship between the attribute and the class. Let’s look at an example…
- Unique global minimima means that we can be confident that when our algorithm converges we have the “correct” model.
- Prediction is fast.
- Provides a real-valued score function that may be interpreted as a probability.
- Looking ahead: provides key machinery for multi-layer neural networks.
Cons
- Linear decision boundary
- Subject to overfitting in high-dimensional settings