Skip to main content

Command Palette

Search for a command to run...

Logistic Regression: When the Output Must Be a Probability

Updated
4 min read

When we want to predict a binary outcome: spam or not spam, tumor or benign, fraud or legitimate, the direct application of linear regression fails here—it can produce outputs less than 0 or greater than 1, which are meaningless as probabilities.

The Sigmoid Function

We need to squash the output of our linear model into the interval (0,1). The perfect function for this is the sigmoid :

$$σ(z) = (\frac{1}{1 + e^{-z}})$$

As z approaches +∞, σ(z) approaches 1.

As z approaches −∞, σ(z) approaches 0.

At z = 0, σ(0) = 0.5.

Our model becomes:

$$(\hat{p} = \sigma(w^T x + b) = P(y = 1 \mid x))$$

We interpret p^ as the probability that the input x belongs to the positive class.

Why is logistic regression still called a linear classifier? It's because the decision boundary—the set of points where 0.5—occurs where wx +b = 0. This forms a hyperplane. Logistic regression can only separate classes with a straight line (or hyperplane).

Deriving the Loss: Maximum Likelihood Estimation

We do not use MSE for logistic regression. Instead, we derive the loss from first principles using Maximum Likelihood Estimation (MLE).

We assume each label y_i follows a Bernoulli distribution with parameter p_i^. The probability of observing a single label y_i given prediction p_i^ is:

$$[ P(y_i \mid x_i) = \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1-y_i} ]$$

To find weights that maximize the probability of the entire dataset, we maximize the log-likelihood:

$$ℓ(w) = i=1->n∑[yi log(p^i) + (1−yi) log(1−p^i)]$$

Minimizing the negative log-likelihood gives us the Binary Cross-Entropy (BCE) loss:

$$L(w) = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]$$

There is no closed-form solution here. We must use gradient descent.

The Gradient

The gradient of the BCE loss with respect to ww has a beautifully simple form:

$$∇wL = (\frac{1}{n}) X^T (p̂ - y)$$

where p^​ is the vector of predictions. This is identical in form to the gradient of linear regression—just with a different prediction step. This is not a coincidence; it arises from the special relationship between the sigmoid and cross-entropy.

import numpy as np

def sigmoid(z):
    # Numerically stable sigmoid
    return np.where(z >= 0,
                    1 / (1 + np.exp(-z)),
                    np.exp(z) / (1 + np.exp(z)))

def logistic_regression_train(X, y, lr=0.1, n_iters=1000):
    n, d = X.shape
    # Add bias column of ones
    X = np.hstack([np.ones((n, 1)), X])
    w = np.zeros(d + 1)  # include bias weight
    
    for _ in range(n_iters):
        z = np.dot(X, w)              # using np.dot instead of @
        p_hat = sigmoid(z)
        gradient = (1/n) * np.dot(X.T, (p_hat - y))
        w -= lr * gradient
    
    return w

def predict(X, w, threshold=0.5):
    # Add bias column of ones
    X = np.hstack([np.ones((X.shape[0], 1)), X])
    return (sigmoid(np.dot(X, w)) >= threshold).astype(int)

where p ^ ​ is the vector of predictions. This is identical in form to the gradient of linear regression—just with a different prediction step. This is not a coincidence; it arises from the special relationship between the sigmoid and cross-entropy.

Regularization in Logistic Regression

Just like linear regression, logistic regression can overfit. We add L2 regularization (controlled by C=1/λ in scikit-learn): L_reg(w) = L_BCE(w) + (λ/2) ∥w∥²₂

Multiclass Extension: Softmax

For K>2 classes, we replace sigmoid with softmax:

$$p^k = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}} ​​$$

The softmax outputs a probability distribution over all K classes. The loss becomes categorical cross-entropy.

Summary for the Road

Logistic regression models the probability of a binary outcome via the sigmoid function. Training minimizes cross-entropy (derived from MLE), not MSE. Despite its name, it is a classification model. Its decision boundary is linear, which is its core limitation. For nonlinear boundaries, we need more powerful models—which we build on top of these exact same principles.