## A Gentle Introduction to Logistic Regression

## A Gentle Introduction to Logistic Regression

### Introduction

This article aims to be an easy to understand, fully explained, gentle introduction to logistic regression. As much as possible, the article will attempt to fully explain each detail in a way that anyone can understand.

The background required for this is solely simple single variable differentiation. Partial derivatives will be used to optimize the logistic regression model, but they will be fully explained with examples.

### What is Logistic Regression?

Logistic regression is a binary (two-class classification), supervised (the training data is labeled) machine learning algorithm. Despite its name, it is not a regression algorithm (a predictor of a continuous numeric variable). It’s a classification algorithm. For example, it could be used to classify a borrower as *likely to default* or *not likely to default* on a loan based on relevant credit data or a credit card transaction as *fraudulent *or *non-fraudulent.* It’s a very powerful, common algorithm used in data science.

Although out of the box, logistic regression is a binary classifier, it can be extended to a multi-class classifier using either one-versus-rest (OvR) or one-versus-one (OvO) techniques. This is a topic for another blog post. We will be discussing a simple binary logistic regression classifier.

Logistic regression is used to discriminate between two classes usually labeled *class* 1 (the positive class) and *class* 0 (the negative class). In this article, we will be using this convention.

### The Concept of Probability Odds

First, let’s start with the simple concept of **odds**. The odds is defined as the probability \(p\) of some event occurring divided by the probability \(1-p\) of the event not occurring. The odds is defined as:

$$ \frac{p}{1-p}$$

For example, if \(p=0.75\) then this implies \(1-p=0.25\) and the odds ratio is \(\frac{0.75}{0.25}=3\). Thus the probability of success is three times greater than the probability of failure. Another example is a heads/tails coin flip which has odds of one.

### The Logit Function or Log Odds

Next, we take the natural logarithm of the odds to obtain the **logit function** (the logit is also known as the **log odds**).

$$ logit(p)=log(\frac{p}{1-p})$$

Why did we take the natural logarithm of the odds? Well, it has to do with how we will model our probabilities as a function of our feature inputs. Keep reading and it will become clearer.

### Modeling Probabilities in the Logistic Regression Model

Let \(P(Y=1|x)\) be the probability \(Y=1\) where \(Y\) is a random variable representing our class that takes on the the values 0 or 1 given an input vector representing a specific observation \(x\). As \(x\) varies, the probability that \(Y=1\) also varies. Note: \(P(Y=0|x) = 1 – P(Y=1|x)\) since there are only two classes 0 and 1.

Logistic regression is a linear model. We are going to relate the probabilities through the \(logit\) function as the linear equation \(\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}\). However, we will not model the probabilities this way i.e., **we aren’t going to simply say**…

INCORRECT: \(P(Y=1|x)=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}\).

Why? Because that would produce probabilities between \(-\infty\) and \(+\infty\). That would be simple multivariate linear regression which doesn’t work with probabilities. Probabilities must be between 0 and 1 and linear regression produces an unconstrained output. Further, probabilities aren’t observed making the linear regression model impossible to train.

Note, we are assuming \(p\) input features. For example, in our loan example, \(p\) could be 2 and feature #1 could be credit score and feature #2 could be yearly income.

### The Sigmoid Function

Here’s where the \(logit\) function comes in. If we set the \(logit\) function equal to \(\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}\) and solve for \(P(Y=1|x)\), we get an interesting functional form called the **sigmoid** **function**.

Let \(p\) denote \(P(Y=1|x)\) for simplicity (we are holding \(x\) constant but it represents any value \(x\) could take on). Let’s do some algebra and solve for \(p\) while setting the \(logit(p)\) function equal to the dot product of of our \(\beta\) and \(x\).

$$logit(p)=log(\frac{p}{1-p})$$

$$log(\frac{p}{1-p}) = \beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}$$

Let…

$$z=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}$$

Let’s introduce the concept of the dot product. The dot product between two vectors is the summation of the multiplication of each corresponding element in the two vectors. \(z\) can be defined as \(z = \beta^T \cdot x\) where we have added a 1 in the zeroth entry in \(x\) (\(x_0\)) for all \(x\). See below…

\begin{eqnarray}

z &=& \beta_{0}x_0+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}\\

&=& \beta_{0}\cdot 1+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}\\

&=& \beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}

\end{eqnarray}

Continuing on and substituting \(z\) for \(\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}\) we get…

$$log(\frac{p}{1-p})=z$$

$$\frac{p}{1-p}=e^z$$

$$p=e^z(1-p)$$

$$p(1+e^z)=e^z$$

$$p=\frac{e^z}{1+e^z}$$

$$p=\frac{1}{1+e^{-z}}$$

$$P(Y=1|x; \beta)=\frac{1}{1+e^{-\beta^{T} \cdot x}}$$

This last equation may not seem exciting or even intuitive but it has the exact properties we would want in a function taking in values between \(-\infty\) and \(\infty\) (our \(\beta^{T} \cdot x\) dot product) and producing probabilities.

Note: The dot product of \(\beta\) and \(x\) was included in the last equation to make explicit that \(P(Y=1|x; \beta)\) is dependent on both \(x\) and \(\beta\).

### Exploring the Sigmoid Function

The functional form…

$$\sigma(z)=\frac{1}{1+e^{-z}}$$

has a lot of awesome properties that make it ideal for modeling probabilities for logistic regression. It is known as the sigmoid function (the word sigmoid means S-shaped which is where the function gets it’s name). Here is a plot of the function…

The logistic regression model doesn’t actually produce exact classifications such as class 0 or class 1, it rather produces probabilities from the sigmoid function. The classification happens when we choose a threshold to determine the class. For example, most commonly the threshold is 0.5 or 50%. Any probability produced above 50% will be classified as class 1 and any probability below 50% would be classified as class 0.

Examining the plot, we see a couple interesting properties:

- As \(z\) approaches \(+\infty\), \(\sigma(z)\) approaches 1. Thus as our dot product \(z=\beta^{T}\cdot x\) gets larger, our classifier favors class 1 more and more.
- As \(z\) approaches \(-\infty\), \(\sigma(z)\) approaches 0. Thus as our dot product \(z=\beta^{T}\cdot x\) gets more and more negative, our classifier classifies the example as class 0 and feels more strongly about the classification.
- Based on these two facts, the classification probability is constrained between 0 and 1 just like real probabilities.
- If \(z=0\), \(\sigma(z)=0.5\) which is the typical threshold between class 0 and class 1. Thus, if \(z=\beta^T \cdot x\) is greater than 0, we classify as class 1 and if it is less than 0, we classify as class 0. Zero is a nice symmetric decision boundary.

These properties make the sigmoid function an excellent choice to model probabilities. Next, we will discuss what learning means.

### What is Learning in the Context of Machine Learning and AI?

Learning is the process of applying your machine learning or AI algorithm to determine the optimal parameters that will make your model most successful at the task at hand. Recall:

$$P(Y=1|x; \beta)=\frac{1}{1+e^{-\beta^T \cdot x}}$$

In this equation, the only unknown is our \(\beta_{i}\) parameters. \(x\) is fixed and comes from our training dataset and the \(P(Y=1|x; \beta)\) will be determined from our \(\beta_{i}\) parameters.

We must learn these \(\beta_{i}\) by applying the logistic regression training algorithm. Let’s discuss the first concept that will drive the learning process.

### The Likelihood Function

For each \(\beta\) we choose, we can compute a likelihood of the training data given \(\beta\) and our logistic regression model. Different \(\beta\)’s will produce different likelihoods based on our model.

For example, suppose you created 1,000 fictitious borrowers and assigned credit scores, yearly incomes, on time bill payment history, etc. and you also chose whether they defaulted or paid their total debt.

Take one of those fictitious borrowers. This borrower has both an \(x\) feature vector with his or her credit score, yearly income, bill payment history, etc. and an \(y\) which could take on the value 0 if they did not default or 1 if this person defaulted. Suppose \(y=1\) , that is the person defaulted on the loan. Then \(P(y=1|x; \beta) = \sigma(\beta^T \cdot x)\). Thus, the likelihood of seeing a particular person with this specific credit situation (\(x\)) defaulting as measured by our logistic regression model is solely determined by \(\beta\). Changing \(\beta\) changes the likelihood.

Below, we will look at the** likelihood function** which takes in an input \(\beta\) and training dataset and returns a probability of seeing that dataset under the assumption of the \(\beta_i\) parameters.

Let \(y^{(i)}\) and \(x^{(i)}\) represent the \(i^{th}\) training example where \(y^{(i)}\) is the class label taking on a value of 0 or 1 and \(x^{(i)}\) is the feature vector.

There is a nice way to represent \(P(y^{(i)}|x^{(i)}; \beta)\)…

$$P(y^{(i)}|x^{(i)}; \beta) = P(y^{(i)}=1|x^{(i)}; \beta)^{y^{(i)}}P(y^{(i)}=0|x^{(i)}; \beta)^{1-y^{(i)}}$$

To understand this first assume \(y^{(i)} = 1\). This knocks out the second term (it produces a power of zero which reduces the term to one) and leaves \(P(y^{(i)}=1|x^{(i)}; \beta)\) which is exactly the probability we want. You can see the above equation also works when \(y^{(i)} = 0\) by knocking out the first term.

Assuming each training sample is independent (very, very roughly speaking none of the training examples influenced the values of the other training examples), we can multiply all the probabilities of each training set sample to determine the likelihood of our seeing our training set given our current \(\beta\). The optimal thing to do is to choose \(\beta\) to maximize this likelihood.

This multiplication of all the probabilities in the training set (suppose it contains \(n\) observations) is called the **Likelihood function**. The equation is:

$$L(\beta) = \prod_{i=1}^{n} P(y^{(i)}=1|x^{(i)}; \beta)^{y^{(i)}}P(y^{(i)}=0|x^{(i)}; \beta)^{1-y^{(i)}}$$

The symbol on the left just means multiply the right hand side for every training example. We want to maximize this function by choosing the optimal \(\beta\).

The likelihood function, \(L(\beta)\), can also be written in terms of the sigmoid function…

$$L(\beta) = \prod_{i=1}^{n} \sigma(z^{(i)})^{y^{(i)}}(1-\sigma(z^{(i)}))^{1 – y^{(i)}}$$

where \(z^{(i)} = \beta^T\cdot x^{(i)}\).

Remember, \(\sigma(z^{(i)})\) is our predicted probability that the random variable \(Y\) given \(x^{(i)}\) and \(\beta\) is class 1, i.e. \(P(Y=1|x^{(i)}; \beta) = \sigma(z^{(i)})\).

### The Log Likelihood Function

Rather than working with the likelihood function, we will take the natural logarithm of the likelihood function (the **Log Likelihood function**) because a) it prevents underflow (lots of small probabilities multiplied together creates a number so small the computer can’t represent which causes an error) and b) the calculus will be much, much cleaner if we take the natural logarithm.

We will denote the log likelihood function as \(l(\beta)\). Here is the algebra to simplify the function:

\begin{eqnarray}

l(\beta) &=& log(L(\beta))\\

&=& log(\prod_{i=1}^{n} P(y^{(i)}=1|x^{(i)}; \beta)^{y^{(i)}}P(y^{(i)}=0|x^{(i)}; \beta)^{1-y^{(i)}})\\

&=& log(\prod_{i=1}^{n} \sigma(z^{(i)})^{y^{(i)}}(1-\sigma(z^{(i)}))^{1 – y^{(i)}}) \\

&=& \sum_{i=1}^{n} y^{(i)}log(\sigma(z^{(i)})) + (1- y^{(i)})log(1-\sigma(z^{(i)}))

\end{eqnarray}

### Converting the Log Likelihood to a Cost Function

To optimize the logistic regression parameters we maximize the likelihood function or equivalently the log likelihood function, but we can simply multiply by -1 and we now have a cost function. Instead of maximizing the log likelihood function, we will minimize the cost function. Let \(J(\beta)\) be our cost function. Thus,

\begin{eqnarray}

J(\beta) &=& -l(\beta)\\

&=& -\sum_{i=1}^{n}y^{(i)}log(\sigma(z^{(i)})) + (1- y^{(i)})log(1-\sigma(z^{(i)}))

\end{eqnarray}

Note, this works because maximizing \(f(x)\) is equivalent to minimizing \(-f(x)\) as far as \(x\) goes.

### What’s a Partial Derivative?

In this article I assume you know single variable differentiation taught in the first course in calculus. If you don’t know partial derivatives that’s fine. I’m going to teach you. It’s super easy. Basically, you have a function say \(f(x, y)\) of two or more variables (in this case \(x\) and \(y\) ) and you want to know how a change in \(x\) holding \(y\) constant affects the output of the function. Well, if you want to compute \(\frac{\partial f}{\partial x}\) which is the partial derivative of \(f\) with respect to \(x\), you simply treat \(y\) as a constant and take the derivative of \(f\) with respect to \(x\).

Let’s do a few simple examples..

$$f(x,y)=x^2+y^2$$

We will now take the partial derivative of \(f\) with respect to \(x\). Remember \(y\) is simply a constant. You can think of it as a number…

$$\frac{\partial f}{\partial x} = \frac{\partial}{\partial x}[x^2 + y^2] = 2x$$

Here, \(y\) was simply a constant and the derivative of a constant with respect to \(x\) is 0.

Now, let’s take the partial derivative of \(f(x,y)\) with respect to \(y\)…

$$\frac{\partial f}{\partial y} = \frac{\partial}{\partial y}[x^2 + y^2] = 2y$$

We basically got the same answer, we just held x constant and took the partial derivative with respect to \(y\).

Now let’s do a three variable example…

Let \(f(x,y,z) = (1 + 2x + 3y + 4z)^2\)…

Then \(\frac{\partial f}{\partial x} = 2(1 + 2x + 3y + 4z)(\frac{\partial}{\partial x}[1 + 2x + 3y+4z]) = 4(1 + 2x + 3y + 4z)\)

Notice, I applied the chain rule (take the derivative of the outer function evaluated at the inner function times the derivative of the inner function). That was the only complicated part. I won’t go over taking the partials with respect to \(y\) and \(z\) since it’s very similar.

### Introducing the Gradient Descent Algorithm

Now, our task is to choose the \(\beta\) that minimizes our cost function \(J(\beta)\). Rather than calculating an exact solution, we’ll be using an iterative algorithm known as **gradient descent**.

What is gradient descent? Put simply we use calculus to calculate how we should alter our \(\beta_i\) by tiny amounts each step of the process to decrease the cost and repeat until we reach our optimal solution.

If you think about the cost function as a valley and you put yourself at the top of the valley, calculus will tell you the steepest direction down the valley. It is your job to take a tiny step in the first direction, stop, recalculate the steepest direction, take another step, stop, recalculate…. until you reach the bottom of the valley after enough small steps.

### What is a Gradient?

A gradient is simply a vector made up of partial derivatives. What is a vector? See the below images for two examples of vectors. Vector \(x\) points off to the right 3 unites and up 1 unit. Vector \(y\) points to the left 2 units and up 6 units. You’ll note, vectors have direction and magnitude (the length of the vector).

The gradient of our cost function \(J(\beta)\) is…

$$\nabla (J(\beta)) = \begin{bmatrix} \\ \frac{\partial J}{\partial \beta_0} \\ \frac{\partial J}{\partial \beta_1} \\ \frac{\partial J}{\partial \beta_2} \\ \vdots \\ \ \frac{\partial J}{\partial \beta_p} \end{bmatrix}$$

Note for a specific value of \(\beta\) say \(\beta^{0}\) , \(\nabla (J(\beta^{0}))\) is a vector with numeric values in it just like the vectors above. The gradient, when evaluated at a specific \(\beta^{0}\) is a numeric vector.

### Gradient Descent Learning Rule

Let \(\beta^{t}\) be the \(t^{th}\) \(\beta\) vector after t steps of the gradient descent algorithm. We randomly assign small values to \(\beta^{0}\) using a random number generator to give us an initial starting position. Then…

$$\beta^{t+1} = \beta^t – \eta \nabla (J(\beta^{t}))$$

Notice the \(\eta\) hyperparameter. This is called the **learning rate**. It usually falls between 0 and 1 and it adjusts the step size during gradient descent. So the negative of the gradient is giving you the steepest direction of descent and the learning rate is shrinking the magnitude so the minimum doesn’t get overshot by too big of a step. The smaller the learning rate, the smaller the steps.

### Derivative of the Sigmoid Function

In order to calculate the gradient of our cost function, we need to learn a little simple fact about the derivative of the sigmoid function.

Recall our sigmoid function \(\sigma(z) = \frac{1}{1 + e^{-z}}\). Well, this function has a special property. Spoiler alert! \(\frac{\partial \sigma}{\partial z} = \sigma(z)(1-\sigma(z))\). Let’s see why…

$$\sigma(z) = \frac{1}{1+e^{-z}}$$

\begin{eqnarray}

\frac{\partial \sigma}{\partial z} &=& \frac{\partial}{\partial z}(1+e^{-z})^{-1}\\

&=& -1(1+e^{-z})^{-2}(\frac{\partial}{\partial z}[1+e^{-z}]) \\

&=& -1(1 + e^{-z})^{-2}(-e^{-z}) \\

&=& \frac{e^{-z}}{(1+e^{-z})^2} \\

\end{eqnarray}

Now I’m going to add and subtract 1 in the numerator which is equivalent to adding zero to the numerator so it changes nothing.

\begin{eqnarray}

\frac{e^{-z}}{(1+e^{-z})^2} &=& \frac{(1+e^{-z}) – 1}{(1+e^{-z})^2} \\

&=&\frac{1}{1+e^{-z}}\frac{(1+e^{-z}) – 1}{1+e^{-z}} \\

&=&\frac{1}{1+e^{-z}}(1-\frac{1}{(1+e^{-z})}) \\

&=&\sigma(z)(1-\sigma(z)) \\

\end{eqnarray}

And thus,

$$\frac{\partial \sigma}{\partial z} = \sigma(z)(1-\sigma(z))$$

### Taking the Partial Derivative of our Cost Function \(J(\beta)\) with Respect to \(\beta_j\)

Recall…

$$J(\beta) = -\sum_{i=1}^{n} y^{(i)}log(\sigma(z^{(i)})) + (1- y^{(i)})log(1-\sigma(z^{(i)}))$$

and

$$\nabla (J(\beta)) = \begin{bmatrix} \\ \frac{\partial J}{\partial \beta_0} \\ \frac{\partial J}{\partial \beta_1} \\ \frac{\partial J}{\partial \beta_2} \\ \vdots \\ \ \frac{\partial J}{\partial \beta_p} \end{bmatrix}$$

furthermore

$$z=\beta_{0}+\beta_{1}x_{1}+\cdots+\beta_{p}x_{p}$$

Lets calculate the partial derivative (holding all other variables constant) of our cost function with respect to \(\beta_j\) . This will give us \(\frac{\partial J(\beta)}{\partial \beta_j}\) which is an arbitrary element of \(\nabla (J(\beta))\).

\begin{eqnarray}

\frac{\partial J(\beta)}{\partial \beta_j} &=& \frac{\partial}{\partial \beta_j} -\sum_{i=1}^{n} y^{(i)}log(\sigma(z^{(i)})) + (1- y^{(i)})log(1-\sigma(z^{(i)}))\\

&=& -\sum_{i=1}^{n} \frac{\partial}{\partial \beta_j} [y^{(i)}log(\sigma(z^{(i)})) + (1- y^{(i)})log(1-\sigma(z^{(i)}))]\\

&=& -\sum_{i=1}^{n}[y^{(i)}\frac{1}{\sigma(z^{(i)})} – (1- y^{(i)})\frac{1}{1-\sigma(z^{(i)})}]\frac{\partial}{\partial \beta_j}\sigma(z{(i)})\\

&=& -\sum_{i=1}^{n}[y^{(i)}\frac{1}{\sigma(z^{(i)})} – (1- y^{(i)})\frac{1}{1-\sigma(z^{(i)})}]\sigma(z^{(i)})(1-\sigma(z^{(i)}))]\frac{\partial}{\partial \beta_j}z^{(i)}\\

&=& -\sum_{i=1}^{n}[y^{(i)}\frac{1}{\sigma(z^{(i)})} – (1- y^{(i)})\frac{1}{1-\sigma(z^{(i)})}]\sigma(z^{(i)})(1-\sigma(z^{(i)}))x^{(i)}_j\\

&=& -\sum_{i=1}^{n}[y^{(i)}(1-\sigma(z^{(i)})) – (1- y^{(i)})\sigma(z^{(i)})]x^{(i)}_j\\

&=& -\sum_{i=1}^{n}(y^{(i)} – \sigma(z^{(i)}))x_j^{(i)}\\

\end{eqnarray}

Notes on the steps:

- We can distribute the partial derivative \(\frac{\partial}{\partial \beta_i}\) into the summation because the derivative of a sum is the sum of the derivatives.
- \(\frac{\partial}{\partial x}log(x) = \frac{1}{x}\)
- \(\frac{\partial \sigma}{\partial z} = \sigma(z)(1-\sigma(z))\)

There you have it. The partial derivative of the cost function with respect to \(\beta_j\) is…

$$\frac{\partial J(\beta)}{\partial \beta_j} = -\sum_{i=1}^{n}(y^{(i)} – \sigma(z^{(i)}))x_j^{(i)}$$

### Training a Logistic Regression Model

Let \(\Delta_j = \eta\sum_{i=1}^{n}(y^{(i)} – \sigma(z^{(i)}))x_j^{(i)}\). Note we took the negative of the partial derivative of the cost function with respect to \(\beta_j\) to reverse the direction of the gradient which naturally points “uphill”.

- Choose \(\eta\) the learning rate. Example values include 0.01, 0.05, 0.1, etc.
- Initialize \(\beta^{(0)}\) to small random values such as drawing from a normal distribution with \(\mu=0\) and \(\sigma=0.1\).
- Set \(\beta_j^{t+1} = \beta_j^t + \Delta_j\) where \(\Delta_j\) is as defined above and \(\beta_j^t\) is the \(j^{th}\) component of \(\beta\) in the \(t^{th}\) epoch (a pass through the training set). Do this for \(j=0 \cdots p\). Note, for \(j=0\) \(x_0=1\) by default for all \(n\) training examples.
- Repeat step #3 \(m\) epochs. \(m\) depends on your learning rate and data. Some experimentation may be necessary.

### Conclusion

I hope by now you fully understand the logistic regression algorithm. You can see my implementation of the logistic regression algorithm from scratch in Python. In a future post, I’m going to cover the code I wrote so you can write your own logistic regression algorithm from scratch. You can also check out a list of the other machine learning and AI algorithms I wrote from scratch. Feel free to contact me with any questions or errors you may spot. I hope you enjoyed!

Barrett Duna is a professional data scientist versed in all the major machine learning and AI algorithms. He graduated from UCLA with a B.S. in Mathematics/Economics and is currently admitted to George Mason University for the M.S. Data Analytics Engineering program.