Loss Function

Original Source Here

In this blog post, we’ll discuss about Loss function, parameter θ and different types of loss function. I’ve learnt a lot while researching about this topic and hope you’ll feel the same. Without further a due, lets starts off with loss function.

In simple terms, the objective of loss function is to find the difference between or deviation between the actual ground truth of a value and an estimated approximation of the same value.

Simplest Form of Loss Function

Above equation is the simplest form loss function.

In a more technical and wikipedia terms — In mathematical optimization and decision theory, a loss function or cost function is a function that maps an event or values of one or more variables onto a real number intuitively representing some “cost” associated with the event. An optimization problem seeks to minimize a loss function.

Why we are learning about Loss Function

Loss function is a key component of any learning mechanism in AI, either be it machine learning or deep learning or reinforcement Learning. Loss function acts a feedback to the system we are building, without feedback the system will never know where and what should be improved.

Model Function

θ is the parameter of the trained model M. Loss function helps the model in answering the what and where question. Answer to the what 🔍 question is “θ”, which should be improved to reduce the difference between the actual vs estimate. And where 🔍 question means which θ, either its

With repeated iteration over model with different set of samples of the dataset, we identify the answer to what and where questions.

Role of θ

Consider the iris data with features as sepal width, sepal length, petal width, petal length and target as variants of iris setosa, versicolor and virginica. It is a multi-class problem with classes more than 2.

We’ll experiment with simple Logistic Regression fit on iris data with different number of iteration to converge the estimate with actual.

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import numpy as np

iris = load_iris #Load Dataset
x_train, x_test, y_train, y_test = train_test_split(iris()['data'], iris()['target'], test_size=0.2) #train and test split
LR = LogisticRegression(max_iter=1) # Change max_iter from 1 to 100, to see the effects on learning coefficient θ.
LR.fit(x_train, y_train)

Check out the coefficient of Logistic Regression at max_iter=1, the LR.coef_ represents three rows, each row represents one class with its feature’s coefficient value θ and intercept_ refers to the bias of the Logistic regression for each class. With the help of θ and intercept, the model creates its decision function.

"""
LR.coef_
array([[-0.06626003,  0.00156274, -0.11994002, -0.04727278],
       [ 0.03090311,  0.00097671,  0.03840425,  0.0103922 ],
       [ 0.03535691, -0.00253945,  0.08153577,  0.03688058]])

LR.intercept_
array([ 4.83768518e-18, -1.00170102e-03,  1.00170102e-03])

LR.decision_function(np.array([x_train[3]]))
array([[-3.49185211,  1.63301032,  1.85884179]])
"""

To arrive at a decision function for sample no. 3 i.e. x_train[3], do the following calculation with coefficient and intercept.

y_estimate = np.sum(np.multiply(LR.coef_, np.array([x_train[3]])), axis=1) + LR.intercept_
"""
y_estimate
array([-3.49185211,  1.63301032,  1.85884179])
"""

Each value in y_estimate is confidence score of the sample for each class. When we iterate the Logistic Regression for max_iter=100, the y_estimate values starts to converge and the parameter values are updates as follows

"""
LR.coef_
array([[-0.45753939,  0.80624315, -2.38307659, -0.9789476 ],
       [ 0.32666965, -0.29473046, -0.16163767, -0.73956488],
       [ 0.13086974, -0.5115127 ,  2.54471426,  1.71851248]])

LR.intercept_
array([ 10.00706074,   2.72893805, -12.73599879])

y_estimate
array([  7.64747156,   3.0268779 , -10.67434946])
"""

For sample x_train[3], the y_train[3] is 0, thus the confidence score of class 0 is increased from -3.49 to 7.64 when max_iter=100. With each iteration, the loss is reduced with the help of parameters θ.

Difference Between Loss Function and Cost Function: The loss function computes the error for a single training example, while the cost function is the average of the loss functions of the entire training set.

Common Loss Function

Squared Loss (Mean Square Error)
Absolute Loss (Mean Absolute Error)
Hinge Loss
Log Loss or Cross Entropy Loss

Mean Squared Error or L2 Loss

MSE of an estimator measures the average of the square of the errors. It is the averaged squared difference between the estimated values and the actual values.

MSE Equation

MSE values are mostly positive and not zero, because of the uncertainty of the estimator and also the loss of information during estimation which accounts for actual ground truth.

Loss Function — Mean Squared Error

Mean Squared Error Code

def MSE(yHat, y):
    return np.sum((yHat - y)**2) / y.size

In regression problems, MSE is used to measure the distance between the data point and predicted regression line. It helps in determine to which extent the model is fit to the data. When the errors are large, it becomes obvious to use MSE. Because square of large number means bigger the error distance thus more the penalising for bigger errors. Thus, making MSE sensitive to outliers.

Also note, that increasing the sample size m leads to decrease in MSE, because larger sample size, reduces the distributions variance, making it easy to reduce the distance between estimator and the actual.

Mean Absolute Error or L1 Loss

MAE is the measure of error between a pair of variable such as predicted vs actual. MAE is the average absolute difference between X and Y. MAE is widely used for forecast error in time series analysis.

Loss Function — Mean Absolute Error

MAE Equation

Mean Absolute Loss Code

def L1(yHat, y):
    return np.sum(np.absolute(yHat - y)) / y.size

MAE is less sensitive to outliers. To reduce MAE, minimize the median and to reduce MSE, minimize the mean.

Hinge Loss

Hinge Loss is used for Maximum Margin Classifier. The sound of maximum margin classifier takes us to SVM (support vector machine), where the distance between the data point and decision boundary is kept at max. The loss function’s penalization depends on how badly the data point is misclassified, meaning how far the data point is present on the wrong side of the decision boundary.

Loss Function — Maximum Margin Classifier

Hinge Loss Equation

Hinge Loss Code

def Hinge(yHat, y):
    return np.max(0, y - (1-2*y)*yHat)

where y_hat is the output of the SVM, and y is the true class (-1 or 1). Note that the loss is nonzero for misclassified points, as well as correctly classified points that fall within the margin. Hinge Loss is a loss function used for classification problems. Check out this awesome resource on how to minimize hinge loss. Hinge Loss has massive documentation because of its so many variants.

Log Loss or Cross Entropy Loss

Logistic Loss is also known as Log Loss. It is used in calculating the loss in Logistic Regression. When number of class is 2 the cross entropy is calculated as

Log Loss Equation — Binary Class

y’ is the predicted label and the raw value of y’ is > 1 or < 0, to convert it into probability score we use sigmoid function on top of y’ to make the raw values as probabilities. By default, the output of the logistics regression model is the probability of the sample being positive, hence the probability score tends to be high and has a ideal score of 1 for positive class and small probability value for negative class i.e. ideal value 0 for negative class.

Loss Function — Log Loss

When the actual class y is 1: second term in the Log Loss = 0 and we will be left with first term

2. When the actual class y is 0: The first term = 0 and second term will turn into as follows

By assigning actual value for y and its estimated probability score, we find if the predicted probability leans towards the class which is close to actual class, then the loss is value is reduced, otherwise the loss is increased. I would encourage you to assign value of y=1 and p(y’)=0.1 and then p(y’)=0.9.

When number of class > 2 in multiclass classification, we calculate a separate loss for each class label per observation and sum the result.

Cross Entropy Loss Code

def CrossEntropy(yHat, y):
    if y == 1:
      return -log(yHat)
    else:
      return -log(1 - yHat)

For more information on log loss, find this amazing blog on Log Loss.

A small titbit from Wikipedia for Selection Of Loss Function

W. Edwards Deming and Nassim Nicholas Taleb argue that empirical reality, not nice mathematical properties, should be the sole basis for selecting loss functions, and real losses often aren’t mathematically nice and aren’t differentiable, continuous, symmetric, etc. For example, a person who arrives before a plane gate closure can still make the plane, but a person who arrives after can not, a discontinuity and asymmetry which makes arriving slightly late much more costly than arriving slightly early. In drug dosing, the cost of too little drug may be lack of efficacy, while the cost of too much may be tolerable toxicity, another example of asymmetry. Traffic, pipes, beams, ecologies, climates, etc. may tolerate increased load or stress with little noticeable change up to a point, then become backed up or break catastrophically. These situations, Deming and Taleb argue, are common in real-life problems, perhaps more common than classical smooth, continuous, symmetric, differentials cases.[ 13]

Long story short, the loss function built should be based on problem in-hand and how small changes in some factor has significant impact on the system.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

via WordPress https://ramseyelbasheer.wordpress.com/2021/02/20/loss-function/