Logistic Regression
Logistic Regression is a type of algorithm used in Machine Learning for binary classification problems. The outcome of this algorithm is discrete (not continuous). The results can be interpreted as probabilities of success.
Contrary to popular belief, the logistic regression model is a statistical model that uses a logistic function to model a binary dependent variable. In simpler words, it deals with situations where you need to predict an outcome that can have only two possible types of values.
The goal of logistic regression is to create a best-fit model to describe the relationship between the predictor and the response variable.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# Assume we have a 'df' DataFrame with 'target' as the prediction label.
x = df.drop('target', axis=1)
y = df.target
# Splitting our data into training and test set
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state=123)
# Normalizing our data
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
# Initialize our classifier
logistic_classifier = LogisticRegression()
# Fitting the model to train set
logistic_classifier.fit(x_train, y_train)
# Predicting the test set result
y_pred = logistic_classifier.predict(x_test)
The logistic regression function has a characteristic 'S' shape and it can seem complex at first, but it's rather simple when you understand it.
In the graph above, you can see the logistic function; it's an 'S' shaped curve that can take any real-valued number and map it into a value from 0 to 1. If the curve goes to positive infinity, y approaches 1, and if the curve goes to negative infinity, y will become 0.
If the output of logistic regression is higher than 0.5, we can conclude that the model predicts class 1. If logistic regression output is less than 0.5, then the model predicts class 0.
Though it is one of the simpler and older algorithms, Logistic Regression is still one of the most widely used algorithms due to its simplicity and the fact that it can be implemented quickly and provide very efficient predictions on binary classification problems.
Advantages of Logistic Regression:
-
Efficiency: Computationally inexpensive compared to complex methods.
-
Easily understandable output: The probabilities provided can be interpreted as a risk, which is directly readable unlike other classifiers like SVMs or Random Forests whose outputs are difficult to interpret in probabilistic terms.
-
Less vulnerable to overfitting: Penalized logistic regression models can avoid overfitting by using built-in feature selection properties of the regularization technique itself.
-
Works well with smaller dataset: It performs well even if you have fewer training samples available for your model creation.
-
Robustness:
- No assumption about distribution.
- Robust against statistically irrelevant features (e.g., transformations).
Disadvantages of Logistic Regression:
-
Binary Targets only Suitable:
-
Cannot predict continuous outcomes.
-
Outcomes must not have multi-class categories.
-
Requires large sample size: To achieve stable results, it requires at least 50 observations per predictor variable because maximum likelihood parameters are quite unstable unless sufficient details are present.
-
Controversy around interpretation when used with non-linear effects and interactions
-
Nonlinearity needs transformation which makes it difficult maintaining its original ease-of-use appeal.
-
Retains all outliers and influential values without adjustment
-
Vulnerable towards 'Perfect Seperation'
-
Provides less insight into individual predictors than decision trees
Instructions regarding assumptions:
- Logit should be linear in X(variable)
- Independent errors(multicollinearity)
Appropriate Usage Scenarios:
- When your target variable has binary outputs i.e., "yes" or "no", "true" or "false", etc., you may use Logistic JavaScript SDKs gression Classifier predictive analytics technique
- Use case scenarios include Email Spam/ Not-spam classifier; Churn Prediction,"Will customer buy this product?" , Health diagnosis such as Diabetes prediction: Will patient get diabetes?
-
In credit risk modelling result will either be default or non-default class making it classic scenario for applying logistic regression classifier.
-
Logistic regression is best suited for situations where data is cleanly separable linearly.
-
The main advantage lies in its simplicity since it creates straightforward decision boundaries which align along axes within input space(This gives it property-term“linear classification”).
-
Can handle several categorical variables- If you may want include categorical variables(example: gender,race)there’s no requirement convert these vars into numbers-their presence adequate enough deal with multiple categories deliver robust insights.