Random Forest Algorithm
Random forest is a supervised machine learning algorithm which produces a great result most of the time even without hyper-parameter. It can be used for both classification and regression tasks. It is based on the concept of ensemble learning, which is a process of grouping of multiple classifiers to solve a complex problem and meliorate the performance of the model.
The random forest takes the prediction from each tree and predicts the final output based on the majority votes of predictions. Random Forest is a classifier which consist of a number of decision trees on various subsets of given dataset and takes the average to improve the predictive accuracy of that dataset.
Random forest builds multiple decision trees and merge them together to get a more accurate and stable prediction. It also prevents the problem of overfitting.
The below diagram explains the working of Random Forest.
Let’s quickly go over decision trees as they are the building blocks of the random forest models.
Decision Tree
A decision tree is a supervised machine learning algorithm which is used for both classification and regression problems. A decision tree is simply a series of sequential decisions made to reach a specific result.
Let’s assume we want to play badminton on a particular day — say Saturday — how will you decide whether to play or not. Let’s say you go out and check if it’s hot or cold, check the speed of the wind and humidity, how the weather is, i.e. is it sunny, cloudy, or rainy. You take all these factors into account to decide if you want to play or not.
A decision tree would be a great way to represent data like this because it takes into account all the possible paths that can lead to the final decision by following a tree-like structure.
We can see that each node represents an attribute or feature and the branch from each node represents the outcome of that node. Finally, its the leaves of the tree where the final decision is made. If features are continuous, internal nodes can test the value of a feature against a threshold.
Feature Importance
It is very easy to measure the relative importance of each feature on the prediction using Random Forest algorithm. Sklearn provides a great tool for this which measures a feature’s importance by looking at how much the tree nodes that use that feature reduce impurity across all trees in the forest.
It computes this score for each feature after training and scales the results automatically so the sum of all importance is equal to one. By looking at feature importance, we can decide which features to possibly drop because they don’t contribute enough to the prediction process. This is important because a general rule in machine learning is that the more features you have the more likely your model will suffer from overfitting and vice versa.
Important Hyperparameters
The hyperparameters in random forest are used to increase the predictive power of the model or and to make the model faster. Here are the hyperparameters of sklearns built-in-random forest function.
1. Increasing the predictive power
The n_estimators hyperparameter, which is the numbers of trees the algorithm builds before taking the maximum voting or taking the averages of prediction. In general, a higher number of trees increases the performance and makes the predictions more stable. However, it slows down the computation.
Another important hyperparameter is max_features, which is the maximum number of features random forest considers to split a node.
The last important hyperparameter is min_sample_leaf. This determines the minimum number of leaves required to split an internal node.
2. Increasing the model’s speed
The n_jobs hyperparameter tells the engine how many processors it is allowed to use. If n_jobs = 1, it can only use one processor. A value of “-1” means that there is no limit.
The random_state hyperparameters makes the model’s output replicable. The model will produce the same results when it has a definite value of random_state and if it has been given the same hyperparameters and the same training data.
Lastly, the oob_score which is a random forest cross-validation method. In this sampling, about one third of the data is not used to train the model and can be useful for evaluating its performance.
Why use Random Forest?
There are several reasons why we should use this algorithm:
- It takes less training time as compared to other algorithms.
- It predicts output with high accuracy, even for the large dataset it runs efficiently.
- Even though when a large proportion of data is missing, it can maintain accuracy.
Applications of Random Forest
The random forest algorithm is used in various fields such as banking, the stock market, medicine and e-commerce.
Banking : In Banking sector, it is mostly used to determine whether the customers more likely to repay their debt in time. In this domain, it is also used to detect fraudsters out to scam the bank.
Stock- market: The algorithm can be used to determine a stock’s future behavior.
Medicine: In the healthcare domain, it is used to identify the correct combination of components in medicine and to analyze a patient’s medical history to identify diseases.
E-Commerce: In e-commerce, it is used to determine whether a customer will actually like the product or not. Also, the marketing trends can be identified using this algorithm.
Advantages of Random Forest
- One of the biggest advantages of random forest is its versatility. It can be used for both regression and classification tasks.
- It is capable of handling large datasets with high dimensionality.
- It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest
- The random forest can be used for both classification and regression tasks. However, it is not more suitable for Regression tasks.
- The main limitation of random forest is that a large number of trees can make the algorithm too slow and ineffective for real-time predictions.In general, these algorithms are fast to train, but quite slow to create predictions once they are trained. A more accurate prediction requires more trees, which results in a slower model. In most real-world applications, the random forest algorithm is fast enough but there can certainly be situations where run-time performance is important and other approaches would be preferred.
- Random forest is a predictive modeling tool and not a descriptive tool, meaning if we’re looking for a description of the relationships in our data, other approaches would be better.
Summary
Random forest is a great algorithm to train early in the model development process, to see how it performs. The algorithm is also a great choice for anyone who needs to develop a model quickly. On top of that, it provides a pretty good indicator of the importance it assigns to your features.
Random forests are very hard to beat performance wise. There is always find a model that can perform better, like a neural network, but these usually take more time to develop though they can handle a lot of different feature types such as binary, categorical and numerical.
Overall, random forest is fast, simple and flexible tool but not without some limitations.