Tree Type Prediction with XGBoost Classifier
In this story, We are going to talk about how to create a Machine Learning model and which steps you to follow.
If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.
We will following these steps to build our tree type prediction model:
- Exploratory data analysis
- Data Collection
- Data Preprocessing
- Data Cleaning
- Visualization
- Model Design, Training, and Offline Evaluation
Of course, the same solution can not be provided for all problems, so the best way is to create a general framework and adapt it to new problem.
Of course, we will not cover all of them in this story, we will cover items 3 to 8.
1. Exploratory data analysis
We used the describe method to make sense of our columns, and examined each characteristic, line by line.
with this loop, we find out how much unique value each feature have in order to separate categorical and continuous values.
2. Data Cleaning
2.1 Missing Value Detection
We found out the missing value rate with this code. And We saw that there is no missing value in the Tree Type dataset.
2.1 Outlier Detection
With this code line, we filtered numeric columns. We assumed as numeric that columns with more than 7 unique values
With the “outlier_inspect” function we defined before you can find optimal IQR value to filtered outliers in the dataset and you can see how the data is distributed. The graph where is standing right identifies the optimal Z-score. Z-score means IQR value that defines upper limit and lower limit that requires to remove outliers. According to the example above, the number which IQR will be multiplied by found as 3 for the elevation column. You can find the “outlier_inspect” code line in my GitHub Repository. However, you don't need to understand this code line just enough to know what it does.
2.2 Deal with Outliers
We accepted Z-score as 3 because the averages of the Z-scores of each column were close to 3. So, instead of calculating the upper and lower limit for each column, we used the average Z-score.
This function detects outliers based on 3 times IQR and returns the number of lower and upper limit and the number of outliers respectively. So, the upper limit “3*IQR” far away from the third quartile. If a value greater than the upper limit or less than the lower limit will be an outlier.
With for loop above we caught outliers. According to the example, if the value in the selected column greater than the upper limit or less than the lower limit add 1 point to “outlier_count” in this manner we obtain the outliers count.
We demonstrated outlier counts with the loop above, there are numeric column names in the numeric list. We pass each column name into the “detect_outlier” function defined before. As you know this function was returning the upper limit, lower limit, and outlier count for the passed column name. We just want to return the outlier count so selected the second index.
We removed outliers by filtering with pandas methods. Code lines above return column’s values greater than the upper limit and less than the lower limit. As you remember index 0 was returning the lower limit and index 1 was returning the upper limit.
When we examine the data, we can see that the target classes are very irregularly distributed. So we can say this data is imbalanced. Because of this case we applied undersampling to our dataset.
As you see in the code block above, we received 1500 samples from each class in order to obtain balanced dataset. Except 1500 samples other data turned into NaN values, so we used dropna method to remove NaN values.
As you see we obtained balanced data each class have 1500 sample. Now our dataset can give more accurate results.
3. Data Preprocessing
We cleaned our data now we can start preprocessing part before train our data.
Now to build our training and test sets, we will create 4 sets — X_train (training part of the matrix of features), X_test (test part of the matrix of features), y_train (training part of the dependent variables associated with the X train sets, and therefore also the same indices) , y_test (test part of the dependent variables associated with the X test sets, and therefore also the same indices). We will assign to them the test_train_split, which takes the parameters — arrays (X and y), test_size (if we give it the value 0.5, meaning 50%, it would split the dataset into half. Since an ideal choice is to allocate 20% of the dataset to test set, it is usually assigned as 0.2. 0.25 would mean 25%, just saying).
First, we used XGBoost Classifier to train our model. And fit training data sets. After the training part, we predicted our test data on the training model. And assigned our predictions into the y_pred variable.
Model Evaluation
Before dive into the evaluation part, we are going to talk about what is evaluation metrics and how do they work. If you want to learn more about evaluation metrics you can visit this page.
Precision:
Precision is defined as the number of true positives divided by the number of true positives plus the number of false positives. It tells us what proportion of predictions which are predicted as positives that are actual positives to the total predicted positives.
Recall:
Recall is defined as the number of true positives divided by the number of true positives plus the number of false negatives. It tells us what proportion of predictions which are predicted as positives that are actual positives to the total positives in the original data set.
High recall, low precision: Indicates that most of the positive examples are correctly recognized (low FN) but there are a lot of false positives.
Low recall, high precision: Indicates that we miss a lot of positive examples (high FN) but those we predict as positive are indeed positive (low FP).
F1 Score:
F1 score combines precision and recall relative to a specific positive class. It conveys the balance between the precision and the recall and there is an uneven class distribution. F1 score reaches its best value at 1 and worst at 0.
Accuracy:
we use Accuracy in classification problems and it is the most common evaluation metric. Accuracy is defined as the ratio of the number of correct predictions made by the model over all kinds of predictions made.
Confusion Matrix :
Confusion matrix, also known as an error matrix. This Metric used for finding the correctness and accuracy of the model and even works better for imbalanced data set. Confusion matrix is a table with two dimensions (“Actual” and “Predicted”), and sets of “classes” in both dimensions. Most performance measures are computed from the confusion matrix.
confusion matrix for Binary Class:
True Positives The cases in which we predicted YES and the actual output was also YES. True Negatives the cases in which we predicted NO and the actual output was NO. False Positives The cases in which we predicted YES and the actual output was NO. False Negatives The cases in which we predicted NO and the actual output was YES.
Now we have an idea about the evaluation metric we can examine our evaluation results.
We obtained %83 accuracy for this model but looking at just the accuracy ratio can be misleading. If we examine other metrics we can see the model gives us highly accurate results for some classes and low accurate results for other classes. Because we have few examples to train the model, so our model didn't learn about some classes. if our data were evenly distributed we wouldn’t have to work with so few samples.
Yellowbrick is a suite of visualization and diagnostic tools that will enable quicker model selection. It’s a Python package that combines scikit-learn and matplotlib. Some of the more popular visualization tools include model selection, feature visualization, classification and regression visualization
We plotted confusion matrix above with yellowbrick library. According to confusion matrix, the squares where in diagonal location show us results predicted as true (red squares). As you see example below failed to predict first and second class, the model predict first class as second class and second class as first class . So we can say our model poor at predicting 1st and 2nd classes. So we can need to adjust our model again.
Model Tunning
Tuning is the process of maximizing a model’s performance without overfitting or creating too high of a variance. In machine learning, this is accomplished by selecting appropriate “hyperparameters.” Hyperparameters can be thought of as the “dials” or “knobs” of a machine learning model
We can tunning our hyperparameters to increase accuracy score by using GridSearch. As you see above we defined “xgb_params” dictionary which contains different parameter values for each hyperparameters. We can increase our range optionally if we need
We passed this dictionary into GridSearchCV to find which parameter more suitable to obtain higher accuracy score. And we passed our model into first parameter as you see , other parameter CV determines the cross-validation splitting strategy.
n_estimators[default=100]
The number of boosting stages to perform
max_depth [default=6]
- The maximum depth of a tree, same as GBM.
- Used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.
subsample [default=1]
- Same as the subsample of GBM. Denotes the fraction of observations to be randomly samples for each tree.
- Lower values make the algorithm more conservative and prevents overfitting but too small values might lead to under-fitting.
learning_rate [default=0.01]
The learning_rate parameter can be set to control the weighting of new trees added to the model. We can use the grid search capability in scikit-learn to evaluate the effect on logarithmic loss of training a gradient boosting model with different learning rate values.
After the run GridSearchCV we can see more suitable hyperparameters for our model which we defined before in a dictionary by using best_params_ method.
We passed our new parameters into XGBoost model in order to train model. Afterwards we fitted train datasets and predict test values with predict method.
If we look at at the new confusion matrix and classification report we can see our predictions improve compared to the previous.
Conclusion:
With this project, we have built a model that can predict with a 84% of accuracy the type of trees, given a set of features. This information can have an enormous value for both companies and individuals when trying to understand how to estimate the type of a tree and, more importantly, the key factors that determine its class.
I tried to briefly explain the processes , If you want to see more detail you can check this notebook from this link.