When building a predictive model, a data scientist wants to identify the features with the strongest predictive power from a dataset. We examine model outputs and apply techniques to remove variables that are insignificant to produce a succinct model.
This being said, we all know this is not as simple as it sounds. So let’s learn a new technique to add to our toolbelt: examining Shapley values.
In the Introduction to iml: Interpretable Machine Learning in R, two questions are asked:
- How does a feature influence the prediction?
- How did the feature values of a single data point affect its prediction?
The varImpPlot function from the randomForest package has an easy way we can visualize the feature importance from a Random Forest output; however, feature importance is not interpretable when using ‘black-box’ models like the Random Forest model uses.
We do see the predictions of the model, but not how it interpret the conclusions.
What we can do is investigate some of the interesting data points using a game theory method named Shapley value, abstractly explained here:
” Assume that for one data point, the feature values play a game together, in which they get the prediction as a payout. The Shapley value tells us how to fairly distribute the payout among the feature values ” – Christopher Molnar
In this article, I want to familiarize you with the Shapley value from the iml package so you can use it in your own Feature Importance stage.
What is a Shapley Value?
The idea behind Lloyd Shapley’s metric is:
” members should receive payment or shares proportional to their marginal contributions ” – Coursera
As a non-statistics background person, formulas, symbols and theories make me sleepy. Professor Giacomo Bonanno’s article COOPERATIVE GAMES: the SHAPLEY VALUE, has a really good example. I made a little change, but hope this is helpful:
Allan is a programmer, he can design the program and sell it for $100.
Bob is a marketing manager, he can sell the product for $125.
Cindy is a sales manager, she can sell product for $50.
If Allan works with Bob, they can make a sale of $270.
If Allan works with Cindy, they can make a sale of $375.
If Bob and Cindy work together, they can make a sale of $350.
If all three of them work together, they can make a sale of $500.
They decide to work together. However, how much money should each person get?
Allan thinks they should be paid equally around $166.67.
Bob feels marketing is the major way of improve profit, so he thinks he should get more.
Cindy feels it’s unfair too, because she is the person who sells the products and talks with customers.
Mike, a statistics employee working in another department, overhears and says “Hey, let me help you”. And does this calculation to measure their contribution to the profit:
He finds Allan and Bob should get: 970 / (970+970+1060) = 32.33% of $500.
Cindy should get 1060 / (970+970+1060) or 35.33% of $500 for her contribution.
As we see in this example, the idea behind Shapley value is to reveal how to distribute the payout among important features. Cindy had the lowest individual value, but she was the most important cause of profit when added to the other workers.
varImpPlot Function and the Shapley Value
Below is a Random Forest Model visualization that predicted the Energy Star Score of large buildings in New York City:
According to the visual, SourceEUIkBtu_ft is the most important factor for Energy Star Score.
By using the Shapley value function, we can check the feature importance on an individual data point from this model and see what feature is responsible for the largest contribution to the predicted value.
In this case, we can investigate the feature importance for the most incorrect predicted value from the model – this is a residual with the greatest difference between predicted and actual value. We find this value in row 1582.
Predicted Energy Star Score: 74.85
Actual Value: 9
We can look at this data point closer and see why it was predicted to have a high Energy Star score by our model. In addition, we get more details about feature importance from our variables and their ranges.
test <- rf_validation[which(names(rf_validation) != "ENERGYSTARScore")] prediction <- Predictor$new(output_forest, data = test, y = rf_validation$ENERGYSTARScore) shapley <- Shapley$new(prediction, x.interest = test[1582,]) plot(shapley)
Here’s what we know about these variables:
A lower SourceEUIkBtu_ft2 score has a positive effect on Energy Star Score.
- In this individual data point, SourceEUIkBtu_ft2 = 105.6 is lower than the 1st quantile of the data distribution – found at SourceEUIkBtu_ft2 = 110 – so it will naturally predict a high Energy Star Score.
We also know PrimaryPropertyTypeMultifamily_Housing = 1 is a top 5 predictor and has a negative effect on the predicted value.
- This is reasonable because most multifamily buildings have lower energy efficiency in our raw data.
This investigation on extreme data points can create a story behind the types of instances our model guesses incorrectly and if adjustments are needed during the building stage of our model.
For this case, it had an issue handling this multifamily building’s data. If this is a continuing trend, we could look into grouping features to better predict when multifamily buildings are being predicted.
Here’s another tail to explain why some features have no influence on the prediction:
We see PrimaryPropertyTypeOther = 0 has no marginal contribution to the Shapley value. This makes sense since further down, we find the PrimaryPropertyTypeHotel = 1. We can deduce why this feature has no significant contribution to Energy Star Score for this individual instance because they are mutually exclusive.
Above are simple examples about using and analyzing the Shapley values in R.
To see how features can be fairly compared, read Gabriel Tseng’s Interpreting complex models with SHAP values article on Medium.
Try looking at Shapley values on your Random Forest outputs and see what you find!
Thank you for reading.