IBM SPSS Modeler V15.0 enables you to build predictive models to solve business issues, quickly and intuitively, without the need for programming. In this demonstration we are going to show, how you can use the “Auto-Classifier Node”. The Auto Classifier node can be used for nominal or binary targets. It tests and compares various models in a single run.
You can select which algorithms (Decision trees, Neural Networks, KNN, …) you want and even tweak some of the properties for each algorithm so you can run different variations of a single algorithm. It makes it really easy to evaluate all algorithms at once and saves the best models for scoring or further analysis. In the end you can choose which algorithm you want to use for scoring or use them all in an ensemble!
First a brief description of the data. The data comes from the 1994 US Census database. You can find the data here http://archive.ics.uci.edu/ml/datasets/Adult from the UCI Machine Learning Repository. The goal here is to determine whether a person makes over 50K a year. It has 14 variables both categorical and numeric. First step is to import the data. The data are in csv format so we can use the “Var. File” node to import them. All you have to do is define the source path and we are ready to import the data.
Then we can use the “Data Audit” node to inspect the data. This is one of the most useful nodes of SPSS Modeler. It will display a graph and statistics for all variables and locate if there are missing values or outliers in the data. I am going to write more about this in another blog post.
After inspecting the data we can see that we do not have any serious problems with missing values or outliers. But we will do a couple of transformations to improve the performance and the interpretability of the model. We will reclassify the countries variable. In the countries variable 90% of the records are US and various other countries with a frequency of 1% or less. So we will use a reclassify node and change the variable to 90% US and 10% Non-US.
Another thing we can do to improve it is binning. We can use the “Binning Node” that has a very good feature called optimal binning. This method will bin the data and try to fight find the optimal bins according to a supervisor field which is usually the target so that this new variable can help better to predict the target.
And here are the results after binning the age variable. The “Binning” node created 8 bins that help to categorise the age variable better with respect to the target variable.
Next step is to partition our data so can test the performance of our models with a fresh set of data. For this we can use the “Partition” node. We divided the data in two random samples with the train sample containing 70% of the records and the testing sample contains the 30%.
Then we need to instantiate our data, decide which variables will be included and define which varable will be our target if we haven’t done already from the import node. For this we use the “Type” node. In this case I do not want to include in the model the old age variable but rather the new one with the bins and also not include the old country variable. So I set the role to none and set the role to input for the new variables that we created and define that our target variable is the class variable.
Then it is time for the modelling part. We drag and drop into the canvas the “Auto Classifier” node and edit to customise it. The first tab is the “Fields” tab where you can set the target variable, the input variables and partitions or just use the default which will read the settings from the type and partition nodes, that we have set up earlier.
let’s move to the model tab. Here we can adjust settings like
- how we want modeler to rank the models that will be auto generated,
- which partition to use to evaluate the models,
- how many models to keep,
- enable modeler to calculate predictor importance,
- set criteria for the lift chart and
- assign costs for wrong predictions and revenues for correct prediction so that modeler can estimate the profit we will get by applying each model.
In the discard tab we set properties so that modeler can choose the models we want and discard the rest.
The final tab is the settings tab where you define properties for the ensemble, that is if you decide to use all the models generated in an ensemble.
- Lift – if we score our database and choose the top 30% the lift shows how much better results we will get by using the model instead of choosing randomly
- Overall Accuracy – the percentage of correctly predicting a value
- Area Under Curve – that is the area under a ROC curve, the higher the better
But these statistcs will not agree every time. In our example we can see according to overall accuracy the C5 decision tree is the best but it is not the best for the Area Under Curve or Lift. It is up to us to decide which model to use. Of course with SPSS Modeler you can create evaluation charts and other statistics to help you, but we are going to present that in an other blog post.
Next is the Graph tab. The graph on the left is based on the ensemble of the models and shows the performance of the ensemble. On the right we can see the predictor importance, really useful information for analysing our results.
Then in the summary tab we can see information about each model like, the fields used or the settings that were defined for each model.
Finally in the settings tab we can define how the ensemble will use all the models to make a prediction. Options are:
- Confidence – Weighted Voting
- Raw Propensity – Weighted Voting
- Highest Confidence Wins
- Average Raw Propensity
That’s it we have created and compared various models from various algorithms and we are ready to make predictions!!!