Exercise #4: Using J4.8 to model sample data
You will need the Labor data set for this assignment
(you may find it in the data folder of the WEKA installation).
The task of this assignment is to model the Labor data by using the J4.8 decision tree algorithm (a Java implementation of the famous C4.5 algorithm developed by Ross J. Quinlan back in 1993).
In this assignment you shall learn how pruning the decision trees affect their accuracy and size.
Step #1:
Open the labor.arff file in WEKA (no preprocessing is needed) and proceed to the "Classify" tab.
Select the J48 classifier (Choose → trees → J48).
Leave the "Test options" on the default value (10-fold cross-validation).
Step #2:
Change J48's parameters to "turn off" pruning:
- collapseTree = False,
- minNumObj = 1,
- unpruned = True;
Run the algorithm.
Note the classification accuracy ("Correctly Classified Instances") and visualize the decision tree.
Take a screen-shot of this (visualized) tree and save the file "as a picture" (jpg, png, gif, ...).
Note the size of this tree (number of nodes and leaves).
Step #3:
Set J48's parameters back to default values with the exception of minNumObj -- set this to 5.
Now prunung is "turned on", both post-pruning and pre-pruning (through the "minNumObj = 5" parameter).
Re-run the algorithm.
Note again the classification accuracy ("Correctly Classified Instances") and visualize this new decision tree.
Take a screen-shot of the decision tree and save the file "as a picture" (jpg, png, gif, ...).
Note again the size of this new tree (number of nodes and leaves).
Conclusion:
You should have noted that pruning decreases the size of the decision tree while at the same time increasing its classification accuracy.
This happens, because the unpruned tree models also the noise in the data -- we say, the unpruned tree overfits the data.
We shall talk about the phenomenon of overfitting in the lecture about Evaluation.
Pack both screen-shots ("picture" files) in a single ZIP file, name it "J48-<SurnameName>.zip"
(example: J48-KavsekBranko.zip) and submit it here!