Exercise #2: EDA and preparation on UCI ML Labor data set
Let's do some data exploration and preparation!
For this practical assignment you will need the "Labor negotiation data" that are provided with WEKA in the form of the labor.arff file.
Open the labor.arff file in WEKA. You shall do data exploration and preparation simultaneously following the steps described by the assistant in the "Data preparation" slides.
Step #1:
Before doing any changes to the data, add an ID attribute to it using the WEKA filter "weka.filters.unsupervised.attribute.AddID".
Just select it and click "Apply" -- no need to change any parameters. A new attribute named "ID" should appear as the first attribute of the data set.
You shall need it later to restore the example order.
Step #2:
Finding and "treating" missing values.
Every attribute (except the newly added "ID" and "class") contains some missing values.
See how many missing values each attribute contains by navigating the "Preprocess" panel.
Now, click on the "Edit" button to open the editable spreadsheet view of your data.
By using the "click-to-sort-column" function of the editor and the "delete-row(s)" capability, remove all missing values from attributes that have less that 10% of missing values.
(Hint: there are 3 such attributes and you should end up deleting 5 examples)
Now, remove the extra 2 examples that remain missing from one of the attributes and stop. This should leave you with a data set containing 50 instances.
Search for the attributes that contain more than 45% missing data and remove them all together.
(Hint: you should have removed 6 attributes, leaving you with 12 attributes in your new data set)
On this new data set use the WEKA filter "weka.filters.unsupervised.attribute.ReplaceMissingValues" that will replace all the remaining missing values with means for numeric attributes and modes for nominal ones.
Your data should now contain no missing values.
Step #3:
Transforming "ordered" nominal attributes to numeric ones to retain the natural ordering of the values.
Identify which nominal attributes exhibit order in their values (hint: there are 3 such attributes).
Use the WEKA filter "weka.filters.unsupervised.attribute.OrdinalToNumeric" to accomplish this task (note: adjust the parameters accordingly).
Step #4:
Outlier detection and removal.
Take a closer look at the histograms of each attributes again.
Observe how the histogram of the attribute "shift-differential" has o "long tail" on the right containing 3 single instances.
We should check with a box plot and the "1.5 x IQR" rule that those 3 examples are indeed outliers.
Use the "Edit" button and a procedure similar to the one in "Step #2" to remove these 3 outliers (hint: those are the highest 3 values for that attribute).
Step #5:
It's time for discretization!
Actually, the data does not need any discretization.
But, for the sake of demonstration, let's do some.
Discretize the attribute "wage-increase-first-year" using equal width discretization and 4 bins.
Discretize the attribute "working-hours" using equal frequency discretization and 3 bins.
Discretize the attribute "shift-differential" using class dependent discretization (how many bins did you get?).
Step #6:
Check all the remaining and finish.
There are no date type attributes in our data.
There is now one attribute with no variability -- the discretized "shift-differential" attribute. Remove it.
Restore the data in the original order using the "Edit" button and the "ID" attribute.
After that, remove the "ID" attribute, because it has 100% variability.
The "final" data set is somehow "a bit" unbalanced, but not drastically.
So, this is it! Save the file according to instructions and proceed to the submission.
Beware: do not overwrite the original labor.arff file !!!
Name the final ARFF file as "Labor-<SurnameName>.arff"
(example: Labor-KavsekBranko.arff) and submit it here!