Task 2 Exploratory Data Analysis and Decision Tree Analysis (Worth 25 Marks)
Task 2.1) Conduct an exploratory data analysis of the patient-health.csv data set using the RapidMiner Studio data mining tool. Summarise the findings of your exploratory data analysis in terms of describing key characteristics of each of the variables in the patient- health.csv data set such as maximum, minimum values, average, standard deviation, most frequent values (mode), missing values and invalid values etc and relationships with other variables if relevant in a table named Table 2.1 Results of Exploratory Data Analysis for the patient-health.csv Data Set.
Hint: The Statistics Tab and the Chart Tab in RapidMiner provide descriptive statistical information and useful charts like Barcharts, Scatterplots etc. You might also like to look at running some correlations and chi square tests to indicate which variables you consider to be the top five key variables and which contribute most to determining whether a patient is healthy. Note in completing Task 2.1 you will find it useful to refer to the data dictionary for the patient-health.csv data set provided in this document which defines each of the variables in terms of their data type and range of values.
Briefly discuss the key results of your exploratory data analysis presented in Table 2.1 and the rationale for why you have selected your five top variables for predicting Patient Health. (About 250 words)
Task 2.2) Build a Decision Tree model for predicting Patient Health using RapidMiner and an appropriate set of data mining operators and a reduced patient-health.csv data set determined by your exploratory data analysis in Task 2.1. Provide these outputs from RapidMiner (1) Final Decision Tree Model process, (2) Final Decision Tree diagram, and (3) Decision Tree rules for Task 2.2.
Briefly describe your final Decision Tree Model Process, and discuss the results of the Final Decision Tree Model drawing on the key outputs (Decision Tree Diagram, Decision Tree Rules) for predicting Patient Health and relevant supporting literature on the interpretation of decision trees (About 250 words).
Include all appropriate RapidMiner outputs such as RapidMiner Processes, Graphs and Tables that support the key aspects of your exploratory data analysis and decision tree model analysis of the data set in your Assignment 2 report. Note you need export these outputs from RapidMiner using the File/Print/Export Image option and where relevant include in Task 2 and/or in Appendix A of the Assignment 2 report.
Table 1 Patient Health Data Set Data Dictionary
Variable Name
|
Type and description of variable
|
Range of values
|
1.
|
Patient_id
|
Integer Patient Id
|
Range 1 to 20,000
|
2.
|
genhealth
|
Polynominal, Health Rating of each patient
|
Poor, Fair, Good, Very Good,
|
|
|
|
Excellent
|
3.
|
exerany
|
Integer, does the patient exercise?
|
1 or 0
|
4.
|
hlthplan
|
Integer, Health insurance plan?
|
1 or 0
|
5.
|
smoke100
|
Integer, Smoker?
|
1 or 0
|
6.
|
height
|
Integer, height in inches of patient
|
Height range in inches
|
7.
|
weight
|
Integer, weight in pounds of each patient?
|
Weight range in pounds
|
8.
|
wtdesire
|
Integer, desired weight of each patient can be
|
Desired weight of each patient
|
|
|
used to calculate if a patient is overweight etc
|
in pounds
|
9.
|
age
|
Integer
|
Age of each patient
|
10.
|
gender
|
Polynominal, Gender of each patient
|
M = Male; F = Female
|