DEPARTMENT OF MANAGEMENT SCIENCE
MSCI 526 DATA MINING COURSEWORK
The Data Mining coursework consist of a single individual task (100%) based upon the lectures and workshops. The tasks will require you to apply your technical skills to real Data Mining project in a sensible manner given the time you have available. The assignments will be worth 100%.
MSc in Quantitative Finance: Credit Scoring Dataset
MSc in Marketing Analytics: Direct Marketing dataset from KDD CUP
MSc in Logistics & SCM: Credit Scoring Dataset (similar to vendor bankruptcy prediction)
MSc in Operational Research: choose one from above (Credit Scoring is recommended)
MSc in Data Science: choose one from above
Download the correct dataset, unzip it, and look at the files. The archive includes a “Read me first.txt”, a description of the task, and a Data Dictionary-file which explains the dataset, target variable, independent variables etc.. You will need to import the data into SAS in order to access it from Enterprise Miner – see the end of this document for help with this.
- You must run a data mining analysis, prepare a report to document and critically argue your analysis. Both the adequacy of what you have modelled, and the clarity your arguments will be assessed.
- The report should be no longer than 20 pages 12 point typeface, minimum 1.5cm margins around, including graphs, but excluding appendices. The cover page should include your library card number.
- One electronic copy of your report should be submitted via the MSCI526 coursework section on MOODLE in PDF format. Please do not submit more than one copy electronically. The electronic copy must not contain the coursework declaration form. All coursework will be processed by a specialist software to check for collusion and plagiarism.
- One hard copy of your report should also be submitted, simply stapled (not in a folder or file) in an unsealed A4 envelope with your name and library card number written clearly in the front, together with the coursework declaration form. Please do not submit more than one hard copy. The hard copy must be identical to the electronic copy.
- You have to use SAS Enterprise Miner (version 4.3)
- In Enterprise Miner create a new Enterprise Miner project named
“DMC_yourlastname”. The corresponding file will be named “DMC_yourlastname.dmp” and saved on your H:-drive. Submit the final file with your coursework by uploading it when making the electronical submission. Also include a screenshot of your final workflow in the appendix of the report.
- In using Enterprise Miner you should make your analysis unique by using your own seed at the Data Partition stage of your analysis. The seed you use should be the last five digits of your library card number.
Case Study Description.
You are working as a data miner in a company that provided a dataset and a corresponding data dictionary to describe it. Your task is to conduct a thorough investigation of the dataset, and to recommend a model that predicts class membership of instances as accurately as you can. Since a
“perfect model” does not exist, your employer needs you to carefully justify the model you recommend, including all data preprocessing and how you measure accuracy. To do so, you need to write a clear and concise technical report.
Develop the most suitable predictions for the dataset (“Direct Marketing” or “Credit Scoring”) and prepare a technical report documenting your modelling process and results. Use your knowledge on all aspects of model building across all phases of the data mining process. Discuss and justify your choices of sampling, modification, models and assessment (some general instructions on writing technical reports at the end). Please consider the following instructions and tips for each phase:
The datasets are both large in size, so building models with the complete dataset may be time consuming. Consider metadata samples for data exploration and initial model building, and the complete sample for final model building, and justify your choices. If this dataset represents an imbalanced classification problem, evaluate two different sampling strategies where one should be using the standard sampling approach using adequate settings of priors for relevant target levels.
2. Exploration & Modification
Conduct an in depth data analysis on the assigned dataset. Explore and describe the target variable of the dataset, comment on its distribution, symmetry, and potential problems. Explore the relationship of the target (dependent variable) with the relevant features (independent variables) and their attributes using graphical tools and statistical analysis.
Use the findings from the data exploration guide you in choosing different data modifications. You should evaluate at least two different candidate sets of relevant variables (argue why these could be relevant), at least two different transformations and two different replacement schemes of these variables (so 2 variable sets for 2*2*2 = 8 preprocessing variants). One of these should always be the base-case, i.e. of not doing variable selection, not doing transformation, not doing replacement. Please also address these relevant questions:
- Which variable(s) are the most important, i.e. of particular predictive relevance? Which variables should be included (distinguish between original and transformed variables)?
- Which variables can you exclude completely, or which levels of a variable?
- Which variables contain outliers, or extreme values? And, if they do, which variables can be improved through transformations?
- Which variables contain missing values? And, if they do, which treatment if any did you apply to missing values, and why?
- Are there important interrelationships between independent variables? Do these require any modifications through transformations and data preprocessing?
- Are the relevant classes balanced? Consider whether this problem represents a balanced or imbalanced classification problem, and how this can be remedied.
Remember, that usually up to 70% of all effort goes into data understanding, pre-processing and transformation into indicators in an iterative process of re-evaluating models (and also re-exploring them in plots)! Unfortunately, you do not have time to do this extensively in the coursework. Therefore follow a high impact approach and limit your workload. The implications from analysis will result in particular modification ideas to be done next
You need to build and evaluate at least two different candidate models of decision trees, logistic regression, and neural networks each (so 6 algorithm candidates in total) and justify your choice of algorithm parameters. One of these should be the basic, unchanged standard version of each model. Please also address these relevant questions:
- What is the best model to predict this dataset? Are particular methods more or less suitable for solving this task? What could serve as a simple baseline solution?
- What is the impact of data preprocessing choices on the performance different models? Which ones work? Which preprocessings do not have a significant impact on the final accuracy? Do these interact with the choice of a model?
- What is the sensitivity of the models to their options in setting them up? Which setups worked for some models and which ones did not?
- 5. Assess
Interpret how well your models are performing. Use at least two metrics and justify your choice. Carefully choose, describe and justify on what data partition you build, validate and evaluate your models. Provide evidence on errors on all three data partitions of training, validation and test.
If the costs of misclassifying individual instances are asymmetric, consider it in addition by setting up your experiments with two different target profiles, one without costs and one to reflect costs. Build corresponding models to utilise the costs. Please also address these relevant questions:
- What is an appropriate performance measure, i.e. misclassification rate, lift, ROC, costs etc. for the properties of this dataset?
- What is a suitable benchmark for accuracy?
- How does using costs change / improve your results with suitable models?
Note: A report that just uses a single variable set, a single standard pre-processing for a single classification method – e.g. the default tree created by Enterprise Miner or another default model - would not convince me that you are making sound analysis and would not achieve a pass.
Report Structure of a Technical Report?
You are required to write a technical report of your analysis, experiments and final recommendations. The report should be tailored to a managerial decision maker with sufficient statistical and OR knowledge (e.g. use technical terms and be precise, don’t explain what is a neural network or what is sampling, you can assume that this is known). The technical report should sufficiently document the data mining process you conducted in interaction with your data, analyse and critically interpret the data, the different experiments, the model setup used and your final results. You don’t need to include references to literature. Demonstrate awareness of the potential project time vs. accuracy trade off!
The report itself should follow a suitable structure, and include an introduction, a summary and conclusion where you make a critical recommendation. Make use of graphics (i.e. capture graphs from SAS) to support your analysis and argument. Include a hardcopy of your workflow from Enterprise Miner. Sections of technical Enterprise Miner outputs are acceptable but only if they can be explained and interpreted in a meaningful way, and they support your arguments. More technical information and documentation which you feel is needed to confirm you did things correctly, but not relevant to support your arguments, should be placed in an appendix. The appendix should be referenced to at every position used in the text, else the evidence cannot count towards your argument. In addition, develop an executive summary for senior management indicating the most relevant findings.