Introduction:
This assignment will provide a summative assessment of your understanding of Big Data Systems and related technologies. Three mini-tasks that are to be completed have the following aims:
- Introduce Big Data in the context of a given organisation (See Task 1)
- Understand the problems of working with Big Data and describe technologies that specialize in catering for Big Data (See Task 2)
- Use a software package that is designed for Big Data Systems to perform a simple analytical task (See Task 3)
Task 1 – Introduce Big Data
In The Context of Amazon; Amazon is an online book retailer that has expended its retail offering far beyond books over the last decade (www.amazon.com).
- Define Big Data in terms of the four V’s. Describe how each V could apply to Amazon. (E.g. ‘Volume’ is one of the V’s. What data would Amazon likely to be capturing to qualify?)
- Give an example from Amazon to illuminate your points for each of the 4 V’s discussed above (12 Marks)
Task 2 – Big Data Technologies
Hadoop is a technological framework that enables processing of large datasets at the scale of Big Data. Your task is to research and understand Hadoop. Your description should include:
- What is Hadoop?
- What are the technological challenges of working with Big Data?
- How does Hadoop framework overcome abovementioned challenges? (10 Marks)
Task 3 – Big Data Analytics with Orange Software Package
The dataset that we will be using is contained in the file Titanic.tab that is made available on CloudDeakin under Resources->Assignment 3->Titanic.tab
This by no means is a Big Data set. In order to simplify the analytical task (as promised in lectures) we will settle for using a smaller and simpler dataset. Your task is to:
- Analyse the full dataset using Orange and try to get an insight.
- Take a random sample of 200 records and perform the same analysis. State your findings. Are your conclusions similar to what you have found previously? Explain why or why not.
- Under what circumstances would it be permissible to use a random sample from a full dataset for analysis? Under what circumstances would it raise red flags? (11.3 Marks)