Document

编程代写｜5011CEM Big Data Programming Project

阿叶2024-05-15 17:47:05

留学生作业代写do not hesitate to contact me!
WeChat：lovexc60

The report is grade out of 150 and contributes 10 credits towards the module. Resit marks are capped at 40%.
For detailed guidance on mark allocation, see the grading scheme below.
This is also available as a separate Excel document on Aula.

Resit Information

Your original submission has been graded and feedback provided. By considering the written feedback, along with the marks for each part you are required to improve your work before re-submitting for the re-sit assessment. For convenience, the details are repeated below.
Please note that work which has not been improved may attract lower marks at the second submission.

Assessment Overview

Over the course of this module you have been introduced to a range of techniques that may be used for programming a big data project. This assessment allows you to pull together these techniques in a realistic scenario to complete a big data analysis project. Below is a realistic project scenario. By using the techniques presented during class you are to carry out the project and write a final project report for your client.
In line with real world projects, where the client has rejected your work and requested improvements, work which is not improved in line with the feedback may be marked lower.

Project Scenario

You have been approached by a client who analyses atmospheric science and climate model data. They have developed a new analysis technique, but it takes too long to run for them to use it. They have asked you to investigate the use of big data techniques to reduce the processing time.

They have a large volume of data to process, and the analysis needs to be repeated frequently. They have the following basic requirements:
1.Current analysis time is approximately 2.5 hours to analyse the climate model output data for a 1-hour time period.
2.The data for a single day of model output is approximately 250MB. However, they have over 100 years’ worth of data to analyse making a total of over 9TB.
3.Each day, they need to analyse the new data set for that day, so they wish to complete the analysis of the data for a 24-hour period (25 data sets) in under 2 hours.
4.It is not possible to hold on this in memory at one time, so the new process should load only 1 hour of data for processing at a time. If parallel processing is to occur, then 1 hour of data per worker can be loaded as needed.

You have been tasked with investigating the use of parallel processing to achieve the analysis speed required, with the following expectations:
1.Test and compare the processing speed of sequential and parallel processing
2.Extrapolate your findings to indicate the number of processors required to achieve the target processing time.
3.Test how your code responds to common errors, e.g. data that is text instead of numeric, use of NaN in the data as an error code.
4.Run automated tests that allow your client to set the tests running and return later to see the results, without user intervention.
The data has been provided by the European Centre for Medium Range Weather Forecasts (ECMWF)

Continued over…