Design Insights for MapReduce from Diverse Production Workloads
CALIFORNIA UNIV BERKELEY DEPT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE
Pagination or Media Count:
In this paper, we analyze seven MapReduce workload traces from production clusters at Facebook and at Cloudera customers in e-commerce, telecommunications media, and retail. Cumulatively, these traces comprise over a years worth of data logged from over 5000 machines, and contain over two million jobs that perform 1.6 exabytes of IO. Key observations include input data forms up to 77 of all bytes, 90 of jobs access KB to GB sized files that make up less than 16 of stored bytes, up to 60 of jobs re-access data that has been touched within the past 6 hours, peak-to-median job submission rates are 91 or greater, an average of 68 of all compute time is spent in map, task-seconds-per-byte is a key metric for balancing compute and data bandwidth task durations range from seconds to hours, and five out of seven workloads contain map-only jobs. We have also deployed a public workload repository with workload replay tools so that the researchers can systematically assess design priorities and compare performance across diverse MapReduce workloads.
- Computer Programming and Software