Caching in the multiverse

Abdi, Mania; Hajkazemi, A.; Turk, Ata; Krieger, Orran; Desnoyers, Peter

Caching in the multiverse

Files

6aad1256d867b194e14962497690f0a98b6e.pdf(680.1 KB)

Accepted manuscript

Date

2019-07-17

DOI

10.5555/3357062.3357087

Authors

Abdi, Mania

Hajkazemi, A.

Turk, Ata

Krieger, Orran

Desnoyers, Peter

Version

Accepted manuscript

URI

https://hdl.handle.net/2144/40963

Citation

Mania Abdi, A. Hajkazemi, Ata Turk, Orran Krieger, Peter Desnoyers. 2019. "Caching in the Multiverse." 11th USENIX Conference on Hot Topics in Storage and File Systems

Abstract

To get good performance for data stored in Object storage services like S3, data analysis clusters need to cache data locally. Recently these caches have started taking into account higher-level information from analysis framework, allowing prefetching based on predictions of future data accesses. There is, however, a broader opportunity; rather than using this information to predict one future, we can use it to select a future that is best for caching. This paper provides preliminary evidence that we can exploit the directed acyclic graph (DAG) of inter-task dependencies used by data-parallel frameworks such as Spark, PIG and Hive to improve application performance, by optimizing caching for the critical path through the DAG for the application. We present experimental results for PIG running TPC-H queries, showing completion time improvements of up to 23% vs our implementation of MRD, a state-of-the-art DAG-based prefetching system, and improvements of up to 2.5x vs LRU caching. We then discuss the broader opportunity for building a system based on this opportunity.

Collections

BU Open Access Articles
ENG: Electrical and Computer Engineering: Scholarly Papers

Full item page