An experimental study of memory management in Rust programming for big data processing
MetadataShow full item record
Planning optimized memory management is critical for Big Data analysis tools to perform faster runtime and efficient use of computation resources. Modern Big Data analysis tools use application languages that abstract their memory management so that developers do not have to pay extreme attention to memory management strategies. Many existing modern cloud-based data processing systems such as Hadoop, Spark or Flink use Java Virtual Machine (JVM) and take full advantage of features such as automated memory management in JVM including Garbage Collection (GC) which may lead to a significant overhead. Dataflow-based systems like Spark allow programmers to define complex objects in a host language like Java to manipulate and transfer tremendous amount of data. System languages like C++ or Rust seem to be a better choice to develop systems for Big Data processing because they do not relay on JVM. By using a system language, a developer has full control on the memory management. We found Rust programming language to be a good candidate due to its ability to write memory-safe and fearless concurrent codes with its concept of memory ownership and borrowing. Rust programming language includes many possible strategies to optimize memory management for Big Data processing including a selection of different variable types, use of Reference Counting, and multithreading with Atomic Reference Counting. In this thesis, we conducted an experimental study to assess how much these different memory management strategies differ regarding overall runtime performance. Our experiments focus on complex object manipulation and common Big Data processing patterns with various memory man- agement. Our experimental results indicate a significant difference among these different memory strategies regarding data processing performance.