Alluxio - Data Orchestration for Analytics and AI in the Cloud

SC19 Proceedings

Alluxio - Data Orchestration for Analytics and AI in the Cloud

Workshop: Alluxio - Data Orchestration for Analytics and AI in the Cloud

Abstract: The data eco-system has heavily evolved over the past two decades. There is an explosion of data-driven frameworks including Presto, Hive, Spark, and MapReduce to run data analytics and ETL queries, as well as TensorFlow, PyTorch to train and serve models. On the data side, the approach to manage and store data has evolved from HDFS to cheaper, more scalable and separated services typified by cloud object stores like AWS S3. Data engineering has become increasingly complex, inefficient, and hard, particularly in the hybrid and cloud environments.

Alluxio Open Source Software is to address these challenges. Alluxio, born from UC Berkeley AMPLab, is a data orchestration system that provides a unified data access and caching layer for single cloud, hybrid and multi-cloud deployments. Alluxio enables distributed compute engines like Presto, Hive, or TensorFlow to transparently access data from various storage systems (including S3, HDFS, Azure etc.) while actively leveraging in-memory cache to accelerate data access. Alluxio community has 1000+ open source contributors and the software is used by 100+ companies worldwide with the large production deployment over 1000 nodes.

In this talk, we will present - New trends and challenges in the data ecosystem in cloud era - Key innovation of Alluxio Project - Production use cases of using popular stacks like {Presto, Spark, Flink, Tensorflow}/Alluxio/{S3, HDFS}

Back to International Parallel Data Systems Workshop (PDSW) Archive Listing

Back to Full Workshop Archive Listing