Can you help us understand how others can use and derive value from Ray DeltaCAT...

thedood · on July 31, 2024

Some of DeltaCAT's goals and use cases have been discussed in this 2022 talk: https://youtu.be/M3pZDp1zock?t=4676

Today, our immediate next goal for DeltaCAT is to ensure that the compactor, and similar procedures for Ray, can run on Apache Iceberg. So, if you're an Iceberg user relying on procedures like Spark's "rewrite_data_files" and/or "rewrite_positional_delete_files" to compact your datasets today, then DeltaCAT will let you easily run similar compaction procedures on Ray to realize similar efficiency/scale improvements (even if it winds up delegating some of the work to other projects like PyIceberg, Daft, etc. along the way).

Going forward, we'd like DeltaCAT to also provide better general-purpose abstractions (e.g., reading/writing/altering large datasets) to simplify writing Ray apps in Python that work across (1) different catalog formats like Iceberg, Hudi, and Delta and (2) different distributed data processing frameworks like Ray Data, Daft, Modin, etc.

From the perspective of an internal DeltaCAT developer, another goal is to just reduce the maintainability burden and dev hours required to write something like a compactor that works across multiple catalogs (i.e., by ensuring that all interfaces used by such a procedure can be readily implemented for multiple catalog formats like Iceberg, Hudi, Delta, etc.).