I work on Daft and we’ve been collaborating with the team at Amazon to make this...

thedood · on Aug 1, 2024

Good to see you here! It's been great working with Daft to further improve data processing on Ray, and the early results of incorporating Daft into the compactor have been very impressive. Also agree with the overall sentiment here that Ray clusters should be able to run best-in-class ETL without requiring a separate cluster maintained by another framework (Spark or otherwise). This also creates an opportunity to avoid many inefficient, high-latency cross-cluster data exchange ops often run out of necessity today (e.g., through an intermediate cloud storage layer like S3).