Filter (clear filters)






Overview for uber

How to performance-tune Spark applications in large clusters

Omkar Joshi a senoir software engineer at Uber discusses a new Spark ingestion system known as Marmaray. This new system has been designed to ingest billions of Kafka messages at intervals of 30 minutes. 


AthenaX - Unified Stream & Batch Processing using SQL at Uber,

Learn how AthenaX, Uber's streaming analytics platform enables users to run production-quality, large scale streaming analytics using SQL. This discussion highlights the design and architecture of AthenaX, and also Uber's production experience.


Petastorm: A Light-Weight Approach to Building ML Pipelines @Uber

Data produced and managed by Big Data systems like Apache Spark and Hive cannot be directly consumed by Deep Learning systems like Tensorflow and PyTorch. Petastorm bridges this gap by enabling direct consumption of data in Apache Parqet format into Tensorflow and PyTorch. In this talk, Yevgeni Litvin a senior software engineer with Perception team at Uber Advanced Technology Group (ATG) describes how Petastorm facilitates tighter integration between Big Data and Deep Learning worlds; simplifies data management and data pipelines; and speeds up model experimentation.


Hudi: Large-Scale, Near Real-Time Pipelines at Uber

In this talk Nishith Agarwal and Vinoth Chandar data engineers at Uber discuss how they everaged Spark as a general purpose distributed execution engine to build Hudi, detailing tradeoffs & operational experience. They also highlight how to ingest data into Hudi using Spark Datasource/Streaming APIs and build Notebooks/Dashboards on top using Spark SQL.