eBay has been using Analytical DBMS (ADBMS) data warehouse solution for over a decade, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. Based upon that, data services and products enables eBay business decisions and site features, so it has to be always available and accurate.
Lipeng Zhu and Edward Zhang from Ebay discuss how eBay has been working on migrating ADBMS batch workload to Spark.
OLX produces about 50 millions messages daily to be delivered to 300+ millions users across the globe; via email, sms or push. The majority of these notifications relies on the processing of the billions of events generated by their web and mobile platforms to understand the users behaviour and to craft relevant messages designed to influence the customer journey positively.
In this presentation Emanuele Bardelli discusses the approach, challenges and learnings of migrating OLX's notification platform from a monolithic, batch system based on AWS Redshift, SQL and ETL pipelines to a micro-service, real-time system developed with Apache Spark and Python.
Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka.
-Striping cheap disks to maximize instance IOPS
-Block compression to reduce disk usage by ~80% (JSON data)
-Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments
-Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free
These are the areas covered;
-Adapting ZFS for cloud instances (gotchas)
-Performance tuning for Kafka
In this talk, Thanh Tran a Director of Data Science for Upwork presents their modernization efforts in moving towards a
1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming
2) model development using Spark MLlib and other ML libraries for Spark
3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and
4) how they orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs.
Learn how eBay moved their ETL computation from conventional RDBMS environment over to Spark. This was a journey which led to an implementation of a 1000+ node Spark Cluster running 10,000+ ETL jobs daily, all done in a span of less than 6 months, by a team with limited Spark experience.
Apache Spark is quickly being adopted at Facebook and now powers an important portion of Facebook’s batch ETL workload. While Spark is typically more efficient than Hive,Facebook continues to search for opportunities to further reduce hardware costs. Recently, they started an effort to apply custom resource oversubscription for every unique Spark job.