SHOW

Filter (clear filters)

Domains

Companies

Technologies

Functions


Overview for benchmarks

Analytical DBMS to Apache Spark Auto Migration Framework

eBay has been using Analytical DBMS (ADBMS) data warehouse solution for over a decade, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. Based upon that, data services and products enables eBay business decisions and site features, so it has to be always available and accurate.

Lipeng Zhu and Edward Zhang from Ebay discuss how eBay has been working on migrating ADBMS batch workload to Spark. 

Links


From “All-at-Once, Once-a-Day” to “A-Little-Each-Time, All-the-Time”

OLX produces about 50 millions messages daily to be delivered to 300+ millions users across the globe; via email, sms or push. The majority of these notifications relies on the processing of the billions of events generated by their web and mobile platforms to understand the users behaviour and to craft relevant messages designed to influence the customer journey positively.

In this presentation Emanuele Bardelli discusses the approach, challenges and learnings of migrating OLX's notification platform from a monolithic, batch system based on AWS Redshift, SQL and ETL pipelines to a micro-service, real-time system developed with Apache Spark and Python.

Link


Kafka on ZFS: Better Living Through Filesystems

Learn how Jet’s Kafka clusters squeeze every drop of disk performance out of Azure, all completely transparent to Kafka. 
-Striping cheap disks to maximize instance IOPS 
-Block compression to reduce disk usage by ~80% (JSON data) 
-Instance SSD as the secondary read cache (storing compressed data), eliminating >99% of disk reads and safe across host redeployments 
-Upcoming features: Compressed blocks in memory, potentially quadrupling your page cache (RAM) for free 

These are the areas covered; 
-Basic Principles 
-Adapting ZFS for cloud instances (gotchas) 
-Performance tuning for Kafka 
-Benchmarks

Links


How to Rebuild an End-to-End ML Pipeline with Databricks and Upwork

In this talk, Thanh Tran a Director of Data Science for Upwork presents their modernization efforts in moving towards a

1) holistic data processing infrastructure for batch and stream data processing using S3, Kinesis, Spark and Spark Structured Streaming

2) model development using Spark MLlib and other ML libraries for Spark

3) model serving using Databricks Model Scoring, Scoring over Structured Streams and microservices and

4) how they orchestrate and streamline all these processes using Apache Airflow and a CI/CD workflow customized to our Data Science product engineering needs.

Links


Moving eBay’s Data Warehouse Over to Apache Spark – Spark as Core ETL Platform at eBay

Learn how eBay moved their ETL computation from conventional RDBMS environment over to Spark. This was a journey which led to an implementation of a 1000+ node Spark Cluster running 10,000+ ETL jobs daily, all done in a span of less than 6 months, by a team with limited Spark experience.

Links


Oversubscribing Apache Spark Resource Usage for Fun and $$$

Apache Spark is quickly being adopted at Facebook and now powers an important portion of Facebook’s batch ETL workload. While Spark is typically more efficient than Hive,Facebook continues to search for opportunities to further reduce hardware costs. Recently, they  started an effort to apply custom resource oversubscription for every unique Spark job.

Links