Filter (clear filters)





Overview for facebook

Oversubscribing Apache Spark Resource Usage for Fun and $$$

Apache Spark is quickly being adopted at Facebook and now powers an important portion of Facebook’s batch ETL workload. While Spark is typically more efficient than Hive,Facebook continues to search for opportunities to further reduce hardware costs. Recently, they  started an effort to apply custom resource oversubscription for every unique Spark job.


Taking Advantage of a Disaggregated Storage and Compute Architecture

In this talk, Brian Cho and Ergin Seyfe software engineers at Facebook  dive into changes made to Spark to take advantage of Facebook’s disaggregated architecture and lessons learned from scaling out and managing these large production clusters.


Migrating Apache Hive Workload to Apache Spark: Bridge the Gap

Facebook shares an update on their overall migration effort and examples of migrations wins. For example, they were able to migrate one of the most complicated workloads in Facebook from Hive to Spark with more than 2.5X performance gain.


Natural Language Understanding @ Facebook Scale

At Facebook, text understanding is the key to surfacing content that’s relevant and personalized, plus enabling new experiences like social recommendations and Marketplace suggestions. In this talk, Rushin Shah, Engineering Leader, at FacebookI introduces to DeepText, Facebook’s platform for text understanding, and discuss the various models it supports.


Experiences Migrating Hive Workload to SparkSQL

At Facebook, millions of Hive queries are executed on a daily basis, and the workload contributes to important analytics that drive product decisions and insights. Spark SQL in Apache Spark provides much of the same functionality as Hive query language (HQL) more efficiently, and Facebook is building a framework to migrate existing production Hive workload to Spark SQL with minimal user intervention.


Hive Bucketing in Apache Spark

In this session  Tejas Patil  of Facebook explains what is bucketing and how it's implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook’s performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You’ll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.