Radek Maciaszek presents his learnings from the migration of machine learning and big data processing pipelines to Apache Airflow.
He highlights how they use Airflow to power their company big data infrastructure where they analyze hundreds of terabytes of data. Examples will cover the building of the ETL pipeline and use of Airflow to manage the machine learning Spark pipeline workflow.
The talk covers the basic Airflow concepts and show real-life examples of how to define your own workflows in the Python code. It finishes with more advanced topics related to Apache Airflow, such as adding custom task operators, sensors and plugins as well as best practices and both the pros and cons of this tool.
This talk is about the features of FoundationDB and the Record Layer that help build CloudKit, Apple’s cloud storage system for structured data. Scott Gray, a software engineer at Apple focuses on three key areas where FoundationDB and the Record Layer have unlocked large benefits for CloudKit. First, FoundationDB’s arbitrary multi-key ACID transactions have allowed implement advanced secondary indexing at scale, including our recently-developed transactional full-text search system.
Ebay has built a GraphDB cloud service called NuGraph, which is based on JanusGraph. FoundationDB is chosen as the JanusGraph’s storage plugin, because of its high-performance and distributed transaction support. Jun Li and Hieu Nguyen from eBay present the GraphDB architecture, and focus on how they deploy and manage FoundationDB in Kubernetes, how they improve JansGraph query performance in a cross-data center environment, how they bulk load the graph into FoundationDB with its transactional support, and how they secure the 3-tier cloud service with limited security support from FoundationDB.
In this talk Anoop Koloth and Hanzhang Wang from eBay present how they managed to build a monitoring system and leveraged data generated from envoy clusters:
(1) Processing billions of hits served from different platforms from worldwide in real-time.
(2) Key Performance Indicators from Envoy ecosystem.
(3) Effective ML solution for proactive monitoring diversified eBay systems.
(4) Graph-based modeling and algorithms to deal with system complexity.
(5) Symbiosis and enhancement with existing SRE solution.
In this talk Shir Bromberg a Big Data team leader at Yotpo,discusses their open-source dockers for running Spark on Nomad servers. She highlights the following;
* The issues they had running spark on managed clusters and the solutions developed.
* How to build a spark docker.
* What to achieve by using Spark on Nomad.
Nielsen Marketing Cloud needs to ingest billions of events per day into their big data stores for their real time analytics. Etti Gur the Senior Big Data developer and Itai Yaffe Tech Lead, Big Data group discuss how they significantly optimized Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 2 hours, resulting in a huge cost reduction.