Radek Maciaszek presents his learnings from the migration of machine learning and big data processing pipelines to Apache Airflow.
He highlights how they use Airflow to power their company big data infrastructure where they analyze hundreds of terabytes of data. Examples will cover the building of the ETL pipeline and use of Airflow to manage the machine learning Spark pipeline workflow.
The talk covers the basic Airflow concepts and show real-life examples of how to define your own workflows in the Python code. It finishes with more advanced topics related to Apache Airflow, such as adding custom task operators, sensors and plugins as well as best practices and both the pros and cons of this tool.
In this talk Shir Bromberg a Big Data team leader at Yotpo,discusses their open-source dockers for running Spark on Nomad servers. She highlights the following;
* The issues they had running spark on managed clusters and the solutions developed.
* How to build a spark docker.
* What to achieve by using Spark on Nomad.
Nielsen Marketing Cloud needs to ingest billions of events per day into their big data stores for their real time analytics. Etti Gur the Senior Big Data developer and Itai Yaffe Tech Lead, Big Data group discuss how they significantly optimized Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 2 hours, resulting in a huge cost reduction.
See how Lyft and Datatonic are using Apache Flink, Apache beam and python in stream processing, machine learning and analytics.
See how Hortonworks and Cloudera is using the latest in Apache Hive, Spark, Druid and Impala in data warehousing, analytics and recommendations.