SHOW

Filter (clear filters)

Domains

Companies

Technologies

Functions

Highlights


Overview for analytics

Lessons Learned from the Migration to Apache Airflow - Radek Maciaszek, Skimlinks*

Radek Maciaszek presents his learnings from the migration of machine learning and big data processing pipelines to Apache Airflow.

He highlights how they use Airflow to power their company big data infrastructure where they analyze hundreds of terabytes of data. Examples will cover the building of the ETL pipeline and use of Airflow to manage the machine learning Spark pipeline workflow.

The talk covers the basic Airflow concepts and show real-life examples of how to define your own workflows in the Python code. It finishes with more advanced topics related to Apache Airflow, such as adding custom task operators, sensors and plugins as well as best practices and both the pros and cons of this tool.

Links


The Benefits of Running Spark on your own Docker

In this talk Shir Bromberg a Big Data team leader at Yotpo,discusses their open-source dockers for running Spark on Nomad servers. She highlights the following; 
* The issues they  had running spark on managed clusters and the solutions developed.
* How to build a spark docker.
* What to achieve by using Spark on Nomad.

Links


Optimizing Spark-based data pipelines - are you up for it?

Nielsen Marketing Cloud needs to ingest billions of events per day into their big data stores for their real time analytics. Etti Gur  the Senior Big Data developer and Itai Yaffe Tech Lead, Big Data group discuss how they significantly optimized Spark-based in-flight analytics daily pipeline, reducing its total execution time from over 20 hours down to 2 hours, resulting in a huge cost reduction.
Links


Apache Beam meetup 7 at Datatonic: Beam at Lyft + datalake using Beam + schemas

See how Lyft and Datatonic are using Apache Flink, Apache beam and python in stream processing, machine learning and analytics.

Links



The Latest in Apache Hive, Spark, Druid and Impala

See how Hortonworks and Cloudera is using the latest in Apache Hive, Spark, Druid and Impala in data warehousing, analytics and recommendations.

Links