Filter (clear filters)






Overview for visualization

Entity Linking @ Scale Using Elasticsearch

This talk is about the the use of Elasticsearch as a scalable entity linking/deduplication tool at Messagepoint Inc. Atif Khan the Vice President, AI & Data Science at Messagepoint presents the high level architecture and design of such a system and reviews its application in the context of two major use cases of data deduplication and attribute-based link discovery.


Kafka and Kafka Streams in the Global Schibsted Data Platform

In this discussion you will learn how Schibsted an international media group has set up a new global streaming data platform using Kafka and Kafka Streams, replacing a homegrown solution based on Kinesis and micro batches in Amazon S3. 


Zipline: Airbnb’s Machine Learning Data Management Platform

Zipline is Airbnb’s data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days. It allows users to define features in an easy-to-use configuration language, then provides access to the following features: resource efficient and point-in-time correct training set backfills and scheduled updates, feature visualizations and automatic data quality monitoring, feature availability in online scoring environment: batch and streaming with batch correction (lambda architecture), collaboration and sharing of features, and data ownership and management.

Spark powers many of Zipline’s features, especially offline tasks for efficient training set backfills and feature computation. This discussion covers Ziplines architecture and the main problems that Zipline solves. Despite being widespread, there is no open source software to address these problems.


VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger Menezes

In this discussion you will learn about present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world.


Big data insights equal big money: Stories from the trenches at GoDaddy

In this presentation you will learn how GoDaddy collects and uses data from business units like hosting and domains as well as from network and hardware events across its fleet of servers and network devices, using a wide range of technologies, including Kafka, Spark, Hadoop, and Elasticsearch.


The Netflix data platform: Now and in the future

The Netflix data platform is constantly evolving, but at it's core, it's an all-cloud platform at a massive scale (60+ PB and over 700 billion new events per day), focused on enabling developers. In this talk, Kurt Brown dives into the current (data) technology landscape at Netflix, as well as what's in the works. He covers key technologies, such as Spark, Presto, Docker, and Jupyter, along with many broader data ecosystem facets (metadata, insights into jobs run, visualizing big data, etc.). Beyond just tech, he also dives a bit into their data platform philosophy. You'll learn how things work at Netflix, along with some ideas for re-envisioning your data platform.