Filter (clear filters)





Overview for hadoop

Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Love to Scale

Luca Canali an engineer and team lead at the CERN Hadoop, Spark and database services at CERN shares the experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. He covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services.


Pinterest’s Story of Streaming Hundreds of Terabytes of Pins from MySQL to S3/Hadoop Continuously

This talk discusses how Pinterest designed and built a continuous database (DB) ingestion system for moving MySQL data into near-real-time computation pipelines with only 15 minutes of latency to support their dynamic personalized recommendations and search indices. Pinterest is moving towards real-time computation, they are facing a stringent service-level agreement requirement such as making the MySQL data available on S3/Hadoop within 15 minutes, and serving the DB data incrementally in stream processing. The data team has designed WaterMill: a continuous DB ingestion system to listen for MySQL binlog changes, publish the MySQL changelogs as an Apache Kafka® change stream and ingest and compact the stream into Parquet columnar tables in S3/Hadoop within 15 minutes. 


Real-Time Detection of Anomalies in the Database Infrastructure using Apache Spark

Learn how CERN, the biggest physics laboratory in the world,has large volumes of data are generated every hour, stored and processed using scalable systems as Hadoop, Spark and HBase.