Filter (clear filters)






Overview for spark

The Benefits of Running Spark on your own Docker

In this talk Shir Bromberg a Big Data team leader at Yotpo,discusses their open-source dockers for running Spark on Nomad servers. She highlights the following; 
* The issues they  had running spark on managed clusters and the solutions developed.
* How to build a spark docker.
* What to achieve by using Spark on Nomad.


How to performance-tune Spark applications in large clusters

Omkar Joshi a senoir software engineer at Uber discusses a new Spark ingestion system known as Marmaray. This new system has been designed to ingest billions of Kafka messages at intervals of 30 minutes. 


Analyzing Movie Reviews using DataStax

In this talk, Amanda Moran, Technical Evangelist at DataStax uses sentiment analysis on Twitter data about the latest movie titles to answer that age old question: “Is that movie any good?” She explains how they built the solution using Apache Cassandra, Apache Spark and DataStax Enterprise Analytics.


Migrating from RDBMS Data Warehouses to Apache Spark

Learn how DBS Bank implemented a Spark-based application which helps during the migration process from traditional RDBMS to BigData. The application embeds the Spark engine and offers a web UI to allow users to create, run, test and deploy jobs interactively. Jobs are primarily written in native SparkSQL, or other flavours of SQL (i.e. TDSQL).


Social Media Influencers Detection, Analysis and Recommendation

Learn how Socialbakers used Databricks for innovative research and large-scale data engineering including ML and the challenges they faced while deploying Apache Spark from the scratch and onboarding the teams to their new platform.


Apache Spark Based Reliable Data Ingestion in Datalake

In this talk Gagan Agrawal from Paytm talks about how they leveraged Sparks Dataframe abstraction for creating generic ingestion platform capable of ingesting data from varied sources with reliability, consistency, auto schema evolution and transformations support. He highlights how they developed spark based data sanity as one of the core components of this platform to ensure 100% correctness of ingested data and auto-recovery in case of inconsistencies found. This talk also focuses on how Hive table creation and schema modification was part of this platform and provided read time consistencies without locking while Spark Ingestion jobs were writing on the same Hive tables and how Paytm maintained different versions of ingested data to do any rollback if required and also allow users of this ingested data to go back in time and read snapshot of ingested data at that moment.