Filter (clear filters)





Overview for hive

MetaConfig driven FeatureStore with Feature compute & Serving Platform powering Machine Learning @MakeMyTrip

MakeMyTrip is India’s #1 online travel platform having more than 70% of the traffic from mobile apps embarked on a journey to revolutionize its customer experience by building a scalable, personalized, machine learning based platform which powers onboarding, in-funnel and post-funnel engagement flows, such as ranking, dynamic pricing, persuasions, cross-sell and propensity models.


Apache Spark Based Reliable Data Ingestion in Datalake

In this talk Gagan Agrawal from Paytm talks about how they leveraged Sparks Dataframe abstraction for creating generic ingestion platform capable of ingesting data from varied sources with reliability, consistency, auto schema evolution and transformations support. He highlights how they developed spark based data sanity as one of the core components of this platform to ensure 100% correctness of ingested data and auto-recovery in case of inconsistencies found. This talk also focuses on how Hive table creation and schema modification was part of this platform and provided read time consistencies without locking while Spark Ingestion jobs were writing on the same Hive tables and how Paytm maintained different versions of ingested data to do any rollback if required and also allow users of this ingested data to go back in time and read snapshot of ingested data at that moment.


Large Scale Feature Aggregation Using Apache Spark

In this presentation, Pulkit Bhanot and Amit Nene  from Uber discuss how, using data stored in Hive and using Spark, they developed a highly scalable solution to carry out feature aggregation in an incremental way. By dividing data aggregation responsibility across the realtime access layer, and the batch computation components, they ensured that only entities for which there is actual value changes are dispersed to real-time access store (Cassandra). They share how they did data modeling in Cassandra using its native capabilities such as counters, and how they worked around some of the limitations of Cassandra.


Productionizing Behavioural Features for Machine Learning with Apache Spark Streaming

Learn how uses Spark Streaming for building online Machine Learning(ML) features that are used for real-time prediction of behaviour and preferences of their users, demand for hotels and improve processes in customer support.


Lyft's analytics pipeline: From Redshift to Apache Hive and Presto

Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world’s largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits.