Filter (clear filters)





Overview for science

Experience of Running Spark on Kubernetes on OpenStack for High Energy Physics Workloads

The physicists at CERN are using Spark to process large physics datasets in a distributed fashion with the aim of reducing time-to-physics with increased interactivity. In this talk Prasanth Kothuri and Piotr Mrowczynski Big Data Engineers for CERN focus on the design choices made and challenges faced while developing spark-as-a-service over kubernetes on openstack to simplify provisioning, automate management, and minimize the operating burden of managing Spark Clusters.


Stateful Structure Streaming and Markov Chains Join Forces to Monitor the Biggest Storage of Physics Data

Learn how CERN, the biggest physics laboratory in the world processses and stores large volumes of data generated every hour. The storage group, which holds more than 200 petabytes, is an essential player to help the organisation overcoming this great challenge. ExDeMon, an open-sourced metrics monitor where stateful processing implemented with Spark Structured Streaming is playing a key role by applying machine learning techniques on collected logs and metrics. One of the machine learning techniques CERN aims to apply are Markov chains, a statistical model that was developed by Andrey Markov in the XIX century.


CERN’s Next Generation Data Analysis Platform with Apache Spark

Learn how CERN uses Spark to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system.


A Cloud-based Architecture for Processing 3D Mars Terrain

How NASA uses OnSight, a cloud-based architecture for processing 3D Mars terrain through the power of virtual reality.


Talk Data to Me: Sparking Insights at Elsevier

Emlyn, a principal developer for Elsevierwill explains how Elsevier’s big data team uses Spark and Databricks to is to facilitate massive dataset workflow.


Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs

A discussion on Spark SQL performance investigations and performance troubleshooting from CERN database services.