Goodbye Hadoop: Building a streaming data processing pipeline on Google Cloud
In a Google Cloud Blog post, Qubit Product Manager Alex Olivier and Platform Engineer Saira Hussain share how having a fully managed data processing platform on Google Cloud has been integral to Qubit’s success.
Ingesting data at scale in real-time
At Qubit, we firmly believe that the most effective way to increase customer loyalty and lifetime value is through personalization. Our customers use our data activation technology to personalize customer experiences and to help them get closer to their customers by using real-time data that we collect from browsers and mobile devices.
Collecting e-commerce clickstream data to the tune of hundreds of thousands of events per second results in absolutely massive datasets.
Traditional systems can take days to deliver insight from data at this scale. Relatively modern tools like Hadoop can help reduce this time tremendously— from days to hours. But to really get the most out of this data, and generate for example real-time recommendations, we needed to be able to get live insights as the data comes in. That means scaling the underlying compute, storage, and processing infrastructure quickly and transparently. We’ll walk you through how building a data collection pipeline using serverless architecture on Google Cloud let us do just that.
Our data collection and processing infrastructure is built entirely on Google Cloud Platform (GCP) managed services (Cloud Dataflow, PubSub, and BigQuery). It streams, processes, and stores more than 120,000 events per second (during peak traffic) in BigQuery, with a very low end-to-end latency (sub-second). We then make that data, often hundreds of terabytes per day, available and actionable through an app that plugs into our ingestion infrastructure—all of this without ever provisioning a single VM.
In our early days, we built most of our data processing infrastructure ourselves, from the ground up. It was split across two data centers with 300 machines deployed in each. We collected and stored the data in storage buckets, dumped it into Hadoop clusters and then waited for massive MapReduce jobs on the data to finish. This meant spinning up hundreds of machines overnight, which was a problem because many of our customers expected near real-time access in order to power experiences to their in-session visitors. And then there’s the sheer scale of our operation: at the end of two years, we had stored 4PB of data.
Auto-scaling was pretty uncommon back then. And provisioning and scaling machines and capacity is tricky, to say the least, eating up valuable man-hours and causing a lot of pain. Scaling up to handle an influx of traffic, for instance during a big game or Black Friday, was an ordeal. Requesting additional machines for our pool had to be done at least two months in advance and it took another two weeks for our infrastructure team to provision them. Once the peak traffic period was over, the scaled up machines sat idle, incurring costs. A team of four full-time engineers needed to be on-hand to keep the system operational.
In a second phase, we switched to stream processing on Apache Storm to overcome the latency inherent in batch processing. We started running Hive jobs on our datasets so that they could be accessible to our analytics layer and integrated business intelligence tools like Tableau and Looker.
With the new streaming pipeline, it now took hours instead of days for the data to become available for analytics. But while this was an incredible improvement, it still wasn’t good enough to drive real-time personalization solutions like recommendations and social proof. This launched our venture into GCP.
Read the full article on the Google Cloud blog here.