A Story About Managing Tracking Log Files from MAAS with AWS Kinesis Firehose, AWS S3, Apache Spark and AWS EMR.

Published in

Serverless Indonesia

6 min readMar 31, 2018

What is AWS EMR

Somehow when we gather a massive tracking data from mobile devices, we always think that the log files are queryable if we store that in relational database. For example, we believe that the easiest solution for storing the tracking data, are using PostgreSQL or MariaDB. But when my prior team face that problem, we found that tracking data that collected for 3 months is reaching ~ 1 Billion rows. Of course its quite expensive both on resource usage and money, we must expense to rent the cluster database service within Amazon Web Service (AWS).

On the other hand, with that enormous size of tracking log files, we want that the dataset must by queryable fastly and also cheap price.

Actually Apache Spark is one of Apache Foundation product that let you work on processing large-scale datasets for example a log files or transaction history. It help you to process the dataset distributedly, of course you have to write the Spark code properly. Apache Spark has a concept of Map Reduce, so the task does not executed in just one node. It will distributed to all over node that controlled by the Master Node.

Apache Spark has several key features such as below:

Resilient Distributed Dataset (RDD)
DataFrame (DF)
Spark Streaming
GraphX (Graph Processing)
Spark ML (Machine Learning)

And Spark has support for several programming language such as Java, Scala, R, and Python.

Eventually, AWS ElasticMapReduce (EMR)is dedicated for you to make the infrastructure easier and you only turn on that cluster only when you used it. At least, you can reduce the cost if you build the cluster via AWS EC2 spot instance type. So that, you can bid for half price from the actual price (even lower) and it will be safe your budget.

AWS EMR is not only provide you the Apache Spark. You can use Oozie, Zeppelin, Hive, HDFS, and many Apache big data products that give to you within the AWS EMR.

You can create the cluster by schedule, when you finish the usage, You can destroy anyway. At least, you don’t have to demolished you physical cluster if you build it with the physical machine.

For more information you can visit this link to get the clear picture about AWS EMR: https://aws.amazon.com/emr/.

Generic Data Pipeline Architecture

We will start with this diagram below to explain how to setup a complete basic of data engineering infrastructure with Apache Spark and other tools that can support it.

First of all, the application maybe the web or mobile application could send its data via message stream that could handle huge amount of message, for example the tracking log, into this infrastructure. Then the message stream packs the data into HDFS make the data become a raw dataset that will be processed by Apache Spark Application.

The raw dataset will be processed by Apache Spark application. The application are just a script, maybe you will write that on Python, Java or Scala. The script will be executed by Apache Spark cluster in your infrastructure. The scheduler has responsibility to make the Apache Spark application run on certain period of time.

Spark application will excess the output dataset and it will be stored on HDFS as the useful dataset that maybe will be ingested to another system such as BI tool, ELK Stacks, GraphX, Spark ML, or maybe for another Apache Spark Application.

The Architecture

Ok, let’s come to the real-world example for building data management infrastructure using Apache Spark.

First of all, we utilized some service on Amazon Web Service. For example:

Route 53, used for creating the domain name
Elastic Beanstalk, run the producer that emit the log from user activity and geolocation data from the mobile phone user
Kinesis Firehose, as a bridge to store that dataset safely and you will have auto partitioning when you store the dataset within the AWS S3
S3, your enormous data storage that can hold TB until PB dataset. And we only store CSV within that S3
EC2, we create our scheduler within t2.micro instance
Elastic Map Reduce, we use this for preprocessing the dataset into dataset that required by the management and data scientist

The story is begin when we stored the tracking data from the sdk directly into PostgreSQL as a part of our Mobile Application Analytic Service (MAAS). Even the PostgreSQL has been indexed for some field and configured to be replicated. The data was keep growing and its reached 2 billion rows. Its not queryable anymore if we want to retrieve a data for 1 day range or 1 month range. The possible range that we could pass to the SQL was only for 15 second time range.

The SDK itself was storing tracking log files that record some useful data such as longitude, latitude, IDFA, GAID, os version, manufacturer information, device version, and some private setting from users devices.

Then my prior team decide to move the storing tracking data part from PostgreSQL into AWS S3 directly. We were handling some partner such as A customer loyalty apps that has been popular in Malaysia & Indonesia, A sharing apps, A Moslem Apps, A TV provider in Malaysia (Android Apps), and etc. So the processing for tracking log would not run on PostgreSQL anymore. But we will processed it in Apache Spark and scheduled it with certain period. At that time we run the scheduler for my spark application for every hour.

The SDK are sending the tracking data from mobile devices into the backend application that running on AWS Elastic Beanstalk. The web service capture the tracking data and stream it into AWS Kinesis Firehose that will store the tracking log into S3 directly. The tracking data will formatted automatically by Kinesis Firehose, and it will store in directory that has structure dddd/mm/yy/hh so the data will be fast enough when processed by Apache Spark.

We built two Apache Spark applications. The first one are aggregating and unifying the device data into device data lake that stored unique device data from the users. It will be used by another spark application to match with tracking log data to find which users that has been tracked at that time. Then the another spark Application will use the device to use the unique device data to filter the tracking log by the device data. After the execution is finish, the Apache Spark Cluster on AWS EMR will be terminate to reduce the cost.

Every execution that success or fail will notify to our slack. So we can decide what action to be done when the execution is failed. Usually we will process some periode of tracking log data manually. We will make the instance manually and store it manually.

Every success and failed process will be notified directly to our phone via Slack Webhook

Conclusion

Its really challenging when using AWS EMR and another AWS Services. You have to write some python code to process the dataset and debugging the Apache Spark application within the AWS EMR is not easy at all. At that time, we are able to ingest 872.1 MiB of tracking dataset and 24.5 MiB of users device dataset for every hour.

Later we could explain another our architecture that might be useful for you or your organization.

Many Thanks for The Team

Actually this architecture is built by the team. So we have to say thank you to all persons that involved on built this architecture. Here are our friends that built this awesome architecture: Tajhul Faijin Aliyudin, fajri abdillah, Mirza, Nugraha, Doni, Zaky, Ridwan, Wildan, Obaidul, Firman, Mr. Hubert.