TodoBI - Business Intelligence, Big Data, ML y AI TodoBI - Business Intelligence, Big Data, ML y AI

How to create an open source Big Data Stack

Big Data Stack
Sub second interactive queries, machine learning, real time processing and data visualization
Nowadays there is a lot technology that enables Big Data Processing. However, choosing the right tools for each scenario and having the know-how to use these tools properly, are very common problems in Big Data projects management.
For this reason, we have proposed the Big Data Stack, a choice of tools for Big Data processing based in our experience gathering requirements for Big Data analytics projects. Our stack includes tools for each possible task in a Big Data project, such as ETL (Pentaho Data Integration or Spark), Machine Learning (Spark, R o Python libs), Big Data OLAP (Kylin or Vertica) and also data visualization, using our Lince BI - ST tools (Pentaho BA Server based) or other famous BI tools.

Figure 1. The Big Data Stack for Big Data Analytics (from Stratebi.com)

Sub second interactive queries over tables with billions of rows
While at beginning existing Big Data technology allowed for very efficient data processing (e.g. Apache Hive or Cloudera Impala), analytical query times were no less than minutes or seconds at best-case. 
This fact made very hard the use of Big Data technology for the implementation of Data Warehouses, as we know them previously, to support analytics applications that require interactive response, such as dashboards, reporting or OLAP viewers.
Luckily, at the end of 2014 Apache Kylin was introduced. This open source tool is a distributed engine for analytical processing scenarios that provides an SQL interface and supports multidimensional analysis applications (OLAP) on a Hadoop/Spark cluster and over Big Data sources. 
The data from sources such as Hive, other common RDBMS (e.g. SQL Server) or even Kafka queues, is pre-agregated and stored in HBase (Kylin cube) by fast and incremental processes using Map Reduce or Spark engines. 
These processes are automatically generated based on the cube definition provided by the Kylin users using its web UI. Once the cube is built, the users can perform SQL analytical queries over billions of rows with response times less than the second.



Figure 2. Kylin web UI. Sample query over a cube of 888 million rows was resolved in 0,57 seconds.

Moreover, thanks to the support for J/ODBC connectors and a complete API REST, Kylin can be integrated with any current BI tool. In our case, we have seamlessly integrated Kylin with our Lince BI - ST tools (Pentaho BA Server based): STpivot (OLAP viewer), STReport (reporting ad-hoc) and also with STDashoard (self-service dashboarding).
Digital Marketing analytics real case
As with the other technologies in our stack, we have been able to successfully integrate Kylin into a real Big Data project. This project main goal was to analyze data from digital marketing campaigns (e.g. impressions metrics), customer base and payments for a worldwide company dedicated to develop mobile apps.
In the baseline scenario we have to load and transform more than 100 data sources with a very high volume, although most of them was structured data. Some of the source tables had more than thousand millions of rows of historical data and several millions of news rows were generated per hour. 
Until that moment, they processed this data using PHP processes and then stored it into a Data Warehouse infrastructure based on distributing the load between MySQL and Redshift (most complex queries). With this system they achieved loading, refreshing and query times (latency) too slow for their business needs. 
Therefore, improving data pre-processing (ETL) and query latency were the main goals of this project.
With these goals in mind, we proposed and implemented an architecture that use a lot of tools from our stack: Sqoop (to load data), Hive (to pre-process data and as source for Kylin), Kylin (to query the resulting Big Data Warehouse with sub second latency) and Lince ST Tools over Pentaho BA Server (to analyze and visualize the aggregated data).
Thanks to the application of these tools, data load and refreshing times were reduced from 30 minutes to about 10 minutes. But the best improvement was the improved query latency due to the use of Apache Kylin, having most of the queries resolved in less than 1 second and between 10x and 100x faster than the initial scenario.
Big Data Analytics Event and Benchmark
After successfully testing the power of the Kylin, we decided to support this technology as a core part of our Big Data solutions. For this reason, we organized a workshop to present our Big Data Stack and Apache Kylin. 
The first edition took place in Barcelona, with more than 30 attendees from big companies, most of them professionals in BI area. After the success of this first edition, we organized a second edition in Madrid , with the participation of Luke Han, creator of Kylin and CEO of Kyligence (Kylin Enterprise). We also have talks from companies where we have successfully implemented Kylin.


Figure 3. Big Data Analytics workshop 2ºed with Roberto Tardío (up) and Luke Han talks (down)

Moreover, we presented a benckmark whitepaper where we compare the Big Data OLAP tools, Kylin and Vertica, and also them against PostgreSQL (traditional BD). The results show that Kylin allow us to achieve the best query latency, but Vertica (also part of our stack) is also proved as a very fast OLAP engine.
This last event was a complete success, with more than 40 attendees from large companies based in Spain that use Big Data.
Other applications and use cases of the Stack
In addition to the Big Data OLAP applications discussed, our stack provides tools for others applications such as data quality processes, real time processing and machine learning.
Nowadays, we are carrying a project where we use Spark and its machine leaning libraries to implement a process of data quality to improve direct and promotional marketing. Using Spark we are able to de duplicate the data using advanced statistics or to cross the raw data of customers with addresses dictionaries and geo API’s to normalize and clean it.
Moreover, we can use Kafka to gathering the data sources directly from our apps at real time or others API’s, in order to process it using Spark Streaming and also to load this data in Kylin directly from Kafka, to achieve near real time OLAP cubes.
Therefore, we can conclude that our Big Data Stack enables the successful implementation of most of the current Big Data Analytics scenarios. However, we will continue researching and testing new Big Data tools in order to enrich our Big Data Stack.