TodoBI - Business Intelligence, Big Data, ML y AI TodoBI - Business Intelligence, Big Data, ML y AI

ETL Validations in Pentaho

We present a framework for validation . In our case very focused to validate that ETL load has worked well.
The purpose and use (desired ) for this STValidations (free download for previous link), is to automate the validation routine and sometimes we fail to do just that. For routine .
Its use is quite simple and are in a version 0.1 , so it can evolve much and that we have few use cases how to make sure their robustness and to cover all needs. Hope this code from Stratebi helps you.
ETL is a process that reads from a table a list of queries to be executed and the expected result. After that , run the query and compares it with the expected result if the result is correct is a log of executions correct and if the result is not expected is a log of botched executions . At the end sends an email with the two records to the user to decide so you can review the implementation .
Here you can see the aspect of work in general :


And the detail of the validation :


Simple ... powerful and versatile. From simple queries like " select count (*) from table" to validate that you have records to complex queries for comparison of values ​​in different tables. In fact, if what I see is data in a table so I think that most of the validations can be performed using this method.
You can follow instructions in this video tutorial:
 
You can check more videos in our Youtube Channel
 

The process, in step settings , read the configuration data from configuration.properties , check that there is a validation table and if not creates and inserts a validation sample . Then, the process reads all existing queries in the table , runs and compare the result with the expected result.
The lookup table looks like and filled manually (we are at version 0.1 , remember .....) :


We did that because we had some problems of inconsistency in some projects that made clear that the loads must be validated and is a tedious and repetitive work.
  • We must always validate charges . We have detected errors due to inconsistent or unexpected data input formats . And that can not be controlled unless the charges are validated .
  • Loads must be validated especially after making a change in the ETL and it is pretty boring , perform the same query validation.
  • We must make a series of routine consultations to ensure that the data is equal to the source .
  • If you can automate ... I'd rather be doing other things.

Strengths and weaknesses :
  • It is a version 0.1 we have tested in a couple of clients and so far so good. But we know that, for example , still can not compare data from different sources faith .
  • Queries can become heavy and this is a potentially VERY expensive in terms of resource consumption. That's why ETL is a process that can be run independently attached to the ETL process or in another moment of time with little load. We do it well . The process does not run right after the ETL but at 6:00 a.m. how a separate process.
  • Do not rely on the BI server. Simple and useful.

Well , I think it can be very useful in projects and we hope you find it useful too!!
By the way ... this is the error log to send :