Material Big Data

Lanzados ppts informativos de tecnologías BigData: Hadoop, Hbase, Hive, Zookeeper...

Apuntate al Curso de PowerBI. Totalmente práctico, aprende los principales trucos con los mejores especialistas

Imprescindible para el mercado laboral actual. Con Certificado de realización!!

Pentaho Analytics. Un gran salto

Ya se ha lanzado Pentaho 8 y con grandes sorpresas. Descubre con nosotros las mejoras de la mejor suite Open BI

Aprende gratis Analytics OLAP sobre Pentaho

La solución open source para business intelligence y Big Data sobre Pentaho, no te lo pierdas!!

31 may. 2017

Create Dashboards in minutes with Open Source



Just a sneak preview of new functionalities we are including in Pentaho in order end user can create their own powerful dashboards in minutes. We call it STDashboard, by our colleagues of Stratebi.

These new functionalities include: new templates, panel resize, drag and drop, remove and create panels, Pentaho 7 upgrade...

As always and as other Pentaho plugins we´ve created (STPivot, STCard and STReport), they are free and included in all of our projects. Check the Demo Pentaho Online, where all new components are updated frequently

It's also part of the predefined BI Open Source based solution, LinceBI

You can use it too, directly in your own projects, including configuration, training and support with our help


Video in action (Dashboards in minutes):

30 may. 2017

Open Source Business Intelligence tips in May 2017




Now you can check latest tips on Business Intelligence Open Source, in May. You can see some of this tips implemented in Demo Online

This month with great stuff:





Gran estudio sobre Transformacion Digital



Os recomendados este gran estudio de Statista, muy actual y reciente, sobre el impacto de la tecnología y la transformación digital




28 may. 2017

Download eBook: Data Scientist, a step-by-step Guide


This compact, informative guide to the world of Data Science will have you up-to-date in no time.

What’s in the eBook?


  • Data Scientists –What do they do?
  • Pre-requisites for becoming a Data Scientist
  • Must-have skill-sets
  • Study-Plan
  • What the future holds

Download your free copy

Novedades en Pentaho 7.1






Como os hemos venido informando en nuestra cuenta de twitter, esta pasada semana se ha presentado la versión Pentaho 7.1

- Pentaho 7.1 en Github
- Pentaho 7.1 en Sourceforge

Os contamos las novedades y os pasamos los enlaces más interesantes:

- Novedades en Pentaho 7.1

- Descripción de mejoras por Pedro Alves

- Descripción de mejoras por Diethard Steiner

- Descripción de mejoras por Hemal Govind 

   Create Once, Execute on Any Engine, Starting with Spark

With adaptive execution on Spark in a visual environment, Pentaho 7.1 makes big data developers more productive and Spark more accessible to non-developers. Users can now create data integration logic one time, and then choose the most appropriate big data processing engine for each workload at run-time. This release starts with Spark, but can easily support other engines in the future.

  • Complete Spark Support: Pentaho is the only vendor to support Spark with all data integration steps in a visual drag-and-drop environment. Unlike other vendors who require users to build Spark-specific data integration logic – and often require Java development skills – with Pentaho you only need to design your logic once, regardless of execution engine.
  • Adaptive Execution on Big Data: Transitioning from one engine for big data processing to another often means users need to re-write and debug their data integration logic for each engine, which takes time. Pentaho’s adaptive execution allows users to match workloads with the most appropriate processing engine, without having to re-write any data integration logic.

   More CLOUD options WITH MICROSOFT AZURE HDINSIGHT

Building on current cloud support for Amazon EMR, Pentaho 7.1 supports Microsoft Azure HDInsight, Azure SQL, and SQL Server in Azure VM, offering more options to store – and more importantly, process – big data in hybrid, on-premises, and public cloud environments.
  • Support for HDInsight: Organizations using Microsoft Azure HDInsight can now use Pentaho to acquire, blend, cleanse and analyze diverse data at scale.
  • Process Data in the Cloud or On-Premises: Most vendors only allow you to access data from cloud sources. With Pentaho 7.1, you can also choose to process data on-premises, in the cloud or using a hybrid approach.

   IMPROVED DATA VISUALIZATIONS ACROSS THE PIPELINE

Pentaho 7.1 speeds up time to insight by allowing users to access visualizations at every step of the data prep process. In addition, simplified integration of third party visualizations drives improved analytics along the entire data pipeline. 
  • Prepare Better Data, Faster: More visualizations throughout the data prep process allows users to spot check data for quality issues and prototype analytic data, without switching in and out of tools or waiting until the very end to discover data quality problems. Now, users can interact with heat grids, geo maps, and sunbursts, as well as drill-down into data sets for further exploration.   
  • Integrate 3rd Party Visualizations: Leverage an easy to use and flexible API with full documentation to integrate visualizations from third party libraries such as D3 or FusionCharts.



   Expanded ENTERPRISE-LEVEL SECURITY FOR HORTONWORKS

Concerns over the lack of comprehensive security and authentication for big data environments are top of mind for IT organizations. Pentaho 7.1 gives customers more options by expanding on existing enterprise-level Hadoop security for Cloudera with a similar level of security for Hortonworks.
  • Kerberos Impersonation Support: Address authentication vulnerabilities with Hortonworks deployments. Protect clusters from intrusion and reduce risk with enterprise-level security.
  • Apache Ranger Support: Control role-based access to specific data sets and applications for Hortonworks deployments. Manage governance and risk with authorization.

Listado de tecnologías para Machine Learning



Aquí tenéis un listado bastante actualizado de herramientas y tecnologías, agrupadas por temática, para trabajar con Machine Learning

Gracias a http://www.shivonzilis.com/

26 may. 2017

Curso gratuito de Business Intelligence


El trabajo en el area de Business Intelligence es uno de los más demandados y apasionantes. Si tienes experiencia en BI o quieres aprender y desarrollar tu carrera en este área, esto puede interesante.

Puedes apuntarte al Curso gratuito de Business Intelligence, los días 9 y 10 de Junio en Barcelona. No pierdas esta oportunidad!!

En Stratebi (creadores del Portal TodoBI), disfrutarás con la gran cantidad de oportunidades en las áreas tecnológicas de mayor desarrollo en la actualidad: Business Intelligence, Big Data y Machine Learning, basadas en soluciones Open Source.

Nuestras soluciones, como LinceBI,y las principales herramientas del mercado, posibilitan a nuestros clientes ser más inteligentes, rápidos y flexibles que sus competidores más directos. Este es el verdadero poder de una organizacion.



Estas soluciones son la piedra angular del negocio de las organizaciones: campañas de marketing, reporting y análisis, financial scorecard, CRM, cuadros de mando, etc... Para poder desarrollar estas soluciones se necesita a las personas más valiosas y brillantes del área Business Intelligence. Eso es lo que buscamos y en lo que se deben convertir nuestros consultores.

Queremos formar un equipo con una alta motivación emprendedora, en donde todos sus miembros se sientan satisfechos por la calidad del trabajo y las relaciones con el resto de empleados.

Envia tu CV a rrhh@stratebi.com

Posiciones abiertas:

- Ingenieros con interés en aprender y trabajar en Business Intelligence
- Consultores con experiencia en Business Intelligence
Debido a la ampliación de operaciones en Madrid y Barcelona, estamos buscando verdaderos apasionados por el Business Intelligence y que tengán interés en soluciones Open Source y en el desarrollo de tecnologías abiertas. Y, sobre todo, con ganas de aprender en nuevas tecnologías como Big Data, Social Intelligence, etc...

Si estas leyendo estas lineas, seguro que te gusta el Business Intelligence. En Stratebi y TodoBI, estamos buscando a personas con gran interés en este área, que tengan una buena formación técnica y/o experiencia en la implementación de proyectos Business Intelligence en importantes empresas con (Oracle, MySQL, Powercenter, Business Objects, Cognos, Pentaho, Microstrategy...) o desarrollos web adhoc. Mucho mejor, si además fuera con BI Open Source, como Pentaho, Talend... y conocimientos de tecnología Big Data y Social Media, orientado a la visualización y front-end

Todo ello, será muy útil para la implementación de soluciones BI/DW con la plataforma BI Open Source que está revolucionando el BI: Pentaho, con la que más trabajamos, junto con el desarrollo de soluciones Big Data, Social Intelligence y Smart Cities.

Si ya conoces, o has trabajado con Pentaho u otras soluciones BI Open Source será un punto a favor. De todos modos, nuestro Plan de Formación te permitirá conocer y mantenerte actualizado en estas soluciones.
¿Quieres saber un poco mas sobre nosotros y las características de las personas y perfiles que estamos buscando para 'subirse al barco'?

¿Qué ofrecemos?


  • Trabajar en algunas de las áreas de mayor futuro y crecimiento dentro del mundo de la informática: Business Intelligence, Big Data y el Open Source.
  • Colaborar en la mejora de las soluciones Bi Open Source, entre las que se encuentran desarrollando algunas de las empresas tecnológicas más importantes.
  • Entorno de trabajo dinámico, aprendizaje continuo, variedad de retos.
  • Trabajo por objetivos.
  • Considerar el I+D y la innovación como parte principal de nuestros desarrollos.
  • Retribución competitiva.
  • Ser parte de un equipo que valora a las personas y al talento como lo más importante.

25 may. 2017

Available new Open Source OLAP viewer, STPivot4




STPivot4 is based on the old Pivot4J project where functionality has been added, improved and extended. These technical features are mentioned below.

Update: STPivot4 now is working with Pentaho 7. Run to download!!



GitHub STPivot4
For additional information, you may visit STPivot4 Project page at http://bit.ly/2gdy09H

Main Features:
  • STPivot4 is Pentaho plugin for visualizing OLAP cubes.
  • Deploys as Pentaho Plugin
  • Supports Mondrian 4!
  • Improves Pentaho user experience.
  • Intuitive UI with Drag and Drop for Measures, Dimensions and Filters
  • Adds key features to Pentaho OLAP viewer replacing JPivot.
  • Easy multi-level member selection.
  • Advanced and function based member selection (Limit, Ranking, Filter, Order).
  • Let user create "on the fly" formulas and calculations using
  • Non MDX gran totals (min,max,avg and sum) per member, hierarchy or axis.
  • New user friendly Selector Area
  • and more…


9 may. 2017

BI meet Big Data, a Happy Story


Cada vez esta más cerca poder hacer análisis BI OLAP tradicionales sobre entornos Big Data, gracias a Kylin. Hace unas semanas, lo comentábamos en esta entrada, en donde también mostrábamos ejemplos reales de vistas OLAP y Dashboards en funcionamiento.



Ahora, os actualizamos con información reciente de los programadores de Kylin (inglés)

What is Apache Kylin?

Kylin is an OLAP engine on Hadoop. As shown in Figure 1, Kylin sits on top of Hadoop and exposes relational data to upper applications via the standard SQL interface.
Get O'Reilly's weekly data newsletter

Kylin can handle big data sets and is fast in terms of query latency, which differentiates it from other SQL on Hadoop. For example, the biggest instance of Kylin in production that we’re aware of is at toutiao.com, a news feed app in China. This app has a table of three trillion rows and the average query response time is less than one second. We’ll discuss what makes Kylin so fast in the next section.

Another feature of the Kylin engine is that it can support complex data models. For example, there is a 60-dimension model running at CPIC, an insurance group in China. Kylin provides standard JDBC / ODBC / RestAPI interfaces, enabling a connection with any SQL application.


Kyligence has also developed an online demo, showcasing the BI experience on 100 million airline records. Check it out to learn, for example, the most delayed airline to San Francisco International Airport in the past 20 years. (Login with username “analyst”, password “analyst”, select the “airline_cube”, drag and drop dimensions and measures to play with the data set.)

6 may. 2017

Nuevas funcionalidades en PostgreSQL 10


Muy intersantes, las nuevas funcionalidades que se anuncian para las versión PostgreSQL 10:

Headline Features

Declarative Partitioning.  In previous versions of PostgreSQL, PostgreSQL supported only table inheritance, which could be used to simulate table partitioning, but it was complicated to set up and the performance characteristics were not that great.  In PostgreSQL 10, it's possible to do list or range partitioning using dedicated syntax, and INSERT performance has been greatly improved.  There is still a lot more work to do in future releases to improve performance and add missing features, but even what we have in v10 is already a major step forward (IMHO, anyway).

Logical Replication.  PostgreSQL has had physical replication -- often called streaming replication -- since version 9.0, but this requires replicating the entire database, cannot tolerate writes in any form on the standby server, and is useless for replicating across versions or database systems.  PostgreSQL has had logical decoding -- basically change capture -- since version 9.4, which has been embraced with enthusiasm, but it could not be used for replication without an add-on of some sort.  PostgreSQL 10 adds logical replication which is very easy to configure and which works at table granularity, clearly a huge step forward.  It will copy the initial data for you and then keep it up to date after that.

Improved Parallel Query.  While PostgreSQL 9.6 offers parallel query, this feature has been significantly improved in PostgreSQL 10, with new features like Parallel Bitmap Heap Scan, Parallel Index Scan, and others.  Speedups of 2-4x are common with parallel query, and these enhancements should allow those speedups to happen for a wider variety of queries.

SCRAM Authentication.  PostgreSQL offers a remarkable variety of different authentication methods, including methods such as Kerberos, SSPI, and SSL certificate authentication, which are intended to be highly secure.  However, sometimes users just want to use passwords managed by the PostgreSQL server itself.  In existing releases, this can be done either using the password authentication type, which just sends the user-supplied password over the wire, or via the md5 authentication type, which sends a hashed and salted version of the password over the wire.  In the latter approach, stealing the hashed password from the database or sniffing it on the wire is equivalent to stealing the password itself, even if you can't compute a preimage.  PostgreSQL 10 introduces scram authentication, specifically SCRAM-SHA-256, which is much more secure.  Neither the information which the server stores on disk nor the contents of an authentication exchange suffice for the server to impersonate the client.  Of course, the substitution of SHA-256 for MD5 is also a substantial improvement.  See also Michael Paquier's blog on this topic. One point to note is that, unless you are using libpq, you will not be able to use this feature unless your particular client driver has been updated with SCRAM support, so it may be a while before this feature is universally available.

Executor Speedups.  Substantial parts of PostgreSQL's executor have been rewritten to make expression and targetlist projection faster; just-in-time compilation will be added in a future release.  Hash aggregation has been rewritten to use a more efficient hash table and store narrower tuples in it, and work has also been done to speed up queries that compute multiple aggregates and joins where one side can be proven unique.  Grouping sets now support hash aggregation.  While all PostgreSQL releases typically contain at least some performance improvements, the rewrite of expression and targetlist projection is a particularly large and significant improvement which will benefit many users.

Durable Hash Indexes.  Hash indexes in PostgreSQL have suffered from years of long neglect; the situation will be noticeably improved in v10.  The most notable change is that changes to a hash index now write WAL, which means that they are crash-safe and that they are properly replicated to standbys.  However, a good deal of other work has been done, including the necessary prerequisite step of revamping the bucket split algorithm to improve performance and concurrency, caching the metapage for better performance, adding page-at-a-time vacuuming, and expanding them more gradually.  Amit Kapila even writes about a case where they outperformed btree indexes.  While there's certainly more work to be done here, I'm excited about these improvements.

ICU Collation Support.  In current releases, PostgreSQL relies exclusively on the collations supplied by the operating system, but this sometimes causes problems: collation behavior often differs between operating systems, especially between Linux and Windows, and it isn't always easy to find a collation for one operating system whose behavior matches that of some collation available on another system.  Furthermore, at least on Red Hat, glibc regularly whacks around the behavior of OS-native collations in minor releases, which effectively corrupts PostgreSQL's indexes, since the index order might no longer match the (revised) collation order.  To me, changing the behavior of a widely-used system call in a maintenance release seems about as friendly as locking a family of angry racoons in someone's car, but the glibc maintainers evidently don't agree.  (In fact, there's one discussion where it's suggested that you not use some of those interfaces at all.)  libicu, on the other hand, says they care about this.

But Wait, There's More!

In my estimation, the features listed above are the most exciting things that users can expect in PostgreSQL 10, which is expected to be released in September.  However, there are quite a few other significant features as well which could easily have qualified as headline features in a release less jam-packed than this one.  Here are some of them:

Extended Statistics (ndistinct, functional dependencies).  If the query planner makes a bad row count estimate resulting in a terrible plan, how do you fix it?  With extended statistics, you can tell the system to gather additional statistics according to parameters that you specify, which may help it get the plan right.

FDW Aggregate Pushdown.  In previous releases, SELECT COUNT(*) FROM foreign_table operated by fetching every row form the foreign table and counting them locally.  That was terrible, so now it doesn't.

Transition Tables.  It is now possible to write a PL/pgsql AFTER STATEMENT trigger which can access all rows modified by the statement.  This can be both faster and more convenient than writing an AFTER ROW trigger that is called once per row.

Improved Wait Events.  PostgreSQL 9.6 introduced wait event monitoring in pg_stat_activity, but only for a limited range of events.  In PostgreSQL 10, you'll be able to see latch waits and I/O waits, even for auxiliary processes and unconnected background workers.

New Integrity Checking Tools.  You can now validate the integrity of your btree indexes using the new amcheck module.  If you're a developer adding write-ahead logging to a new storage form, or a user who thinks the developers may have introduced a bug, you'll be pleased to be able to test with wal_consistency_checking. pg_dump now has better test coverage.

Smarter Connection Handling.  Connections through libpq can now specify multiple hosts, and you can even tell it to find you the server that is currently accepting write connections.

Quorum-Based Synchronous Replication.  You can now specify that a commit must be acknowledged by any K of N standby synchronous servers, improving flexibility and performance.

Other Cool Stuff

Many other things have also been significantly improved in this release.  XMLTABLE makes querying XML data faster and easier.  You can now interrogate the commit status of a transaction directly, and we've got better tracking of replication lag.  psql now supports \if ... \elseif ... \else ... \endif to make scripting easier, and there are new functions and new roles to allow monitoring tools to run without superuser privileges.  Encoding conversions are now faster, and so is sorting. You can compress the transaction log while streaming it.  And there's more, but this blog post is too long already.  If you're interested in reading even more about new features that will be coming with PostgreSQL 10, depesz blogs frequently on this topic, and so does Michael Paquier.  Both have additional details on some of the features mentioned here, as well as others that may be of interest.

This final note: we have had chronic problems with users erroneously believing that the pg_xlog or pg_clog directory is non-critical data, possibly because the directory names include the word "log".  Those directories have been renamed to pg_wal and pg_xact, which we hope will be clearer.  All SQL functions and utility names that formerly included the string "xlog", meaning the transaction log or write-ahead log, have been renamed to use "wal" instead.  Conversely, the default log directory is now called log rather than pg_log so that it is looks less like an internal name.  These changes will probably cause a bit of upgrade pain for some users, but we hope that they will also help users to avoid catastrophic mistakes.


Visto en Robert Haas Blog