Material Big Data

Lanzados ppts informativos de tecnologías BigData: Hadoop, Hbase, Hive, Zookeeper...

Apuntate al Curso de PowerBI. Totalmente práctico, aprende los principales trucos con los mejores especialistas

Imprescindible para el mercado laboral actual. Con Certificado de realización!!

Pentaho Analytics. Un gran salto

Ya se ha lanzado Pentaho 8 y con grandes sorpresas. Descubre con nosotros las mejoras de la mejor suite Open BI

LinceBI, la mejor solución Big Data Analytics basada en Open Source

LinceBI incluye Reports, OLAP, Dashboards, Scorecards, Machine Learning y Big Data. Pruébala!!

23 may. 2019

List of Open Source solutions for Smart Cities - Internet of Things projects

Increasingly projects are carried on so-called 'Smart Cities', supported by Big Data, Internet of Things... and the good news is that most of them are made with Open Source technologies. We can share, from our insights about these technologies

Making a city “smart” involves a set of areas we will outline below: Without IOT (Internet Of Things), there will be no Smart City. 

Since automatic collected data is the most efficient way to get huge amounts of information, devices connected to the internet are an essential part of a Smart City.
The way we store and process data from city is generally using Big Data and Real Time Streaming technologies. 

The final goal where more innovative and custom analysis can be achieved using Artificial Intelligence and Machine Learning. Finally I would include Apps, as usually this kind of solutions is consumed in mobile devices. 

Here we outline the common process of building a Smart City solution: 

-Choose data 
-Connecting devices 
-Design Data Storage Infrastructure 
-Real Time Events and Notifications 
-Analytics -Visualization (Dashboards) 

 1) Choosing Data 

In a city there are three basic sources of data: citizens, systems, sensors. Use the available information of users, on social networks, informations systems, public statistical information offered by the administration. 

A typical example is user with geolocalization enabled in twitter. Information about the systems and services in a city are sometimes available in open data sources. An example could be the water or electricity consumption. 

Last but not least, sensors. A city hoping to become “Smart” has to intend to provide automatic information of its environment, and that could be achieved using sensors. Sensors can be anywhere

2) Connecting Devices

Devices (sensors) connects with the real time data streaming and the storage infrastructure using efficient communications protocols, that using light weight packaging and asynchronous communications.

Examples of some communications protocols used:

MQTT (Message Queuing Telemetry Transport) Websocket (bi-directional web communication and connection management)

STOMP (The Simple Text Oriented Messaging Protocol)

XMPP (Extensible Messaging and Presence Protocol)

3) Design Data Storage Infraestructure 

The Data Storage Infrastructure for a Smart City solutions has special characteristics, due to the diversity and dynamism of its sources. 

Time series DB are frequently used, because of the time evolution of data catched by sensors Some examples of this kind of DB are InfluxDB and Druid

Another DB commonly used in Smart Cities project are MongoDB (json format advantages), Cassandra (fast insertion advantages), Hadoop (big data frameworks advantages)

Some samples

4) Real Time events and notifications

Usually Smart Cities solutions have needs for real time notifications on events. To accomplish such requirements the system must have a Stream Analytic engine, that can react to events in real time and send notification. This characteristics bring us some technologies related to this; Storm, Spark Streaming, Flink, WebSocket, Socket.IO

IoT Frameworks:


Node-RED is a tool for wiring together hardware devices, APIs and online services in new and interesting ways.

The light-weight runtime is built on Node.js, taking full advantage of its event-driven, non-blocking model. This makes it ideal to run at the edge of the network on low-cost hardware such as the Raspberry Pi as well as in the cloud.

The flows created in Node-RED are stored using JSON which can be easily imported and exported for sharing with others.
An online flow library allows you to share your best flows with the world


     PubNub is a Data Stream Network, that offers infrastructure as a service. With PubNub,  we can use the infrastructure provided and connect our devices, designing our architecture and simply get advantages of all this.

PubNub has 5 main tools:

-Publish Subscribe (Allows Real Time Notifications of Events to users)
-Stream Controller (Allows managing channels and groups of channels)
-Presence (Allows notifications when users login or leave the system, or similar behaviour, device availability for example)
-Access Manager (Allows administrators, to grant or deny permitson users of the systems)
-Storage & Playback (Provide storage for messages,and allows messages retrieval at later time)


AWS IoT is a platform that enables you to connect devices to AWS Services and other devices, secure data and interactions, process and act upon device data, and enable applications to interact with devices even when they are offline

5) Analytics and Visualization

You can show real time dashboards, reports, OLAP Analysis using tools like Pentaho. See samples of Analytics  

16 may. 2019

Analisis de los Panama Papers con Neo4J - Big Data

En este ejemplo se usa Neo4j como Base de Datos basada en grafo para modelar las relaciones entre las entidades que forman parte de los Papeles de Panamá (PP). A partir de ficheros de texto con los datos y relaciones entre clientes, oficinas y empresas que forman parte de los PP, hemos creado este grafo que facilia la comprensión de las interacciones entre sujetos distintos en esta red.
La demostración comienza seleccionando una entidad de cualquier tipo (Address, Company, Client, Officer), según el tipo que seleccione se muestran los atributos de ese nodo, luego seleccione el atributos que desea e introduzca el filtro, agregando varios paneles para filtrar por más de uno si es necesario. El parámetro "Deep" significa el número de conexiones al elemento seleccionado que se quiere mostrar.
En el servidor se hace una búsqueda BFS a partir del nodo seleccionado realizando consultas a Neo4j para cada tipo de relación donde una de sus partes sea el nodo actual, hasta llegar al nivel de profundidad solicitado. Se van guardando los nodos y los arcos para devolverlos como resultado.

Para la visualización del grafo se ha usado Linkurious, uno de los componentes más efectivos para este propósito en el mercado. Se puede interactuar con el grafo haciendo zoom, seleccionando elementos, moviendo elementos o usando el lasso tool para seleccionar varios nodos. Haciendo doble click sobre un nodo se cargan las conexiones a él que no estén visualizadas.
Neo4j y las Bases de Datos basadas en grafos en general tienen aplicaciones muy particulares, como Detección de Fraudes (descubriendo patrones de relaciones entre nodos), Recomendaciones en Tiempo Real (es relativamente sencillo, usando el peso de las relaciones de cada nodo, su tendencia, etc), Analítica de Redes Sociales (por la facilidad de implementar algoritmos de grafos en este tipo de Base de Datos)
Enjoy it!!

14 may. 2019

Diferencias entre Data Analyst, desarrollador Business Intelligence, Data Scientist y Data Engineer

Conforme se extiende el uso de analytics en las organizaciones cuesta más diferenciar los roles de cada una de las personas que intervienen. A continuación, os incluimos una descripción bastante ajustada

Data Analyst

Data Analysts are experienced data professionals in their organization who can query and process data, provide reports, summarize and visualize data. They have a strong understanding of how to leverage existing tools and methods to solve a problem, and help people from across the company understand specific queries with ad-hoc reports and charts.
However, they are not expected to deal with analyzing big data, nor are they typically expected to have the mathematical or research background to develop new algorithms for specific problems.

Skills and Tools: Data Analysts need to have a baseline understanding of some core skills: statistics, data munging, data visualization, exploratory data analysis, Microsoft Excel, SPSS, SPSS Modeler, SAS, SAS Miner, SQL, Microsoft Access, Tableau, SSAS.

Business Intelligence Developers

Business Intelligence Developers are data experts that interact more closely with internal stakeholders to understand the reporting needs, and then to collect requirements, design, and build BI and reporting solutions for the company. They have to design, develop and support new and existing data warehouses, ETL packages, cubes, dashboards and analytical reports.
Additionally, they work with databases, both relational and multidimensional, and should have great SQL development skills to integrate data from different resources. They use all of these skills to meet the enterprise-wide self-service needs. BI Developers are typically not expected to perform data analyses.

Skills and tools: ETL, developing reports, OLAP, cubes, web intelligence, business objects design, Tableau, dashboard tools, SQL, SSAS, SSIS.

Data Engineer

Data Engineers are the data professionals who prepare the “big data” infrastructure to be analyzed by Data Scientists. They are software engineers who design, build, integrate data from various resources, and manage big data. Then, they write complex queries on that, make sure it is easily accessible, works smoothly, and their goal is optimizing the performance of their company’s big data ecosystem.
They might also run some ETL (Extract, Transform and Load) on top of big datasets and create big data warehouses that can be used for reporting or analysis by data scientists. Beyond that, because Data Engineers focus more on the design and architecture, they are typically not expected to know any machine learning or analytics for big data.

Skills and tools: Hadoop, MapReduce, Hive, Pig, MySQL, MongoDB, Cassandra, Data streaming, NoSQL, SQL, programming.

Data Scientist

A data scientist is the alchemist of the 21st century: someone who can turn raw data into purified insights. Data scientists apply statistics, machine learning and analytic approaches to solve critical business problems. Their primary function is to help organizations turn their volumes of big data into valuable and actionable insights.
Indeed, data science is not necessarily a new field per se, but it can be considered as an advanced level of data analysis that is driven and automated by machine learning and computer science. In another word, in comparison with ‘data analysts’, in addition to data analytical skills, Data Scientists are expected to have strong programming skills, an ability to design new algorithms, handle big data, with some expertise in the domain knowledge.

Moreover, Data Scientists are also expected to interpret and eloquently deliver the results of their findings, by visualization techniques, building data science apps, or narrating interesting stories about the solutions to their data (business) problems.

The problem-solving skills of a data scientist requires an understanding of traditional and new data analysis methods to build statistical models or discover patterns in data. For example, creating a recommendation engine, predicting the stock market, diagnosing patients based on their similarity, or finding the patterns of fraudulent transactions.
Data Scientists may sometimes be presented with big data without a particular business problem in mind. In this case, the curious Data Scientist is expected to explore the data, come up with the right questions, and provide interesting findings! This is tricky because, in order to analyze the data, a strong Data Scientists should have a very broad knowledge of different techniques in machine learning, data mining, statistics and big data infrastructures.

They should have experience working with different datasets of different sizes and shapes, and be able to run his algorithms on large size data effectively and efficiently, which typically means staying up-to-date with all the latest cutting-edge technologies. This is why it is essential to know computer science fundamentals and programming, including experience with languages and database (big/small) technologies.

Skills and tools: Python, R, Scala, Apache Spark, Hadoop, data mining tools and algorithms, machine learning, statistics.

Visto en BigDataUniversity

9 may. 2019

Big Data: Real Time Dashboards with Spark Streaming

Al abrirse la página de esta demostración, se solicita una conexión con el end point que provee los datos de la wikipedia, mediante un WebSocket.

Enel servidor se crea una conexión con el cliente y mientras esté abierta y no ocurran errores en el envio, el sistema busca los datos de los componentes de "Broadcast Queue". Estos componentes, a su vez, están recibiendo datos del API REST, que les llega a través del Cliente Http implementado y usado por Spark para enviar los resultados.
La implementación de la "Broadcast Queue", permite que todas las conexiones al servidor puedan buscar los datos en la misma cola obteniendo un tiempo óptimo de O(1), (Complejidad Computacionalde obtener datos de una Cola de Mensajes) para cada conexión en recibir el mensaje.

A su vez, en su papel de Cola de Mensajes permite que la comunicación entre Spark y el Server Socket sea óptima, en O(1) igualmente sin contar los retrazos por red.

Esta implementación permite que un número muy alto de clientes puedan conectarse a visualizar en tiempo real los datos recibidos de la wikipedia.

Puedes ver también un video en funcionamiento:

7 may. 2019

Real Time Analytics, concepts and tools

We could consider three types of Real Time when we manage data and depends on each stage:

1. Real Time Processing: Is the possibility of ingest data at the time the event is produced in real live. This includes only processing step, i.e copying data from source to destiny and guarantees data to be ready for analytics

You can try some online demos here


2. Stream Analytics: it performs analytics of data on the fly, as a stream is usually analyzed in a window time frame, the analytics we can do here is limited because only attack a very limited data set


3. Real Time Analytics: refers to two basic conditions: the most recent data will be included in any report, graphic, etc, that analytics will take near to 0 time in execute


In Memory Mapreduce
-Apache Spark (Spark SQL)
-Apache Flink (FQL)

Column Storage Engines
-Kafka + (Spark | Flink) + 
-InfluxDB (Time series analytics)

-Marketing (Product recommendations based on latest updates)
-Fraud Detection (Tracking suspect activities on events that appear to be fraudulent)
-Health Care Monitoring (Social network trending topics can help to this)