Material Big Data

Lanzados ppts informativos de tecnologías BigData: Hadoop, Hbase, Hive, Zookeeper...

Apuntate al Curso de PowerBI. Totalmente práctico, aprende los principales trucos con los mejores especialistas

Imprescindible para el mercado laboral actual. Con Certificado de realización!!

Pentaho Analytics. Un gran salto

Ya se ha lanzado Pentaho 8 y con grandes sorpresas. Descubre con nosotros las mejoras de la mejor suite Open BI

Aprende gratis Analytics OLAP sobre Pentaho

La solución open source para business intelligence y Big Data sobre Pentaho, no te lo pierdas!!

29 abr. 2018

RStudio papers: listos para descargar

Con los cheat sheets siguientes (lo que vienen a ser resumenes o 'chuletas') se hace mucho más fácil aprender el uso de las mejores librerías y paquetes para R. Pulsa en las imágenes para descargar los pdf (TensorFlow, Shiny, Sparklyr, ggplot2...)






Introducing STMonitoring for Pentaho


One of the things more useful when you are running a Pentaho production environment with a lot of users accessing the BI server, using reports, dashbords, olap analysis... is monitor the whole user performance.

                           

That´s why we´ve created STMonitoring (included free in all of the projects we help to develop and in some solutions, like LinceBI). It includes a set of predefined dashboards, reports and olap analysis based on several monitoring models including:


User session events model:

- Analysis by year, month, day, hour and minute
- Session event (login, logout)
- User
- Session status (abandoned, ended, started)
  • Session Duration
  • Session Avg Duration
  • Session Max Duration
  • Session Min Duration
  • Session Count
  • Acc. Login Count by Time
  • Avg. Login Session Count by Time
  • Max. Login Session Count by Time
  • Acc. Logout Count by Time
  • Avg. Logout Session Count by Time
  • Max. Logout Session Count by Time
  • Concurrent Sessions Count by Time





Server Content access model:

- Analysis by year, month, day, hour and minute
- User
- Content type (CDE, Pentaho Analyzer, CDE, Saiku Analytics, STPivot, STReport, STDashboard, Pentaho Reporting, STAgile...)
- Content extension (prpt, wcdf, xaction, xjpivot...)
- Content (complete path)

  • Duration
  • Avg duration
  • Access count



Want to know more? send an email

21 abr. 2018

Por que muchos Data Scientist estan dejando sus trabajos?


Muy revelador lo que nos cuentan en este articulo del Towards Data Science, y que coincide con muchas situaciones y casos reales que conocemos y que se están produciendo.

La frustración con el día a día del trabajo de los Data Scientist, respecto a las expectativas es importante (muchos conocéis que es llamado 'el trabajo más atractivo del siglo XXI'). La realidad es que muchos abandonan sus puestos de trabajo en grandes compañías, cuando parecían ser lo más deseados


Estas son las razones:

1. Las expectativas no coinciden con la realidad



Cuando son contratados, los Data Scientist creen que van a estar resolviendo problemas muy complejos y cruciales para la compañía, con algoritmos novedosos y sofisticados.
La realidad es que se encuentran que a la compañía lo que le importa es que tipo de gráfico debe aparecer en los informes o cuadros de mando del próximo comité de dirección, en mayor proporción que optimizar el mejor algoritmo

Creen que van a ser muy importantes en la compañía y salvo que ésta se dedique especificamente a 'Machine Learning' (muy pocas), serán solo un empleado más, por muy grande o multinacional que sea la compañía


2. Las relaciones en la empresa son más importantes



Por mucho que piensen los Data Scientist que van a ser valorados por conocer hasta el algortimo más complejo (y esto les haga tener más relevancia en las compañias), la realidad es que será más importante ayudar a las personas de negocio que pidan realizar tareas más sencillas y repetitivas como cargar ficheros de datos, hacer limpieza de los mismos y crear algunos informes, como forma de progresar en la misma


3. Te van a ver como 'el de los datos', en general



Da igual que expliques la diferencias, el nivel de conocimiento que como 'Data Scientist' tienes de Spark, Hadoop, Hive, Pig, SQL, Neo4J, MySQL, Python, R, Scala, Tensorflow, A/B Testing, NLP anything machine learning... tu eres el experto en datos, por lo que la mayor parte de tu tiempo, los responsables de estas grandes empresas te pedirán informes, por que no cuadran los datos, un bonito dashboard, cargar tablas o CSVs, etc....


4. Trabajar en equipos especializados y solitarios no siempre funciona




Los Data Scientist pueden ser muy buenos con premios ganados en Kaggle, conocer muchos algoritmos y trabajar bien en equipos pequeños.
Pero para las grandes organizaciones los resultados de un Data Scientist o su equipo es solo una pieza dentro de un gran puzzle que son los objetivos empresariales y, por tanto, es importante ir alineados con el resto de áreas y departamentos, lo que necesita de 'mano izquierda' o saber manejarse con las personas en las empresas, algo frustrante para muchos Data Scientist


18 abr. 2018

STDashboard, a free license way to create Dashboards



The improvements in this version of STDashboard are focused on user interface for panel and dashboard and also some enhancement in performance and close some old bugs. It works with Pentaho and embeded in web applications

You can see it in action in this Pentaho Demo Online and as a part of LinceBI suite

STDashboard doesn´t requiere anual license, you can manage unlimited users and it´s open source based. 

STDashboard includes professional services (training, support and maintenance, docs and bug resolution - so, you have high enterprise level guaranteed -)

Interested? contact Stratebi or LinceBI




See a Video Demo:


About UI improvements:

 - New set of predefined dashboard templates. We have designed a new way to manage dashboard panels that allow you to shape the dashboard in almost any combination of size, proportion and amount of panel you want to have. For this reason we have created a set of different layouts for most common cases.




- Embed in any web application. This sample shows STDashboard in LinceBI




 - Self managed panel. Add and remove panels, now in stdashboard you can add or remove panels easily using the button inside each panel header.



 - New layout management. Now an stashboard layout is composed of a list panel container, the containers in this list are stacked vertically in the page. There are two types of such containers; horizontal and vertical, each one stores a list of real panels (the ones where the graph are drawn) in an horizontal or vertical flow, in this ways you can combine those panels to achieve almost any layout you can imagine.





 - Resizable panels. We have included the possibility of resize the panel horizontally or vertically, keeping the proportion of graph inside it in correspondence with horizontal adjacent panels without making an horizontal scroll in the page, that means if you shrink a panel horizontally and there is another panel in the same row, the other panels also shrink an a proportional way to allow all panels in a row fit the horizontal size of the window. 

Is interesting to note here that we have implemented this functionality using pure GWT API, to avoid external dependencies and ensure portability between browsers.

 - Draggable panels. Each panel in the entire dashboard can be dragged to any parent container. In the header of each single panel the is a handle that allow dragging the panels to any panel container in the dashboard.



 - Responsive Dashboard. The ability to resize dynamically the panels and graph when the window's dimensions change, or when a user make zoom in the page is now implemented, also in most phones the dashboard can be seen proportionally and keeping the original layout.

 - Persistent state of the layout. When you save a dashboard to a file, we are saving the visual state of it and store it in the file. Then, when you open the dashboard, all the details of visual interface are hold and you can see the dashboard exactly the same previous to saved, that means panels size, locations are restored effectively.



About performance:

 - In some points of the application an specific query was causing performance problem. To know if a member has child or not in a multilevel hierarchy, the previous code issued a query to list all the sons of that member and check if the size is greater than 0, our solutions in this case for this type of query was simply check the level of the current member and in this way answer that boolean query.

 - Connection to cubes using the new MondrianOlap4jDriver java class. This improve the connection performance and stability because is designed for mondrian connections, the previous code was using an standard JDBC connection.


About new enhacements:

- Date configuration for filters. Date dimension are special dimensions, because almost any cube has at least one defined and are very used for make range query over fact table, to allow dynamic filter in panels, we had to enable a .property file that allow the user to define their date dimension and configure the way they want to use it in queries.



Added the Pentaho File Explorer to allows the users navigation through the files stored in pentaho, like reports, documents, etc and embeed it inside a panel in the dashboard



14 abr. 2018

Libro Gratuito: Front-End Developer Handbook 2018


Que todavía no lo habéis descargado? Un libro imprescindible!! Front-End Developer Handbook 2018

Descargar pdf

Contenido:

Introduction
What Is a Front-End Developer?
Recap of Front-end Dev in 2017
In 2018 expect...

Part I: The Front-End Practice
Front-End Jobs Titles
Common Web Tech Employed
Front-End Dev Skills
Front-End Devs Develop For...
Front-End on a Team
Generalist/Full-Stack Myth
Front-End interview questions
Front-End Job Boards
Front-End Salaries
How FDs Are Made

Part II: Learning Front-End Dev
Self Directed Learning
Learn Internet/Web
Learn Web Browsers
Learn DNS
Learn HTTP/Networks
Learn Web Hosting
Learn General Front-End Dev
Learn UI/Interaction Design
Learn HTML & CSS
Learn SEO
Learn JavaScript
Learn Web Animation
Learn DOM, BOM & jQuery
Learn Web Fonts, Icons, & Images
Learn Accessibility
Learn Web/Browser APIs
Learn JSON
Learn JS Templates
Learn Static Site Generators
Learn Computer Science via JS
Learn Front-End App Architecture
Learn Data API (i.e. JSON/REST) Design
Learn React
Learn State Management
Learn Progressive Web App
Learn JS API Design
Learn Web Dev Tools
Learn Command Line
Learn Node.js
Learn JS Modules
Learn JS Module loaders/bundlers
Learn Package Managers
Learn Version Control
Learn Build & Task Automation
Learn Site Performance Optimization
Learn Testing
Learn Headless Browsers
Learn Offline Dev
Learn Web/Browser/App Security
Learn Multi-Device Dev (e.g., RWD)
Directed Learning
Front-End Schools, Courses, & Bootcamps
Front-End Devs to Learn From
Newsletters, News, & Podcasts

Part III: Front-End Dev Tools
Doc/API Browsing Tools
SEO Tools
Prototyping & Wireframing Tools
Diagramming Tools
HTTP/Network Tools
Code Editing Tools
Browser Tools
HTML Tools
CSS Tools
DOM Tools
JavaScript Tools
Static Site Generators Tools
Accessibility Dev Tools
App Frameworks (Desktop, Mobile etc.) Tools
State Management Tools
Progressive Web App Tools
GUI Development/Build Tools
Templating/Data Binding Tools
UI Widget & Component Toolkits
Data Visualization (e.g., Charts) Tools
Graphics (e.g., SVG, canvas, webgl) Tools
Animation Tools
JSON Tools
Placeholder Images/Text Tools
Testing Tools
Front-end Data Storage Tools
Module/Package Loading Tools
Module/Package Repo. Tools
Hosting Tools
Project Management & Code Hosting
Collaboration & Communication Tools
CMS Hosted/API Tools
BAAS (for Front-End Devs) Tools
Offline Tools
Security Tools
Tasking (aka Build) Tools
Deployment Tools
Site/App Monitoring Tools
JS Error Monitoring Tools
Performance Tools
Tools for Finding Tools

13 abr. 2018

From Big Data to Fast Data



Muy buen articulo de Raul Estrada. Principales puntos:

1. Data acquisition: pipeline for performance

In this step, data enters the system from diverse sources. The key focus of this stage is performance, as this step impacts of how much data the whole system can receive at any given point in time.


  • Technologies
    For this stage you should consider streaming APIs and messaging solutions like:
    • Apache Kafka - open-source stream processing platform
    • Akka Streams - open-source stream processing based on Akka
    • Amazon Kinesis - Amazon data stream processing solution
    • ActiveMQ - open-source message broker with a JMS client in Java
    • RabbitMQ - open-source message broker with a JMS client in Erlang
    • JBoss AMQ - lightweight MOM developed by JBoss
    • Oracle Tuxedo - middleware message platform by Oracle
    • Sonic MQ - messaging system platform by Sonic
For handling many of these key principles of data acquisition, the winner is Apache Kafka because it’s open source, focused on high-throughput, low-latency, and handles real-time data feeds.


2. Data storage: flexible experimentation leads to solutions

There are a lot of points of view for designing this layer, but all should consider two perspectives: logical (i.e. the model) and physical data storage. The key focus for this stage is "experimentation” and flexibility.


  • Technologies
    For this stage consider distributed database storage solutions like:
    • Apache Cassandra - distributed NoSQL DBMS
    • Couchbase - NoSQL document-oriented database
    • Amazon DynamoDB - fully managed proprietary NoSQL database
    • Apache Hive - data warehouse built on Apache Hadoop
    • Redis - distributed in-memory key-value store
    • Riak - distributed NoSQL key-value data store
    • Neo4J - graph database management system
    • MariaDB - with Galera form a replication cluster based on MySQL
    • MongoDB - cross-platform document-oriented database
    • MemSQL - distributed in-memory SQL RDBMS
For handling many of key principles of data storage just explained, the most balanced option is Apache Cassandra. It is open source, distributed, NoSQL, and designed to handle large data across many commodity servers with no single point of failure.


3. Data processing: combining tools and approaches

Years ago, there was discussion about whether big data systems should be (modern) stream processing or (traditional) batch processing. Today we know the correct answer for fast data is that most systems must be hybrid — both batch and stream at the same time. The type of processing is now defined by the process itself, not by the tool. The key focus of this stage is "combination."


  • Technologies
    For this stage, you should consider data processing solutions like:
    • Apache Spark - engine for large-scale data processing
    • Apache Flink - open-source stream processing framework
    • Apache Storm - open-source distributed realtime computation system
    • Apache Beam - open-source, unified model for batch and streaming data
    • Tensorflow - open-source library for machine intelligence
For managing many of the key principles of data storage just explained, the winner is a tie between Spark (micro batching) and Flink (streaming).


4. Data visualization

Visualization communicates data or information by encoding it as visual objects in graphs, to clearly and efficiently get information to users. This stage is not easy; it’s both an art and a science.

Technologies

  • For this layer you should consider visualization solutions in these three categories:
  • Comparacion de sistemas Open Source OLAP para Big Data

    Ya os hemos hablado en este blog mucho de nuestra solucion Open Source OLAP para Big Data preferida, que es Apache Kylin:





    -x50 faster 'near real time' Big Data OLAP Analytics Architecture
    Use Case “Dashboard with Kylin (OLAP Hadoop) & Power BI”
    Cuadros de mando con Tableau y Apache Kylin (OLAP con Big Data)
    BI meet Big Data, a Happy Story
    7 Ejemplos y Aplicaciones practicas de Big Data
    Analysis Big Data OLAP sobre Hadoop con Apache Kylin
    Real Time Analytics, concepts and tools 
    Hadoop Hive y Pentaho: Business Intelligence con Big Data (Caso Practico)


    Hoy os vamos a contar sobre otras alternativas gracias a Roman Lementov:

    I want to compare ClickHouseDruid and Pinot, the three open source data stores that run analytical queries over big volumes of data with interactive latencies.
    ClickHouse, Druid and Pinot have fundamentally similar architecture, and their own niche between general-purpose Big Data processing frameworks such as Impala, Presto, Spark, and columnar databases with proper support for unique primary keys, point updates and deletes, such as InfluxDB.
    Due to their architectural similarity, ClickHouse, Druid and Pinot have approximately the same “optimization limit”. But as of now, all three systems are immature and very far from that limit. Substantial efficiency improvements to either of those systems (when applied to a specific use case) are possible in a matter of a few engineer-months of work. I don’t recommend to compare performance of the subject systems at all, choose the one which source code you are able to understand and modify, or in which you want to invest.
    Among those three systems, ClickHouse stands a little apart from Druid and Pinot, while the latter two are almost identical, they are pretty much two independently developed implementations of exactly the same system.
    ClickHouse more resembles “traditional” databases like PostgreSQL. A single-node installation of ClickHouse is possible. On small scale (less than 1 TB of memory, less than 100 CPU cores). 

    ClickHouse is much more interesting than Druid or Pinot, if you still want to compare with them, because ClickHouse is simpler and has less moving parts and services. I would say that it competes with InfluxDB or Prometheus on this scale, rather than with Druid or Pinot.
    Druid and Pinot more resemble other Big Data systems in the Hadoop ecosystem. They retain “self-driving” properties even on very large scale (more than 500 nodes), while ClickHouse requires a lot of attention of professional SREs. Also, Druid and Pinot are in the better position to optimize for infrastructure costs of large clusters, and better suited for the cloud environments, than ClickHouse.
    The only sustainable difference between Druid and Pinot is that Pinot depends on Helix framework and going to continue to depend on ZooKeeper, while Druid could move away from the dependency on ZooKeeper. On the other hand, Druid installations are going to continue to depend on the presence of some SQL database.
    Currently Pinot is optimized better than Druid. (But please read again above — “I don’t recommend to compare performance of the subject systems at all”, and corresponding sections in the post.)