Material Big Data

Lanzados ppts informativos de tecnologías BigData: Hadoop, Hbase, Hive, Zookeeper...

Apuntate al Curso de PowerBI. Totalmente práctico, aprende los principales trucos con los mejores especialistas

Imprescindible para el mercado laboral actual. Con Certificado de realización!!

Pentaho Analytics. Un gran salto

Ya se ha lanzado Pentaho 8 y con grandes sorpresas. Descubre con nosotros las mejoras de la mejor suite Open BI

Aprende gratis Analytics OLAP sobre Pentaho

La solución open source para business intelligence y Big Data sobre Pentaho, no te lo pierdas!!

26 feb. 2018

Metodologias Agiles para Analytics (Business Intelligence, Big Data)





En este post, os vamos a contar como hacer proyectos ágiles en Analytics (Business Intelligence/Big Data Analytics). Realmente, os vamos a contar unos tips o consejos que cada vez más usamos y que nos cuenta Emilio Arias de Stratebi.  

Tradicionalmente, este enfoque se ha aplicado más a proyectos en los que el componente de 'desarrollo' tiene un peso muy importante y se hace muy difícil aplicarlo al BI/DW, donde los requisitos, el manejo de datos de negocio y la participación de perfiles de interlocutores muy diversos lo hace muy difícil.


A. El enfoque tradicional de planificación en BI/DW

- La planificación de proyectos en cascada (con diagramas de Gantt) que todos conocéis (lleva usándose más de 70 años) se ha demostrado imperfecto a la hora de conseguir que un proyecto BI sea exitoso. Que quiere decir 'un proyecto BI exitoso'? quiere decir 'que se use' por la mayor parte de la organización y es porque les ofrece 'lo que necesitan'

- Los diferentes planteamientos teóricos de construcción (Kimball, Inmon, Data Vault) se han demostrado muy útiles para reflejar el diagrama de modelos y almacenes de datos, pero la ejecución en el día a día, nos ha demostrado que se requieren enfoques ágiles para llevarlos a la práctica



- Los problemas surgen pues 'Cómo se había hecho una planificación', 'con muchos meses por delante', cuando surge un problema de arquitectura, de volumen, cambio de requerimientos, mejoras de software... el encaje y respuesta rápida se hace imposible

- Al ser proyectos con un alcance ya cerrado y difícil de cambiar, 'proyectos caja negra', los usuarios e interesados en el proyecto no lo sienten como suyo, generando reticencias sobre su uso, al no sentirse partícipes, pues sus propuestas y sugerencias, 'suelen llegar tarde'




B. Los 20 puntos clave para un proyecto Agile BI/DW

1. Haz prototipos (antes, durante y después). No dejes de hacerlos, son la mejor herramienta para garantizar que se va en el buen camino

2. Ten un entorno preparado para los prototipos rápidos (entorno en la nube, componentes predefinidos, procesos automatizados...)

3. Usa metodologías ágiles. Hay muchas... (scrum...), lo más importante es el cambio de mentalidad y empezar a usarlas

4. La regla de oro: mejor rehacer un 30% ahora que un 100% dentro de 6 meses. No tengas miedo a que te hagan cambios en los prototipos. Siempre será mejor que ir a ciegas

5. Todo el equipo se siente implicado desde el momento inicial. Y sienten que sus opiniones cuentan

6. La tradicional batalla entre usuarios-IT-Consultores, por sus diferentes prioridades, se minimiza al colaborar desde momentos muy tempranos y con la tranquilidad de que 'hay tiempo para corregir errores'

7. En este tipo de proyectos, encontrar un 'product owner' es complicado, pero lo tenéis que hacer. Debe ser de negocio

8. Solventa cuanto antes los puntos de fricción 'top-down', 'down-top', desde la importancia de la calidad del datos, los procesos ETL y los metadatos a los análisis de negocio en tiempo real, KPIs, etc... (en el punto intermedio, todos los participantes deberán alinearse)

9. Haz los planes de pruebas no al final, sino al día siguiente de empezar

10. Necesitas un Project Manager (el que está al tanto de todo, conoce a todos, convoca y resume las reuniones, etc...) Necesitas una cabeza visible y clara que todos 'identifiquen con el proyecto'

11. Mide y cuenta los avances, genera satisfacción con lo conseguido
12. Reuniones breves al principio de cada día y más amplias cada semana

13. Nunca pongas la presentación de un hito, avance, etc.. un lunes por la mañana (es de malos gestores, contar con el fin de semana de colchón) y genera ansiedades

14. Usa el BI (cubos, dashboards..) de forma ágil para validar rápidamente la calidad de los datos, tiempos de ejecución, etc... BI por partida doble

15. Deja que los usuarios se acerquen al BI. Desde las fases iniciales pierde el miedo a que accedan, toquen, rompan, se frustren, se sorprendan, se quejen de lo que van viendo...

16. No dejes el diseño y usabilidad para el final. Aunque pienses que es secundario y posterior, deber ir paralelo al desarrollo. Si no lo haces, la implicación de usuarios decaerá enormemente

17. Con AgileBI vas a tener que seguir documentando (de otra forma, con herramientas online (trello, podio, etc...), pero lo harás

18. Con AgileBI se necesita más disciplina, no menos. Esto es muy importante. Se asocia a cierto caos y es todo lo contrario. Se trata de trabajar como los mecánicos que cambian las ruedas en Formula 1

19. Tienes que tener a la gente motivada en el proyecto (esto se consigue con todo lo anterior), pero si haces todo lo anterior y no están motivados, 'el problema eres tú'

20. Un proyecto BI/DW nunca, nunca, nunca se acaba. Si lo das por acabado, también será un fracaso

Adenda: Si usas BI Open Source (por su flexibilidad, ahorro de costes e integración), tienes 'muchos' más puntos para conseguir tu objetivo

Te puede interesar:

- Big Data para Dummies
- Comparativa de herramientas Business Intelligence
- Descarga gratuita del Libro de un buen amigo y gran especialista, Roberto Canales: 'Transformacion Digital y Metodologías Agiles'
- Así se convierten los datos en conocimiento
- Como aprender Big Data en dos horas



23 feb. 2018

Principales tendencias de Visualizacion de Datos para 2018



Gracias a nuestros amigos de Carto nos llega esta interesante recopilación de las principales tendencias en Visualización de Datos para 2018

1. Data visualization is not just for data scientists anymore.
IBM projects a 39% increase in demand for data scientists and data engineers over the next three years. But employers are coming to expect a familiarity and comfort with data across their organizations, not just from their scientists and engineers.
Because of this trend, we can expect the continued growth of tools and resources geared towards making the data visualization field and its benefits more accessible to everyone.
Ferdio Data Viz
For example, someone new to the field may turn to Ferdio’s DataVizProject.com, a compendium of over 100 visualization models. The infographic agency put this resource together to “inform and inspire” those looking to build their own data visualizations. Other services like Google’s Data Studio allow users to easily create visualizations and dashboards without coding skills.

2. The increase of both open and private data helps enrich data visualizations.

In order to gain greater insight into the actions and patterns of their customers and constituents, organizations need to turn to sources outside of their own proprietary data.
Luckily for data scientists, more and more data becomes available every day, and we can expect the trend of increased availability to continue into 2018.
Data.gov, the United States Federal Government’s open data site, boasts data sets from 43 US states, 47 cities and counties and 53 other nations. In June, Forbes identified 85 US cities that have their own data portals.
The example above visualizes open data about Cholera outbreaks from WHOusing custom iconography and color palette.
In addition to open data sources, new marketplaces, data exchanges such as the new Salesforce Data Studio (announced in September 2017) as well as resources such as CARTO’s Data Observatory, will provide data scientists and visualizers even more opportunities to enhance their data and draw new and actionable insights.

3. Artificial Intelligence and Machine Learning allows data professionals to work smarter not harder.

Artificial Intelligence and Machine Learning are the buzz words du jour in the tech world and that includes their use in the field of data science and visualization.
Salesforce has certainly highlighted their use, advertising their Einstein AI, which will aid users in discovering patterns in their data.
Einstein AI
Microsoft has recently announced similar enhancements to Excel, expected in 2018. Their “Insights” upgrade includes the creation of new data types in the program. For example, the Company Name data type will automatically pull in such information as location and population data using their Bing API. They are also introducing Machine Learning models that will assist with data manipulation. These updates will empower Excel users, already familiar with the programs data visualization tools, with data sets that are automatically enhanced.

4. The “interactive map” is becoming a standard medium for data visualizations.

Data visualization, as a term, can refer to any visual representation of data. However, with the growing amount and prevalence of location data, more and more data visualizations require an interactive map to fully tell a story with data.
Data Visualizations

5. There is a new focus on “data stories.”

Creating a single data visualization can have great impact. But, more companies are beginning to create custom website experiences that tell a more complete story using many types of data and visualization methods.
Enigma Labs released the world’s first Sanctions Tracker earlier this year, a data story that contextualizes and communicates over twenty years of U.S. sanctions data as meaningful information. Look in 2018 for more custom experiences that use maps and other mediums of data visualization to communicate complex issues.
Sanctions Tracker Map

6. New color schemes and palettes for visually impaired.

The color of 2017, according to Pantone, is “Greenery,” a lighter shade of green conveying a sense of rejuvenation, restoration, and renewal. The long-term color forecast, however, is a return to primary colors like red, green, and blue, colors often appearing in country flags, because “[i]n complex times we look to restricted, uncompromising pallets.”
Regardless of trends, it’s important to understand the fundamentals of choosing color palettes for your data visualization. Once you understand the fundamentals, you can start exploring other palette options and incorporating design trends. Check out Invision’s post on Finding The Right Color Palletes for Data Visualizations.
CARTO also offers an open-source set of colors specifically designed for data visualizations using maps, called CARTO Colors.
It’s important to consider that about 4.5% of the world’s population is color-blind. Data visualization designers especially need to considering building visualizations with color-blind color palettes, like those provided by ColorBrewer.

7. Data visualizations around current events are dominating the social conversation.

Data visualizations for social sharing will also take a “less is more” approach for the remainder of the year.
Interactive data visualizations, and maps specifically, offer a new format that is great for social sharing. Marketers can quickly build maps using available location data from social platforms or open data portals.
Below, one marketer created a data visualization using Twitter traffic from the Game of Thrones season seven premiere and generated thousands of views:
In focusing in on the three main contenders for the Iron Throne, this data visualization quickly and efficiently tells the viewer that Cersei Lannister drove the most twitter activity in the twenty four hours following the premiere. A marketer could do a similar analysis using branded keyword data.

8. Journalists are striking back with data visualizations.

The Oxford English Dictionary selected “post-truth” as the word of 2016. Indeed, following the U.S. presidential election, data analysts, and journalists have been on the defensive from opponents labeling their reporting “fake news.”
But 2017 is the year data analysts and journalists strike back with the help of data visualizations.
The editors of the Columbia Journalism Review (CJR) released an editorial, titled “America’s Growing News Deserts,” in spring 2017. The article featured the interactive data visualization below that maps the dearth of local newspapers across the country.

22 feb. 2018

Web Reporting open source based tool updated features

Some new features of one of 'our favourites tools' in analytics that you can use it for Adhoc web reporting for end users with no licenses and professional support

You can use it 'standalone', with some BI solutions like Pentaho (check online Demo), suiteCRM, Odoo... or as a part of predefined solutions like LinceBI

You can see STReport main new functionalities on this video including:

- Graph support
- Indentify cardinality of elements
- Parameter filter for end users access
- Cancel execution of long queries
- Upgraded to new Pentaho versions
- Many other minor enhacements and bugs fixed

Contact info



Main features:












19 feb. 2018

Si estas en Peru, programa de BigData, Machine Learning & Business Intelligence



Este interesante Curso supone una de las primeras participaciones de la compañía especialista en Analytics, Stratebi en Perú, en donde hay un gran interés en estas tecnologías y ya se están acometiendo algunos proyectos interesantes

Objetivo

Al finalizar el programa los estudiantes podrán:


  • Evaluar los fundamentos y conceptualizaciones que rigen las tecnologías del Data Science, BigData, Machine Learning & Business Intelligence.
  • Desarrollar soluciones de Business Intelligence mediante aplicaciones de BigData a través de Pentaho.
  • Desarrollar soluciones de Business Intelligence mediante aplicaciones de Machine Learning a través de Python, Apache Mahout, Spark y MLib.
  • Desarrollar Dashboards y soluciones de Data Visualization y Data Discovery.
  • Evaluar la calidad de los proyectos IT& Data Science enfocados a Business Intelligence.
  • Gestionar proyectos de Data Science, BigData, Machine Learning & Business Intelligence.
  • Aplicar las herramientas más avanzadas IT & Data Science para la creación de soluciones estructuradas de BI enfocadas a las ciencias, ingeniería y negocios.


Dirigido a: 

Profesionales de las tecnologías de información, gestores de TI, analistas de negocio, analistas de sistemas, arquitectos Java, desarrolladores de sistemas, administradores de bases de datos, desarrolladores y profesionales con relación al área de tecnología, marketing, negocio y financiera.



18 feb. 2018

Dear Data, arte en la visualizacion


Os recomendamos esta gran iniciativa, Dear Data, de Giorgia Lupi y Stefanie Posavec

Se trata de un libro colaborativo en el envío de cartas que convierte a los imagenes en arte y elegancia. Muy recomendable!!



17 feb. 2018

New Open Source free Analysis tool in Pentaho Marketplace



Hi, Pentaho Community fans, just available in Pentaho Marketplace in order to download STPivot4 Olap Viewer, so you have compiled and ready-to-use



STPivot4 is based on the old Pivot4J project where functionality has been added, improved and extended. These technical features are mentioned below.







GitHub STPivot4
For additional information, you may visit STPivot4 Project page at http://bit.ly/2gdy09H

Main Features:
  • STPivot4 is Pentaho plugin for visualizing OLAP cubes.
  • Deploys as Pentaho Plugin
  • Supports Mondrian 4!
  • Improves Pentaho user experience.
  • Intuitive UI with Drag and Drop for Measures, Dimensions and Filters
  • Adds key features to Pentaho OLAP viewer replacing JPivot.
  • Easy multi-level member selection.
  • Advanced and function based member selection (Limit, Ranking, Filter, Order).
  • Let user create "on the fly" formulas and calculations using
  • Non MDX gran totals (min,max,avg and sum) per member, hierarchy or axis.
  • New user friendly Selector Area
  • and more…


13 feb. 2018

Benchmarking 20 Machine Learning Models Accuracy and Speed


As Machine Learning tools become mainstream, and ever-growing choice of these is available to data scientists and analysts, the need to assess those best suited becomes challenging. In this study, 20 Machine Learning models were benchmarked for their accuracy and speed performance on a multi-core hardware, when applied to 2 multinomial datasets differing broadly in size and complexity. 

See Study

It was observed that BAG-CART, RF and BOOST-C50 top the list at more than 99% accuracy while NNET, PART, GBM, SVM and C45 exceeded 95% accuracy on the small Car Evaluation dataset


Visto en Rpubs

8 feb. 2018

Comparativa Kettle (Pentaho Data Integration) y Talend

Hace unos días os hablábamos de que el ETL es crucial y hoy os mostramos una comparativa de las dos mejores herramientas Open Source de ETL (Kettle de Pentaho y Talend), que tampoco empieza a ser arriesgado a decir que se están convirtiendo en las mejores, sobre todo si valoramos el coste y la posibilidad de integración y modificación respecto a Informatica Powercenter, Oracle, Microsoft o IBM

Tanto Kettle como Talend son grandes herramientas, muy visuales, que nos permiten integrar todo tipo de fuentes, incluyendo también Big Data para hacer todo tipo de transformaciones y proyectos de integración o para preparar potentes entornos analíticos, también con soluciones Open Source como podéis ver en esta Demo Online, donde se han usado Kettle y Talend en el backend




Descargar la comparativa de Excella 

5 feb. 2018

Un glosario de los 7 principales terminos de Machine Learning


Machine learning


Machine learning is the process through which a computer learns with experience rather than additional programming.
Let’s say you use a program to determine which customers receive which discount offers. If it’s a machine-learning program, it will make better recommendations as it gets more data about how customers respond. The system gets better at its task by seeing more data.

Algorithm


An algorithm is a set of specific mathematical or operational steps used to solve a problem or accomplish a task.
In the context of machine learning, an algorithm transforms or analyzes data. That could mean:
• performing regression analysis—“based on previous experiments, every $10 we spend on advertising should yield $14 in revenue”
• classifying customers—“this site visitor’s clickstream suggests that he’s a stay-at-home dad”
• finding relationships between SKUs—“people who bought these two books are very likely to buy this third title”
Each of these analytical tasks would require a different algorithm.
When you put a big data set through an algorithm, the output is typically a model.

Model


The simplest definition of a model is a mathematical representation of relationships in a data set.
A slightly expanded definition: “a simplified, mathematically formalized way to approximate reality (i.e. what generates your data) and optionally to make predictions from this approximation.”
Here’s a visualization of a really simple model, based on only two variables.
The blue dots are the inputs (i.e. the data), and the red line represents the model.

I can use this model to make predictions. If I put any “ad dollars spent” amount into the model, it will yield a predicted “revenue generated” amount.
Two key things to understand about models:
1. Models get complicated. The model illustrated here is simple because the data is simple. If your data is more complex, the predictive model will be more complex; it likely wouldn’t be portrayed on a two-axis graph.
When you speak to your smartphone, for example, it turns your speech into data and runs that data through a model in order to recognize it. That’s right, Siri uses a speech recognition model to determine meaning.
Complex models underscore why machine-learning algorithms are necessary: You can use them to identify relationships you would never be able to catch by “eyeballing” the data.
2. Models aren’t magic. They can be inaccurate or plain old wrong for many reasons. Maybe I chose the wrong algorithm to generate the model above. See the line bending down, as you pass our last actual data point (blue dot)? It indicates that this model predicts that past that point, additional ad spending will generate less overall revenue. This might be true, but it certainly seems counterintuitive. That should draw some attention from the marketing and data science teams.
A different algorithm might yield a model that predicts diminishing incremental returns, which is quite different from lower revenue.

Features


Wikipedia’s definition of a feature is good: “an individual measurable property of a phenomenon being observed. Choosing informative, discriminating, and independent features is a crucial step for effective algorithms.”
So features are elements or dimensions of your data set.
Let’s say you are analyzing data about customer behavior. Which features have predictive value for the others? Features in this type of data set might include demographics such as age, location, job status, or title, and behaviors such as previous purchases, email newsletter subscriptions, or various dimensions of website engagement.
You can probably make intelligent guesses about the features that matter to help a data scientist narrow her work. On the other hand, she might analyze the data and find “informative, discriminating, and independent features” that surprise you.

Supervised vs. unsupervised learning


Machine learning can take two fundamental approaches.
Supervised learning is a way of teaching an algorithm how to do its job when you already have a set of data for which you know “the answer.”
Classic example: To create a model that can recognize cat pictures via a supervised learning process, you would show the system millions of pictures already labeled “cat” or “not cat.”
Marketing example: You could use a supervised learning algorithm to classify customers according to six personas, training the system with existing customer data that is already labeled by persona.
Unsupervised learning is how an algorithm or system analyzes data that isn’t labeled with an answer, then identifies patterns or correlations.
An unsupervised-learning algorithm might analyze a big customer data set and produce results indicating that you have 7 major groups or 12 small groups. Then you and your data scientist might need to analyze those results to figure out what defines each group and what it means for your business.
In practice, most model building uses a combination of supervised and unsupervised learning, says Doyle.

“Frequently, I start by sketching my expected model structure before reviewing the unsupervised machine-learning result,” he says. “Comparing the gaps between these models often leads to valuable insights.”


Deep learning


Deep learning is a type of machine learning. Deep-learning systems use multiple layers of calculation, and the later layers abstract higher-level features. In the cat-recognition example, the first layer might simply look for a set of lines that could outline a figure. Subsequent layers might look for elements that look like fur, or eyes, or a full face.

Compared to a classical computer program, this is somewhat more like the way the human brain works, and you will often see deep learning associated with neural networks, which refers to a combination of hardware and software that can perform brain-style calculation.


It’s most logical to use deep learning on very large, complex problems. Recommendation engines (think Netflix or Amazon) commonly use deep learning.

Visto en Huffingtonpost

1 feb. 2018

30 años del Data Warehouse


Justo ahora hace 30 años que Barry Devlin publicó el primer artículo describiendo la arquitectura de un Data Warehouse

Descargate el histórico artículo

Original publication: “An architecture for a business and information system”, B. A. Devlin, P. T. Murphy, IBM Systems Journal, Volume 27, Number 1, Page 60 (February, 1988)