Material Big Data

Lanzados ppts informativos de tecnologías BigData: Hadoop, Hbase, Hive, Zookeeper...

Apuntate al Curso de PowerBI. Totalmente práctico, aprende los principales trucos con los mejores especialistas

Imprescindible para el mercado laboral actual. Con Certificado de realización!!

Pentaho Analytics. Un gran salto

Ya se ha lanzado Pentaho 8 y con grandes sorpresas. Descubre con nosotros las mejoras de la mejor suite Open BI

Aprende gratis Analytics OLAP sobre Pentaho

La solución open source para business intelligence y Big Data sobre Pentaho, no te lo pierdas!!

17 mar. 2017

El Cuadro de Mando que controla toda tu vida

Anand Sharma, registra sus peripecias vitales como una forma de legar a la posteridad los datos vinculados con su salud. En la web de su proyecto Aprilzero puedes conocer cada minúsculo detalle, y muy pronto publicar también los tuyos.

Trabaja en una herramienta para que cualquier persona pueda monitorizarse a sí misma. Se trata de un nuevo proyecto llamado, que aún está en fase de desarrollo y que es, en definitiva, una segunda versión de Aprilzero abierta a la comunidad, que integra muchos datos:

Si echas un vistazo a su página web, comprobarás que es increible todos los aspectos analizados y se echan algo de menos algunas herramientas tipo informes, dashboards adhoc, etc... para explotar toda esa información 

Visto en el diario

11 mar. 2017

Mas de 20 Tecnicas y Tipos de Analisis Big Data

A continuación, os detallamos las principales técnicas y tipos de análisis que se realizan en Big Data, muchas veces agrupadas bajo nombres como algoritmos, machine learning, etc.... pero que no siempre se explican correctamente

Aquí os hemos creado algunos ejemplos online usando algunas de estas técnicas

Si quieres saber más, puedes consultar también otros posts relacionados:

Las 53 Claves para conocer Machine Learning
69 claves para conocer Big Data
Como empezar a aprender Big Data en 2 horas
Tipos de roles en Analytics (Business Intelligence, Big Data)
Libro Gratuito: Big Data, el poder de convertir datos en decisiones

Veamos pues, cuales son estas técnicas:

1. A/B testing: A technique in which a control group is compared with a variety of test groups in order to determine what treatments (i.e., changes) will improve a given objective variable, e.g., marketing response rate. This technique is also known as split testing or bucket testing. An example application is determining what copy text, layouts, images, or colors will improve conversion rates on an e-commerce Web site. Big data enables huge numbers of tests to be executed and analyzed, ensuring that groups are of sufficient size to detect meaningful (i.e., statistically significant) differences between the control and treatment groups (see statistics). When more than one variable is simultaneously manipulated in the treatment, the multivariate generalization of this technique, which applies statistical modeling, is often called “A/B/N” testing

2. Association rule learning: A set of techniques for discovering interesting relationships, i.e., “association rules,” among variables in large databases.These techniques consist of a variety of algorithms to generate and test possible rules. One application is market basket analysis, in which a retailer can determine which products are frequently bought together and use this information for marketing (a commonly cited example is the discovery that many supermarket shoppers who buy diapers also tend to buy beer). Used for data mining.

3. Classification: A set of techniques to identify the categories in which new data points belong, based on a training set containing data points that have already been categorized. One application is the prediction of segment-specific customer behavior (e.g., buying decisions, churn rate, consumption rate) where there is a clear hypothesis or objective outcome. These techniques are often described as supervised learning because of the existence of a training set; they stand in contrast to cluster analysis, a type of unsupervised learning. Used for data mining.

4. Cluster analysis: A statistical method for classifying objects that splits a diverse group into smaller groups of similar objects, whose characteristics of similarity are not known in advance. An example of cluster analysis is segmenting consumers into self-similar groups for targeted marketing. This is a type of unsupervised learning because training data are not used. This technique is in contrast to classification, a type of supervised learning. Used for data mining.

5. Crowdsourcing: A technique for collecting data submitted by a large group of people or ommunity (i.e., the “crowd”) through an open call, usually through networked media such as the Web.This is a type of mass collaboration and an instance of using Web.

6. Data fusion and data integration: A set of techniques that integrate and analyze data from multiple sources in order to develop insights in ways that are more efficient and potentially more accurate than if they were developed by analyzing a single source of data. Signal processing techniques can be used to implement some types of data fusion. One example of an application is sensor data from the Internet of Things being combined to develop an integrated perspective on the performance of a complex distributed system such as an oil refinery. Data from social media, analyzed by natural language processing, can be combined with real-time sales data, in order to determine what effect a marketing campaign is having on customer sentiment and purchasing behavior.

7. Data mining: A set of techniques to extract patterns from large datasets by combining methods from statistics and machine learning with database management. These techniques include association rule learning, cluster analysis, classification, and regression. Applications include mining customer data to determine segments most likely to respond to an offer, mining human resources data to identify characteristics of most successful employees, or market basket analysis to model the purchase behavior of customers

8. Ensemble learning: Using multiple predictive models (each developed using statistics and/or machine learning) to obtain better predictive performance than could be obtained from any of the constituent models. This is a type of supervised learning.

9. Genetic algorithms: A technique used for optimization that is inspired by the process of natural evolution or “survival of the fittest.” In this technique, potential solutions are encoded as “chromosomes” that can combine and mutate. These individual chromosomes are selected for survival within a modeled “environment” that determines the fitness or performance of each individual in the population. Often described as a type of “evolutionary algorithm,” these algorithms are well-suited for solving nonlinear problems. Examples of applications include improving job scheduling in manufacturing and optimizing the performance of an investment portfolio.

10. Machine learning: A subspecialty of computer science (within a field historically called “artificial intelligence”) concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data. Natural language processing is an example of machine learning

11. Natural language processing (NLP): A set of techniques from a subspecialty of computer science (within a field historically called “artificial intelligence”) and linguistics that uses computer algorithms to analyze human (natural) language. Many NLP techniques are types of machine learning. One application of NLP is using sentiment analysis on social media to determine how prospective customers are reacting to a branding campaign.

12. Neural networks: Computational models, inspired by the structure and workings of biological neural networks (i.e., the cells and connections within a brain), that find patterns in data. Neural networks are well-suited for finding nonlinear patterns. They can be used for pattern recognition and optimization. Some neural network applications involve supervised learning and others involve unsupervised learning. Examples of applications include identifying high-value customers that are at risk of leaving a particular company and identifying fraudulent insurance claims.

13. Network analysis: A set of techniques used to characterize relationships among discrete nodes in a graph or a network. In social network analysis, connections between individuals in a community or organization are analyzed, e.g., how information travels, or who has the most influence over whom. Examples of applications include identifying key opinion leaders to target for marketing, and identifying bottlenecks in enterprise information flows.

14. Optimization: A portfolio of numerical techniques used to redesign complex systems and processes to improve their performance according to one or more objective measures (e.g., cost, speed, or reliability). Examples of applications include improving operational processes such as scheduling, routing, and floor layout, and making strategic decisions such as product range strategy, linked investment analysis, and R&D portfolio strategy. Genetic algorithms are an example of an optimization technique

15. Pattern recognition: A set of machine learning techniques that assign some sort of output value (or label) to a given input value (or instance) according to a specific algorithm. Classification techniques are an example.

16. Predictive modeling: A set of techniques in which a mathematical model is created or chosen to best predict the probability of an outcome. An example of an application in customer relationship management is the use of predictive models to estimate the likelihood that a customer will “churn” (i.e., change providers) or the likelihood that a customer can be cross-sold another product. Regression is one example of the many predictive modeling techniques.

17. Regression: A set of statistical techniques to determine how the value of the dependent variable changes when one or more independent variables is modified. Often used for forecasting or prediction. Examples of applications include forecasting sales volumes based on various market and economic variables or determining what measurable manufacturing parameters most influence customer satisfaction. Used for data mining.

18. Sentiment analysis: Application of natural language processing and other analytic techniques to identify and extract subjective information from source text material. Key aspects of these analyses include identifying the feature, aspect, or product about which a sentiment is being expressed, and determining the type, “polarity” (i.e., positive, negative, or neutral) and the degree and strength of the sentiment. Examples of applications include companies applying sentiment analysis to analyze social media (e.g., blogs, microblogs, and social networks) to determine how different customer segments and stakeholders are reacting to their products and actions.

19. Signal processing: A set of techniques from electrical engineering and applied mathematics originally developed to analyze discrete and continuous signals, i.e., representations of analog physical quantities (even if represented digitally) such as radio signals, sounds, and images. This category includes techniques from signal detection theory, which quantifies the ability to discern between signal and noise. Sample applications include modeling for time series analysis or implementing data fusion to determine a more precise reading by combining data from a set of less precise data sources (i.e., extracting the signal from the noise).

20. Spatial analysis: A set of techniques, some applied from statistics, which analyze the topological, geometric, or geographic properties encoded in a data set. Often the data for spatial analysis come from geographic information systems (GIS) that capture data including location information, e.g., addresses or latitude/longitude coordinates. Examples of applications include the incorporation of spatial data into spatial regressions (e.g., how is consumer willingness to purchase a product correlated with location?) or simulations (e.g., how would a manufacturing supply chain network perform with sites in different locations?).

21. Statistics: The science of the collection, organization, and interpretation of data, including the design of surveys and experiments. Statistical techniques are often used to make judgments about what relationships between variables could have occurred by chance (the “null hypothesis”), and what relationships between variables likely result from some kind of underlying causal relationship (i.e., that are “statistically significant”). Statistical techniques are also used to reduce the likelihood of Type I errors (“false positives”) and Type II errors (“false negatives”). An example of an application is A/B testing to determine what types of marketing material will most increase revenue.

22. Supervised learning: The set of machine learning techniques that infer a function or relationship from a set of training data. Examples include classification and support vector machines.30 This is different from unsupervised learning.

23. Simulation: Modeling the behavior of complex systems, often used for forecasting, predicting and scenario planning. Monte Carlo simulations, for example, are a class of algorithms that rely on repeated random sampling, i.e., running thousands of simulations, each based on different assumptions. The result is a histogram that gives a probability distribution of outcomes. One application is assessing the likelihood of meeting financial targets given uncertainties about the success of various initiatives

24. Time series analysis: Set of techniques from both statistics and signal processing for analyzing sequences of data points, representing values at successive times, to extract meaningful characteristics from the data. Examples of time series analysis include the hourly value of a stock market index or the number of patients diagnosed with a given condition every day. Time series forecasting is the use of a model to predict future values of a time series based on known past values of the same or other series. Some of these techniques, e.g., structural modeling, decompose a series into trend, seasonal, and residual components, which can be useful for identifying cyclical patterns in the data. Examples of applications include forecasting sales figures, or predicting the number of people who will be diagnosed with an infectious disease.

25. Unsupervised learning: A set of machine learning techniques that finds hidden structure in unlabeled data. Cluster analysis is an example of unsupervised learning (in contrast to supervised learning).

26. Visualization: Techniques used for creating images, diagrams, or animations to communicate, understand, and improve the results of big data analyses.

Visto en Big Data made simple

5 mar. 2017

Por qué si tengo un dashboard no soy capaz de tomar decisiones?

Muy interesante esta reflexión de Tristan Elosegui, de hace ya un par de años, pero que mantiene toda su vigencia. Abajo os indicamos los puntos principales que detalla:

En TodoBI, hablamos mucho de Dashboards (ver posts), de los que os destacamos:

12 aplicaciones gratuitas para crear Dashboards
Tutorial de Creación de Cuadros de Mando Open Source
Ejemplos Dashboards
- Cuadro de Mando Integral (Scorecard)

Según Tristán, las empresas tienen gran cantidad de datos a su alcance, pero no son capaces de poner orden entre tanto caos y como consecuencia, no tienen una visión clara de la situación. 

El ruido es mayor que la ‘señal’

El volumen de datos y la velocidad con la que se generan, provocan más ruido que señal.
Esta situación lleva a las empresas a la toma de decisiones sin los datos necesarios o a la parálisis post-análisis en lugar de facilitar la acción (toma de decisiones).
Los datos llegan desde diferentes fuentes, en diferentes formatos, desde diferentes herramientas,… y todos acaban en informes, que intentan integrar en un dashboard que les ayude a tomar decisiones.

¿Por qué teniendo tantos datos las empresas no son capaces de tomar decisiones estratégicas?

Tener muchos datos no siempre significa tener mejor visión sobre la situación. Seguro que más de uno de los que estáis leyendo este post, os sentís identificados.
Las empresas toman decisiones en base a datos todos los días (y sin datos también), el problema es que estas decisiones son tácticas ya que se toman tipo ‘silo’ (por áreas).
Para poder tomar decisiones que optimicen la estrategia global de la empresa necesitamos:
  • Tener los datos necesarios, ni más ni menos, para tomarlas (la foto más completa posible del contexto) y
  • ser capaces de entender los datos,para transformarlos en información y a continuación en conocimiento.
No hay nada peor que haber recorrido el camino hasta tener un dashboard estratégico, y que la persona que tiene que tomar las decisiones no las tome. ¿por qué ocurre esto?

Falta de contexto

El motivo principal para no tomar decisiones, es que los datos representados en el dashboard no sean relevantes, no sean accionables.
Esto ocurre cuando no hemos definido correctamente el dashboard (los pasos correctos están definidos en el modelo de madurez de la analítica digital). Los errores más comunes suelen ser:
  • Objetivos y KPIs mal definidos: si el punto de partida esta mal definido, todo lo que venga detrás nos llevará a error. Y por supuesto, el contexto será del todo equivocado.
  • Datos irrelevantes o no accionables: bien por una mala definición de objetivos y de las KPIs que nos ayudan a controlarlos o simplemente porque hemos seleccionado mal los datos, llegamos a un dashboard lleno de números y gráficas, que no nos permite tomar decisiones.Bien porque no muestra los datos con el área de responsabilidad de la persona que toma las decisiones, o simplemente porque son datos no accionables. En cualquiera de los dos casos el resultado es el mismo.
  • Datos incompletos: es el otro extremo del caso anterior. Nos faltan los datos necesarios para tomar decisiones.

Visualización de datos

El segundo gran problema es que la persona que tiene que tomar las decisiones no entienda los datos.
Al igual que tenemos que mostrar a cada stakeholder los datos que son relevantes para su trabajo (caso anterior), tenemos que adaptar el lenguaje y la visualización, para que el decisor entienda lo que está viendo.
Así que, para que un dashboard estratégico funcione debes empezar por tener definir bien los objetivos y KPIs, trabajar la calidad del dato, que estos datos te estén contando lo que te interesa y que integren datos de las diferentes fuentes que manejas.

No te saltes ninguna fase del modelo de madurez de la analítica digital, porque sino te puedes encontrar con los problemas que hemos visto en este post.

Ver Articulo completo