Quality Data - "Sine qua non" for Analytics and Machine Learning


In this blog, I would like to discuss:
1) The importance of Data,
2) The importance of Quality Data and Role of Data in Analytics
3) The importance of data in Machine Learning

The importance of Data - Sine qua non:

Anything and everything we do on a computer produces data. The efficiency of our programs and the areas of improvement are identified by analyzing the data thus generated. Hence, without data, if we just depend on experiences and "gut feelings" to find improvements, we might be spending resources in completely wrong areas. This does not mean experiences are worthless. But, making business decisions only based on feeling or likelihood without the backing of data would be too risky and expensive.

One of the recent massive projects that we witnessed in the Bayarea is the building of Bay Bridge. The project spanned nearly a decade (from 2002 - 2013) costing nearly $7 Billion. But the actual research work started way back in the 1990s. The amount of data gathered from the various course for analysis is tremendous. Data was gathered from sources including weather, Caltrans, geographical, population, etc... For this project to be successful, several models were created by Engineers and Analysts to see if the model solves the problem and is beneficial of the future Here is the link for the research work done: https://www.sciencedirect.com/topics/engineering/bay-bridge


Analysis.org | Intelligence Analysis in Market Context
(Source: Analysis.org)

The bottom line is without all the Data analysis any major project can not be started (forget about finishing). Many times we call data as a "source of truth" because data cannot lie about the occurrence (or non-occurrence) event.

The importance of Quality Data and Role of Data in Analytics:

For Analytics to be useful for a business, the below three characteristics of data are needed:

1) Availability: The Data needs to be available when the Analytics program runs (either on-demand or as scheduled). Real-time analytics are becoming popular with the advent of event processing and also machine learning capabilities like Anomaly detection.

Real-time Analytics has a bit more gained importance because of Streaming data. The IoT devices pump performance and event data. The ability to capture the data and save it for analysis if an absolute necessity. Streaming software tools like Apache Kafka and Spark Streaming and other tools help in capturing data in real-time. The bottom line is data loss means loss of actionable information.

2) Completeness: Any analytical and machine learning model usually depends on several sources and types of data before suggesting a proper analysis or recommendation. For example, an Anomaly detection machine learning capability depends on data from sources like DataBase access, Transaction activity, User profile data, User Access data, IP activity.

Imagine you being a CIO of a technology company and would like to make a decision on IP security. Would you be comfortable making calls when the Analysts say, boss we only have 3 of 5 sources of data. Hence completeness of data is also critical when the algorithm is based on different data sources.

3) Accuracy: Accuracy could also mean correctness. The Data Analytics and Machine Learning algorithm don't "usually" check the accuracy of the data because that could create tremendous overhead. So, the onus is on the data gatherer / the origin of the raw data.  

Three factors that affect the accuracy of the data are : 
1) Data conversion factors or normalization at the source
2) Data movement causing data to become stale 
3) Data does not correlate well with other components. For example CPU data is captured every 1 second and I/O to a storage device captures every 5 seconds, While Cache metrics are every 3 seconds. These kinds of data gathering make Analytics complex and sometimes represent incorrect values. 

As mentioned in #2 Data accuracy also affected by data transportation. In today's global business, data is generated across oceans. So, as an Analyst we need to see is worth transporting Raw data over several miles. In my experience transporting 1GB of data over 100 miles takes about 1-2 seconds but what if you have 100GB of data generated every second. What is there is packet loss? This is where Hadoop's concept of computation happens near the data is beneficial. 


The importance of data in Machine Learning: 
The very first step in Product Development is to understand the underlying business problem are we trying to solve. Not many companies have the luxury to start a project because its a "cool" concept if that concept cannot be backed with proper reasoning. The reasoning comes from identifying the underlying information about the problem and gathering the data that transforms into this information. Hence, we are back to the importance of data here again. 

Machine Learning is all about Problem Solving. Or may I put it as solving the problem before it occurs? Hence, identifying business problems is key here. The power of identifying a problem before it occurs is invaluable for business. For example, a hardware manufacturing company would like to see how the hardware functions overtime and record any anomalies from normal. 

Below are the stages the data flows and how finally the ML model gets deployed.


Here are the process steps for Machine Learning implementation: 

Steps covered Data Analytics: 

1) Understand the "business" problem and empathize with the customer. 
2) Discuss the problem at a deeper level to understand the scenarios (also include user personas)
3) Understand the customer environment. (understanding limitations sooner will be beneficial)
4) Identify the data that can confirm the problem and also its symptoms
5) Perform Real-time and interval data analysis. This is needed to understand the process and symptoms
6) Perform root cause analysis. 
7) Identify the cause and effect, What's normal and whatnot, Identify the categories. Identify the peaks and valleys of the business. 
8) clarify all about the business, Data, process, and users. 
Very important to know Where the data originates, How it gets transformed, how is it transported. 
Important to ask what they would like to do in case of a "potential" issue. (alert, alarm, identify the problem sooner, fix the problem before it occurs). 

Steps covered under Machine learning: 

5) Categorize the problems. Identify the data needed.  
6) Prioritize your problemsWe don't solve problems because we can. We solve the problems first which makes a big impact and has importance. (put your Product Manager hat). 
Know your KPIs to make sure the project is on track and accomplished successfully. 
7) Identify which machine learning concept will help the customer. 
8) what are the limitations. (customer cannot install 3rd party or limited resources, the timing of production and interference). 
9) Work with Data scientists explaining the facts and coming up with a feasible solution to implement a machine learning model. The importance of a Data scientist should be emphasized here.  
10) communicate with the customer about the solution and how it will help them. 
11) Implement it in phases.
Data gathering for training, validation and test purpose, Creation of Microservices model where the solution is loosely coupled.
12) Gather metrics and improve until the model is close to perfection. 

In this stage, the Quality of data is very important. Accuracy, Availability, and Completeness are very important. As they say Garbage in - Garbage out. 

Myths: 

1) Machine learning replaces Data Analytics - Wrong. Machine Learning is the logical next step to Data Analytics.  

2) One Machine learning Model fits all - Wrong. There is no silver bullet to all models. Machine learning model creation is based on a specific problem, data that is backing, and the outcome expectation.

3) once deployed we are done - Wrong. Usage patterns change frequently. How people used Walmart.com 5 years back is completely different now. Shopping happens 24X7 via various mobile devices. Also, earlier business peaks happened during standard 2-3 months a year. But now, there could be sudden spike consuming resources. Identifying the trend, anomalies, patterns will help in the model re-learning.    


This is a very interesting topic. But the idea is Identifying that you have a problem and understanding and analysis of the data is the key to Machine Learning.

Comments

Popular posts from this blog