Data PipeLines
Data Pipe Lines For Data Analytics
For a successful implementation of Data Analytics or Machine Learning algorithms, data is the key. When the analysis is done in real time, how fast can the data be streamed (through a "pipe") becomes very important.When Machine Learning models are "fitted" with Train data and verified with Test data, the data is assumed to be available locally in a usable format. But when the model is deployed in production. data is not always available locally and in the form that the model can readily use.
This non-locality and format inconsistencies makes the case for a very dependable "pipeline" that a business can depend on so that the data flows from several origins (data points) and in the process gets converted, cleaned, normalized (very important when the metrics are not the same).
For example, lets take a case of Autonomous Car you are sitting to give you a smooth ride. While the car was manufactured, lots of algorithms would be used to create "models" that are expected to function in real time using data from several sources:
1) Car internal hardware
2) Car sensors for external
3) Weather data from satellite
4) Local data from County
5) Cell phone data to keep you connected
6) Data from other cars
and so on......
All these are common sense data points. But there could be several more that might be needed to make the car run without you manually interfering. The point is again all the data needs to streamed in real time to the programs that conduct AI and the meeting the Service Level Agreement is the very important.
Types:
- Real Time
- Batch
Data Stages :
Data Ingestion
Data Transportation
Data storage for Analysis
Data Analysis
Data Storage for Analyzed data
Data communication to requested parties (HTTP REST calls, Websockets, HTTP long poll)
Data Security
Data Pipe Line Examples:
#1 AWS Data PipeLine can be used to schedule tasks to retrieve data from Dynamo DB, RDB and other data sources, process them and even produce reports over them.#2 ETL - Data Ware housing tools typically used for RDBMs data systems.
#3 Kafka, Spark- This is good solution for streaming data.
#4 Luigi: This is python package typically used for job scheduling for Batch processing.
Home Grown Pipelines Example:
I created a data pipeline that just works fine for large data sets and also is scalable at the production level:Step 1 : Create Python program that you want to read, clean and process the data.
Step 2: Install and run Celery on your environment to convert the python module into a task/process.
Step 3: Use Ansible for software deployment onto the sever. This is a automation tool that enables Continuous deployment with minimum down time.
Step 4: Now we need to fine tune our pipeline to be able to work in "real time" in distributed environment. Queues consists of tasks of certain priority. For production we need to have at least 3 priorities, High, Medium and Low. Create three Celery queues for these three.
When running tasks we identify which queue it would end up getting appropriate resources.
Step 5: Scalability: How to deal with large datasets? Say 100TB.
There are couple of options Spark or Dask
Other Scalable Software for data pipelines based on Directed Acyclic Graphs (DAG)
> LUIGI (A Spotify product)
> AIRFLOW (Air BnB Product and now a Apache incubator project)
Some commonalities of these PipeLine management software:
1) Both are python based solutions for ETL or Data Pipeline management
2) Integrates well with Apache Hadoop, Amazon S3, and data bases and file systems (HDFS).
3) Manage their database to keep log of task performance, issues and status
Differences:
1) Airflow supports Scheduler via executor (Celery Executor). Lugui depends on Chron to schedule tasks.
2) Luigui is proven to be stable in large enterprises dealing with complex process.
Airflow is still not proven to be stable in large environments
The complexity is mainly dealing with DAGs
3) Airflow has a nice GUI, Lugui not very useful.
Comments
Post a Comment