Speaker
黃泰瑋 (Tai-Wei Huang)
Material
Note
design large scale pipeline
- separate business logic from DAG
- reduce DAG processing time
- don't import global module
- import them in function
- use
.airflowignore
- don't import global module
- customize operator
- don't do heavy computations in
__init__
pre_execute()
execute()
post_execute()
- don't do heavy computations in
- use the jinja template over Variable
- decouple logic from airflow → make unit tests easier to write
- DAG generator
- extract similar parts among data pipelines
run large-scale pipeline
- Runner
- Celery Executor
- scale up
worker_concurrency
worker_autoscale
- scale out
- run more workers
- scale up
- Kubernetes Executor
- Celery Executor
- airflow level parameter
max_active_tasks_per_dag
max_active_runs_per_dag
- DAG level parameter
max_active_runs
max_active_tasks
- task level parameter
max_active_tis_per_dag
pool
manage large scale pipeline
- setup
access_control
- separate runtime environment →
executor_config
- cluster policy
task_policy
dag_policy
task_instance_mutation_hook
- failure management
retires
sla
sla_miss_callback
on_failure_callback