Speaker
黃泰瑋 (Tai-Wei Huang)
Material
Note
design large scale pipeline
- separate business logic from DAG
- reduce DAG processing time
- don't import global module
- import them in function
- use
.airflowignore
- don't import global module
- customize operator
- don't do heavy computations in
__init__ pre_execute()execute()post_execute()
- don't do heavy computations in
- use the jinja template over Variable
- decouple logic from airflow → make unit tests easier to write
- DAG generator
- extract similar parts among data pipelines
run large-scale pipeline
- Runner
- Celery Executor
- scale up
worker_concurrencyworker_autoscale
- scale out
- run more workers
- scale up
- Kubernetes Executor
- Celery Executor
- airflow level parameter
max_active_tasks_per_dagmax_active_runs_per_dag
- DAG level parameter
max_active_runsmax_active_tasks
- task level parameter
max_active_tis_per_dagpool
manage large scale pipeline
- setup
access_control - separate runtime environment →
executor_config - cluster policy
task_policydag_policytask_instance_mutation_hook
- failure management
retiresslasla_miss_callbackon_failure_callback