如何讓在 Airflow 上跑的任務使用 GPU 資源
Step 1. 在既有的 Airflow 下新增有 GPU 的 Pool
使用參數 --accelerator 指定 GPU 的規格
gcloud beta container node-pools create "gpu-pool" \
--cluster "mlworkflow-24aeef46-gke" \
--zone "us-east1-b" \
--machine-type "n1-standard-4" \
--accelerator "type=nvidia-tesla-p100,count=1" \
--image-type "COS_CONTAINERD" \
--num-nodes "3"
Step 2. 測試 GPU 的功能是否正常
部署一個 Pod 來跑 nvidia-smi 指令
注意: GCP 在建立 GPU 的 Pool 時會自動加上 Taint 資訊如下
spec:
taints:
- effect: NoSchedule
key: nvidia.com/gpu
value: present
Pod 的 deployment 要加上 Tolerations
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
labels:
run: mig-none-example
name: mig-none-example
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
containers:
- image: nvidia/cuda:11.0-base
name: mig-none-example
resources:
limits:
nvidia.com/gpu: "1"
restartPolicy: Always
status: {}
EOF
使用 kubectl 執行 nvidia-smi
kubectl exec -it mig-none-example -- nvidia-smi -L
Step 3. 編輯 Airflow DAG 檔
匯入 k8s client package
from kubernetes.client import models as k8s
使用 KubernetesPodOperator 來執行 GPU 任務
注意:
1.這邊需要定義 Node Affinity 將要跑 GPU 的 Pod 部署在指定的 Node Pool 上
affinity={
'nodeAffinity': {
'requiredDuringSchedulingIgnoredDuringExecution': {
'nodeSelectorTerms': [{
'matchExpressions': [{
'key': 'cloud.google.com/gke-nodepool',
'operator': 'In',
'values': [
"gpu-pool"
]
}]
}]
}
}
},
2. 定義 Pod Toleration
tolerations = [
k8s.V1Toleration(key="nvidia.com/gpu", operator="Equal", value="present")
],
3. Resource Limit
resources=k8s.V1ResourceRequirements(
limits={'nvidia.com/gpu': '1'}
),
完整的 KubernetesPodOperator 任務會長這樣
run_notebook = kubernetes_pod_operator.KubernetesPodOperator(
task_id=f"run_notebook",
name=f"run_notebook",
is_delete_operator_pod=True,
image_pull_policy="IfNotPresent",
startup_timeout_seconds=86400,
execution_timeout=timedelta(seconds=86400),
resources=k8s.V1ResourceRequirements(
limits={'nvidia.com/gpu': '1'}
),
cmds=['/bin/bash'],
arguments=["-c",
"""
# do something ...
"""
],
tolerations = [
k8s.V1Toleration(key="nvidia.com/gpu", operator="Equal", value="present")
],
affinity={
'nodeAffinity': {
'requiredDuringSchedulingIgnoredDuringExecution': {
'nodeSelectorTerms': [{
'matchExpressions': [{
'key': 'cloud.google.com/gke-nodepool',
'operator': 'In',
'values': [
"highcpu-gpu-pool"
]
}]
}]
}
}
},
image=f'gcr.io/deeplearning-platform-release/base-cu110:m87'
)
[REF]
https://airflow.apache.org/docs/apache-airflow/2.0.2/_modules/airflow/example_dags/example_kubernetes_executor_config.html
https://thenewstack.io/getting-started-with-gpus-in-google-kubernetes-engine/
留言
張貼留言