GPU-based workloads as a part of Airflow DAGs

如何讓在 Airflow 上跑的任務使用 GPU 資源

Step 1. 在既有的 Airflow 下新增有 GPU 的 Pool

使用參數 --accelerator 指定 GPU 的規格

gcloud beta container node-pools create "gpu-pool" \

--cluster "mlworkflow-24aeef46-gke" \

--zone "us-east1-b" \

--machine-type "n1-standard-4" \

--accelerator "type=nvidia-tesla-p100,count=1" \

--image-type "COS_CONTAINERD" \

--num-nodes "3"

Step 2. 測試 GPU 的功能是否正常

部署一個 Pod 來跑 nvidia-smi 指令

注意: GCP 在建立 GPU 的 Pool 時會自動加上 Taint 資訊如下

spec:

taints:

- effect: NoSchedule

key: nvidia.com/gpu

value: present

Pod 的 deployment 要加上 Tolerations

cat <<EOF | kubectl apply -f -

apiVersion: v1

kind: Pod

metadata:

labels:

run: mig-none-example

name: mig-none-example

spec:

tolerations:

- key: "nvidia.com/gpu"

operator: "Exists"

effect: "NoSchedule"

containers:

- image: nvidia/cuda:11.0-base

name: mig-none-example

resources:

limits:

nvidia.com/gpu: "1"

restartPolicy: Always

status: {}

EOF

使用 kubectl 執行 nvidia-smi

kubectl exec -it mig-none-example -- nvidia-smi -L

Step 3. 編輯 Airflow DAG 檔

匯入 k8s client package

from kubernetes.client import models as k8s

使用 KubernetesPodOperator 來執行 GPU 任務

注意：

1.這邊需要定義 Node Affinity 將要跑 GPU 的 Pod 部署在指定的 Node Pool 上

affinity={

'nodeAffinity': {

'requiredDuringSchedulingIgnoredDuringExecution': {

'nodeSelectorTerms': [{

'matchExpressions': [{

'key': 'cloud.google.com/gke-nodepool',

'operator': 'In',

'values': [

"gpu-pool"

]

}]

}

2. 定義 Pod Toleration

tolerations = [

k8s.V1Toleration(key="nvidia.com/gpu", operator="Equal", value="present")

3. Resource Limit

resources=k8s.V1ResourceRequirements(

limits={'nvidia.com/gpu': '1'}

完整的 KubernetesPodOperator 任務會長這樣

run_notebook = kubernetes_pod_operator.KubernetesPodOperator(

task_id=f"run_notebook",

name=f"run_notebook",

is_delete_operator_pod=True,

image_pull_policy="IfNotPresent",

startup_timeout_seconds=86400,

execution_timeout=timedelta(seconds=86400),

resources=k8s.V1ResourceRequirements(

limits={'nvidia.com/gpu': '1'}

cmds=['/bin/bash'],

arguments=["-c",

"""

# do something ...

"""

tolerations = [

k8s.V1Toleration(key="nvidia.com/gpu", operator="Equal", value="present")

affinity={

'nodeAffinity': {

'requiredDuringSchedulingIgnoredDuringExecution': {

'nodeSelectorTerms': [{

'matchExpressions': [{

'key': 'cloud.google.com/gke-nodepool',

'operator': 'In',

'values': [

"highcpu-gpu-pool"

]

}]

}

image=f'gcr.io/deeplearning-platform-release/base-cu110:m87'

)

[REF]

https://airflow.apache.org/docs/apache-airflow/2.0.2/_modules/airflow/example_dags/example_kubernetes_executor_config.html

https://thenewstack.io/getting-started-with-gpus-in-google-kubernetes-engine/

黑皮考町

搜尋此網誌

GPU-based workloads as a part of Airflow DAGs

標籤

留言

張貼留言

這個網誌中的熱門文章

[解決方法] docker: permission denied

[C#] Visual Studio, 如何在10分鐘內快速更改命名專案名稱

[Visual Studio Code] 如何切換背景主題