spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

purna pradeep
im running Spark 2.3 job on kubernetes cluster

kubectl version

    Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
    
    Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}



when i ran spark submit on k8s master the driver pod is stuck in Waiting: PodInitializing state.
I had to manually kill the driver pod and submit new job in this case ,then it works.


This is happening if i submit the jobs almost parallel ie submit 5 jobs one after the other simultaneously. 

I'm running spark jobs on 20 nodes each having below configuration 

I tried kubectl describe node on the node where trhe driver pod is running this is what i got ,i do see there is overcommit on resources but i expected kubernetes scheduler not to schedule if resources in node are overcommitted or node is in Not Ready state ,in this case node is in Ready State but i observe same behaviour if node is in "Not Ready" state



    Name:               **********
    
    Roles:              worker
    
    Labels:             beta.kubernetes.io/arch=amd64
    
                        beta.kubernetes.io/os=linux
    
                        kubernetes.io/hostname=****
    
                        node-role.kubernetes.io/worker=true
    
    Annotations:        node.alpha.kubernetes.io/ttl=0
    
    
    Taints:             <none>
    
    CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
    
    Conditions:
    
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
    
      ----             ------  -----------------                 ------------------                ------                       -------
    
      OutOfDisk        False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
    
      MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
    
      DiskPressure     False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
    
      Ready            True    Tue, 14 Aug 2018 09:31:20 -0400   Sat, 11 Aug 2018 00:41:27 -0400   KubeletReady                 kubelet is posting ready status. AppArmor enabled
    
    Addresses:
    
      InternalIP:  *****
    
      Hostname:    ******
    
    Capacity:
    
     cpu:     16
    
     memory:  125827288Ki
    
     pods:    110
    
    Allocatable:
    
     cpu:     16
    
     memory:  125724888Ki
    
     pods:    110
    
    System Info:
    
     Machine ID:                 *************
    
     System UUID:                **************
    
     Boot ID:                    1493028d-0a80-4f2f-b0f1-48d9b8910e9f
    
     Kernel Version:             4.4.0-1062-aws
    
     OS Image:                   Ubuntu 16.04.4 LTS
    
     Operating System:           linux
    
     Architecture:               amd64
    
     Container Runtime Version:  docker://Unknown
    
     Kubelet Version:            v1.8.3
    
     Kube-Proxy Version:         v1.8.3
    
    PodCIDR:                     ******
    
    ExternalID:                  **************
    
    Non-terminated Pods:         (11 in total)
    
      Namespace                  Name                                                            CPU Requests  CPU Limits  Memory Requests  Memory Limits
    
      ---------                  ----                                                            ------------  ----------  ---------------  -------------
    
      kube-system                calico-node-gj5mb                                               250m (1%)     0 (0%)      0 (0%)           0 (0%)
    
      kube-system                kube-proxy-****************************************             100m (0%)     0 (0%)      0 (0%)           0 (0%)
    
      kube-system                prometheus-prometheus-node-exporter-9cntq                       100m (0%)     200m (1%)   30Mi (0%)        50Mi (0%)
    
      logging                    elasticsearch-elasticsearch-data-69df997486-gqcwg               400m (2%)     1 (6%)      8Gi (6%)         16Gi (13%)
    
      logging                    fluentd-fluentd-elasticsearch-tj7nd                             200m (1%)     0 (0%)      612Mi (0%)       0 (0%)
    
      rook                       rook-agent-6jtzm                                                0 (0%)        0 (0%)      0 (0%)           0 (0%)
    
      rook                       rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j    0 (0%)        0 (0%)      0 (0%)           0 (0%)
    
      spark                      accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1       2 (12%)       0 (0%)      10Gi (8%)        12Gi (10%)
    
      spark                      accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5    2 (12%)       0 (0%)      10Gi (8%)        12Gi (10%)
    
      spark                      accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver    1 (6%)        0 (0%)      2Gi (1%)         2432Mi (1%)
    
      spark                      accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver    1 (6%)        0 (0%)      2Gi (1%)         2432Mi (1%)
    
    Allocated resources:
    
      (Total limits may be over 100 percent, i.e., overcommitted.)
    
      CPU Requests  CPU Limits  Memory Requests  Memory Limits
    
      ------------  ----------  ---------------  -------------
    
      7050m (44%)   1200m (7%)  33410Mi (27%)    45874Mi (37%)
    
    
    Events:         <none>


Kubectl describe pod gives below message

    Name:         accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
    Namespace:    spark
    Node:         ****
    Start Time:   Mon, 13 Aug 2018 16:18:34 -0400
    Labels:       launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
                  spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
                  spark-role=driver
    Annotations:  spark-app-name=accelerate-testing-2
    Status:       Pending
    IP:           
    Init Containers:
      spark-init:
        Container ID:  
        Image:         ****:v2.3.0
        Image ID:      
        Port:          <none>
        Args:
          init
          /etc/spark-init/spark-init.properties
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Environment:    <none>
        Mounts:
          /etc/spark-init from spark-init-properties (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Containers:
      spark-kubernetes-driver:
        Container ID:  
        Image:         ******:v2.3.0
        Image ID:      
        Port:          <none>
        Args:
          driver
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Limits:
          memory:  2432Mi
        Requests:
          cpu:     1
          memory:  2Gi
        Environment:
          SPARK_DRIVER_MEMORY:        2g
          SPARK_DRIVER_CLASS:         com.myclass
          SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
          SPARK_MOUNTED_CLASSPATH:    /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
          SPARK_MOUNTED_FILES_DIR:    /var/spark-data/spark-files
          SPARK_JAVA_OPT_0:           -Dspark.kubernetes.container.image=***
          SPARK_JAVA_OPT_1:           -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
          SPARK_JAVA_OPT_2:           -Dspark.submit.deployMode=cluster
          SPARK_JAVA_OPT_3:           -Dspark.driver.blockManager.port=7079
          SPARK_JAVA_OPT_4:           -Dspark.executor.memory=10g
          SPARK_JAVA_OPT_5:           -Dspark.app.id=spark-63f536fd87f8457796802767922ef7d9
          SPARK_JAVA_OPT_6:           -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
          SPARK_JAVA_OPT_7:           -Dspark.master=k8s://https://kubernetes.default
          SPARK_JAVA_OPT_8:           -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
          SPARK_JAVA_OPT_9:           -Dspark.executor.cores=2
          SPARK_JAVA_OPT_10:          -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
          SPARK_JAVA_OPT_11:          -Dspark.driver.port=7078
          SPARK_JAVA_OPT_12:          -Dspark.kubernetes.namespace=spark
          SPARK_JAVA_OPT_13:          -Dspark.executor.memoryOverhead=2g
          SPARK_JAVA_OPT_14:          -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
          SPARK_JAVA_OPT_15:          -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
          SPARK_JAVA_OPT_16:          -Dspark.executor.instances=10
          SPARK_JAVA_OPT_17:          -Dspark.memory.fraction=0.6
          SPARK_JAVA_OPT_18:          -Dspark.driver.memory=2g
          SPARK_JAVA_OPT_19:          -Dspark.kubernetes.driver.pod.name=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
          SPARK_JAVA_OPT_20:          -Dspark.app.name=accelerate-testing-2
          SPARK_JAVA_OPT_21:          -Dspark.kubernetes.driver.label.launch-id=********
        Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Conditions:
      Type           Status
      Initialized    False 
      Ready          False 
      PodScheduled   True 
    Volumes:
      spark-init-properties:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
        Optional:  false
      download-jars-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:  
      download-files-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:  
      spark-token-mj86g:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  spark-token-mj86g
        Optional:    false
    QoS Class:       Burstable
    Node-Selectors:  <none>
    Tolerations:     <none>
    Events:
      Type     Reason          Age                  From                                               Message
      ----     ------          ----                 ----                                               -------
      Normal   SandboxChanged  44m (x518 over 18h)  kubelet, ****************************  Pod sandbox changed, it will be killed and re-created.
      Warning  FailedSync      19s (x540 over 18h)  kubelet, ****************************  Error syncing pod

Reply | Threaded
Open this post in threaded view
|

Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

purna pradeep
Hello,

im running Spark 2.3 job on kubernetes cluster

kubectl version

    Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
    
    Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}



when i ran spark submit on k8s master the driver pod is stuck in Waiting: PodInitializing state.
I had to manually kill the driver pod and submit new job in this case ,then it works.How this can be handled in production ?



This is happening if i submit the jobs almost parallel ie submit 5 jobs one after the other simultaneously. 

I'm running spark jobs on 20 nodes each having below configuration 

I tried kubectl describe node on the node where trhe driver pod is running this is what i got ,i do see there is overcommit on resources but i expected kubernetes scheduler not to schedule if resources in node are overcommitted or node is in Not Ready state ,in this case node is in Ready State but i observe same behaviour if node is in "Not Ready" state



    Name:               **********
    
    Roles:              worker
    
    Labels:             beta.kubernetes.io/arch=amd64
    
                        beta.kubernetes.io/os=linux
    
                        kubernetes.io/hostname=****
    
                        node-role.kubernetes.io/worker=true
    
    Annotations:        node.alpha.kubernetes.io/ttl=0
    
    
    Taints:             <none>
    
    CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
    
    Conditions:
    
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
    
      ----             ------  -----------------                 ------------------                ------                       -------
    
      OutOfDisk        False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
    
      MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
    
      DiskPressure     False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
    
      Ready            True    Tue, 14 Aug 2018 09:31:20 -0400   Sat, 11 Aug 2018 00:41:27 -0400   KubeletReady                 kubelet is posting ready status. AppArmor enabled
    
    Addresses:
    
      InternalIP:  *****
    
      Hostname:    ******
    
    Capacity:
    
     cpu:     16
    
     memory:  125827288Ki
    
     pods:    110
    
    Allocatable:
    
     cpu:     16
    
     memory:  125724888Ki
    
     pods:    110
    
    System Info:
    
     Machine ID:                 *************
    
     System UUID:                **************
    
     Boot ID:                    1493028d-0a80-4f2f-b0f1-48d9b8910e9f
    
     Kernel Version:             4.4.0-1062-aws
    
     OS Image:                   Ubuntu 16.04.4 LTS
    
     Operating System:           linux
    
     Architecture:               amd64
    
     Container Runtime Version:  docker://Unknown
    
     Kubelet Version:            v1.8.3
    
     Kube-Proxy Version:         v1.8.3
    
    PodCIDR:                     ******
    
    ExternalID:                  **************
    
    Non-terminated Pods:         (11 in total)
    
      Namespace                  Name                                                            CPU Requests  CPU Limits  Memory Requests  Memory Limits
    
      ---------                  ----                                                            ------------  ----------  ---------------  -------------
    
      kube-system                calico-node-gj5mb                                               250m (1%)     0 (0%)      0 (0%)           0 (0%)
    
      kube-system                kube-proxy-****************************************             100m (0%)     0 (0%)      0 (0%)           0 (0%)
    
      kube-system                prometheus-prometheus-node-exporter-9cntq                       100m (0%)     200m (1%)   30Mi (0%)        50Mi (0%)
    
      logging                    elasticsearch-elasticsearch-data-69df997486-gqcwg               400m (2%)     1 (6%)      8Gi (6%)         16Gi (13%)
    
      logging                    fluentd-fluentd-elasticsearch-tj7nd                             200m (1%)     0 (0%)      612Mi (0%)       0 (0%)
    
      rook                       rook-agent-6jtzm                                                0 (0%)        0 (0%)      0 (0%)           0 (0%)
    
      rook                       rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j    0 (0%)        0 (0%)      0 (0%)           0 (0%)
    
      spark                      accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1       2 (12%)       0 (0%)      10Gi (8%)        12Gi (10%)
    
      spark                      accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5    2 (12%)       0 (0%)      10Gi (8%)        12Gi (10%)
    
      spark                      accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver    1 (6%)        0 (0%)      2Gi (1%)         2432Mi (1%)
    
      spark                      accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver    1 (6%)        0 (0%)      2Gi (1%)         2432Mi (1%)
    
    Allocated resources:
    
      (Total limits may be over 100 percent, i.e., overcommitted.)
    
      CPU Requests  CPU Limits  Memory Requests  Memory Limits
    
      ------------  ----------  ---------------  -------------
    
      7050m (44%)   1200m (7%)  33410Mi (27%)    45874Mi (37%)
    
    
    Events:         <none>


Kubectl describe pod gives below message

    Name:         accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
    Namespace:    spark
    Node:         ****
    Start Time:   Mon, 13 Aug 2018 16:18:34 -0400
    Labels:       launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
                  spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
                  spark-role=driver
    Annotations:  spark-app-name=accelerate-testing-2
    Status:       Pending
    IP:           
    Init Containers:
      spark-init:
        Container ID:  
        Image:         ****:v2.3.0
        Image ID:      
        Port:          <none>
        Args:
          init
          /etc/spark-init/spark-init.properties
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Environment:    <none>
        Mounts:
          /etc/spark-init from spark-init-properties (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Containers:
      spark-kubernetes-driver:
        Container ID:  
        Image:         ******:v2.3.0
        Image ID:      
        Port:          <none>
        Args:
          driver
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Limits:
          memory:  2432Mi
        Requests:
          cpu:     1
          memory:  2Gi
        Environment:
          SPARK_DRIVER_MEMORY:        2g
          SPARK_DRIVER_CLASS:         com.myclass
          SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
          SPARK_MOUNTED_CLASSPATH:    /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
          SPARK_MOUNTED_FILES_DIR:    /var/spark-data/spark-files
          SPARK_JAVA_OPT_0:           -Dspark.kubernetes.container.image=***
          SPARK_JAVA_OPT_1:           -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
          SPARK_JAVA_OPT_2:           -Dspark.submit.deployMode=cluster
          SPARK_JAVA_OPT_3:           -Dspark.driver.blockManager.port=7079
          SPARK_JAVA_OPT_4:           -Dspark.executor.memory=10g
          SPARK_JAVA_OPT_5:           -Dspark.app.id=spark-63f536fd87f8457796802767922ef7d9
          SPARK_JAVA_OPT_6:           -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
          SPARK_JAVA_OPT_7:           -Dspark.master=k8s://https://kubernetes.default
          SPARK_JAVA_OPT_8:           -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
          SPARK_JAVA_OPT_9:           -Dspark.executor.cores=2
          SPARK_JAVA_OPT_10:          -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
          SPARK_JAVA_OPT_11:          -Dspark.driver.port=7078
          SPARK_JAVA_OPT_12:          -Dspark.kubernetes.namespace=spark
          SPARK_JAVA_OPT_13:          -Dspark.executor.memoryOverhead=2g
          SPARK_JAVA_OPT_14:          -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
          SPARK_JAVA_OPT_15:          -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
          SPARK_JAVA_OPT_16:          -Dspark.executor.instances=10
          SPARK_JAVA_OPT_17:          -Dspark.memory.fraction=0.6
          SPARK_JAVA_OPT_18:          -Dspark.driver.memory=2g
          SPARK_JAVA_OPT_19:          -Dspark.kubernetes.driver.pod.name=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
          SPARK_JAVA_OPT_20:          -Dspark.app.name=accelerate-testing-2
          SPARK_JAVA_OPT_21:          -Dspark.kubernetes.driver.label.launch-id=********
        Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Conditions:
      Type           Status
      Initialized    False 
      Ready          False 
      PodScheduled   True 
    Volumes:
      spark-init-properties:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
        Optional:  false
      download-jars-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:  
      download-files-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:  
      spark-token-mj86g:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  spark-token-mj86g
        Optional:    false
    QoS Class:       Burstable
    Node-Selectors:  <none>
    Tolerations:     <none>
    Events:
      Type     Reason          Age                  From                                               Message
      ----     ------          ----                 ----                                               -------
      Normal   SandboxChanged  44m (x518 over 18h)  kubelet, ****************************  Pod sandbox changed, it will be killed and re-created.
      Warning  FailedSync      19s (x540 over 18h)  kubelet, ****************************  Error syncing pod

Reply | Threaded
Open this post in threaded view
|

Re: spark driver pod stuck in Waiting: PodInitializing state in Kubernetes

purna pradeep
Resurfacing The question to get more attention 

Hello,

im running Spark 2.3 job on kubernetes cluster

kubectl version

    Client Version: version.Info{Major:"1", Minor:"9", GitVersion:"v1.9.3", GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", BuildDate:"2018-02-09T21:51:06Z", GoVersion:"go1.9.4", Compiler:"gc", Platform:"darwin/amd64"}
    
    Server Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.3", GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}



when i ran spark submit on k8s master the driverpod is stuck in Waiting: PodInitializing state.
I had to manually kill the driver pod and submit new job in this case ,then it works.How this can be handled in production ?
This happens with executor pods as well



This is happening if i submit the jobs almost parallel ie submit 5 jobs one after the other simultaneously. 

I'm running spark jobs on 20 nodes each having below configuration 

I tried kubectl describe node on the node where trhe driver pod is running this is what i got ,i do see there is overcommit on resources but i expected kubernetes scheduler not to schedule if resources in node are overcommitted or node is in Not Ready state ,in this case node is in Ready State but i observe same behaviour if node is in "Not Ready" state



    Name:               **********
    
    Roles:              worker
    
    Labels:             beta.kubernetes.io/arch=amd64
    
                        beta.kubernetes.io/os=linux
    
                        kubernetes.io/hostname=****
    
                        node-role.kubernetes.io/worker=true
    
    Annotations:        node.alpha.kubernetes.io/ttl=0
    
    
    Taints:             <none>
    
    CreationTimestamp:  Tue, 31 Jul 2018 09:59:24 -0400
    
    Conditions:
    
      Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
    
      ----             ------  -----------------                 ------------------                ------                       -------
    
      OutOfDisk        False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
    
      MemoryPressure   False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
    
      DiskPressure     False   Tue, 14 Aug 2018 09:31:20 -0400   Tue, 31 Jul 2018 09:59:24 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
    
      Ready            True    Tue, 14 Aug 2018 09:31:20 -0400   Sat, 11 Aug 2018 00:41:27 -0400   KubeletReady                 kubelet is posting ready status. AppArmor enabled
    
    Addresses:
    
      InternalIP:  *****
    
      Hostname:    ******
    
    Capacity:
    
     cpu:     16
    
     memory:  125827288Ki
    
     pods:    110
    
    Allocatable:
    
     cpu:     16
    
     memory:  125724888Ki
    
     pods:    110
    
    System Info:
    
     Machine ID:                 *************
    
     System UUID:                **************
    
     Boot ID:                    1493028d-0a80-4f2f-b0f1-48d9b8910e9f
    
     Kernel Version:             4.4.0-1062-aws
    
     OS Image:                   Ubuntu 16.04.4 LTS
    
     Operating System:           linux
    
     Architecture:               amd64
    
     Container Runtime Version:  docker://Unknown
    
     Kubelet Version:            v1.8.3
    
     Kube-Proxy Version:         v1.8.3
    
    PodCIDR:                     ******
    
    ExternalID:                  **************
    
    Non-terminated Pods:         (11 in total)
    
      Namespace                  Name                                                            CPU Requests  CPU Limits  Memory Requests  Memory Limits
    
      ---------                  ----                                                            ------------  ----------  ---------------  -------------
    
      kube-system                calico-node-gj5mb                                               250m (1%)     0 (0%)      0 (0%)           0 (0%)
    
      kube-system                kube-proxy-****************************************             100m (0%)     0 (0%)      0 (0%)           0 (0%)
    
      kube-system                prometheus-prometheus-node-exporter-9cntq                       100m (0%)     200m (1%)   30Mi (0%)        50Mi (0%)
    
      logging                    elasticsearch-elasticsearch-data-69df997486-gqcwg               400m (2%)     1 (6%)      8Gi (6%)         16Gi (13%)
    
      logging                    fluentd-fluentd-elasticsearch-tj7nd                             200m (1%)     0 (0%)      612Mi (0%)       0 (0%)
    
      rook                       rook-agent-6jtzm                                                0 (0%)        0 (0%)      0 (0%)           0 (0%)
    
      rook                       rook-ceph-osd-10-6-42-250.accel.aws-cardda.cb4good.com-gwb8j    0 (0%)        0 (0%)      0 (0%)           0 (0%)
    
      spark                      accelerate-test-5-a3bfb8a597e83d459193a183e17f13b5-exec-1       2 (12%)       0 (0%)      10Gi (8%)        12Gi (10%)
    
      spark                      accelerate-testing-1-8ed0482f3bfb3c0a83da30bb7d433dff-exec-5    2 (12%)       0 (0%)      10Gi (8%)        12Gi (10%)
    
      spark                      accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver    1 (6%)        0 (0%)      2Gi (1%)         2432Mi (1%)
    
      spark                      accelerate-testing-2-e8bd0607cc693bc8ae25cc6dc300b2c7-driver    1 (6%)        0 (0%)      2Gi (1%)         2432Mi (1%)
    
    Allocated resources:
    
      (Total limits may be over 100 percent, i.e., overcommitted.)
    
      CPU Requests  CPU Limits  Memory Requests  Memory Limits
    
      ------------  ----------  ---------------  -------------
    
      7050m (44%)   1200m (7%)  33410Mi (27%)    45874Mi (37%)
    
    
    Events:         <none>


Kubectl describe pod gives below message

    Name:         accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
    Namespace:    spark
    Node:         ****
    Start Time:   Mon, 13 Aug 2018 16:18:34 -0400
    Labels:       launch-id=k8s-submit-service-cddf45ff-0d88-4681-af85-d8ed0359ce73
                  spark-app-selector=spark-63f536fd87f8457796802767922ef7d9
                  spark-role=driver
    Annotations:  spark-app-name=accelerate-testing-2
    Status:       Pending
    IP:           
    Init Containers:
      spark-init:
        Container ID:  
        Image:         ****:v2.3.0
        Image ID:      
        Port:          <none>
        Args:
          init
          /etc/spark-init/spark-init.properties
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Environment:    <none>
        Mounts:
          /etc/spark-init from spark-init-properties (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Containers:
      spark-kubernetes-driver:
        Container ID:  
        Image:         ******:v2.3.0
        Image ID:      
        Port:          <none>
        Args:
          driver
        State:          Waiting
          Reason:       PodInitializing
        Ready:          False
        Restart Count:  0
        Limits:
          memory:  2432Mi
        Requests:
          cpu:     1
          memory:  2Gi
        Environment:
          SPARK_DRIVER_MEMORY:        2g
          SPARK_DRIVER_CLASS:         com.myclass
          SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
          SPARK_MOUNTED_CLASSPATH:    /var/spark-data/spark-jars/quantum-workflow-2.2.24.0-SNAPSHOT-assembly.jar:/var/spark-data/spark-jars/my.jar
          SPARK_MOUNTED_FILES_DIR:    /var/spark-data/spark-files
          SPARK_JAVA_OPT_0:           -Dspark.kubernetes.container.image=***
          SPARK_JAVA_OPT_1:           -Dspark.jars=s3a://my/my.jar,s3a://my/my1.jar
          SPARK_JAVA_OPT_2:           -Dspark.submit.deployMode=cluster
          SPARK_JAVA_OPT_3:           -Dspark.driver.blockManager.port=7079
          SPARK_JAVA_OPT_4:           -Dspark.executor.memory=10g
          SPARK_JAVA_OPT_5:           -Dspark.app.id=spark-63f536fd87f8457796802767922ef7d9
          SPARK_JAVA_OPT_6:           -Dspark.kubernetes.authenticate.driver.serviceAccountName=spark
          SPARK_JAVA_OPT_7:           -Dspark.master=k8s://https://kubernetes.default
          SPARK_JAVA_OPT_8:           -Dspark.driver.host=spark-1534191513364-driver-svc.spark.svc
          SPARK_JAVA_OPT_9:           -Dspark.executor.cores=2
          SPARK_JAVA_OPT_10:          -Dspark.kubernetes.executor.podNamePrefix=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba
          SPARK_JAVA_OPT_11:          -Dspark.driver.port=7078
          SPARK_JAVA_OPT_12:          -Dspark.kubernetes.namespace=spark
          SPARK_JAVA_OPT_13:          -Dspark.executor.memoryOverhead=2g
          SPARK_JAVA_OPT_14:          -Dspark.kubernetes.initContainer.configMapName=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
          SPARK_JAVA_OPT_15:          -Dspark.kubernetes.initContainer.configMapKey=spark-init.properties
          SPARK_JAVA_OPT_16:          -Dspark.executor.instances=10
          SPARK_JAVA_OPT_17:          -Dspark.memory.fraction=0.6
          SPARK_JAVA_OPT_18:          -Dspark.driver.memory=2g
          SPARK_JAVA_OPT_19:          -Dspark.kubernetes.driver.pod.name=accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-driver
          SPARK_JAVA_OPT_20:          -Dspark.app.name=accelerate-testing-2
          SPARK_JAVA_OPT_21:          -Dspark.kubernetes.driver.label.launch-id=********
        Mounts:
          /var/run/secrets/kubernetes.io/serviceaccount from spark-token-mj86g (ro)
          /var/spark-data/spark-files from download-files-volume (rw)
          /var/spark-data/spark-jars from download-jars-volume (rw)
    Conditions:
      Type           Status
      Initialized    False 
      Ready          False 
      PodScheduled   True 
    Volumes:
      spark-init-properties:
        Type:      ConfigMap (a volume populated by a ConfigMap)
        Name:      accelerate-testing-2-8cecc18bb42f31a386c6304bd63e9eba-init-config
        Optional:  false
      download-jars-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:  
      download-files-volume:
        Type:    EmptyDir (a temporary directory that shares a pod's lifetime)
        Medium:  
      spark-token-mj86g:
        Type:        Secret (a volume populated by a Secret)
        SecretName:  spark-token-mj86g
        Optional:    false
    QoS Class:       Burstable
    Node-Selectors:  <none>
    Tolerations:     <none>
    Events:
      Type     Reason          Age                  From                                               Message
      ----     ------          ----                 ----                                               -------
      Normal   SandboxChanged  44m (x518 over 18h)  kubelet, ****************************  Pod sandbox changed, it will be killed and re-created.
      Warning  FailedSync      19s (x540 over 18h)  kubelet, ****************************  Error syncing pod