Executor lost for unknown reasons error Spark 2.3 on kubernetes

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Executor lost for unknown reasons error Spark 2.3 on kubernetes

Mamillapalli, Purna Pradeep

Hello,

 

I’m getting below error in spark driver pod logs and executor pods are getting killed midway through while the job is running  and even driver pod Terminated with below intermittent error ,this happens if I run multiple jobs in parallel.

 

Not able to see executor logs as executor pods are killed

 

org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 36.0 failed 4 times, most recent failure: Lost task 23.3 in stage 36.0 (TID 1006, 10.10.125.119, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor lost for unknown reasons.

Driver stacktrace:

    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)

    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at scala.Option.foreach(Option.scala:257)

    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)

    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)

    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)

    ... 42 more

 

 

Thanks,

Purna



The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Reply | Threaded
Open this post in threaded view
|

Executor lost for unknown reasons error Spark 2.3 on kubernetes

purna pradeep

Hello,

 

I’m getting below error in spark driver pod logs and executor pods are getting killed midway through while the job is running  and even driver pod Terminated with below intermittent error ,this happens if I run multiple jobs in parallel.

 

Not able to see executor logs as executor pods are killed

 

org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 36.0 failed 4 times, most recent failure: Lost task 23.3 in stage 36.0 (TID 1006, 10.10.125.119, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor lost for unknown reasons.

Driver stacktrace:

    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)

    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at scala.Option.foreach(Option.scala:257)

    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)

    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)

    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)

    ... 42 mor

Reply | Threaded
Open this post in threaded view
|

Executor lost for unknown reasons error Spark 2.3 on kubernetes

purna pradeep

Hello,

 

I’m getting below error in spark driver pod logs and executor pods are getting killed midway through while the job is running  and even driver pod Terminated with below intermittent error ,this happens if I run multiple jobs in parallel.

 

Not able to see executor logs as executor pods are killed

 

org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 36.0 failed 4 times, most recent failure: Lost task 23.3 in stage 36.0 (TID 1006, 10.10.125.119, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor lost for unknown reasons.

Driver stacktrace:

    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)

    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at scala.Option.foreach(Option.scala:257)

    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)

    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)

    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)

    ... 42 mor

Reply | Threaded
Open this post in threaded view
|

Re: Executor lost for unknown reasons error Spark 2.3 on kubernetes

purna pradeep
More details about executor pod which died abruptly from spark driver pod logs 


2018-07-30 19:58:41 ERROR TaskSchedulerImpl:70 - Lost executor 3 on 10.*.*.*.*: Executor lost for unknown reasons.

2018-07-30 19:58:41 WARN  TaskSetManager:66 - Lost task 32.0 in stage 9.0 (TID 133, 10.10.147.6, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Executor lost for unknown reasons.

2018-07-30 19:58:41 WARN  KubernetesClusterSchedulerBackend:66 - Received delete event of executor pod accelerate-snowflake-test-5b6ba9d5495b3ae9a1358ae9c3f9a8c3-exec-3. Reason: null

2018-07-30 19:58:41 WARN  KubernetesClusterSchedulerBackend:347 - Executor with id 3 was not marked as disconnected, but the watch received an event of type DELETED for this executor. The executor may have failed to start in the first place and never registered with the driver.

2018-07-30 19:58:41 INFO  TaskSetManager:54 - Starting task 32.1 in stage 9.0 (TID 134, 10.*.*.*.*, executor 7, partition 32, ANY, 9262 bytes)

2018-07-30 19:58:42 INFO  ContextCleaner:54 - Cleaned accumulator 246

2018-07-30 19:58:42 INFO  ContextCleaner:54 - Cleaned accumulator 252

2018-07-30 19:58:42 INFO  ContextCleaner:54 - Cleaned accumulator 254

2018-07-30 19:58:42 INFO  BlockManagerInfo:54 - Removed broadcast_11_piece0 on spark-1532979165550-driver-svc.spark.svc:7079 in memory (size: 6.9 KB, free: 997.6 MB)

2018-07-30 19:58:42 INFO  BlockManagerInfo:54 - Removed broadcast_11_piece0 on 10.*.*.*.*:43815 on disk (size: 6.9 KB)

2018-07-30 19:58:42 WARN  TransportChannelHandler:78 - Exception in connection from /10.*.*.*.*:37578

java.io.IOException: Connection reset by peer

                at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

                at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

                at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

                at sun.nio.ch.IOUtil.read(IOUtil.java:192)

                at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

                at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)

                at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)

                at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)

                at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)

                at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)

                at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)

                at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)

                at java.lang.Thread.run(Thread.java:748)

2018-07-30 19:58:42 ERROR TransportResponseHandler:154 - Still have 1 requests outstanding when connection from /10.*.*.*.*:37578 is closed

2018-07-30 19:58:42 INFO  KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint:54 - Disabling executor 7.

2018-07-30 19:58:42 INFO  DAGScheduler:54 - Executor lost: 7 (epoch 1)

2018-07-30 19:58:42 WARN  BlockManagerMaster:87 - Failed to remove broadcast 11 with removeFromMaster = true - Connection reset by peer

java.io.IOException: Connection reset by peer

                at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

                at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

                at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

                at sun.nio.ch.IOUtil.read(IOUtil.java:192)

                at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

                at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)

                at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)

                at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)

                at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)

                at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)

                at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)

                at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)

                at java.lang.Thread.run(Thread.java:748)

2018-07-30 19:58:42 INFO  BlockManagerMasterEndpoint:54 - Trying to remove executor 7 from BlockManagerMaster.

2018-07-30 19:58:42 WARN  BlockManagerMasterEndpoint:66 - No more replicas available for rdd_9_37 !

MasterEndpoint:54 - Removing block manager BlockManagerId(7, 10.*.*.*.*, 43888, None)

2018-07-30 19:58:42 INFO  BlockManagerMaster:54 - Removed 7 successfully in removeExecutor

2018-07-30 19:58:42 INFO  DAGScheduler:54 - Shuffle files lost for executor: 7 (epoch 1)

2018-07-30 19:58:42 ERROR ContextCleaner:91 - Error cleaning broadcast 11

org.apache.spark.SparkException: Exception thrown in awaitResult:

                at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)

                at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)

                at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:155)

                at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:321)

                at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)

                at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:66)

                at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:238)

                at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:194)

                at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$1.apply(ContextCleaner.scala:185)

                at scala.Option.foreach(Option.scala:257)

                at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:185)

                at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1319)

                at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:178)

                at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:73)

Caused by: java.io.IOException: Connection reset by peer

                at sun.nio.ch.FileDispatcherImpl.read0(Native Method)

                at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)

                at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)

                at sun.nio.ch.IOUtil.read(IOUtil.java:192)

                at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)

                at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288)

                at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1106)

                at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:343)

                at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:123)

                at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)

                at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)

                at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)

                at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)

                at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)

                at java.lang.Thread.run(Thread.java:


On Tue, Jul 31, 2018 at 8:32 AM purna pradeep <[hidden email]> wrote:

Hello,

 

I’m getting below error in spark driver pod logs and executor pods are getting killed midway through while the job is running  and even driver pod Terminated with below intermittent error ,this happens if I run multiple jobs in parallel.

 

Not able to see executor logs as executor pods are killed

 

org.apache.spark.SparkException: Job aborted due to stage failure: Task 23 in stage 36.0 failed 4 times, most recent failure: Lost task 23.3 in stage 36.0 (TID 1006, 10.10.125.119, executor 1): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Executor lost for unknown reasons.

Driver stacktrace:

    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)

    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)

    at scala.Option.foreach(Option.scala:257)

    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)

    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)

    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)

    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)

    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)

    ... 42 mor