spark errors: Executor X disconnected, so removing it

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

spark errors: Executor X disconnected, so removing it


I am experiencing the following problem with Spark.

My application runs properly for very small datasets (6 MB), but fails for
datasets beyond 12MB.

With those larger datasets, the main log shows the following errors for
all of my executors. The application (launched from sbt command) hangs (until I
terminate it with ctrl-c), but the Web UI shows a FAILED state, with only 4 executors
(it started with 5), whose states are shown as KILLED. Those messages and
failures happen almost right after launching my application.

  INFO cluster.SparkDeploySchedulerBackend: Executor 1 disconnected, so removing it
  ERROR cluster.ClusterScheduler: Lost executor 1 on OFW4: remote Akka client shutdown
  WARN cluster.ClusterTaskSetManager: Lost TID 0 (task 2.0:0)
  INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0)

The same warnings and errors get logged a few times, then all processing stops.
The executor stderr only shows the following error, which does not explain
why my executors keep disconnecting with larger datasets.

  INFO server.AbstractConnector: Started SocketConnector@
  ERROR executor.CoarseGrainedExecutorBackend: Driver terminated or disconnected! Shutting down.

After closer inspection, I pinpointed the problem to a partitionBy transformation.

My current application looks like this:

  input dataset (HDFS) -> sample
    -> map
      -> union (other RDD coming from same file with similar lineage)
        -> partitionBy
          -> first (for debugging)

Note that for debugging reasons I am also loading the entire contents of my input
file into application memory using Scala's Source.fromFile API. Could this have
anything to do with the above failures ?

If not, any idea of what could be causing executor disconnections?
How could I get more detailed debugging information to help me further investigate
this issue ?

Any help would be gladly appreciated,