Multiple applications being spawned

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Multiple applications being spawned

Sachit Murarka
Hi Users,

When action(I am using count and write) gets executed in my spark job , it launches many more application instances(around 25 more apps). 

In my spark code ,  I am running the transformations through Dataframes then converting dataframe to rdd then applying zipwithindex , then converting it back to dataframe and then applying 2 actions(Count & Write).

Please note : This was working fine till the previous week, it has started giving this issue since yesterday.

Could you please tell what can be the reason for this behavior?

Kind Regards,
Sachit Murarka
Reply | Threaded
Open this post in threaded view
|

Re: Multiple applications being spawned

Sachit Murarka
Adding Logs.

When it launches the multiple applications , following logs get generated on the terminal
Also it retries the task always:

20/10/13 12:04:30 WARN TaskSetManager: Lost task XX in stage XX (TID XX, executor 5): java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)

Kind Regards,
Sachit Murarka


On Tue, Oct 13, 2020 at 4:02 PM Sachit Murarka <[hidden email]> wrote:
Hi Users,

When action(I am using count and write) gets executed in my spark job , it launches many more application instances(around 25 more apps). 

In my spark code ,  I am running the transformations through Dataframes then converting dataframe to rdd then applying zipwithindex , then converting it back to dataframe and then applying 2 actions(Count & Write).

Please note : This was working fine till the previous week, it has started giving this issue since yesterday.

Could you please tell what can be the reason for this behavior?

Kind Regards,
Sachit Murarka
Reply | Threaded
Open this post in threaded view
|

Re: Multiple applications being spawned

Lalwani, Jayesh

Where are you running your Spark cluster? Can you post the command line that you are using to run your application?

 

Spark is designed to process a lot of data by distributing work to a cluster of a machines. When you submit a job, it starts executor processes on the cluster. So, what you are seeing is somewhat expected, (although 25 processes on a single node seem too high)

 

From: Sachit Murarka <[hidden email]>
Date: Tuesday, October 13, 2020 at 8:15 AM
To: spark users <[hidden email]>
Subject: RE: [EXTERNAL] Multiple applications being spawned

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Adding Logs.

 

When it launches the multiple applications , following logs get generated on the terminal
Also it retries the task always:

20/10/13 12:04:30 WARN TaskSetManager: Lost task XX in stage XX (TID XX, executor 5): java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)

 

Kind Regards,
Sachit Murarka

 

 

On Tue, Oct 13, 2020 at 4:02 PM Sachit Murarka <[hidden email]> wrote:

Hi Users,

When action(I am using count and write) gets executed in my spark job , it launches many more application instances(around 25 more apps). 

In my spark code ,  I am running the transformations through Dataframes then converting dataframe to rdd then applying zipwithindex , then converting it back to dataframe and then applying 2 actions(Count & Write).

 

Please note : This was working fine till the previous week, it has started giving this issue since yesterday.


Could you please tell what can be the reason for this behavior?


Kind Regards,
Sachit Murarka

Reply | Threaded
Open this post in threaded view
|

Re: Multiple applications being spawned

Sachit Murarka
Hi Jayesh,

Its not executor process. Its application( job itself) is getting called multiple times. Like a recursion. Problem seems mainly in ZipWithIndex

Thanks
Sachit

On Tue, 13 Oct 2020, 22:40 Lalwani, Jayesh, <[hidden email]> wrote:

Where are you running your Spark cluster? Can you post the command line that you are using to run your application?

 

Spark is designed to process a lot of data by distributing work to a cluster of a machines. When you submit a job, it starts executor processes on the cluster. So, what you are seeing is somewhat expected, (although 25 processes on a single node seem too high)

 

From: Sachit Murarka <[hidden email]>
Date: Tuesday, October 13, 2020 at 8:15 AM
To: spark users <[hidden email]>
Subject: RE: [EXTERNAL] Multiple applications being spawned

 

CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

 

Adding Logs.

 

When it launches the multiple applications , following logs get generated on the terminal
Also it retries the task always:

20/10/13 12:04:30 WARN TaskSetManager: Lost task XX in stage XX (TID XX, executor 5): java.net.SocketException: Broken pipe (Write failed)
        at java.net.SocketOutputStream.socketWrite0(Native Method)
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
        at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
        at java.io.DataOutputStream.write(DataOutputStream.java:107)
        at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
        at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
        at org.apache.spark.api.python.PythonRunner$$anon$2.writeIteratorToStream(PythonRunner.scala:561)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
        at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)

 

Kind Regards,
Sachit Murarka

 

 

On Tue, Oct 13, 2020 at 4:02 PM Sachit Murarka <[hidden email]> wrote:

Hi Users,

When action(I am using count and write) gets executed in my spark job , it launches many more application instances(around 25 more apps). 

In my spark code ,  I am running the transformations through Dataframes then converting dataframe to rdd then applying zipwithindex , then converting it back to dataframe and then applying 2 actions(Count & Write).

 

Please note : This was working fine till the previous week, it has started giving this issue since yesterday.


Could you please tell what can be the reason for this behavior?


Kind Regards,
Sachit Murarka