Possible to limit number of IPC retries on spark-submit?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Possible to limit number of IPC retries on spark-submit?

Jeff Evans
Greetings,

Is it possible to limit the number of times the IPC client retries upon a spark-submit invocation?  For context, see this StackOverflow post.  In essence, I am trying to call spark-submit on a Kerberized cluster, without having valid Kerberos tickets available.  This is deliberate, and I'm not truly facing a Kerberos issue.  Rather, this is the easiest reproducible case of "long IPC retry" I have been able to trigger.

In this particular case, the following errors are printed (presumably by the launcher):

20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.
This continues for 30 times before the launcher finally gives up.

As indicated in the answer on that StackOverflow post, the relevant Hadoop properties should be ipc.client.connect.max.retries and/or ipc.client.connect.max.retries.on.sasl.  However, in testing on Spark 2.4.0 (on CDH 6.1), I am not able to get either of these to take effect (it still retries 30 times regardless).  I am trying the SparkPi example, and specifying them with --conf spark.hadoop.ipc.client.connect.max.retries and/or --conf spark.hadoop.ipc.client.connect.max.retries.on.sasl.

Any ideas on what I could be doing wrong, or why I can't get these properties to take effect?
Reply | Threaded
Open this post in threaded view
|

Re: Possible to limit number of IPC retries on spark-submit?

Jeff Evans
Figured out the answer, eventually.  The magic property name, in this case, is yarn.client.failover-max-attempts (prefixed with spark.hadoop. in the case of Spark, of course).  For a full explanation, see the StackOverflow answer I just added.

On Wed, Jan 22, 2020 at 5:02 PM Jeff Evans <[hidden email]> wrote:
Greetings,

Is it possible to limit the number of times the IPC client retries upon a spark-submit invocation?  For context, see this StackOverflow post.  In essence, I am trying to call spark-submit on a Kerberized cluster, without having valid Kerberos tickets available.  This is deliberate, and I'm not truly facing a Kerberos issue.  Rather, this is the easiest reproducible case of "long IPC retry" I have been able to trigger.

In this particular case, the following errors are printed (presumably by the launcher):

20/01/22 15:49:32 INFO retry.RetryInvocationHandler: java.io.IOException: Failed on local exception: java.io.IOException: org.apache.hadoop.security.AccessControlException: Client cannot authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: "node-1.cluster/172.18.0.2"; destination host is: "node-1.cluster":8032; , while invoking ApplicationClientProtocolPBClientImpl.getClusterMetrics over null after 1 failover attempts. Trying to failover after sleeping for 35160ms.
This continues for 30 times before the launcher finally gives up.

As indicated in the answer on that StackOverflow post, the relevant Hadoop properties should be ipc.client.connect.max.retries and/or ipc.client.connect.max.retries.on.sasl.  However, in testing on Spark 2.4.0 (on CDH 6.1), I am not able to get either of these to take effect (it still retries 30 times regardless).  I am trying the SparkPi example, and specifying them with --conf spark.hadoop.ipc.client.connect.max.retries and/or --conf spark.hadoop.ipc.client.connect.max.retries.on.sasl.

Any ideas on what I could be doing wrong, or why I can't get these properties to take effect?