[Spark Streaming] Why is ZooKeeper LeaderElection Agent not being called by Spark Master?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Spark Streaming] Why is ZooKeeper LeaderElection Agent not being called by Spark Master?

Saloni Mehta
Hello,

Request you to please help me out on the below queries:

I have 2 spark masters and 3 zookeepers deployed on my system on separate virtual machines. The services come up online in the below sequence:

  1. zookeeper-1
  2. sparkmaster-1
  3. sparkmaster-2
  4. zookeeper-2
  5. zookeeper-3

The above sequence leads to both the spark masters running in STANDBY mode.

From the logs, I can see that only after zookeeper-2 service comes up (i.e. 2 zookeeper services are up), spark master is successfully able to create a zookeeper session. Until zookeeper-2 is up, it re-tries session creation. However, after both zookeeper services are up and Persistence Engine is able to successfully connect and create a session; the ZooKeeper LeaderElection Agent is not called.

Logs:

    10:03:47.241 INFO  org.apache.spark.internal.Logging:57 - Persisting recovery state to 
    ZooKeeper
    Initiating client connection, connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx 
    sessionTimeout=60000 watcher=org.apache.curator.ConnectionState

    ##### Only zookeeper-2 is online #####

    10:03:47.630 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to 
    server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error)
    10:03:50.635 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: 
    zookeeper-1:xxxx: No route to host
    10:03:50.738 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket 
    connection to server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown 
    error)
    2020-12-18 10:03:50.739 INFO  org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection 
    established to zookeeper-2:xxxx, initiating session
    10:03:50.742 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read 
    additional data from server sessionid 0x0, likely server has closed socket, closing socket 
    connection and attempting reconnect
    10:03:51.842 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket 
    connection to server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown 
    error)
    10:03:51.843 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error 
    occurred: zookeeper-3:xxxx: Connection refused

    10:04:02.685 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection 
    string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (15274)
    org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)

    10:04:22.691 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection 
    string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (35297)
    org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)

    10:04:42.696 ERROR org.apache.curator.ConnectionState:200 - Connection timed out for connection 
    string (zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx) and timeout (15000) / elapsed (55301)
    org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
        at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:197)
        at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:87)

    10:05:32.699 WARN  org.apache.curator.ConnectionState:191 - Connection attempt unsuccessful after 
    105305 (greater than max timeout of 60000). Resetting connection and trying again with a new 
    connection.
    10:05:32.864 INFO  org.apache.zookeeper.ZooKeeper:693 - Session: 0x0 closed
    10:05:32.865 INFO  org.apache.zookeeper.ZooKeeper:442 - Initiating client connection, 
    connectString=zookeeper-2:xxxx,zookeeper-3:xxxx,zookeeper-1:xxxx sessionTimeout=60000 
    watcher=org.apache.curator.ConnectionState@
    10:05:32.864 INFO  org.apache.zookeeper.ClientCnxn$EventThread:522 - EventThread shut down for 
    session: 0x0

    10:05:32.969 ERROR org.apache.spark.internal.Logging:94 - Ignoring error
    org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss 
    for /x/y
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:102)
        at org.apache.zookeeper.KeeperException.create(KeeperException.java:54)

    ##### zookeeper-2, zookeeper-3 are online #####

    10:05:47.357 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to 
    server zookeeper-2:xxxx. Will not attempt to authenticate using SASL (unknown error)
    10:05:47.358 INFO  org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established 
    to zookeeper-2:xxxx, initiating session
    10:05:47.359 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional 
    data from server sessionid 0x0, likely server has closed socket, closing socket connection and 
    attempting reconnect
    10:05:47.528 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to 
    server zookeeper-1:xxxx. Will not attempt to authenticate using SASL (unknown error)
    10:05:50.529 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1162 - Socket error occurred: 
    zookeeper-1:xxxx: No route to host
    10:05:51.454 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to 
    server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error)
    10:05:51.455 INFO  org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established 
    to zookeeper-3:xxxx, initiating session
    10:05:51.457 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1158 - Unable to read additional 
    data from server sessionid 0x0, likely server has closed socket, closing socket connection and 
    attempting reconnect

    10:05:57.564 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1025 - Opening socket connection to 
    server zookeeper-3:xxxx. Will not attempt to authenticate using SASL (unknown error)
    10:05:57.566 INFO  org.apache.zookeeper.ClientCnxn$SendThread:879 - Socket connection established 
    to zookeeper-3:xxxx, initiating session
    10:05:57.574 INFO  org.apache.zookeeper.ClientCnxn$SendThread:1299 - Session establishment 
    complete on server zookeeper-3:xxxx, sessionid = xxxx, negotiated timeout = 40000
    10:05:57.580 INFO  org.apache.curator.framework.state.ConnectionStateManager:228 - State change: 
    CONNECTED

Questions:

  1. The last line from the logs above indicates that a zookeeper session was successfully established. Why is the Zookeeper LeaderElection Agent not being called then?
  2. Is there any configuration that we can do in spark so as to increase the number of retries/timeouts while connecting to zookeeper?
Any insight on this is appreciated.

Thanks & Regards,
Saloni R. Mehta