Caching small Rdd's take really long time and Spark seems frozen

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Caching small Rdd's take really long time and Spark seems frozen

Guillermo Ortiz
I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache.

If I go to the SparkUI->Cache, I can see how the percentage is increasing until 83% (250kbytes) and then it stops for a while. If I check the event time in the Spark UI I can see that when this happen there is a node where tasks takes very long time. This node could be any from the cluster, it's not always the same.

In the spark executor logs I can see it's that it takes about 40s in store 3.7kb when this problem occurs

    INFO  2018-08-23 12:46:58 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as bytes in memory (estimated size 3.7 KB, free 1048.0 MB)
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally

I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same results. I have checked the IO disk (although if I use memory_only I guess that it doesn't have sense) and I can't see any problem. This happens randomly, but it could be in the 25% of the jobs.

Any idea about what it could be happening?
Reply | Threaded
Open this post in threaded view
|

Re: Caching small Rdd's take really long time and Spark seems frozen

Sonal Goyal
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? 

Thanks,
Sonal
Nube Technologies 





On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz <[hidden email]> wrote:
I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache.

If I go to the SparkUI->Cache, I can see how the percentage is increasing until 83% (250kbytes) and then it stops for a while. If I check the event time in the Spark UI I can see that when this happen there is a node where tasks takes very long time. This node could be any from the cluster, it's not always the same.

In the spark executor logs I can see it's that it takes about 40s in store 3.7kb when this problem occurs

    INFO  2018-08-23 12:46:58 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as bytes in memory (estimated size 3.7 KB, free 1048.0 MB)
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally

I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same results. I have checked the IO disk (although if I use memory_only I guess that it doesn't have sense) and I can't see any problem. This happens randomly, but it could be in the 25% of the jobs.

Any idea about what it could be happening?

Reply | Threaded
Open this post in threaded view
|

Re: Caching small Rdd's take really long time and Spark seems frozen

Guillermo Ortiz
it's a complex DAG before the point I cache the RDD, they are some joins, filter and maps before caching data, but most of the times it doesn't take almost time to do it. I could understand if it would take the same time all the times to process or cache the data. Besides it seems random and they are any weird data in the input.

Another test I tried it's disabled caching, and I saw that all the microbatches last the same time, so it seems that it's relation with caching these RDD's.

El jue., 23 ago. 2018 a las 15:29, Sonal Goyal (<[hidden email]>) escribió:
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? 

Thanks,
Sonal
Nube Technologies 





On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz <[hidden email]> wrote:
I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache.

If I go to the SparkUI->Cache, I can see how the percentage is increasing until 83% (250kbytes) and then it stops for a while. If I check the event time in the Spark UI I can see that when this happen there is a node where tasks takes very long time. This node could be any from the cluster, it's not always the same.

In the spark executor logs I can see it's that it takes about 40s in store 3.7kb when this problem occurs

    INFO  2018-08-23 12:46:58 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as bytes in memory (estimated size 3.7 KB, free 1048.0 MB)
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally

I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same results. I have checked the IO disk (although if I use memory_only I guess that it doesn't have sense) and I can't see any problem. This happens randomly, but it could be in the 25% of the jobs.

Any idea about what it could be happening?

Reply | Threaded
Open this post in threaded view
|

Re: Caching small Rdd's take really long time and Spark seems frozen

Guillermo Ortiz
Another test I just did it's to execute with local[X] and this problem doesn't happen.  Communication problems?

2018-08-23 22:43 GMT+02:00 Guillermo Ortiz <[hidden email]>:
it's a complex DAG before the point I cache the RDD, they are some joins, filter and maps before caching data, but most of the times it doesn't take almost time to do it. I could understand if it would take the same time all the times to process or cache the data. Besides it seems random and they are any weird data in the input.

Another test I tried it's disabled caching, and I saw that all the microbatches last the same time, so it seems that it's relation with caching these RDD's.

El jue., 23 ago. 2018 a las 15:29, Sonal Goyal (<[hidden email]>) escribió:
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? 

Thanks,
Sonal
Nube Technologies 





On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz <[hidden email]> wrote:
I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache.

If I go to the SparkUI->Cache, I can see how the percentage is increasing until 83% (250kbytes) and then it stops for a while. If I check the event time in the Spark UI I can see that when this happen there is a node where tasks takes very long time. This node could be any from the cluster, it's not always the same.

In the spark executor logs I can see it's that it takes about 40s in store 3.7kb when this problem occurs

    INFO  2018-08-23 12:46:58 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as bytes in memory (estimated size 3.7 KB, free 1048.0 MB)
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally

I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same results. I have checked the IO disk (although if I use memory_only I guess that it doesn't have sense) and I can't see any problem. This happens randomly, but it could be in the 25% of the jobs.

Any idea about what it could be happening?


Reply | Threaded
Open this post in threaded view
|

Re: Caching small Rdd's take really long time and Spark seems frozen

Sonal Goyal
Without knowing too much about your application, it would be hard to say. Maybe it is working faster in local as there is no shuffling etc? The spark.ui would be your best bet to know what stage is slowing things down. 

On Fri 24 Aug, 2018, 3:26 PM Guillermo Ortiz, <[hidden email]> wrote:
Another test I just did it's to execute with local[X] and this problem doesn't happen.  Communication problems?

2018-08-23 22:43 GMT+02:00 Guillermo Ortiz <[hidden email]>:
it's a complex DAG before the point I cache the RDD, they are some joins, filter and maps before caching data, but most of the times it doesn't take almost time to do it. I could understand if it would take the same time all the times to process or cache the data. Besides it seems random and they are any weird data in the input.

Another test I tried it's disabled caching, and I saw that all the microbatches last the same time, so it seems that it's relation with caching these RDD's.

El jue., 23 ago. 2018 a las 15:29, Sonal Goyal (<[hidden email]>) escribió:
How are these small RDDs created? Could the blockage be in their compute creation instead of their caching? 

Thanks,
Sonal
Nube Technologies 





On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz <[hidden email]> wrote:
I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache.

If I go to the SparkUI->Cache, I can see how the percentage is increasing until 83% (250kbytes) and then it stops for a while. If I check the event time in the Spark UI I can see that when this happen there is a node where tasks takes very long time. This node could be any from the cluster, it's not always the same.

In the spark executor logs I can see it's that it takes about 40s in store 3.7kb when this problem occurs

    INFO  2018-08-23 12:46:58 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as bytes in memory (estimated size 3.7 KB, free 1048.0 MB)
    INFO  2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally

I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same results. I have checked the IO disk (although if I use memory_only I guess that it doesn't have sense) and I can't see any problem. This happens randomly, but it could be in the 25% of the jobs.

Any idea about what it could be happening?