Cluster taking a long time with not much activity (or so I think)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Cluster taking a long time with not much activity (or so I think)

Vipul Pandey
Hi 

My use case is pretty simple : 
- Get all the data   (2TB uncompressed)
- calculate some aggregates for, say, time slices (A)   - this could be every minute of every day for past 1 month
- calculate some aggregates for a filtered subset of data for the same slices (B)
- join them and calculate the % of B wrt A   
- save them to the file  (160MB)

# Nodes  = 20  (150G each)
Spark Version = 0.9.0
Input data size  =  2TB
Output Data Size = 160 M 


Everything else works fine but saveAsTextFile call takes about an hour. But in that hour the CPU Utilization, Load Average and Network traffic is pretty low - infact tapers off after 30 minutes. (check out the values in the plots below starting 15:08)

Heres' my code 

      val joined =   countsForCategory.join(countsAllCategories)
      val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
      counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("<a href="hdfs://x.y.z/path/to/output/dir">hdfs://x.y.z/path/to/output/dir")

Can someone explain what's going on? is that expected? As mentioned, my output data is pretty small.







CPU Utilization 



NETWORK 


LOAD AVERAGE 
Reply | Threaded
Open this post in threaded view
|

Re: Cluster taking a long time with not much activity (or so I think)

Mayur Rustagi
You can check out the storage tab of your application. If you see RDD spilling off to disk that could be an issue. 
Another possibility is disk commits are taking time so disk utilization could be relevant.
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257


On Mon, Mar 24, 2014 at 8:13 PM, Vipul Pandey <[hidden email]> wrote:
Hi 

My use case is pretty simple : 
- Get all the data   (2TB uncompressed)
- calculate some aggregates for, say, time slices (A)   - this could be every minute of every day for past 1 month
- calculate some aggregates for a filtered subset of data for the same slices (B)
- join them and calculate the % of B wrt A   
- save them to the file  (160MB)

# Nodes  = 20  (150G each)
Spark Version = 0.9.0
Input data size  =  2TB
Output Data Size = 160 M 


Everything else works fine but saveAsTextFile call takes about an hour. But in that hour the CPU Utilization, Load Average and Network traffic is pretty low - infact tapers off after 30 minutes. (check out the values in the plots below starting 15:08)

Heres' my code 

      val joined =   countsForCategory.join(countsAllCategories)
      val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
      counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("hdfs://x.y.z/path/to/output/dir")

Can someone explain what's going on? is that expected? As mentioned, my output data is pretty small.







CPU Utilization 



NETWORK 


LOAD AVERAGE 

Reply | Threaded
Open this post in threaded view
|

Re: Cluster taking a long time with not much activity (or so I think)

Mayur Rustagi
Another issue could not not enough memory. Can you try out with 1TB or possibly 500GB data & scale gracefully. 
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257


On Wed, Mar 26, 2014 at 12:52 PM, Mayur Rustagi <[hidden email]> wrote:
You can check out the storage tab of your application. If you see RDD spilling off to disk that could be an issue. 
Another possibility is disk commits are taking time so disk utilization could be relevant.
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Mon, Mar 24, 2014 at 8:13 PM, Vipul Pandey <[hidden email]> wrote:
Hi 

My use case is pretty simple : 
- Get all the data   (2TB uncompressed)
- calculate some aggregates for, say, time slices (A)   - this could be every minute of every day for past 1 month
- calculate some aggregates for a filtered subset of data for the same slices (B)
- join them and calculate the % of B wrt A   
- save them to the file  (160MB)

# Nodes  = 20  (150G each)
Spark Version = 0.9.0
Input data size  =  2TB
Output Data Size = 160 M 


Everything else works fine but saveAsTextFile call takes about an hour. But in that hour the CPU Utilization, Load Average and Network traffic is pretty low - infact tapers off after 30 minutes. (check out the values in the plots below starting 15:08)

Heres' my code 

      val joined =   countsForCategory.join(countsAllCategories)
      val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
      counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("hdfs://x.y.z/path/to/output/dir")

Can someone explain what's going on? is that expected? As mentioned, my output data is pretty small.







CPU Utilization 



NETWORK 


LOAD AVERAGE 


Reply | Threaded
Open this post in threaded view
|

Re: Cluster taking a long time with not much activity (or so I think)

Vipul Pandey
You can check out the storage tab of your application. If you see RDD spilling off to disk that could be an issue. 
Storage was just fine. The entire dataset fits into less than a TB of memory and I have more. 

Another possibility is disk commits are taking time so disk utilization could be relevant.
Where do you think this will matter? While writing out the final output  to disk? but that's only 160M


Here's the plot for the Memory as well. 



On Mar 26, 2014, at 9:54 AM, Mayur Rustagi <[hidden email]> wrote:

Another issue could not not enough memory. Can you try out with 1TB or possibly 500GB data & scale gracefully. 
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257


On Wed, Mar 26, 2014 at 12:52 PM, Mayur Rustagi <[hidden email]> wrote:
You can check out the storage tab of your application. If you see RDD spilling off to disk that could be an issue. 
Another possibility is disk commits are taking time so disk utilization could be relevant.
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Mon, Mar 24, 2014 at 8:13 PM, Vipul Pandey <[hidden email]> wrote:
Hi 

My use case is pretty simple : 
- Get all the data   (2TB uncompressed)
- calculate some aggregates for, say, time slices (A)   - this could be every minute of every day for past 1 month
- calculate some aggregates for a filtered subset of data for the same slices (B)
- join them and calculate the % of B wrt A   
- save them to the file  (160MB)

# Nodes  = 20  (150G each)
Spark Version = 0.9.0
Input data size  =  2TB
Output Data Size = 160 M 


Everything else works fine but saveAsTextFile call takes about an hour. But in that hour the CPU Utilization, Load Average and Network traffic is pretty low - infact tapers off after 30 minutes. (check out the values in the plots below starting 15:08)

Heres' my code 

      val joined =   countsForCategory.join(countsAllCategories)
      val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
      counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("hdfs://x.y.z/path/to/output/dir")

Can someone explain what's going on? is that expected? As mentioned, my output data is pretty small.





<PastedGraphic-81.png>


CPU Utilization 
<PastedGraphic-79.png>



NETWORK 
<PastedGraphic-78.png>


LOAD AVERAGE 
<PastedGraphic-80.png>



Reply | Threaded
Open this post in threaded view
|

Re: Cluster taking a long time with not much activity (or so I think)

Mayur Rustagi
Intermediate data could be huge before its reduced to 160M
You can look at shuffle writes of your tasks, is this the writes graph?
So intermediate data is 3TB?

Mayur Rustagi
Ph: +1 (760) 203 3257


On Thu, Mar 27, 2014 at 1:45 AM, Vipul Pandey <[hidden email]> wrote:
You can check out the storage tab of your application. If you see RDD spilling off to disk that could be an issue. 
Storage was just fine. The entire dataset fits into less than a TB of memory and I have more. 

Another possibility is disk commits are taking time so disk utilization could be relevant.
Where do you think this will matter? While writing out the final output  to disk? but that's only 160M


Here's the plot for the Memory as well. 



On Mar 26, 2014, at 9:54 AM, Mayur Rustagi <[hidden email]> wrote:

Another issue could not not enough memory. Can you try out with 1TB or possibly 500GB data & scale gracefully. 
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Wed, Mar 26, 2014 at 12:52 PM, Mayur Rustagi <[hidden email]> wrote:
You can check out the storage tab of your application. If you see RDD spilling off to disk that could be an issue. 
Another possibility is disk commits are taking time so disk utilization could be relevant.
Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257


On Mon, Mar 24, 2014 at 8:13 PM, Vipul Pandey <[hidden email]> wrote:
Hi 

My use case is pretty simple : 
- Get all the data   (2TB uncompressed)
- calculate some aggregates for, say, time slices (A)   - this could be every minute of every day for past 1 month
- calculate some aggregates for a filtered subset of data for the same slices (B)
- join them and calculate the % of B wrt A   
- save them to the file  (160MB)

# Nodes  = 20  (150G each)
Spark Version = 0.9.0
Input data size  =  2TB
Output Data Size = 160 M 


Everything else works fine but saveAsTextFile call takes about an hour. But in that hour the CPU Utilization, Load Average and Network traffic is pretty low - infact tapers off after 30 minutes. (check out the values in the plots below starting 15:08)

Heres' my code 

      val joined =   countsForCategory.join(countsAllCategories)
      val counts = joined.map(x => (x._1,100*calculateRatio(x._2)))
      counts.map(x => x._1+","+x._2).coalesce(10).saveAsTextFile("hdfs://x.y.z/path/to/output/dir")

Can someone explain what's going on? is that expected? As mentioned, my output data is pretty small.





<PastedGraphic-81.png>


CPU Utilization 
<PastedGraphic-79.png>



NETWORK 
<PastedGraphic-78.png>


LOAD AVERAGE 
<PastedGraphic-80.png>