How to estimate the rdd size before the rdd result is written to disk

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

How to estimate the rdd size before the rdd result is written to disk

zhangliyun
Hi all:
 i want to ask a question  about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small. 


following step is  what i can solve for this problem 

 1.sample 0.01 's original data

 2.compute sample data count

 3. if sample data count >0, cache the sample data  and compute sample data size

 4.compute original rdd total count

 5.estimate the rdd size as ${total count}* ${sampel data size}  / ${sample rdd count}


The code is here.  

My question
1. can i use above way to solve the problem?   If can not, where is wrong?
2. Is there any existed solution ( existed API in spark) to solve the problem?



Best Regards
Kelly Zhang


 

Reply | Threaded
Open this post in threaded view
|

Re: How to estimate the rdd size before the rdd result is written to disk

sriramb12
Hello Experts
I am trying to maximise the resource utilisation on my 3 node spark cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am trying to create a benchmark so I can recommend an optimal POD for the job
128GB x 16 cores 
I have standalone spark running 2.4.0
HTOP shows only half of the memory is in use. So what will be alternatives 
I can try? CPU is always 100 % for the allocated resources
I can reduce per executor memory to 32 GB and increase number of executors? 
I have the following properties:

spark.driver.maxResultSize64g
spark.driver.memory100g
spark.driver.port33631
spark.dynamicAllocation.enabledtrue
spark.dynamicAllocation.executorIdleTimeout60s
spark.executor.cores8
spark.executor.iddriver
spark.executor.instances4
spark.executor.memory64g
spark.filesfile://dist/xxxx-0.0.1-py3.7.egg
spark.locality.wait10s

100
spark.shuffle.service.enabledtrue

On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <[hidden email]> wrote:
Hi all:
 i want to ask a question  about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small. 


following step is  what i can solve for this problem 

 1.sample 0.01 's original data

 2.compute sample data count

 3. if sample data count >0, cache the sample data  and compute sample data size

 4.compute original rdd total count

 5.estimate the rdd size as ${total count}* ${sampel data size}  / ${sample rdd count}


The code is here.  

My question
1. can i use above way to solve the problem?   If can not, where is wrong?
2. Is there any existed solution ( existed API in spark) to solve the problem?



Best Regards
Kelly Zhang


 



--
-Sriram
Reply | Threaded
Open this post in threaded view
|

optimising cluster performance

sriramb12
Hi All
Sorry, earlier, I forgot to set the subject line correctly
Hello Experts
I am trying to maximise the resource utilisation on my 3 node spark cluster (2 data nodes and 1 driver) so that the job finishes quickest. I am trying to create a benchmark so I can recommend an optimal POD for the job
128GB x 16 cores 
I have standalone spark running 2.4.0
HTOP shows only half of the memory is in use. So what will be alternatives 
I can try? CPU is always 100 % for the allocated resources
I can reduce per executor memory to 32 GB and increase number of executors? 
I have the following properties:

spark.driver.maxResultSize64g
spark.driver.memory100g
spark.driver.port33631
spark.dynamicAllocation.enabledtrue
spark.dynamicAllocation.executorIdleTimeout60s
spark.executor.cores8
spark.executor.iddriver
spark.executor.instances4
spark.executor.memory64g
spark.filesfile://dist/xxxx-0.0.1-py3.7.egg
spark.locality.wait10s

100
spark.shuffle.service.enabledtrue

On Fri, Dec 20, 2019 at 10:56 AM zhangliyun <[hidden email]> wrote:
Hi all:
 i want to ask a question  about how to estimate the rdd size( according to byte) when it is not saved to disk because the job spends long time if the output is very huge and output partition number is small. 


following step is  what i can solve for this problem 

 1.sample 0.01 's original data

 2.compute sample data count

 3. if sample data count >0, cache the sample data  and compute sample data size

 4.compute original rdd total count

 5.estimate the rdd size as ${total count}* ${sampel data size}  / ${sample rdd count}


The code is here.  

My question
1. can i use above way to solve the problem?   If can not, where is wrong?
2. Is there any existed solution ( existed API in spark) to solve the problem?



Best Regards
Kelly Zhang


 



--
-Sriram


--
-Sriram