Spark Small file issue

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Small file issue

Hichki
Hello Team,

 

I am new to Spark environment. I have converted Hive query to Spark Scala.
Now I am loading data and doing performance testing. Below are details on
loading 3 weeks data. Cluster level small file avg size is set to 128 MB.



1. New temp table where I am loading data is ORC formatted as current Hive
table is ORC stored.

2. Hive table each partition folder size is 200 MB.

3. I am using repartition(1) in spark code so that it will create one 200MB
part file in each partition folder(to avoid small file issue). With this job
is completing in 23 to 26 mins.

4. If I don't use repartition(), job is completing in 12 to 13 mins. But
problem with this approach is, it is creating 800 part files (size <128MB)
in each partition folder.

 

I am quite not sure on how to reduce processing time and not create small
files at the same time. Could anyone please help me in this situation.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Small file issue

German SM
Hi,

When reducing partitions is better to use coalesce because it doesn't need to shuffle the data.

dataframe.coalesce(1)

El mar., 23 jun. 2020 23:54, Hichki <[hidden email]> escribió:
Hello Team,



I am new to Spark environment. I have converted Hive query to Spark Scala.
Now I am loading data and doing performance testing. Below are details on
loading 3 weeks data. Cluster level small file avg size is set to 128 MB.



1. New temp table where I am loading data is ORC formatted as current Hive
table is ORC stored.

2. Hive table each partition folder size is 200 MB.

3. I am using repartition(1) in spark code so that it will create one 200MB
part file in each partition folder(to avoid small file issue). With this job
is completing in 23 to 26 mins.

4. If I don't use repartition(), job is completing in 12 to 13 mins. But
problem with this approach is, it is creating 800 part files (size <128MB)
in each partition folder.



I am quite not sure on how to reduce processing time and not create small
files at the same time. Could anyone please help me in this situation.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Small file issue

Bobby Evans-2
First, you need to be careful with coalesce. It will impact upstream processing, so if you are doing a lot of computation in the last stage before the repartition then coalesce will make the problem worse because all of that computation will happen in a single thread instead of being spread out.

My guess is that it has something to do with writing your output files. Writing orc and/or parquet is not cheap. It does a lot of compression and statistics calculations. I also am not sure why, but from what I have seen they do not scale very linearly with more data being put into a single file. You might also be doing the repartition too early.  There should be some statistics on the SQL page of the UI where you can look to see which stages took a long time it should point you in the right direction.

On Tue, Jun 23, 2020 at 5:06 PM German SM <[hidden email]> wrote:
Hi,

When reducing partitions is better to use coalesce because it doesn't need to shuffle the data.

dataframe.coalesce(1)

El mar., 23 jun. 2020 23:54, Hichki <[hidden email]> escribió:
Hello Team,



I am new to Spark environment. I have converted Hive query to Spark Scala.
Now I am loading data and doing performance testing. Below are details on
loading 3 weeks data. Cluster level small file avg size is set to 128 MB.



1. New temp table where I am loading data is ORC formatted as current Hive
table is ORC stored.

2. Hive table each partition folder size is 200 MB.

3. I am using repartition(1) in spark code so that it will create one 200MB
part file in each partition folder(to avoid small file issue). With this job
is completing in 23 to 26 mins.

4. If I don't use repartition(), job is completing in 12 to 13 mins. But
problem with this approach is, it is creating 800 part files (size <128MB)
in each partition folder.



I am quite not sure on how to reduce processing time and not create small
files at the same time. Could anyone please help me in this situation.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Small file issue

Koert Kuipers
i second that. we have gotten bitten too many times by coalesce impacting upstream in an unintended way that i avoid coalesce on write altogether.

i prefer to use repartition (and take the shuffle hit) before writing (especially if you are writing out partitioned), or if possible use adaptive query execution to avoid too many files to begin with

On Wed, Jun 24, 2020 at 9:09 AM Bobby Evans <[hidden email]> wrote:
First, you need to be careful with coalesce. It will impact upstream processing, so if you are doing a lot of computation in the last stage before the repartition then coalesce will make the problem worse because all of that computation will happen in a single thread instead of being spread out.

My guess is that it has something to do with writing your output files. Writing orc and/or parquet is not cheap. It does a lot of compression and statistics calculations. I also am not sure why, but from what I have seen they do not scale very linearly with more data being put into a single file. You might also be doing the repartition too early.  There should be some statistics on the SQL page of the UI where you can look to see which stages took a long time it should point you in the right direction.

On Tue, Jun 23, 2020 at 5:06 PM German SM <[hidden email]> wrote:
Hi,

When reducing partitions is better to use coalesce because it doesn't need to shuffle the data.

dataframe.coalesce(1)

El mar., 23 jun. 2020 23:54, Hichki <[hidden email]> escribió:
Hello Team,



I am new to Spark environment. I have converted Hive query to Spark Scala.
Now I am loading data and doing performance testing. Below are details on
loading 3 weeks data. Cluster level small file avg size is set to 128 MB.



1. New temp table where I am loading data is ORC formatted as current Hive
table is ORC stored.

2. Hive table each partition folder size is 200 MB.

3. I am using repartition(1) in spark code so that it will create one 200MB
part file in each partition folder(to avoid small file issue). With this job
is completing in 23 to 26 mins.

4. If I don't use repartition(), job is completing in 12 to 13 mins. But
problem with this approach is, it is creating 800 part files (size <128MB)
in each partition folder.



I am quite not sure on how to reduce processing time and not create small
files at the same time. Could anyone please help me in this situation.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Small file issue

Hichki
In reply to this post by Bobby Evans-2
Hi,

I am doing repartition at the end. I mean before insert overwriting the
table. I see the last step (repartition)  is taking more time.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Small file issue

Bobby Evans
In reply to this post by Hichki
So I should have done some back of the napkin math before all of this. You are writing out 800 files, each < 128 MB.  If they were 128 MB then it would be 100GB of data being written, I'm not sure how much hardware you have but, but the fact that you can shuffle about 100GB to a single thread and write it out in 13 extra mins feels actually really good for spark. You are writing out roughly about 130 MB/sec of compressed parquet data. It has been a little while since I benchmarked it, but that feels about the right order of magnitude. I would suggest that you try repartitioning it to 10 threads or 100 threads instead.

On Tue, Jun 23, 2020 at 4:54 PM Hichki <[hidden email]> wrote:
Hello Team,



I am new to Spark environment. I have converted Hive query to Spark Scala.
Now I am loading data and doing performance testing. Below are details on
loading 3 weeks data. Cluster level small file avg size is set to 128 MB.



1. New temp table where I am loading data is ORC formatted as current Hive
table is ORC stored.

2. Hive table each partition folder size is 200 MB.

3. I am using repartition(1) in spark code so that it will create one 200MB
part file in each partition folder(to avoid small file issue). With this job
is completing in 23 to 26 mins.

4. If I don't use repartition(), job is completing in 12 to 13 mins. But
problem with this approach is, it is creating 800 part files (size <128MB)
in each partition folder.



I am quite not sure on how to reduce processing time and not create small
files at the same time. Could anyone please help me in this situation.





--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark Small file issue

Hichki
All 800 files(in a partition folder) sizes are in bytes. It will sum up to
200 MB which is each partition folder input size. And I am using ORC format.
Never used Parquet format.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]