Spark Streaming Small files in Hive

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Streaming Small files in Hive

khajaasmath786
Hi,

I am using spark streaming to write data back into hive with the below code snippet


eventHubsWindowedStream.map(x => EventContent(new String(x)))

      .foreachRDD(rdd => {

        val sparkSession = SparkSession.builder.enableHiveSupport.getOrCreate

        import sparkSession.implicits._

        rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto(hiveTableName)

      })


Hive table is partitioned by year,month,day so we end up getting less data for some days and it in turn results in smaller files inside hive. Since the data is being written in smaller files, there is lot of performance on Impala/Hive when reading it? is there a way to merge files while inserting data into hive?

It would be really helpful too if you anyone can provide suggestions on how to design it in better way. we cannot use Hbase/kudu in this current scenario due to space issue with clusters .

Thanks,

Asmath
Reply | Threaded
Open this post in threaded view
|

Re: Spark Streaming Small files in Hive

Siva Gudavalli
Hello Asmath,

We had a similar challenge recently.

When you write back to hive, you are creating files on HDFS, and it depends on your batch window. 
If you increase your batch window lets say from 1 min to 5 mins you will end up creating 5x times less.

The other factor is your partitioning. For instance, if your spark application is working on 5 partitions, you can repartition to 1, this will again reduce the number of files to 5x.

You can create staging to hold small files and once a decent amount of data is accumulated you can prepare large files and load to your final hive table.

hope this helps.

Regards
Shiv


On Oct 29, 2017, at 11:03 AM, KhajaAsmath Mohammed <[hidden email]> wrote:

Hi,

I am using spark streaming to write data back into hive with the below code snippet


eventHubsWindowedStream.map(x => EventContent(new String(x)))

      .foreachRDD(rdd => {

        val sparkSession = SparkSession.builder.enableHiveSupport.getOrCreate

        import sparkSession.implicits._

        rdd.toDS.write.mode(org.apache.spark.sql.SaveMode.Append).insertInto(hiveTableName)

      })


Hive table is partitioned by year,month,day so we end up getting less data for some days and it in turn results in smaller files inside hive. Since the data is being written in smaller files, there is lot of performance on Impala/Hive when reading it? is there a way to merge files while inserting data into hive?

It would be really helpful too if you anyone can provide suggestions on how to design it in better way. we cannot use Hbase/kudu in this current scenario due to space issue with clusters .

Thanks,

Asmath