spark streaming part files in hive partition

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

spark streaming part files in hive partition

khajaasmath786
Hi,

I am writing spark streaming output into hive partitioned table. I always see the data is being written as below. is there a way to change part file names instead of ending it with name as copy. I want to add timestamp to it.

Inline image 1

My spark streming job is always getting task failed after running for one day. I am suspecting issue is with this part files . is there a way to resolve it now?

Inline image 2

Thanks,
Asmath
Reply | Threaded
Open this post in threaded view
|

Re: spark streaming part files in hive partition

khajaasmath786
Here is error in detail. any suggestions to resolve it?


Job aborted due to stage failure: Task 0 in stage 381.0 failed 4 times, most recent failure: Lost task 0.3 in stage 381.0 (TID 129383, brksvl255.brk.navistar.com, executor 1): org.apache.spark.SparkException: Task failed while writing rows.+details
Job aborted due to stage failure: Task 0 in stage 381.0 failed 4 times, most recent failure: Lost task 0.3 in stage 381.0 (TID 129383, brksvl255.brk.navistar.com, executor 1): org.apache.spark.SparkException: Task failed while writing rows.
	at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:328)
	at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
	at org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$3.apply(InsertIntoHiveTable.scala:210)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
	at org.apache.spark.scheduler.Task.run(Task.scala:99)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
	at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:152)
	at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:111)
	at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:112)
	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:102)
	at org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.close(ParquetRecordWriterWrapper.java:119)
	at org.apache.spark.sql.hive.SparkHiveDynamicPartitionWriterContainer.writeToFile(hiveWriterContainers.scala:320)
	... 8 more

On Sun, Nov 19, 2017 at 5:36 PM, KhajaAsmath Mohammed <[hidden email]> wrote:
Hi,

I am writing spark streaming output into hive partitioned table. I always see the data is being written as below. is there a way to change part file names instead of ending it with name as copy. I want to add timestamp to it.

Inline image 1

My spark streming job is always getting task failed after running for one day. I am suspecting issue is with this part files . is there a way to resolve it now?

Inline image 2

Thanks,
Asmath