Hi All,
We have implemented delta lake in spark scala. It’s working fine for small data set.
We need to be updated appox 20M records on daily basis on delta lake using spark EMR jobs(5 node m5.2xlarge machine) . Jobs has been failed with memory error .
Error Terminated with errors
All slaves in the job flow were terminated

source code :
var source_location_distination_data_lake = “s3://test/deltalake”
DeltaTable.forPath(spark,
source_location_distination_data_lake)
.as("events")
.merge(
df_reader_update.as("updates"),
"events.id = updates.id")
.whenMatched
.updateExpr(
Map("action_name"
->
"updates.action_name",
"action_date"
->
"updates.action_date"))
.whenNotMatched
.insertExpr(
Map(
"id"
->
"updates.id",
"package_name"
->
"updates.package_name",
"action_name"
->
"updates.action_name",
"action_date"
->
"updates.action_date"
))
.execute()
Please suggest for better optimized code in delta lake.
Thanks,
Bimal Naresh
Disclaimer: This e-mail and any documents, files, or previous e-mail messages appended or attached to it may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the
sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden