s3a staging committer (directory committer) not writing data to s3 bucket (final output directory) in spark3

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

s3a staging committer (directory committer) not writing data to s3 bucket (final output directory) in spark3

Rao, Abhishek (Nokia - IN/Bangalore)

Hi,

 

I'm running spark3 on Kubernetes and using S3A staging committer (directory committer) to write data to s3 bucket. The same set up works fine with spark 2.4.5 but with spark3 the final data (writing in parquet format) is not visible in s3 bucket and when read operation is performed on that parquet data it fails as it is an empty path without any data.

As s3a committer requires shared file system (like NFS or HDFS) for staging data i have set up a shared PVC for all executors and drivers(i.e., spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with readWriteMany)

 

In S3 bucket i can see only _SUCCESS file without any data.

 

bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket= s3://rookbucket/shiva/ --recursive | grep people.parquet

2021-02-22 11:55      4074   s3://rookbucket/shiva/people.parquet/_SUCCESS

bash-4.2#

 

The _SUCCESS file is in json format with below content:

 

==============================

{

  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",

  "timestamp" : 1613994948681,

  "date" : "Mon Feb 22 11:55:48 UTC 2021",

  "hostname" : "spark-thrift-hdfs",

  "committer" : "directory",

  "description" : "Task committer attempt_20210222115547_0000_m_000000_0",

  "metrics" : {

    "stream_write_block_uploads" : 0,

    "files_created" : 5,

    "S3guard_metadatastore_put_path_latencyNumOps" : 0,

    "stream_write_block_uploads_aborted" : 0,

    "committer_commits_reverted" : 0,

    "op_open" : 2,

    "stream_closed" : 12,

    "committer_magic_files_created" : 0,

    "object_copy_requests" : 0,

    "s3guard_metadatastore_initialization" : 0,

    "S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,

    "stream_write_block_uploads_committed" : 0,

    "S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,

    "S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,

    "committer_bytes_committed" : 0,

    "op_create" : 5,

    "stream_read_fully_operations" : 0,

    "committer_commits_completed" : 0,

    "object_put_requests_active" : 0,

    "s3guard_metadatastore_retry" : 0,

    "stream_write_block_uploads_active" : 0,

    "stream_opened" : 12,

    "S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,

    "op_create_non_recursive" : 0,

    "object_continue_list_requests" : 0,

    "committer_jobs_completed" : 5,

    "S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,

    "stream_close_operations" : 12,

    "stream_read_operations" : 378,

    "object_delete_requests" : 4,

    "fake_directories_deleted" : 8,

    "stream_aborted" : 0,

    "op_rename" : 0,

    "object_multipart_aborted" : 0,

    "committer_commits_created" : 0,

    "op_get_file_status" : 26,

    "s3guard_metadatastore_put_path_request" : 9,

    "committer_commits_failed" : 0,

    "stream_bytes_read_in_close" : 0,

    "op_glob_status" : 1,

    "stream_read_exceptions" : 0,

    "op_exists" : 5,

    "stream_read_version_mismatches" : 0,

    "S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,

    "S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,

    "stream_write_block_uploads_pending" : 4,

    "directories_created" : 0,

    "S3guard_metadatastore_throttle_rateNumEvents" : 0,

    "S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,

    "stream_bytes_backwards_on_seek" : 0,

    "stream_bytes_read" : 2997558,

    "stream_write_total_data" : 16282,

    "committer_jobs_failed" : 0,

    "stream_read_operations_incomplete" : 29,

    "files_copied_bytes" : 0,

    "op_delete" : 8,

    "object_put_bytes_pending" : 0,

    "stream_write_block_uploads_data_pending" : 0,

    "op_list_located_status" : 0,

    "object_list_requests" : 19,

    "stream_forward_seek_operations" : 0,

    "committer_tasks_completed" : 0,

    "committer_commits_aborted" : 0,

    "object_metadata_requests" : 45,

    "object_put_requests_completed" : 4,

    "stream_seek_operations" : 0,

    "op_list_status" : 0,

    "store_io_throttled" : 0,

    "stream_write_failures" : 0,

    "op_get_file_checksum" : 0,

    "files_copied" : 0,

    "ignored_errors" : 8,

    "committer_bytes_uploaded" : 0,

    "committer_tasks_failed" : 0,

    "stream_bytes_skipped_on_seek" : 0,

   "op_list_files" : 0,

    "files_deleted" : 0,

    "stream_bytes_discarded_in_abort" : 0,

    "op_mkdirs" : 1,

    "op_copy_from_local_file" : 0,

    "op_is_directory" : 1,

    "s3guard_metadatastore_throttled" : 0,

    "S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,

    "stream_write_total_time" : 0,

    "stream_backward_seek_operations" : 0,

    "object_put_requests" : 4,

    "object_put_bytes" : 16282,

    "directories_deleted" : 0,

    "op_is_file" : 2,

    "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0

  },

  "diagnostics" : {

    "fs.s3a.metadatastore.impl" : "org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",

    "fs.s3a.committer.magic.enabled" : "false",

    "fs.s3a.metadatastore.authoritative" : "false"

  },

  "filenames" : [ ]

}

 

===============================

With same s3 bucket if i run spark job with spark 2.4.5 then it writes data to s3://rookbucket/shiva/people.parquet/  and the _SUCCESS file looks similar to above one but "filenames" key in that json contain list of part files (parquet's data files) but with spark3 it is empty list as shown above.

There is no exception or error during write operation, but read fails to get the schema as the parquet file is empty.

 

Not sure what is causing the issue, I have attached the spark configuration which are used to submit the job as attachment(spark-default.conf).

 

I'm using Ceph as underlying storage for s3 buckets and if I use rados command to check data i can see parquet data with file name containing multipart upload in some path like below (but not in final output s3 path)

 

bash-4.2# rados ls  -p rook-ceph-store.rgw.buckets.data | grep "part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5"

 

Thanks and Regards,

Abhishek

 



---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

spark-default.conf (5K) Download Attachment