s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

shiva
Hi,
I'm running spark3 on Kubernetes and using S3A staging committer (directory
committer) to write data to s3 bucket. The same set up works fine with
spark2 but with spark3 the final data (writing in parquet format) is not
visible in s3 bucket and when read operation is performed on that parquet
data it fails as it is a empty path without any data.
As s3a committer requires shared file system (like NFS or HDFS) for staging
data i have set up a shared PVC for all executors and drivers(i.e.,
spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with
readWriteMany)

In S3 bucket i can see only _SUCCESS file without any data.

bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket=
s3://rookbucket/shiva/ --recursive | grep people.parquet
2021-02-22 11:55      4074   s3://rookbucket/shiva/people.parquet/_SUCCESS
bash-4.2#

The _SUCCESS file is in json format with below content:

==============================
{
  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
  "timestamp" : 1613994948681,
  "date" : "Mon Feb 22 11:55:48 UTC 2021",
  "hostname" : "spark-thrift-hdfs",
  "committer" : "directory",
  "description" : "Task committer attempt_20210222115547_0000_m_000000_0",
  "metrics" : {
    "stream_write_block_uploads" : 0,
    "files_created" : 5,
    "S3guard_metadatastore_put_path_latencyNumOps" : 0,
    "stream_write_block_uploads_aborted" : 0,
    "committer_commits_reverted" : 0,
    "op_open" : 2,
    "stream_closed" : 12,
    "committer_magic_files_created" : 0,
    "object_copy_requests" : 0,
    "s3guard_metadatastore_initialization" : 0,
    "S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,
    "stream_write_block_uploads_committed" : 0,
    "S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,
    "committer_bytes_committed" : 0,
    "op_create" : 5,
    "stream_read_fully_operations" : 0,
    "committer_commits_completed" : 0,
    "object_put_requests_active" : 0,
    "s3guard_metadatastore_retry" : 0,
    "stream_write_block_uploads_active" : 0,
    "stream_opened" : 12,
    "S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,
    "op_create_non_recursive" : 0,
    "object_continue_list_requests" : 0,
    "committer_jobs_completed" : 5,
    "S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,
    "stream_close_operations" : 12,
    "stream_read_operations" : 378,
    "object_delete_requests" : 4,
    "fake_directories_deleted" : 8,
    "stream_aborted" : 0,
    "op_rename" : 0,
    "object_multipart_aborted" : 0,
    "committer_commits_created" : 0,
    "op_get_file_status" : 26,
    "s3guard_metadatastore_put_path_request" : 9,
    "committer_commits_failed" : 0,
    "stream_bytes_read_in_close" : 0,
    "op_glob_status" : 1,
    "stream_read_exceptions" : 0,
    "op_exists" : 5,
    "stream_read_version_mismatches" : 0,
    "S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,
    "stream_write_block_uploads_pending" : 4,
    "directories_created" : 0,
    "S3guard_metadatastore_throttle_rateNumEvents" : 0,
    "S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,
    "stream_bytes_backwards_on_seek" : 0,
    "stream_bytes_read" : 2997558,
    "stream_write_total_data" : 16282,
    "committer_jobs_failed" : 0,
    "stream_read_operations_incomplete" : 29,
    "files_copied_bytes" : 0,
    "op_delete" : 8,
    "object_put_bytes_pending" : 0,
    "stream_write_block_uploads_data_pending" : 0,
    "op_list_located_status" : 0,
    "object_list_requests" : 19,
    "stream_forward_seek_operations" : 0,
    "committer_tasks_completed" : 0,
    "committer_commits_aborted" : 0,
    "object_metadata_requests" : 45,
    "object_put_requests_completed" : 4,
    "stream_seek_operations" : 0,
    "op_list_status" : 0,
    "store_io_throttled" : 0,
    "stream_write_failures" : 0,
    "op_get_file_checksum" : 0,
    "files_copied" : 0,
    "ignored_errors" : 8,
    "committer_bytes_uploaded" : 0,
    "committer_tasks_failed" : 0,
    "stream_bytes_skipped_on_seek" : 0,
    "op_list_files" : 0,
    "files_deleted" : 0,
    "stream_bytes_discarded_in_abort" : 0,
    "op_mkdirs" : 1,
    "op_copy_from_local_file" : 0,
    "op_is_directory" : 1,
    "s3guard_metadatastore_throttled" : 0,
    "S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,
    "stream_write_total_time" : 0,
    "stream_backward_seek_operations" : 0,
    "object_put_requests" : 4,
    "object_put_bytes" : 16282,
    "directories_deleted" : 0,
    "op_is_file" : 2,
    "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
  },
  "diagnostics" : {
    "fs.s3a.metadatastore.impl" :
"org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
    "fs.s3a.committer.magic.enabled" : "false",
    "fs.s3a.metadatastore.authoritative" : "false"
  },
  "filenames" : [ ]
}

===============================
With same s3 bucket if i run spark job with spark2 then it writes data to
s3://rookbucket/shiva/people.parquet/  and the _SUCCESS file looks similar
to above one but "filenames" key in that json contain list of part files
(parquet's data files) but with spark3 it is empty list as shown above.
There is no exception or error during write operation, but read fails to get
the schema as the parquet file is empty.

Not sure what is causing the issue, I have attached the spark configuration
which are used to submit the job as attachment( spark-default.conf
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11249/spark-default.conf>
).

I'm using Ceph as underlying storage for s3 buckets and if I use rados
command to check data i can see parquet data with file name containing
multipart upload in some path like below (but not in final output s3 path)

bash-4.2# rados ls  -p rook-ceph-store.rgw.buckets.data | grep
"part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5"
4bd26ab1-6211-4aa5-92d9-9595ad0ee383.454449.1__multipart_shiva/people.parquet/part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5-c000-spark-66e7529285e54226b94d61c2263be83b.snappy.parquet.2~d3to_jPrAO_BLxTu74GXr_g_sz4pvQF.1

Could someone help me to debug this issue or any known issue around this?

Regards,
Shiva






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

shiva
Any suggestions or help is greatly appreciated!



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Mich Talebzadeh
In reply to this post by shiva

Hi,

What exact version of spark is it?

HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 22 Feb 2021 at 14:41, shiva <[hidden email]> wrote:
Hi,
I'm running spark3 on Kubernetes and using S3A staging committer (directory
committer) to write data to s3 bucket. The same set up works fine with
spark2 but with spark3 the final data (writing in parquet format) is not
visible in s3 bucket and when read operation is performed on that parquet
data it fails as it is a empty path without any data.
As s3a committer requires shared file system (like NFS or HDFS) for staging
data i have set up a shared PVC for all executors and drivers(i.e.,
spark.hadoop.fs.s3a.committer.staging.tmp.path set to shared PVC with
readWriteMany)

In S3 bucket i can see only _SUCCESS file without any data.

bash-4.2# s3cmd ls  --no-ssl --host=${AWS_ENDPOINT} --host-bucket=
s3://rookbucket/shiva/ --recursive | grep people.parquet
2021-02-22 11:55      4074   s3://rookbucket/shiva/people.parquet/_SUCCESS
bash-4.2#

The _SUCCESS file is in json format with below content:

==============================
{
  "name" : "org.apache.hadoop.fs.s3a.commit.files.SuccessData/1",
  "timestamp" : 1613994948681,
  "date" : "Mon Feb 22 11:55:48 UTC 2021",
  "hostname" : "spark-thrift-hdfs",
  "committer" : "directory",
  "description" : "Task committer attempt_20210222115547_0000_m_000000_0",
  "metrics" : {
    "stream_write_block_uploads" : 0,
    "files_created" : 5,
    "S3guard_metadatastore_put_path_latencyNumOps" : 0,
    "stream_write_block_uploads_aborted" : 0,
    "committer_commits_reverted" : 0,
    "op_open" : 2,
    "stream_closed" : 12,
    "committer_magic_files_created" : 0,
    "object_copy_requests" : 0,
    "s3guard_metadatastore_initialization" : 0,
    "S3guard_metadatastore_put_path_latency90thPercentileLatency" : 0,
    "stream_write_block_uploads_committed" : 0,
    "S3guard_metadatastore_throttle_rate75thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_throttle_rate90thPercentileFrequency (Hz)" : 0,
    "committer_bytes_committed" : 0,
    "op_create" : 5,
    "stream_read_fully_operations" : 0,
    "committer_commits_completed" : 0,
    "object_put_requests_active" : 0,
    "s3guard_metadatastore_retry" : 0,
    "stream_write_block_uploads_active" : 0,
    "stream_opened" : 12,
    "S3guard_metadatastore_throttle_rate95thPercentileFrequency (Hz)" : 0,
    "op_create_non_recursive" : 0,
    "object_continue_list_requests" : 0,
    "committer_jobs_completed" : 5,
    "S3guard_metadatastore_put_path_latency50thPercentileLatency" : 0,
    "stream_close_operations" : 12,
    "stream_read_operations" : 378,
    "object_delete_requests" : 4,
    "fake_directories_deleted" : 8,
    "stream_aborted" : 0,
    "op_rename" : 0,
    "object_multipart_aborted" : 0,
    "committer_commits_created" : 0,
    "op_get_file_status" : 26,
    "s3guard_metadatastore_put_path_request" : 9,
    "committer_commits_failed" : 0,
    "stream_bytes_read_in_close" : 0,
    "op_glob_status" : 1,
    "stream_read_exceptions" : 0,
    "op_exists" : 5,
    "stream_read_version_mismatches" : 0,
    "S3guard_metadatastore_throttle_rate50thPercentileFrequency (Hz)" : 0,
    "S3guard_metadatastore_put_path_latency95thPercentileLatency" : 0,
    "stream_write_block_uploads_pending" : 4,
    "directories_created" : 0,
    "S3guard_metadatastore_throttle_rateNumEvents" : 0,
    "S3guard_metadatastore_put_path_latency99thPercentileLatency" : 0,
    "stream_bytes_backwards_on_seek" : 0,
    "stream_bytes_read" : 2997558,
    "stream_write_total_data" : 16282,
    "committer_jobs_failed" : 0,
    "stream_read_operations_incomplete" : 29,
    "files_copied_bytes" : 0,
    "op_delete" : 8,
    "object_put_bytes_pending" : 0,
    "stream_write_block_uploads_data_pending" : 0,
    "op_list_located_status" : 0,
    "object_list_requests" : 19,
    "stream_forward_seek_operations" : 0,
    "committer_tasks_completed" : 0,
    "committer_commits_aborted" : 0,
    "object_metadata_requests" : 45,
    "object_put_requests_completed" : 4,
    "stream_seek_operations" : 0,
    "op_list_status" : 0,
    "store_io_throttled" : 0,
    "stream_write_failures" : 0,
    "op_get_file_checksum" : 0,
    "files_copied" : 0,
    "ignored_errors" : 8,
    "committer_bytes_uploaded" : 0,
    "committer_tasks_failed" : 0,
    "stream_bytes_skipped_on_seek" : 0,
    "op_list_files" : 0,
    "files_deleted" : 0,
    "stream_bytes_discarded_in_abort" : 0,
    "op_mkdirs" : 1,
    "op_copy_from_local_file" : 0,
    "op_is_directory" : 1,
    "s3guard_metadatastore_throttled" : 0,
    "S3guard_metadatastore_put_path_latency75thPercentileLatency" : 0,
    "stream_write_total_time" : 0,
    "stream_backward_seek_operations" : 0,
    "object_put_requests" : 4,
    "object_put_bytes" : 16282,
    "directories_deleted" : 0,
    "op_is_file" : 2,
    "S3guard_metadatastore_throttle_rate99thPercentileFrequency (Hz)" : 0
  },
  "diagnostics" : {
    "fs.s3a.metadatastore.impl" :
"org.apache.hadoop.fs.s3a.s3guard.NullMetadataStore",
    "fs.s3a.committer.magic.enabled" : "false",
    "fs.s3a.metadatastore.authoritative" : "false"
  },
  "filenames" : [ ]
}

===============================
With same s3 bucket if i run spark job with spark2 then it writes data to
s3://rookbucket/shiva/people.parquet/  and the _SUCCESS file looks similar
to above one but "filenames" key in that json contain list of part files
(parquet's data files) but with spark3 it is empty list as shown above.
There is no exception or error during write operation, but read fails to get
the schema as the parquet file is empty.

Not sure what is causing the issue, I have attached the spark configuration
which are used to submit the job as attachment( spark-default.conf
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11249/spark-default.conf>
).

I'm using Ceph as underlying storage for s3 buckets and if I use rados
command to check data i can see parquet data with file name containing
multipart upload in some path like below (but not in final output s3 path)

bash-4.2# rados ls  -p rook-ceph-store.rgw.buckets.data | grep
"part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5"
4bd26ab1-6211-4aa5-92d9-9595ad0ee383.454449.1__multipart_shiva/people.parquet/part-00000-43466165-16d1-4b36-ab90-acb6c3c309a5-c000-spark-66e7529285e54226b94d61c2263be83b.snappy.parquet.2~d3to_jPrAO_BLxTu74GXr_g_sz4pvQF.1

Could someone help me to debug this issue or any known issue around this?

Regards,
Shiva






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

shiva
Hi Mich Talebzadeh,
Thanks for your reply, the issue is seen in spark 3.0.0 and with spark 2.4.5
it works without any problem.

Regards,
Shiva



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Mich Talebzadeh


Hi,


We also have an issue with data not being displayed in Google Cloud DataProc 2 that uses Spark 3.1.1.


It works on 3.0.1 on Prem but not on 3.1.1 on Google Data Proc (offered as a service). It may be related to Spark version


It is concerning.


HTH


LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Mon, 1 Mar 2021 at 18:55, shiva <[hidden email]> wrote:
Hi Mich Talebzadeh,
Thanks for your reply, the issue is seen in spark 3.0.0 and with spark 2.4.5
it works without any problem.

Regards,
Shiva



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

shiva
Hi Mich Talebzadeh,
Could you please share the spark configuration used to run the job? you
mentioned it works on 3.0.1 I will check if I am also using the same
configuration or not.

Regards,
Shiva



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: s3a staging committer(directory committer )not writing data to s3 bucket (final output directory) in spark3

Mich Talebzadeh
Hi Shiva,

This works on 3.0.1 on prem but not on Google dataproc with spark 3.1.1-RC2

These are the jar files used for structured streaming

All added under $SPARK_HOME/jars on all nodes

spark-sql-kafka-0-10_2.12-3.0.1.jar
kafka-clients-2.7.0.jar
spark-token-provider-kafka-0-10_2.12-3.0.1.jar
commons-pool2-2.9.0.jar

Also add these under $SPARK_HOME/conf in file spark-defaults.conf all nodes

spark.driver.extraClassPath        $SPARK_HOME/jars/*.jar
spark.executor.extraClassPath      $SPARK_HOME/jars/*.jar


Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201)

HTH

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Tue, 2 Mar 2021 at 11:47, shiva <[hidden email]> wrote:
Hi Mich Talebzadeh,
Could you please share the spark configuration used to run the job? you
mentioned it works on 3.0.1 I will check if I am also using the same
configuration or not.

Regards,
Shiva



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]