Spark job stuck at s3a-file-system metrics system started

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark job stuck at s3a-file-system metrics system started

Aniruddha P Tekade
Hello,

I am trying to run a spark job that is trying to write the data into a custom s3 endpoint bucket. But I am stuck at this line of output and job is not moving forward at all -
20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/').
20/04/29 16:03:59 INFO SharedState: Warehouse path is 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'.
20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system started
After long time of waiting it shows this -
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection refused)

However, I am able to access this bucket from aws cli from the same machine. I don't understand why it is saying not able to execute the HTTP request.

I am using -

spark               3.0.0-preview2
hadoop-aws          3.2.0
aws-java-sdk-bundle 1.11.375

My spark code has following properties set for hadoop configuration -

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") 

Can someone help me in understanding what is wrong here? Is there anything else I need to configure. The custom s3-endpoint and its keys are valid and working from aws cli profile. What is wrong with the scala code here?

val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) as "month", year(current_date()) as "year")
      .writeStream
      .format("parquet")
      .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/")
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("15 seconds"))
      .partitionBy("year", "month", "day")
      .option("path", "s3a://test-bucket")

val streamingQuery: StreamingQuery = dataStreamWriter.start()
Aniruddha
-----------
Reply | Threaded
Open this post in threaded view
|

Re: Spark job stuck at s3a-file-system metrics system started

Abhisheks
Hi there,

Read your question and I do believe you are on right path. But what could be
worth checking is - are you able to connect to s3 bucket from your worker
nodes.

I did read that you are able to do it from your machine but since write
happens at the the worker end, it might be worth checking the connection
from there.

Best,
Shobhit G



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark job stuck at s3a-file-system metrics system started

Gourav Sengupta
In reply to this post by Aniruddha P Tekade
Hi,

I think that we should stop using S3a, and use S3. 

Please try refer about EMRFS and how it provides fantastic advantages :) 


Regards,
Gourav Sengupta 

On Thu, Apr 30, 2020 at 12:54 AM Aniruddha P Tekade <[hidden email]> wrote:
Hello,

I am trying to run a spark job that is trying to write the data into a custom s3 endpoint bucket. But I am stuck at this line of output and job is not moving forward at all -
20/04/29 16:03:59 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/').
20/04/29 16:03:59 INFO SharedState: Warehouse path is 'file:/Users/abc/IdeaProjects/qct-air-detection/spark-warehouse/'.
20/04/29 16:04:01 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
20/04/29 16:04:02 INFO MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).
20/04/29 16:04:02 INFO MetricsSystemImpl: s3a-file-system metrics system started
After long time of waiting it shows this -
org.apache.hadoop.fs.s3a.AWSClientIOException: doesBucketExist on test-bucket: com.amazonaws.SdkClientException: Unable to execute HTTP request: Connect to s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection refused): Unable to execute HTTP request: Connect to s3-region0.mycloud.com:443 [s3-region0.mycloud.com/10.10.3.72] failed: Connection refused (Connection refused)

However, I am able to access this bucket from aws cli from the same machine. I don't understand why it is saying not able to execute the HTTP request.

I am using -

spark               3.0.0-preview2
hadoop-aws          3.2.0
aws-java-sdk-bundle 1.11.375

My spark code has following properties set for hadoop configuration -

spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true") 

Can someone help me in understanding what is wrong here? Is there anything else I need to configure. The custom s3-endpoint and its keys are valid and working from aws cli profile. What is wrong with the scala code here?

val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(dayofmonth(current_date()) as "day", month(current_date()) as "month", year(current_date()) as "year")
      .writeStream
      .format("parquet")
      .option("checkpointLocation", "/Users/abc/Desktop/qct-checkpoint/")
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("15 seconds"))
      .partitionBy("year", "month", "day")
      .option("path", "s3a://test-bucket")

val streamingQuery: StreamingQuery = dataStreamWriter.start()
Aniruddha
-----------