Path style access fs.s3a.path.style.access property is not working in spark code

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Path style access fs.s3a.path.style.access property is not working in spark code

Aniruddha P Tekade
Hello Users,

I am using on-premise object storage and able to perform operations on different bucket using aws-cli. 
However, when I am trying to use the same path from my spark code, it fails. Here are the details - 

Addes dependencies in build.sbt - 
  • hadoop-aws-2.7.4.ja
  • aws-java-sdk-1.7.4.jar 
Spark Hadoop Configuration setup as - 
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
And now I try to write data into my custom s3 endpoint as follows -
        val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
      dayofmonth(current_date()) as "day",
      month(current_date()) as "month",
      year(current_date()) as "year",
      column("time"),
      column("quality"),
      column("PM25"))
      .writeStream
      .partitionBy("year", "month", "day")
      .format("csv")
      .outputMode("append")
      .option("path",  "s3a://test-bucket/")

        val streamingQuery: StreamingQuery = dataStreamWriter.start()

However, I am getting en error that AmazonHttpClient is not able to execute HTTP request and 
also it is referring to the bucket-name before the URL. Seems like the hadoop configuration is not being resolved here -

20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com

Is there anything that I am missing here in the configurations? Seems like even after setting up path style access to true, 
it's not working. 

--
Aniruddha
-----------
Reply | Threaded
Open this post in threaded view
|

Re: Path style access fs.s3a.path.style.access property is not working in spark code

Aniruddha P Tekade
Hello User,

I got the solution to this. If you are writing to a custom s3 url, then use the hadoop-aws-2.8.0.jar as the separate flag was introduced to enable path style access.

Best,
Aniruddha
-----------


On Fri, May 1, 2020 at 5:08 PM Aniruddha P Tekade <[hidden email]> wrote:
Hello Users,

I am using on-premise object storage and able to perform operations on different bucket using aws-cli. 
However, when I am trying to use the same path from my spark code, it fails. Here are the details - 

Addes dependencies in build.sbt - 
  • hadoop-aws-2.7.4.ja
  • aws-java-sdk-1.7.4.jar 
Spark Hadoop Configuration setup as - 
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
And now I try to write data into my custom s3 endpoint as follows -
        val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
      dayofmonth(current_date()) as "day",
      month(current_date()) as "month",
      year(current_date()) as "year",
      column("time"),
      column("quality"),
      column("PM25"))
      .writeStream
      .partitionBy("year", "month", "day")
      .format("csv")
      .outputMode("append")
      .option("path",  "s3a://test-bucket/")

        val streamingQuery: StreamingQuery = dataStreamWriter.start()

However, I am getting en error that AmazonHttpClient is not able to execute HTTP request and 
also it is referring to the bucket-name before the URL. Seems like the hadoop configuration is not being resolved here -

20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com

Is there anything that I am missing here in the configurations? Seems like even after setting up path style access to true, 
it's not working. 

--
Aniruddha
-----------
Reply | Threaded
Open this post in threaded view
|

Re: Path style access fs.s3a.path.style.access property is not working in spark code

Samik Raychaudhuri
Recommend to use v2.9.x, there are lot of optimizations that makes life much easier while accessing from Spark.
Thanks.
-Samik

On 05-05-2020 01:55 am, Aniruddha P Tekade wrote:
Hello User,

I got the solution to this. If you are writing to a custom s3 url, then use the hadoop-aws-2.8.0.jar as the separate flag was introduced to enable path style access.

Best,
Aniruddha
-----------


On Fri, May 1, 2020 at 5:08 PM Aniruddha P Tekade <[hidden email]> wrote:
Hello Users,

I am using on-premise object storage and able to perform operations on different bucket using aws-cli. 
However, when I am trying to use the same path from my spark code, it fails. Here are the details - 

Addes dependencies in build.sbt - 
  • hadoop-aws-2.7.4.ja
  • aws-java-sdk-1.7.4.jar 
Spark Hadoop Configuration setup as - 
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
And now I try to write data into my custom s3 endpoint as follows -
        val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
      dayofmonth(current_date()) as "day",
      month(current_date()) as "month",
      year(current_date()) as "year",
      column("time"),
      column("quality"),
      column("PM25"))
      .writeStream
      .partitionBy("year", "month", "day")
      .format("csv")
      .outputMode("append")
      .option("path",  "s3a://test-bucket/")

        val streamingQuery: StreamingQuery = dataStreamWriter.start()

However, I am getting en error that AmazonHttpClient is not able to execute HTTP request and 
also it is referring to the bucket-name before the URL. Seems like the hadoop configuration is not being resolved here -

20/05/01 16:51:37 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com

Is there anything that I am missing here in the configurations? Seems like even after setting up path style access to true, 
it's not working. 

--
Aniruddha
-----------

--
Samik Raychaudhuri, Ph.D.
http://in.linkedin.com/in/samikr/