spark writeStream not working with custom S3 endpoint

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

spark writeStream not working with custom S3 endpoint

Aniruddha P Tekade

Hello,

While working with Spark Structured Streaming (v2.4.3) I am trying to write my streaming dataframe to a custom S3. I have made sure that I am able to login, upload data to s3 buckets manually using UI and have also setup ACCESS_KEY and SECRET_KEY for it.

val sc = spark.sparkContext
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3-region1.myObjectStore.com:443")
sc.hadoopConfiguration.set("fs.s3a.access.key", "00cce9eb2c589b1b1b5b")
sc.hadoopConfiguration.set("fs.s3a.secrete.key", "flmheKX9Gb1tTlImO6xR++9kvnUByfRKZfI7LJT8")
sc.hadoopConfiguration.set("fs.s3a.path.style.access", "true") // bucket name appended as url/bucket and not bucket.url
val writeToS3Query = stream.writeStream
      .format("csv")
      .option("sep", ",")
      .option("header", true)
      .outputMode("append")
      .trigger(Trigger.ProcessingTime("30 seconds"))
      .option("path", "s3a://bucket0/")
      .option("checkpointLocation", "/Users/home/checkpoints/s3-checkpointing")
      .start()

However, I am getting the error that

Unable to execute HTTP request: bucket0.s3-region1.myObjectStore.com: nodename nor servname provided, or not known

I have mapping of URL and IP in my /etc/hosts file and the bucket is accessable from other sources. Is there any other way to do this successfully? I am really not sure why bucket name is being appended before URL when it is executed by Spark.

Can this be because I am setting up the spark context hadoop configurations after session is created and so they are not effective? But then how it is able to refer the actual URL when in the path I am providing value as s3a://bucket0.


Best,
Aniruddha
-----------