Can't get Spark to interface with S3A Filesystem with correct credentials

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Can't get Spark to interface with S3A Filesystem with correct credentials

Devin Boyer
Hello,

I'm attempting to run Spark within a Docker container with the hope of eventually running Spark on Kubernetes. Nearly all the data we currently process with Spark is stored in S3, so I need to be able to interface with it using the S3A filesystem.

I feel like I've gotten close to getting this working but for some reason cannot get my local Spark installations to correctly interface with S3 yet.

A basic example of what I've tried:
  • Build Kubernetes docker images by downloading the spark-2.4.5-bin-hadoop2.7.tgz archive and building the kubernetes/dockerfiles/spark/Dockerfile image.
  • Run an interactive docker container using the above built image.
  • Within that container, run spark-shell. This command passes valid AWS credentials by setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the hadoop-aws package by specifying the --packages org.apache.hadoop:hadoop-aws:2.7.3 flag.
  • Try to access the simple public file as outlined in the "Integration with Cloud Infrastructures" documentation by running: sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
  • Observe this to fail with a 403 Forbidden exception thrown by S3

I've tried a variety of other means of setting credentials (like exporting the standard AWS_ACCESS_KEY_ID environment variable before launching spark-shell), and other means of building a Spark image and including the appropriate libraries (see this Github repo: https://github.com/drboyer/spark-s3a-demo), all with the same results. I've tried also accessing objects within our AWS account, rather than the object from the public landsat-pds bucket, with the same 403 error being thrown.

Can anyone help explain why I can't seem to connect to S3 successfully using Spark, or even explain where I could look for additional clues as to what's misconfigured? I've tried turning up the logging verbosity and didn't see much that was particularly useful, but happy to share additional log output too.

Thanks for any help you can provide!

Best,
Devin Boyer
Reply | Threaded
Open this post in threaded view
|

Re: Can't get Spark to interface with S3A Filesystem with correct credentials

stevenstetzler
To successfully read from S3 using s3a, I've had to also set
```
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
```
in addition to `spark.hadoop.fs.s3a.access.key` and `spark.hadoop.fs.s3a.secret.key`. I've also needed to ensure Spark has access to the AWS SDK jar. I have downloaded `aws-java-sdk-1.7.4.jar` (maven) paired with `hadoop-aws-2.7.3.jar` in `$SPARK_HOME/jars`.

These additionally configurations don't seem related to credentials and security (and may not even be needed in my case), but perhaps it will help you.

Thanks,
Steven

On Wed, Mar 4, 2020 at 1:11 PM Devin Boyer <[hidden email]> wrote:
Hello,

I'm attempting to run Spark within a Docker container with the hope of eventually running Spark on Kubernetes. Nearly all the data we currently process with Spark is stored in S3, so I need to be able to interface with it using the S3A filesystem.

I feel like I've gotten close to getting this working but for some reason cannot get my local Spark installations to correctly interface with S3 yet.

A basic example of what I've tried:
  • Build Kubernetes docker images by downloading the spark-2.4.5-bin-hadoop2.7.tgz archive and building the kubernetes/dockerfiles/spark/Dockerfile image.
  • Run an interactive docker container using the above built image.
  • Within that container, run spark-shell. This command passes valid AWS credentials by setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the hadoop-aws package by specifying the --packages org.apache.hadoop:hadoop-aws:2.7.3 flag.
  • Try to access the simple public file as outlined in the "Integration with Cloud Infrastructures" documentation by running: sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
  • Observe this to fail with a 403 Forbidden exception thrown by S3

I've tried a variety of other means of setting credentials (like exporting the standard AWS_ACCESS_KEY_ID environment variable before launching spark-shell), and other means of building a Spark image and including the appropriate libraries (see this Github repo: https://github.com/drboyer/spark-s3a-demo), all with the same results. I've tried also accessing objects within our AWS account, rather than the object from the public landsat-pds bucket, with the same 403 error being thrown.

Can anyone help explain why I can't seem to connect to S3 successfully using Spark, or even explain where I could look for additional clues as to what's misconfigured? I've tried turning up the logging verbosity and didn't see much that was particularly useful, but happy to share additional log output too.

Thanks for any help you can provide!

Best,
Devin Boyer
Reply | Threaded
Open this post in threaded view
|

Re: Can't get Spark to interface with S3A Filesystem with correct credentials

Hariharan
In reply to this post by Devin Boyer
If you're using hadoop 2.7 or below, you may also need to use the
following hadoop settings:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

Hadoop 2.8 and above would have these set by default.

Thanks,
Hariharan

On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer
<[hidden email]> wrote:

>
> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of eventually running Spark on Kubernetes. Nearly all the data we currently process with Spark is stored in S3, so I need to be able to interface with it using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason cannot get my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
> Build Kubernetes docker images by downloading the spark-2.4.5-bin-hadoop2.7.tgz archive and building the kubernetes/dockerfiles/spark/Dockerfile image.
> Run an interactive docker container using the above built image.
> Within that container, run spark-shell. This command passes valid AWS credentials by setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the hadoop-aws package by specifying the --packages org.apache.hadoop:hadoop-aws:2.7.3 flag.
> Try to access the simple public file as outlined in the "Integration with Cloud Infrastructures" documentation by running: sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
> Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting the standard AWS_ACCESS_KEY_ID environment variable before launching spark-shell), and other means of building a Spark image and including the appropriate libraries (see this Github repo: https://github.com/drboyer/spark-s3a-demo), all with the same results. I've tried also accessing objects within our AWS account, rather than the object from the public landsat-pds bucket, with the same 403 error being thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully using Spark, or even explain where I could look for additional clues as to what's misconfigured? I've tried turning up the logging verbosity and didn't see much that was particularly useful, but happy to share additional log output too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Can't get Spark to interface with S3A Filesystem with correct credentials

Devin Boyer
Thanks for the input Steven and Hariharan. I think this ended up being a combination of bad configuration with the credential providers I was using and using the wrong set of credentials for the test data I was trying to access.

I was able to get this working with both hadoop 2.8 and 3.1 by pulling down the correct hadoop-aws and aws-java-sdk[-bundle] bundles and fixing the credential provider I was using for testing. It's probably the same for the spark distribution compiled for hadoop 2.7, but since I already have a build with a more modern hadoop version working, I may just stick with that.

Best,
Devin

On Wed, Mar 4, 2020 at 11:02 PM Hariharan <[hidden email]> wrote:
If you're using hadoop 2.7 or below, you may also need to use the
following hadoop settings:

fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3a.S3A
fs.AbstractFileSystem.s3a.impl=org.apache.hadoop.fs.s3a.S3A

Hadoop 2.8 and above would have these set by default.

Thanks,
Hariharan

On Thu, Mar 5, 2020 at 2:41 AM Devin Boyer
<[hidden email].invalid> wrote:
>
> Hello,
>
> I'm attempting to run Spark within a Docker container with the hope of eventually running Spark on Kubernetes. Nearly all the data we currently process with Spark is stored in S3, so I need to be able to interface with it using the S3A filesystem.
>
> I feel like I've gotten close to getting this working but for some reason cannot get my local Spark installations to correctly interface with S3 yet.
>
> A basic example of what I've tried:
>
> Build Kubernetes docker images by downloading the spark-2.4.5-bin-hadoop2.7.tgz archive and building the kubernetes/dockerfiles/spark/Dockerfile image.
> Run an interactive docker container using the above built image.
> Within that container, run spark-shell. This command passes valid AWS credentials by setting spark.hadoop.fs.s3a.access.key and spark.hadoop.fs.s3a.secret.key using --conf flags, and downloads the hadoop-aws package by specifying the --packages org.apache.hadoop:hadoop-aws:2.7.3 flag.
> Try to access the simple public file as outlined in the "Integration with Cloud Infrastructures" documentation by running: sc.textFile("s3a://landsat-pds/scene_list.gz").take(5)
> Observe this to fail with a 403 Forbidden exception thrown by S3
>
>
> I've tried a variety of other means of setting credentials (like exporting the standard AWS_ACCESS_KEY_ID environment variable before launching spark-shell), and other means of building a Spark image and including the appropriate libraries (see this Github repo: https://github.com/drboyer/spark-s3a-demo), all with the same results. I've tried also accessing objects within our AWS account, rather than the object from the public landsat-pds bucket, with the same 403 error being thrown.
>
> Can anyone help explain why I can't seem to connect to S3 successfully using Spark, or even explain where I could look for additional clues as to what's misconfigured? I've tried turning up the logging verbosity and didn't see much that was particularly useful, but happy to share additional log output too.
>
> Thanks for any help you can provide!
>
> Best,
> Devin Boyer

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]