Cross Region Apache Spark Setup

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Cross Region Apache Spark Setup

Stone Zhong-2
Hi,

I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon S3 and well partitioned by region.

For example, I have parquet file at
    S3://mybucket/sales_fact.parquet/us-west
    S3://mybucket/sales_fact.parquet/us-east
    S3://mybucket/sales_fact.parquet/uk

And my cluster have nodes in us-west, us-east and uk region -- basically I have node in all region that I supported.

When I have code like:

df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print(df.select("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and etc, and spark to add 3 regional count and return me a total count. I do not expect large cross region data transfer in this case.
For #2, I expect only us-west nodes read data partition in us-west, and etc. Each region, do the distinct() locally first, and merge 3 "product_id" list and do a distinct() again, I am ok with the necessary cross-region data transfer for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the Apache Spark to work in such a way?

Any idea and help is appreciated!

Thanks,
Stone
Reply | Threaded
Open this post in threaded view
|

Re: Cross Region Apache Spark Setup

ZHANG Wei
There might be 3 options:

1. Just as you expect,  only ONE application, ONE rdd with regioned containers and executors automatically allocated and distributed, the ResourceProfile (https://issues.apache.org/jira/browse/SPARK-27495) may meet the requirement, treating Region as a type of resource just like GPU. But you have to wait for the full feature finished. And I can image the trouble shooting challenges.
2. Label Yarn nodes with region tag, group them into queues and submit the different jobs for different regions into dedicate queues (with –queue argument when submitting).
3. Build seperated Spark clusters with independed Yarn Resource manager for regions, such as, UK cluster, US-east cluster, US-west cluster, looks dirty, but easy to deploy and manage, you can schedule the job by the region busy and idle hours to get more performance and lower cost.

Just my 2 cents

---
Cheers,
-z

________________________________________
From: Stone Zhong <[hidden email]>
Sent: Wednesday, April 15, 2020 4:31
To: [hidden email]
Subject: Cross Region Apache Spark Setup

Hi,

I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon S3 and well partitioned by region.

For example, I have parquet file at
    S3://mybucket/sales_fact.parquet/us-west
    S3://mybucket/sales_fact.parquet/us-east
    S3://mybucket/sales_fact.parquet/uk

And my cluster have nodes in us-west, us-east and uk region -- basically I have node in all region that I supported.

When I have code like:

df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print(df.select("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and etc, and spark to add 3 regional count and return me a total count. I do not expect large cross region data transfer in this case.
For #2, I expect only us-west nodes read data partition in us-west, and etc. Each region, do the distinct() locally first, and merge 3 "product_id" list and do a distinct() again, I am ok with the necessary cross-region data transfer for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the Apache Spark to work in such a way?

Any idea and help is appreciated!

Thanks,
Stone

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Cross Region Apache Spark Setup

Stone Zhong-2
Thank you Wei.

I will look into #1. With option 2, seems it will push the complexity to application -- application need to write multiple queries and merge the final result.

Regards,
Stone

On Mon, Apr 20, 2020 at 7:39 AM ZHANG Wei <[hidden email]> wrote:
There might be 3 options:

1. Just as you expect,  only ONE application, ONE rdd with regioned containers and executors automatically allocated and distributed, the ResourceProfile (https://issues.apache.org/jira/browse/SPARK-27495) may meet the requirement, treating Region as a type of resource just like GPU. But you have to wait for the full feature finished. And I can image the trouble shooting challenges.
2. Label Yarn nodes with region tag, group them into queues and submit the different jobs for different regions into dedicate queues (with –queue argument when submitting).
3. Build seperated Spark clusters with independed Yarn Resource manager for regions, such as, UK cluster, US-east cluster, US-west cluster, looks dirty, but easy to deploy and manage, you can schedule the job by the region busy and idle hours to get more performance and lower cost.

Just my 2 cents

---
Cheers,
-z

________________________________________
From: Stone Zhong <[hidden email]>
Sent: Wednesday, April 15, 2020 4:31
To: [hidden email]
Subject: Cross Region Apache Spark Setup

Hi,

I am trying to setup a cross region Apache Spark cluster. All my data are stored in Amazon S3 and well partitioned by region.

For example, I have parquet file at
    S3://mybucket/sales_fact.parquet/us-west
    S3://mybucket/sales_fact.parquet/us-east
    S3://mybucket/sales_fact.parquet/uk

And my cluster have nodes in us-west, us-east and uk region -- basically I have node in all region that I supported.

When I have code like:

df = spark.read.parquet("S3://mybucket/sales_fact.parquet/*")
print(df.count()) #1
print(df.select("product_id").distinct().count()) #2

For #1, I expect only us-west nodes read data partition in us-west, and etc, and spark to add 3 regional count and return me a total count. I do not expect large cross region data transfer in this case.
For #2, I expect only us-west nodes read data partition in us-west, and etc. Each region, do the distinct() locally first, and merge 3 "product_id" list and do a distinct() again, I am ok with the necessary cross-region data transfer for merging the distinct product_ids

Can anyone please share the best practice? Is it possible to config the Apache Spark to work in such a way?

Any idea and help is appreciated!

Thanks,
Stone