Triggering sql on Was S3 via Apache Spark

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Triggering sql on Was S3 via Apache Spark

Omer.Ozsakarya

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  2. I need to read the contents of customer.tsv.
  3. I will add current timestamp info to the file.
  4. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer

 

           

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Triggering sql on Was S3 via Apache Spark

Divya Gehlot
Hi Omer ,
Here are couple of the solutions which you can implement for your use case : 
Option 1 : 
you can mount the S3 bucket as local file system 
Option 2 :
 You can use Amazon Glue for your use case 

Option 3 :
Store the file in the local file system and later push it s3 bucket 

Thanks,
Divya 

On Tue, 23 Oct 2018 at 15:53, <[hidden email]> wrote:

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  2. I need to read the contents of customer.tsv.
  3. I will add current timestamp info to the file.
  4. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer

 

           

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Triggering sql on Was S3 via Apache Spark

Jörn Franke
Why not directly access the S3 file from Spark?


You need to configure the IAM roles so that the machine running the S3 code is allowed to access the bucket.

Am 24.10.2018 um 06:40 schrieb Divya Gehlot <[hidden email]>:

Hi Omer ,
Here are couple of the solutions which you can implement for your use case : 
Option 1 : 
you can mount the S3 bucket as local file system 
Option 2 :
 You can use Amazon Glue for your use case 

Option 3 :
Store the file in the local file system and later push it s3 bucket 

Thanks,
Divya 

On Tue, 23 Oct 2018 at 15:53, <[hidden email]> wrote:

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  2. I need to read the contents of customer.tsv.
  3. I will add current timestamp info to the file.
  4. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer

 

           

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Triggering sql on Was S3 via Apache Spark

Gourav Sengupta
In reply to this post by Omer.Ozsakarya
This is interesting you asked and then answered the questions (almost) as well

Regards,
Gourav

On Tue, 23 Oct 2018, 13:23 , <[hidden email]> wrote:

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  2. I need to read the contents of customer.tsv.
  3. I will add current timestamp info to the file.
  4. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer

 

           

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Triggering sql on Was S3 via Apache Spark

Omer.Ozsakarya

Thank you very much 😊

 

From: Gourav Sengupta <[hidden email]>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <[hidden email]>
Cc: Spark Forum <[hidden email]>
Subject: Re: Triggering sql on Was S3 via Apache Spark

 

This is interesting you asked and then answered the questions (almost) as well

 

Regards,

Gourav

 

On Tue, 23 Oct 2018, 13:23 , <[hidden email]> wrote:

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  1. I need to read the contents of customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer

 

           

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Triggering sql on Was S3 via Apache Spark

Gourav Sengupta
Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data 

Regards,
Gourav

On Wed, 24 Oct 2018, 14:53 , <[hidden email]> wrote:

Thank you very much 😊

 

From: Gourav Sengupta <[hidden email]>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <[hidden email]>
Cc: Spark Forum <[hidden email]>
Subject: Re: Triggering sql on Was S3 via Apache Spark

 

This is interesting you asked and then answered the questions (almost) as well

 

Regards,

Gourav

 

On Tue, 23 Oct 2018, 13:23 , <[hidden email]> wrote:

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  1. I need to read the contents of customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer

 

           

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Triggering sql on Was S3 via Apache Spark

Omer.Ozsakarya

Thank you Gourav,

Today I saw the article: https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important

It seems also interesting.

I was in meeting, I will also watch it.

 

From: Gourav Sengupta <[hidden email]>
Date: 24 October 2018 Wednesday 13:39
To: "Ozsakarya, Omer" <[hidden email]>
Cc: Spark Forum <[hidden email]>
Subject: Re: Triggering sql on Was S3 via Apache Spark

 

Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data 

 

Regards,

Gourav

 

On Wed, 24 Oct 2018, 14:53 , <[hidden email]> wrote:

Thank you very much 😊

 

From: Gourav Sengupta <[hidden email]>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <[hidden email]>
Cc: Spark Forum <[hidden email]>
Subject: Re: Triggering sql on Was S3 via Apache Spark

 

This is interesting you asked and then answered the questions (almost) as well

 

Regards,

Gourav

 

On Tue, 23 Oct 2018, 13:23 , <[hidden email]> wrote:

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  1. I need to read the contents of customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer

 

           

 

 

Reply | Threaded
Open this post in threaded view
|

Re: Triggering sql on Was S3 via Apache Spark

Gourav Sengupta
I do not think security and governance has become important it always was. Horton works and Cloudera has fantastic security implementations and hence I mentioned about updates via Hive.

Regards,
Gourav

On Wed, 24 Oct 2018, 17:32 , <[hidden email]> wrote:

Thank you Gourav,

Today I saw the article: https://databricks.com/session/apache-spark-in-cloud-and-hybrid-why-security-and-governance-become-more-important

It seems also interesting.

I was in meeting, I will also watch it.

 

From: Gourav Sengupta <[hidden email]>
Date: 24 October 2018 Wednesday 13:39
To: "Ozsakarya, Omer" <[hidden email]>
Cc: Spark Forum <[hidden email]>
Subject: Re: Triggering sql on Was S3 via Apache Spark

 

Also try to read about SCD and the fact that Hive may be a very good alternative as well for running updates on data 

 

Regards,

Gourav

 

On Wed, 24 Oct 2018, 14:53 , <[hidden email]> wrote:

Thank you very much 😊

 

From: Gourav Sengupta <[hidden email]>
Date: 24 October 2018 Wednesday 11:20
To: "Ozsakarya, Omer" <[hidden email]>
Cc: Spark Forum <[hidden email]>
Subject: Re: Triggering sql on Was S3 via Apache Spark

 

This is interesting you asked and then answered the questions (almost) as well

 

Regards,

Gourav

 

On Tue, 23 Oct 2018, 13:23 , <[hidden email]> wrote:

Hi guys,

 

We are using Apache Spark on a local machine.

 

I need to implement the scenario below.

 

In the initial load:

  1. CRM application will send a file to a folder. This file contains customer information of all customers. This file is in a folder in the local server. File name is: customer.tsv
    1. Customer.tsv contains customerid, country, birty_month, activation_date etc
  1. I need to read the contents of customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer customer.tsv to the S3 bucket: customer.history.data

 

In the daily loads:

  1.  CRM application will send a new file which contains the updated/deleted/inserted customer information.

  File name is daily_customer.tsv

    1. Daily_customer.tsv contains contains customerid, cdc_field, country, birty_month, activation_date etc

Cdc field can be New-Customer, Customer-is-Updated, Customer-is-Deleted.

  1. I need to read the contents of daily_customer.tsv.
  2. I will add current timestamp info to the file.
  3. I will transfer daily_customer.tsv to the S3 bucket: customer.daily.data
  4. I need to merge two buckets customer.history.data and customer.daily.data.
    1. Two buckets have timestamp fields. So I need to query all records whose timestamp is the last timestamp.
    2. I can use row_number() over(partition by customer_id order by timestamp_field desc) as version_number
    3. Then I can put the records whose version is one, to the final bucket: customer.dimension.data

 

I am running Spark on premise.

  • Can I query on AWS S3 buckets by using Spark Sql / Dataframe or RDD on a local Spark cluster?
  • Is this approach efficient? Will the queries transfer all historical data from AWS S3 to the local cluster?
  • How can I implement this scenario in a more effective way? Like just transferring daily data to AWS S3 and then running queries on AWS.
    • For instance Athena can query on AWS. But it is just a query engine. As I know I can not call it by using an sdk and I can not write the results to a bucket/folder.

 

Thanks in advance,

Ömer