Modularising Spark/Scala program

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Modularising Spark/Scala program

Mich Talebzadeh

Hi,

I have a Spark Scala program created and compiled with Maven. It works fine. It basically does the following:

  1. Reads an xml file from HDFS location
  2. Creates a DF on top of what it reads
  3. Creates a new DF with some columns renamed etc
  4. Creates a new DF for rejected rows (incorrect value for a column)
  5. Puts rejected data into Hive exception table 
  6. Puts valid rows into Hive main table
  7. Nullifies the invalid rows by setting the invalid column to NULL and puts the rows into the main Hive table
These are currently performed in one method. Ideally I want to break this down as follows:

  1. A method to read the XML file and creates DF and a new DF on top of previous DF
  2. A method to create a DF on top of rejected rows using t
  3. A method to put invalid rows into the exception table using tmp table
  4. A method to put the correct rows into the main table again using tmp table
I was wondering if this is correct approach?

Thanks,


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Modularising Spark/Scala program

Stephen Boesch
Hi Mich!
   I think you can combine the good/rejected into one method that internally:
  • Create good/rejected df's given an input df and input rules/predicates to apply to the df.   
  • Create a third df containing the good rows and the rejected rows with the bad columns nulled out
  • Append/insert the two dfs into their respective hive good/exception tables
  • return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)  or maybe just the (combinedDf,exceptionsDf)

Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <[hidden email]>:

Hi,

I have a Spark Scala program created and compiled with Maven. It works fine. It basically does the following:

  1. Reads an xml file from HDFS location
  2. Creates a DF on top of what it reads
  3. Creates a new DF with some columns renamed etc
  4. Creates a new DF for rejected rows (incorrect value for a column)
  5. Puts rejected data into Hive exception table 
  6. Puts valid rows into Hive main table
  7. Nullifies the invalid rows by setting the invalid column to NULL and puts the rows into the main Hive table
These are currently performed in one method. Ideally I want to break this down as follows:

  1. A method to read the XML file and creates DF and a new DF on top of previous DF
  2. A method to create a DF on top of rejected rows using t
  3. A method to put invalid rows into the exception table using tmp table
  4. A method to put the correct rows into the main table again using tmp table
I was wondering if this is correct approach?

Thanks,


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Modularising Spark/Scala program

Stephen Boesch
I neglected to include the rationale: the assumption is this will be a repeatedly needed process thus a reusable method were helpful.  The predicate/input rules that are supported will need to be flexible enough to support the range of input data domains and use cases .  For my workflows the predicates are typically sql's.

Am Sa., 2. Mai 2020 um 06:13 Uhr schrieb Stephen Boesch <[hidden email]>:
Hi Mich!
   I think you can combine the good/rejected into one method that internally:
  • Create good/rejected df's given an input df and input rules/predicates to apply to the df.   
  • Create a third df containing the good rows and the rejected rows with the bad columns nulled out
  • Append/insert the two dfs into their respective hive good/exception tables
  • return value can be a tuple of the (goodDf,exceptionsDf,combinedDf)  or maybe just the (combinedDf,exceptionsDf)

Am Sa., 2. Mai 2020 um 06:00 Uhr schrieb Mich Talebzadeh <[hidden email]>:

Hi,

I have a Spark Scala program created and compiled with Maven. It works fine. It basically does the following:

  1. Reads an xml file from HDFS location
  2. Creates a DF on top of what it reads
  3. Creates a new DF with some columns renamed etc
  4. Creates a new DF for rejected rows (incorrect value for a column)
  5. Puts rejected data into Hive exception table 
  6. Puts valid rows into Hive main table
  7. Nullifies the invalid rows by setting the invalid column to NULL and puts the rows into the main Hive table
These are currently performed in one method. Ideally I want to break this down as follows:

  1. A method to read the XML file and creates DF and a new DF on top of previous DF
  2. A method to create a DF on top of rejected rows using t
  3. A method to put invalid rows into the exception table using tmp table
  4. A method to put the correct rows into the main table again using tmp table
I was wondering if this is correct approach?

Thanks,


Dr Mich Talebzadeh

 

LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.