Newbie question on how to extract column value

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Newbie question on how to extract column value

James Starks
I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code

    sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()

Now I want to perform filter and map function on col_b value. In plain scala it would be something like

    Seq((1, "http://a.com/a"), (2, "http://b.com/b"), (3, "unknown")).filter { case (_, url) => isValid(url) }.map { case (id, url)  => (id, pathOf(url)) }

where filter will remove invalid url, and then map (id, url) to (id, path of url).

However, when applying this concept to spark sql with code snippet 

    sparkSession.sql("...").filter(isValid($"url"))

Compiler complains type mismatch because $"url" is ColumnName type. How can I extract column value i.e. http://... for the column url in order to perform filter function?

Thanks

Java 1.8.0
Scala 2.11.8
Spark 2.1.0





Reply | Threaded
Open this post in threaded view
|

Re: Newbie question on how to extract column value

Gourav Sengupta
Hi James,

It is always advisable to use the latest SPARK version. That said, can you please giving a try to dataframes and udf if possible. I think, that would be a much scalable way to address the issue. 

Also in case possible, it is always advisable to use the filter option before fetching the data to Spark.


Thanks and Regards,
Gourav

On Tue, Aug 7, 2018 at 4:09 PM, James Starks <[hidden email]> wrote:
I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code

    sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()

Now I want to perform filter and map function on col_b value. In plain scala it would be something like

    Seq((1, "http://a.com/a"), (2, "http://b.com/b"), (3, "unknown")).filter { case (_, url) => isValid(url) }.map { case (id, url)  => (id, pathOf(url)) }

where filter will remove invalid url, and then map (id, url) to (id, path of url).

However, when applying this concept to spark sql with code snippet 

    sparkSession.sql("...").filter(isValid($"url"))

Compiler complains type mismatch because $"url" is ColumnName type. How can I extract column value i.e. http://... for the column url in order to perform filter function?

Thanks

Java 1.8.0
Scala 2.11.8
Spark 2.1.0






Reply | Threaded
Open this post in threaded view
|

Re: Newbie question on how to extract column value

James Starks
Because of some legacy issues I can't immediately upgrade spark version. But I try filter data before loading it into spark based on the suggestion by

     val df = sparkSession.read.format("jdbc").option(...).option("dbtable", "(select .. from ... where url <> '') table_name")....load()
     df.createOrReplaceTempView("new_table")

Then perform custom operation do the trick.

    sparkSession.sql("select id, url from new_table").as[(String, String)].map { case (id, url) =>
       val derived_data = ... // operation on url
       (id, derived_data)
    }.show()

Thanks for the advice, it's really helpful!

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On August 7, 2018 5:33 PM, Gourav Sengupta <[hidden email]> wrote:

Hi James,

It is always advisable to use the latest SPARK version. That said, can you please giving a try to dataframes and udf if possible. I think, that would be a much scalable way to address the issue. 

Also in case possible, it is always advisable to use the filter option before fetching the data to Spark.


Thanks and Regards,
Gourav

On Tue, Aug 7, 2018 at 4:09 PM, James Starks <[hidden email]> wrote:
I am very new to Spark. Just successfully setup Spark SQL connecting to postgresql database, and am able to display table with code

    sparkSession.sql("SELECT id, url from table_a where col_b <> '' ").show()

Now I want to perform filter and map function on col_b value. In plain scala it would be something like

    Seq((1, "http://a.com/a"), (2, "http://b.com/b"), (3, "unknown")).filter { case (_, url) => isValid(url) }.map { case (id, url)  => (id, pathOf(url)) }

where filter will remove invalid url, and then map (id, url) to (id, path of url).

However, when applying this concept to spark sql with code snippet 

    sparkSession.sql("...").filter(isValid($"url"))

Compiler complains type mismatch because $"url" is ColumnName type. How can I extract column value i.e. http://... for the column url in order to perform filter function?

Thanks

Java 1.8.0
Scala 2.11.8
Spark 2.1.0