Fwd: array_contains in package org.apache.spark.sql.functions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: array_contains in package org.apache.spark.sql.functions

刘崇光

---------- Forwarded message ----------
From: 刘崇光 <[hidden email]>
Date: Thu, Jun 14, 2018 at 11:08 AM
Subject: array_contains in package org.apache.spark.sql.functions
To: [hidden email]


Hello all,

I ran into a use case in project with spark sql and want to share with you some thoughts about the function array_contains.

Say I have a Dataframe containing 2 columns. Column A of type "Array of String" and Column B of type "String". I want to determine if the value of column B is contained in the value of column A, without using a udf of course.
The function array_contains came into my mind naturally:

def array_contains(column: Column, value: Any): Column = withExpr {
ArrayContains(column.expr, Literal(value))
}
However the function takes the column B and does a "Literal" of column B, which yields a runtime exception: RuntimeException("Unsupported literal type " + v.getClass + " " + v).

Then after discussion with my friends, we fund a solution without using udf:
new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)
 
With this solution, I think of empowering a little bit more the function, by doing like this:
def array_contains(column: Column, value: Any): Column = withExpr {
value match {
case c: Column => ArrayContains(column.expr, c.expr)
case _ => ArrayContains(column.expr, Literal(value))
}
}

It does a pattern matching to detect if value is of type Column. If yes, it will use the .expr of the column, otherwise it will work as it used to.

Any suggestion or opinion on the proposition?


Kind regards,
Chongguang LIU

Reply | Threaded
Open this post in threaded view
|

Re: array_contains in package org.apache.spark.sql.functions

Takuya UESHIN
Hi Chongguang,

Thanks for the report!

That makes sense and the proposition should work, or we can add something like `def array_contains(column: Column, value: Column)`.
Maybe other functions, such as `array_position`, `element_at`, are the same situation.

Could you file a JIRA, and submit a PR if possible?
We can have a discussion more about the issue there.

Btw, I guess you can use `expr("array_contains(columnA, columnB)")` as a workaround.

Thanks.


On Thu, Jun 14, 2018 at 2:15 AM, 刘崇光 <[hidden email]> wrote:

---------- Forwarded message ----------
From: 刘崇光 <[hidden email]>
Date: Thu, Jun 14, 2018 at 11:08 AM
Subject: array_contains in package org.apache.spark.sql.functions
To: [hidden email]


Hello all,

I ran into a use case in project with spark sql and want to share with you some thoughts about the function array_contains.

Say I have a Dataframe containing 2 columns. Column A of type "Array of String" and Column B of type "String". I want to determine if the value of column B is contained in the value of column A, without using a udf of course.
The function array_contains came into my mind naturally:

def array_contains(column: Column, value: Any): Column = withExpr {
ArrayContains(column.expr, Literal(value))
}
However the function takes the column B and does a "Literal" of column B, which yields a runtime exception: RuntimeException("Unsupported literal type " + v.getClass + " " + v).

Then after discussion with my friends, we fund a solution without using udf:
new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)
 
With this solution, I think of empowering a little bit more the function, by doing like this:
def array_contains(column: Column, value: Any): Column = withExpr {
value match {
case c: Column => ArrayContains(column.expr, c.expr)
case _ => ArrayContains(column.expr, Literal(value))
}
}

It does a pattern matching to detect if value is of type Column. If yes, it will use the .expr of the column, otherwise it will work as it used to.

Any suggestion or opinion on the proposition?


Kind regards,
Chongguang LIU




--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin
Reply | Threaded
Open this post in threaded view
|

Re: array_contains in package org.apache.spark.sql.functions

刘崇光
Hello Takuya,

Thanks for your message. I will do the JIRA and PR.

Best regards,
Chongguang

On Thu, Jun 14, 2018 at 11:25 PM, Takuya UESHIN <[hidden email]> wrote:
Hi Chongguang,

Thanks for the report!

That makes sense and the proposition should work, or we can add something like `def array_contains(column: Column, value: Column)`.
Maybe other functions, such as `array_position`, `element_at`, are the same situation.

Could you file a JIRA, and submit a PR if possible?
We can have a discussion more about the issue there.

Btw, I guess you can use `expr("array_contains(columnA, columnB)")` as a workaround.

Thanks.


On Thu, Jun 14, 2018 at 2:15 AM, 刘崇光 <[hidden email]> wrote:

---------- Forwarded message ----------
From: 刘崇光 <[hidden email]>
Date: Thu, Jun 14, 2018 at 11:08 AM
Subject: array_contains in package org.apache.spark.sql.functions
To: [hidden email]


Hello all,

I ran into a use case in project with spark sql and want to share with you some thoughts about the function array_contains.

Say I have a Dataframe containing 2 columns. Column A of type "Array of String" and Column B of type "String". I want to determine if the value of column B is contained in the value of column A, without using a udf of course.
The function array_contains came into my mind naturally:

def array_contains(column: Column, value: Any): Column = withExpr {
ArrayContains(column.expr, Literal(value))
}
However the function takes the column B and does a "Literal" of column B, which yields a runtime exception: RuntimeException("Unsupported literal type " + v.getClass + " " + v).

Then after discussion with my friends, we fund a solution without using udf:
new Column(ArrayContains(df("ColumnA").expr, df("ColumnB").expr)
 
With this solution, I think of empowering a little bit more the function, by doing like this:
def array_contains(column: Column, value: Any): Column = withExpr {
value match {
case c: Column => ArrayContains(column.expr, c.expr)
case _ => ArrayContains(column.expr, Literal(value))
}
}

It does a pattern matching to detect if value is of type Column. If yes, it will use the .expr of the column, otherwise it will work as it used to.

Any suggestion or opinion on the proposition?


Kind regards,
Chongguang LIU




--
Takuya UESHIN
Tokyo, Japan

http://twitter.com/ueshin