Regexp_extract not giving correct output

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Regexp_extract not giving correct output

Sachit Murarka
Hi All,

I am using Pyspark to get the value from a column on basis of regex.

Following is the regex which I am using:
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

df = spark.createDataFrame([("[1234] [3333] [4444] [66]",), ("abcd",)],["stringValue"])
 
result = df.withColumn('extracted value', F.regexp_extract(F.col('stringValue'), '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', 1))

I have tried with spark.sql as well. It is giving empty output. 

I have tested this regex , it is working fine on an online regextester . But it is not working in spark . I know spark needs Java based regex , hence I tried escaping also , that gave exception:
: java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)


Can you please help here?

Kind Regards,
Sachit Murarka
Reply | Threaded
Open this post in threaded view
|

Re: Regexp_extract not giving correct output

srowen
As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes.

On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka <[hidden email]> wrote:
Hi All,

I am using Pyspark to get the value from a column on basis of regex.

Following is the regex which I am using:
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

df = spark.createDataFrame([("[1234] [3333] [4444] [66]",), ("abcd",)],["stringValue"])
 
result = df.withColumn('extracted value', F.regexp_extract(F.col('stringValue'), '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', 1))

I have tried with spark.sql as well. It is giving empty output. 

I have tested this regex , it is working fine on an online regextester . But it is not working in spark . I know spark needs Java based regex , hence I tried escaping also , that gave exception:
: java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)


Can you please help here?

Kind Regards,
Sachit Murarka
Reply | Threaded
Open this post in threaded view
|

Re: Regexp_extract not giving correct output

Sachit Murarka
Hi Sean,

Thanks for quick response!

I have tried with string literal 'r' as a prefix that also gave an empty result..
spark.sql(r"select regexp_extract('[11] [22] [33]','(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',1) as anyid").show()

and as I mentioned when I am using 2 backslashes it is giving an exception as follows:
  : java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

Kind Regards,
Sachit Murarka


On Wed, Dec 2, 2020 at 9:07 PM Sean Owen <[hidden email]> wrote:
As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes.

On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka <[hidden email]> wrote:
Hi All,

I am using Pyspark to get the value from a column on basis of regex.

Following is the regex which I am using:
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

df = spark.createDataFrame([("[1234] [3333] [4444] [66]",), ("abcd",)],["stringValue"])
 
result = df.withColumn('extracted value', F.regexp_extract(F.col('stringValue'), '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', 1))

I have tried with spark.sql as well. It is giving empty output. 

I have tested this regex , it is working fine on an online regextester . But it is not working in spark . I know spark needs Java based regex , hence I tried escaping also , that gave exception:
: java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)


Can you please help here?

Kind Regards,
Sachit Murarka
Reply | Threaded
Open this post in threaded view
|

Re: Regexp_extract not giving correct output

srowen
This means there is something wrong with your regex vs what Java supports. Do you mean "(?:" rather than "(?" around where the error is? This is not related to Spark.

On Wed, Dec 2, 2020 at 9:45 AM Sachit Murarka <[hidden email]> wrote:
Hi Sean,

Thanks for quick response!

I have tried with string literal 'r' as a prefix that also gave an empty result..
spark.sql(r"select regexp_extract('[11] [22] [33]','(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)',1) as anyid").show()

and as I mentioned when I am using 2 backslashes it is giving an exception as follows:
  : java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

Kind Regards,
Sachit Murarka


On Wed, Dec 2, 2020 at 9:07 PM Sean Owen <[hidden email]> wrote:
As in Java/Scala, in Python you'll need to escape the backslashes with \\. "\[" means just "[" in a string. I think you could also prefix the string literal with 'r' to disable Python's handling of escapes.

On Wed, Dec 2, 2020 at 9:34 AM Sachit Murarka <[hidden email]> wrote:
Hi All,

I am using Pyspark to get the value from a column on basis of regex.

Following is the regex which I am using:
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)

df = spark.createDataFrame([("[1234] [3333] [4444] [66]",), ("abcd",)],["stringValue"])
 
result = df.withColumn('extracted value', F.regexp_extract(F.col('stringValue'), '(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)', 1))

I have tried with spark.sql as well. It is giving empty output. 

I have tested this regex , it is working fine on an online regextester . But it is not working in spark . I know spark needs Java based regex , hence I tried escaping also , that gave exception:
: java.util.regex.PatternSyntaxException: Unknown inline modifier near index 21
(^\[OrderID:\s)?(?(1).*\]\s\[UniqueID:\s([a-z0-9A-Z]*)\].*|\[.*\]\s\[([a-z0-9A-Z]*)\].*)


Can you please help here?

Kind Regards,
Sachit Murarka