regexp_extract regex for extracting the columns from string

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

regexp_extract regex for extracting the columns from string

anbutech
Hi All,

I have a following info.in the data column.

<1000> date=2020-08-01 time=20:50:04 name=processing id=123 session=new
packt=20 orgin=null address=null dest=fgjglgl

here I want to create a separate column for the above key value pairs after
the integer <1000> separated by spaces.
Is there any way to achieved it using regexp_extract inbuilt functions.i
don't want to do it using udf function.
apart from udf,is there any way to achieved it.


Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: regexp_extract regex for extracting the columns from string

Patrick McCarthy-2
Can you simply do a string split on space, and then another on '='?

On Sun, Aug 9, 2020 at 12:00 PM anbutech <[hidden email]> wrote:
Hi All,

I have a following info.in the data column.

<1000> date=2020-08-01 time=20:50:04 name=processing id=123 session=new
packt=20 orgin=null address=null dest=fgjglgl

here I want to create a separate column for the above key value pairs after
the integer <1000> separated by spaces.
Is there any way to achieved it using regexp_extract inbuilt functions.i
don't want to do it using udf function.
apart from udf,is there any way to achieved it.


Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: regexp_extract regex for extracting the columns from string

Enrico Minack
In reply to this post by anbutech
You can remove the <1000> first and then turn the string into a map
(interpret the string as key-values). From that map you can access each
key and turn it into a separate column:

Seq(("<1000> date=2020-08-01 time=20:50:04 name=processing id=123
session=new packt=20 orgin=null address=null dest=fgjglgl"))
   .toDF("string")
   .withColumn("key-values", regexp_replace($"string", "^[^ ]+ ", ""))
   .withColumn("map", expr("str_to_map(`key-values`, ' ', '=')"))
   .select(
     $"map"("date").as("date"),
     $"map"("time").as("time"),
     $"map"("name").as("name"),
     $"map"("id").as("id"),
     $"map"("session").as("session"),
     $"map"("packt").as("packt"),
     $"map"("origin").as("origin"),
     $"map"("address").as("address"),
     $"map"("dest").as("dest")
   )
   .show(false)

Enrico


Am 09.08.20 um 18:00 schrieb anbutech:

> Hi All,
>
> I have a following info.in the data column.
>
> <1000> date=2020-08-01 time=20:50:04 name=processing id=123 session=new
> packt=20 orgin=null address=null dest=fgjglgl
>
> here I want to create a separate column for the above key value pairs after
> the integer <1000> separated by spaces.
> Is there any way to achieved it using regexp_extract inbuilt functions.i
> don't want to do it using udf function.
> apart from udf,is there any way to achieved it.
>
>
> Thanks
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]