Split column with dynamic data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Split column with dynamic data

Aakash Basu-2
Hi all,

I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description -

Eg - (Description is the column header)

Description
Inventory Tree
Products
1. AT&T Services
2. Accessories
4. Miscellaneous
5. Service Center Items
3. 3rd Party Services
Integrated Service
6. Demo Devices
IT Department
Merchandising
2A. Impulse
2. Handsets
3. Strategic Products
1. Opportunities
5. Features
6. Rate Plans
7. Other AT&T Incentives
8. Wired 
4. Rate Plan Tier Change
Integration SKUs (March 2016)
9. Financing SKUs
1. Smartphone


From the above, I only want the words of description and remove the numbers. But since they're abrupt, it is difficult to use a single logic to have it done.

Trying with the following options -

1) Using split by fullstop, together -

split_col = split(CategoryInp_DF['description'], '.')

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))
CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))


Result -> But both the output columns come as nulls.

2) Using split by fullstop and space, together -

split_col = split(CategoryInp_DF['description'], '. ')

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))
CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))

Result -> This is perfectly working for the numbered rows, but is unnecessarily splitting the non-numbered rows too.

Any help would be greatly appreciated.


Thanks,
Aakash.
Reply | Threaded
Open this post in threaded view
|

RE: Split column with dynamic data

Hondros, Constantine (ELS-AMS)

You should just use regexp_replace to remove all the leading number information (assuming it ends with a full-stop, and catering for the possibility of a capital letter).

 

This is untested, but it shoud do the trick based on your examples so far:

 

df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, “”))

 

 

From: Aakash Basu [mailto:[hidden email]]
Sent: 30 October 2017 18:53
To: user
Subject: Split column with dynamic data

 

*** External email: use caution ***

 

Hi all,

 

I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description -

 

Eg - (Description is the column header)

 

Description

Inventory Tree

Products

1. AT&T Services

2. Accessories

4. Miscellaneous

5. Service Center Items

3. 3rd Party Services

Integrated Service

6. Demo Devices

IT Department

Merchandising

2A. Impulse

2. Handsets

3. Strategic Products

1. Opportunities

5. Features

6. Rate Plans

7. Other AT&T Incentives

8. Wired 

4. Rate Plan Tier Change

Integration SKUs (March 2016)

9. Financing SKUs

1. Smartphone

 


From the above, I only want the words of description and remove the numbers. But since they're abrupt, it is difficult to use a single logic to have it done.

 

Trying with the following options -

 

1) Using split by fullstop, together -

 

split_col = split(CategoryInp_DF['description'], '.')

 

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))

CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))

 

 

Result -> But both the output columns come as nulls.

 

2) Using split by fullstop and space, together -

split_col = split(CategoryInp_DF['description'], '. ')

 

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))

CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))

 

Result -> This is perfectly working for the numbered rows, but is unnecessarily splitting the non-numbered rows too.

 

Any help would be greatly appreciated.

 

 

Thanks,

Aakash.



Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.

Reply | Threaded
Open this post in threaded view
|

RE: Split column with dynamic data

Aakash Basu-2
Hey buddy,


Thanks a TON! Issue resolved.

Thanks again,
Aakash.

On 30-Oct-2017 11:44 PM, "Hondros, Constantine (ELS-AMS)" <[hidden email]> wrote:

You should just use regexp_replace to remove all the leading number information (assuming it ends with a full-stop, and catering for the possibility of a capital letter).

 

This is untested, but it shoud do the trick based on your examples so far:

 

df.withColumn(“new_column”, regexp_replace($”Description”, “^\d+A-Z?\.”, “”))

 

 

From: Aakash Basu [mailto:[hidden email]]
Sent: 30 October 2017 18:53
To: user
Subject: Split column with dynamic data

 

*** External email: use caution ***

 

Hi all,

 

I've a requirement to split a column and fetch only the description where I have numbers appended before that for some rows whereas other rows have only the description -

 

Eg - (Description is the column header)

 

Description

Inventory Tree

Products

1. AT&T Services

2. Accessories

4. Miscellaneous

5. Service Center Items

3. 3rd Party Services

Integrated Service

6. Demo Devices

IT Department

Merchandising

2A. Impulse

2. Handsets

3. Strategic Products

1. Opportunities

5. Features

6. Rate Plans

7. Other AT&T Incentives

8. Wired 

4. Rate Plan Tier Change

Integration SKUs (March 2016)

9. Financing SKUs

1. Smartphone

 


From the above, I only want the words of description and remove the numbers. But since they're abrupt, it is difficult to use a single logic to have it done.

 

Trying with the following options -

 

1) Using split by fullstop, together -

 

split_col = split(CategoryInp_DF['description'], '.')

 

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))

CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))

 

 

Result -> But both the output columns come as nulls.

 

2) Using split by fullstop and space, together -

split_col = split(CategoryInp_DF['description'], '. ')

 

CategoryInp_DF= CategoryInp_DF.withColumn('unneeded', split_col.getItem(0))

CategoryInp_DF= CategoryInp_DF.withColumn('description', split_col.getItem(1))

 

Result -> This is perfectly working for the numbered rows, but is unnecessarily splitting the non-numbered rows too.

 

Any help would be greatly appreciated.

 

 

Thanks,

Aakash.



Elsevier B.V. Registered Office: Radarweg 29, 1043 NX Amsterdam, The Netherlands, Registration No. 33156677, Registered in The Netherlands.