DropNa in Spark for Columns

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

DropNa in Spark for Columns

Chetan Khatri
Hi Users, 

What is equivalent of df.dropna(axis='columns') of Pandas in the Spark/Scala?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: DropNa in Spark for Columns

Vitali Lupusor
Hello Chetan,

I don’t know about Scala, but in PySpark there is no elegant way of dropping NAs on column axis.

Here is a possible solution to your problem:

>>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
>>> columns = ('A', 'B', 'C')
>>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
>>> df = spark.createDataFrame(data, columns)
>>> df.show()
+----+----+---+
|   A|   B|  C|
+----+----+---+
|null|   1|  2|
|   0|null|  2|
|   0|   1|  2|
+----+----+---+
>>> for column in df.columns:
        if df.select(column).where(df[column].isNull()).first():
                df = df.drop(column)
... 
>>> df.show()
+---+
|  C|
+---+
|  2|
|  2|
|  2|
+—+

If your dataframe doesn’t exceed the size of your memory, I suggest you bring it into Pandas.

>>> df_pd = df.toPandas()
>>> df_pd
     A    B  C
0  NaN  1.0  2
1  0.0  NaN  2
2  0.0  1.0  2
>>> df_pd = df_pd.dropna(axis='column’)
>>> df_pd
   C
0  2
1  2
2  2

Which you then can bring back into Spark:

>>> df = spark.createDataFrame(df_pd)
>>> df.show()
+---+
|  C|
+---+
|  2|
|  2|
|  2|
+---+

Hope that help.

Regards,
V

On 27 Feb 2021, at 05:25, Chetan Khatri <[hidden email]> wrote:

Hi Users, 

What is equivalent of df.dropna(axis='columns') of Pandas in the Spark/Scala?

Thanks

Reply | Threaded
Open this post in threaded view
|

Re: DropNa in Spark for Columns

Peyman Mohajerian
I don't have personal experience with Koalas but it does seem to have the same api:

On Fri, Feb 26, 2021 at 11:46 PM Vitali Lupusor <[hidden email]> wrote:
Hello Chetan,

I don’t know about Scala, but in PySpark there is no elegant way of dropping NAs on column axis.

Here is a possible solution to your problem:

>>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
>>> columns = ('A', 'B', 'C')
>>> data = [(None, 1,  2), (0, None, 2), (0, 1, 2)]
>>> df = spark.createDataFrame(data, columns)
>>> df.show()
+----+----+---+
|   A|   B|  C|
+----+----+---+
|null|   1|  2|
|   0|null|  2|
|   0|   1|  2|
+----+----+---+
>>> for column in df.columns:
        if df.select(column).where(df[column].isNull()).first():
                df = df.drop(column)
... 
>>> df.show()
+---+
|  C|
+---+
|  2|
|  2|
|  2|
+—+

If your dataframe doesn’t exceed the size of your memory, I suggest you bring it into Pandas.

>>> df_pd = df.toPandas()
>>> df_pd
     A    B  C
0  NaN  1.0  2
1  0.0  NaN  2
2  0.0  1.0  2
>>> df_pd = df_pd.dropna(axis='column’)
>>> df_pd
   C
0  2
1  2
2  2

Which you then can bring back into Spark:

>>> df = spark.createDataFrame(df_pd)
>>> df.show()
+---+
|  C|
+---+
|  2|
|  2|
|  2|
+---+

Hope that help.

Regards,
V

On 27 Feb 2021, at 05:25, Chetan Khatri <[hidden email]> wrote:

Hi Users, 

What is equivalent of df.dropna(axis='columns') of Pandas in the Spark/Scala?

Thanks