Using UDF based on Numpy functions in Spark SQL

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Using UDF based on Numpy functions in Spark SQL

Mich Talebzadeh

Hi,


This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        STDDEV(amount_sold) AS Standard_deviation

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)


Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

Peyman Mohajerian

On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh <[hidden email]> wrote:

Hi,


This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        STDDEV(amount_sold) AS Standard_deviation

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)


Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

Mich Talebzadeh
OK Thanks for the tip.

I found this link useful for Python from Databricks




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 23 Dec 2020 at 21:31, Peyman Mohajerian <[hidden email]> wrote:

On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh <[hidden email]> wrote:

Hi,


This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        STDDEV(amount_sold) AS Standard_deviation

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)


Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

srowen
In reply to this post by Mich Talebzadeh
Why do you want to use this function instead of the built-in stddev function?

On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <[hidden email]> wrote:

Hi,


This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        STDDEV(amount_sold) AS Standard_deviation

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)


Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

Mich Talebzadeh

Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet!


Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP.  Run a test on Hive


SELECT
        SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,
        STDDEV(amount_sold) AS STDDEV,
        STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
        STDDEV_POP(amount_sold) AS STDDEV_POP
from    sales;



It comes back with

+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |     stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+
| 458.59339942758265  | 458.5931211084447  | 458.5933504051772  | 458.5931211084447  |
+---------------------+--------------------+--------------------+--------------------+

spark-sql comes back with


spark-sql> SELECT

         >         SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,

         >         STDDEV(amount_sold) AS STDDEV,

         >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

         >         STDDEV_POP(amount_sold) AS STDDEV_POP

         > from    sales;


+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |     stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+

458.59339942758265      458.5933504051778       458.5933504051778       458.59312110844525


Just wanted to see what numpy would come back with

Thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 23 Dec 2020 at 23:50, Sean Owen <[hidden email]> wrote:
Why do you want to use this function instead of the built-in stddev function?

On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <[hidden email]> wrote:

Hi,


This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        STDDEV(amount_sold) AS Standard_deviation

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)


Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

srowen
I don't know which one is 'correct' (it's not standard SQL?) or whether it's the sample stdev for a good reason or just historical now. But you can always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std with ddof=1, the Bessel-corrected standard deviation.

On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh <[hidden email]> wrote:

Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet!


Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP.  Run a test on Hive


SELECT
        SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,
        STDDEV(amount_sold) AS STDDEV,
        STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
        STDDEV_POP(amount_sold) AS STDDEV_POP
from    sales;



It comes back with

+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |     stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+
| 458.59339942758265  | 458.5931211084447  | 458.5933504051772  | 458.5931211084447  |
+---------------------+--------------------+--------------------+--------------------+

spark-sql comes back with


spark-sql> SELECT

         >         SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,

         >         STDDEV(amount_sold) AS STDDEV,

         >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

         >         STDDEV_POP(amount_sold) AS STDDEV_POP

         > from    sales;


+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |     stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+

458.59339942758265      458.5933504051778       458.5933504051778       458.59312110844525


Just wanted to see what numpy would come back with

Thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 23 Dec 2020 at 23:50, Sean Owen <[hidden email]> wrote:
Why do you want to use this function instead of the built-in stddev function?

On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <[hidden email]> wrote:

Hi,


This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        STDDEV(amount_sold) AS Standard_deviation

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)


Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

Mich Talebzadeh
Thanks for the feedback.

I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below

  sqltext = f"""

  SELECT
          rs.Customer_ID
        , rs.Number_of_orders
        , rs.Total_customer_amount
        , rs.Average_order
        , rs.Standard_deviation
        , rs.mystddev
  FROM
  (
           SELECT cust_id AS Customer_ID
        ,  COUNT(amount_sold) AS Number_of_orders
        ,  SUM(amount_sold) AS Total_customer_amount
        ,  AVG(amount_sold) AS Average_order
        ,  STDDEV(amount_sold) AS Standard_deviation
        ,  SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS mystddev
           FROM {DB}.{table}
           GROUP BY cust_id
           HAVING SUM(amount_sold) > 94000
           AND AVG(amount_sold) < STDDEV(amount_sold)
  ) rs
  ORDER BY
          3 DESC
  """
  spark.sql(sqltext)

The green one (mystddev) is the way I work standard deviation myself. I want to add numpy STD to this sql as well but I don't think it is possible here. Can a UDF help? . Or it is only possible with FP?

thanks


On Thu, 24 Dec 2020 at 13:47, Sean Owen <[hidden email]> wrote:
I don't know which one is 'correct' (it's not standard SQL?) or whether it's the sample stdev for a good reason or just historical now. But you can always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std with ddof=1, the Bessel-corrected standard deviation.

On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh <[hidden email]> wrote:

Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet!


Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP.  Run a test on Hive


SELECT
        SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,
        STDDEV(amount_sold) AS STDDEV,
        STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,
        STDDEV_POP(amount_sold) AS STDDEV_POP
from    sales;



It comes back with

+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |     stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+
| 458.59339942758265  | 458.5931211084447  | 458.5933504051772  | 458.5931211084447  |
+---------------------+--------------------+--------------------+--------------------+

spark-sql comes back with


spark-sql> SELECT

         >         SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV,

         >         STDDEV(amount_sold) AS STDDEV,

         >         STDDEV_SAMP(amount_sold) AS STDDEV_SAMP,

         >         STDDEV_POP(amount_sold) AS STDDEV_POP

         > from    sales;


+---------------------+--------------------+--------------------+--------------------+
|      mystddev       |       stddev       |    stddev_samp     |     stddev_pop     |
+---------------------+--------------------+--------------------+--------------------+

458.59339942758265      458.5933504051778       458.5933504051778       458.59312110844525


Just wanted to see what numpy would come back with

Thanks


Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 23 Dec 2020 at 23:50, Sean Owen <[hidden email]> wrote:
Why do you want to use this function instead of the built-in stddev function?

On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh <[hidden email]> wrote:

Hi,


This is a shot in the dark so to speak.


I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now


The code as below


  sqltext = f"""

  SELECT

          rs.Customer_ID

        , rs.Number_of_orders

        , rs.Total_customer_amount

        , rs.Average_order

        , rs.Standard_deviation

  FROM

  (

        SELECT cust_id AS Customer_ID,

        COUNT(amount_sold) AS Number_of_orders,

        SUM(amount_sold) AS Total_customer_amount,

        AVG(amount_sold) AS Average_order,

        STDDEV(amount_sold) AS Standard_deviation

        FROM {DB}.{table}

        GROUP BY cust_id

        HAVING SUM(amount_sold) > 94000

        AND AVG(amount_sold) < STDDEV(amount_sold)

  ) rs

  ORDER BY

          3 DESC

  """

  spark.sql(sqltext)


Now if I wanted to use UDF based on numpy STD function, I can do

import numpy as np
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
udf = UserDefinedFunction(np.std, DoubleType())

How can I use that udf with spark SQL? I gather this is only possible through functional programming?

Thanks,

Mich




LinkedIn  https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

srowen
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation.
You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in.

On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the feedback.

I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below

  sqltext = f"""

  SELECT
          rs.Customer_ID
        , rs.Number_of_orders
        , rs.Total_customer_amount
        , rs.Average_order
        , rs.Standard_deviation
        , rs.mystddev
  FROM
  (
           SELECT cust_id AS Customer_ID
        ,  COUNT(amount_sold) AS Number_of_orders
        ,  SUM(amount_sold) AS Total_customer_amount
        ,  AVG(amount_sold) AS Average_order
        ,  STDDEV(amount_sold) AS Standard_deviation
        ,  SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS mystddev
           FROM {DB}.{table}
           GROUP BY cust_id
           HAVING SUM(amount_sold) > 94000
           AND AVG(amount_sold) < STDDEV(amount_sold)
  ) rs
  ORDER BY
          3 DESC
  """
  spark.sql(sqltext)

The green one (mystddev) is the way I work standard deviation myself. I want to add numpy STD to this sql as well but I don't think it is possible here. Can a UDF help? . Or it is only possible with FP?

thanks

 

Reply | Threaded
Open this post in threaded view
|

Re: Using UDF based on Numpy functions in Spark SQL

Mich Talebzadeh
Well I gave up on using anything except the standard one offered by PySpark itself. The problem is that anything that is homemade (UDF), is never going to be as performant as the functions offered by Spark itself. What I don't understand is why a numpy STDDEV provided should be more performant than the native one? Are they claiming the numpy packages are more optimised?

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 




On Thu, 24 Dec 2020 at 14:13, Sean Owen <[hidden email]> wrote:
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation.
You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in.

On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh <[hidden email]> wrote:
Thanks for the feedback.

I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below

  sqltext = f"""

  SELECT
          rs.Customer_ID
        , rs.Number_of_orders
        , rs.Total_customer_amount
        , rs.Average_order
        , rs.Standard_deviation
        , rs.mystddev
  FROM
  (
           SELECT cust_id AS Customer_ID
        ,  COUNT(amount_sold) AS Number_of_orders
        ,  SUM(amount_sold) AS Total_customer_amount
        ,  AVG(amount_sold) AS Average_order
        ,  STDDEV(amount_sold) AS Standard_deviation
        ,  SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS mystddev
           FROM {DB}.{table}
           GROUP BY cust_id
           HAVING SUM(amount_sold) > 94000
           AND AVG(amount_sold) < STDDEV(amount_sold)
  ) rs
  ORDER BY
          3 DESC
  """
  spark.sql(sqltext)

The green one (mystddev) is the way I work standard deviation myself. I want to add numpy STD to this sql as well but I don't think it is possible here. Can a UDF help? . Or it is only possible with FP?

thanks