|
|
Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now
The code as below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, STDDEV(amount_sold) AS Standard_deviation FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible through functional programming?
Thanks,
Mich
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
|
|
On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh < [hidden email]> wrote: Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now
The code as below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, STDDEV(amount_sold) AS Standard_deviation FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible through functional programming?
Thanks,
Mich
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
|
|
OK Thanks for the tip.
I found this link useful for Python from Databricks

LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
On Wed, 23 Dec 2020 at 21:31, Peyman Mohajerian < [hidden email]> wrote:
On Wed, Dec 23, 2020 at 12:52 PM Mich Talebzadeh < [hidden email]> wrote: Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now
The code as below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, STDDEV(amount_sold) AS Standard_deviation FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible through functional programming?
Thanks,
Mich
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
|
|
Why do you want to use this function instead of the built-in stddev function? On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh < [hidden email]> wrote: Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now
The code as below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, STDDEV(amount_sold) AS Standard_deviation FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible through functional programming?
Thanks,
Mich
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
|
|
Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet!
Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP. Run a test on Hive
SELECT SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV, STDDEV(amount_sold) AS STDDEV, STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, STDDEV_POP(amount_sold) AS STDDEV_POP from sales;
It comes back with
+---------------------+--------------------+--------------------+--------------------+ | mystddev | stddev | stddev_samp | stddev_pop | +---------------------+--------------------+--------------------+--------------------+ | 458.59339942758265 | 458.5931211084447 | 458.5933504051772 | 458.5931211084447 | +---------------------+--------------------+--------------------+--------------------+
spark-sql comes back with
spark-sql> SELECT > SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV, > STDDEV(amount_sold) AS STDDEV, > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, > STDDEV_POP(amount_sold) AS STDDEV_POP > from sales;
+---------------------+--------------------+--------------------+--------------------+ | mystddev | stddev | stddev_samp | stddev_pop | +---------------------+--------------------+--------------------+--------------------+ 458.59339942758265 458.5933504051778 458.5933504051778 458.59312110844525
Just wanted to see what numpy would come back with
Thanks
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
Why do you want to use this function instead of the built-in stddev function?
On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh < [hidden email]> wrote: Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now
The code as below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, STDDEV(amount_sold) AS Standard_deviation FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible through functional programming?
Thanks,
Mich
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
|
|
I don't know which one is 'correct' (it's not standard SQL?) or whether it's the sample stdev for a good reason or just historical now. But you can always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std with ddof=1, the Bessel-corrected standard deviation. On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh < [hidden email]> wrote:
Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet!
Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP. Run a test on Hive
SELECT SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV, STDDEV(amount_sold) AS STDDEV, STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, STDDEV_POP(amount_sold) AS STDDEV_POP from sales;
It comes back with
+---------------------+--------------------+--------------------+--------------------+ | mystddev | stddev | stddev_samp | stddev_pop | +---------------------+--------------------+--------------------+--------------------+ | 458.59339942758265 | 458.5931211084447 | 458.5933504051772 | 458.5931211084447 | +---------------------+--------------------+--------------------+--------------------+
spark-sql comes back with
spark-sql> SELECT > SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV, > STDDEV(amount_sold) AS STDDEV, > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, > STDDEV_POP(amount_sold) AS STDDEV_POP > from sales;
+---------------------+--------------------+--------------------+--------------------+ | mystddev | stddev | stddev_samp | stddev_pop | +---------------------+--------------------+--------------------+--------------------+ 458.59339942758265 458.5933504051778 458.5933504051778 458.59312110844525
Just wanted to see what numpy would come back with
Thanks
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
Why do you want to use this function instead of the built-in stddev function?
On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh < [hidden email]> wrote: Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now
The code as below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, STDDEV(amount_sold) AS Standard_deviation FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible through functional programming?
Thanks,
Mich
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
|
|
Thanks for the feedback.
I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below
rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation , rs.mystddev FROM
( SELECT cust_id AS Customer_ID , COUNT(amount_sold) AS Number_of_orders , SUM(amount_sold) AS Total_customer_amount , AVG(amount_sold) AS Average_order , STDDEV(amount_sold) AS Standard_deviation , SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS mystddev FROM {DB}.{table}
GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC
The green one (mystddev) is the way I work standard deviation myself. I want to add numpy STD to this sql as well but I don't think it is possible here. Can a UDF help? . Or it is only possible with FP?
thanks
I don't know which one is 'correct' (it's not standard SQL?) or whether it's the sample stdev for a good reason or just historical now. But you can always call STDDEV_SAMP (in any DB) if needed. It's equivalent to numpy.std with ddof=1, the Bessel-corrected standard deviation.
On Thu, Dec 24, 2020 at 3:17 AM Mich Talebzadeh < [hidden email]> wrote:
Well the truth is that we had this discussion in 2016 :(. what Hive calls Standard Deviation Function STDDEV is a pointer to STDDEV_POP. This is incorrect and has not been rectified yet!
Spark-sql, Oracle and Sybase point STDDEV to STDDEV_SAMP and not STDDEV_POP. Run a test on Hive
SELECT SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV, STDDEV(amount_sold) AS STDDEV, STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, STDDEV_POP(amount_sold) AS STDDEV_POP from sales;
It comes back with
+---------------------+--------------------+--------------------+--------------------+ | mystddev | stddev | stddev_samp | stddev_pop | +---------------------+--------------------+--------------------+--------------------+ | 458.59339942758265 | 458.5931211084447 | 458.5933504051772 | 458.5931211084447 | +---------------------+--------------------+--------------------+--------------------+
spark-sql comes back with
spark-sql> SELECT > SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS MYSTDDEV, > STDDEV(amount_sold) AS STDDEV, > STDDEV_SAMP(amount_sold) AS STDDEV_SAMP, > STDDEV_POP(amount_sold) AS STDDEV_POP > from sales;
+---------------------+--------------------+--------------------+--------------------+ | mystddev | stddev | stddev_samp | stddev_pop | +---------------------+--------------------+--------------------+--------------------+ 458.59339942758265 458.5933504051778 458.5933504051778 458.59312110844525
Just wanted to see what numpy would come back with
Thanks
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
Why do you want to use this function instead of the built-in stddev function?
On Wed, Dec 23, 2020 at 2:52 PM Mich Talebzadeh < [hidden email]> wrote: Hi,
This is a shot in the dark so to speak.
I would like to use the standard deviation std offered by numpy in PySpark. I am using SQL for now
The code as below
sqltext = f""" SELECT rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation FROM ( SELECT cust_id AS Customer_ID, COUNT(amount_sold) AS Number_of_orders, SUM(amount_sold) AS Total_customer_amount, AVG(amount_sold) AS Average_order, STDDEV(amount_sold) AS Standard_deviation FROM {DB}.{table} GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC """ spark.sql(sqltext)
Now if I wanted to use UDF based on numpy STD function, I can do
import numpy as np from pyspark.sql.functions import UserDefinedFunction from pyspark.sql.types import DoubleType udf = UserDefinedFunction(np.std, DoubleType())
How can I use that udf with spark SQL? I gather this is only possible through functional programming?
Thanks,
Mich
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
|
|
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation. You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in. On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh < [hidden email]> wrote: Thanks for the feedback.
I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below
rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation , rs.mystddev FROM
( SELECT cust_id AS Customer_ID , COUNT(amount_sold) AS Number_of_orders , SUM(amount_sold) AS Total_customer_amount , AVG(amount_sold) AS Average_order , STDDEV(amount_sold) AS Standard_deviation , SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS mystddev FROM {DB}.{table}
GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC
The green one (mystddev) is the way I work standard deviation myself. I want to add numpy STD to this sql as well but I don't think it is possible here. Can a UDF help? . Or it is only possible with FP?
thanks
|
|
Well I gave up on using anything except the standard one offered by PySpark itself. The problem is that anything that is homemade (UDF), is never going to be as performant as the functions offered by Spark itself. What I don't understand is why a numpy STDDEV provided should be more performant than the native one? Are they claiming the numpy packages are more optimised?
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
Why not just use STDDEV_SAMP? it's probably more accurate than the differences-of-squares calculation. You can write an aggregate UDF that calls numpy and register it for SQL, but, it is already a built-in. On Thu, Dec 24, 2020 at 8:12 AM Mich Talebzadeh < [hidden email]> wrote: Thanks for the feedback.
I have a question here. I want to use numpy STD as well but just using sql in pyspark. Like below
rs.Customer_ID , rs.Number_of_orders , rs.Total_customer_amount , rs.Average_order , rs.Standard_deviation , rs.mystddev FROM
( SELECT cust_id AS Customer_ID , COUNT(amount_sold) AS Number_of_orders , SUM(amount_sold) AS Total_customer_amount , AVG(amount_sold) AS Average_order , STDDEV(amount_sold) AS Standard_deviation , SQRT((SUM(POWER(AMOUNT_SOLD,2))-(COUNT(1)*POWER(AVG(AMOUNT_SOLD),2)))/(COUNT(1)-1)) AS mystddev FROM {DB}.{table}
GROUP BY cust_id HAVING SUM(amount_sold) > 94000 AND AVG(amount_sold) < STDDEV(amount_sold) ) rs ORDER BY 3 DESC
The green one (mystddev) is the way I work standard deviation myself. I want to add numpy STD to this sql as well but I don't think it is possible here. Can a UDF help? . Or it is only possible with FP?
thanks
|
|