

Hi,
I am not sure Spark forum is the correct avenue for this question.
I am using PySpark with matplotlib to get the best fit for data using the Lorentzian Model. This curve uses 20102020 data points (11 on xaxis). I need to predict predict the prices for years 20212025 based on this fit. So not sure if someone can advise me? If Ok, then I can post the details
Thanks
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.


If your data set is 11 points, surely this is not a distributed problem? or are you asking how to build tens of thousands of those projections in parallel? On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh < [hidden email]> wrote: Hi,
I am not sure Spark forum is the correct avenue for this question.
I am using PySpark with matplotlib to get the best fit for data using the Lorentzian Model. This curve uses 20102020 data points (11 on xaxis). I need to predict predict the prices for years 20212025 based on this fit. So not sure if someone can advise me? If Ok, then I can post the details
Thanks
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.


thanks Sean.
This is the gist of the case
I have data points for xaxis from 2010 till 2020 and values for y axis. I am using PySpark, pandas and matplotlib. Data is read into PySpark from the underlying database and a pandas Data Frame is built on it. Data is aggregated over each year. However, the underlying prices are provided on a monthly basis in CSV file which has been loaded into a Hive table summary_df = spark.sql(f"""SELECT cast(Year as int) as year, AVGFlatPricePerYear, AVGTerracedPricePerYear, AVGSemiDetachedPricePerYear, AVGDetachedPricePerYear FROM {v.DSDB}.yearlyhouseprices""")
df_10 = summary_df.filter(col("year").between(f'{start_date}', f'{end_date}')) p_dfm = df_10.toPandas() # converting spark DF to Pandas DF
for i in range(n): if p_dfm.columns[i] != 'year': # year is x axis in integer vcolumn = p_dfm.columns[i] print(vcolumn) params = model.guess(p_dfm[vcolumn], x = p_dfm['year']) result = model.fit(p_dfm[vcolumn], params, x = p_dfm['year']) result.plot_fit() if vcolumn == "AVGFlatPricePerYear": plt.xlabel("Year", fontdict=v.font) plt.ylabel("Flat house prices in millions/GBP", fontdict=v.font) plt.title(f"""Flat price fluctuations in {regionname} for the past 10 years """, fontdict=v.font) plt.text(0.35, 0.45, "Bestfit based on NonLinear Lorentzian Model", transform=plt.gca().transAxes, color="grey", fontsize=10 ) print(result.fit_report()) plt.xlim(left=2009) plt.xlim(right=2022) plt.show() plt.close() ``` So far so good. I get a best fit plot as shown using Lorentzian model
Also I have model fit data
[[Model]]
Model(lorentzian) [[Fit Statistics]] # fitting method = leastsq # function evals = 25 # data points = 11 # variables = 3 chisquare = 8.4155e+09 reduced chisquare = 1.0519e+09 Akaike info crit = 231.009958 Bayesian info crit = 232.203644 [[Variables]] amplitude: 31107480.0 +/ 1471033.33 (4.73%) (init = 6106104) center: 2016.75722 +/ 0.18632315 (0.01%) (init = 2016.5) sigma: 8.37428353 +/ 0.45979189 (5.49%) (init = 3.5) fwhm: 16.7485671 +/ 0.91958379 (5.49%) == '2.0000000*sigma' height: 1182407.88 +/ 15681.8211 (1.33%) == '0.3183099*amplitude/max(2.220446049250313e16, sigma)' [[Correlations]] (unreported correlations are < 0.100) C(amplitude, sigma) = 0.977 C(amplitude, center) = 0.644 C(center, sigma) = 0.603
Now I need to predict the prices for years 20212022 based on this fit. Is there any way I can use some plt functions to provide extrapolated values for 2021 and beyond?
Thanks
If your data set is 11 points, surely this is not a distributed problem? or are you asking how to build tens of thousands of those projections in parallel?
On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh < [hidden email]> wrote: Hi,
I am not sure Spark forum is the correct avenue for this question.
I am using PySpark with matplotlib to get the best fit for data using the Lorentzian Model. This curve uses 20102020 data points (11 on xaxis). I need to predict predict the prices for years 20212025 based on this fit. So not sure if someone can advise me? If Ok, then I can post the details
Thanks
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.

To unsubscribe email: [hidden email]


You will need to use matplotlib on the driver to plot in any event. If this is a single extrapolation, over 11 data points, you can just use Spark to do the aggregation, call .toPandas, and do whatever you want in the Python ecosystem to fit and plot that result. On Tue, Jan 5, 2021 at 9:18 AM Mich Talebzadeh < [hidden email]> wrote: thanks Sean.
This is the gist of the case
I have data points for xaxis from 2010 till 2020 and values for y axis. I am using PySpark, pandas and matplotlib. Data is read into PySpark from the underlying database and a pandas Data Frame is built on it. Data is aggregated over each year. However, the underlying prices are provided on a monthly basis in CSV file which has been loaded into a Hive table summary_df = spark.sql(f"""SELECT cast(Year as int) as year, AVGFlatPricePerYear, AVGTerracedPricePerYear, AVGSemiDetachedPricePerYear, AVGDetachedPricePerYear FROM {v.DSDB}.yearlyhouseprices""")
df_10 = summary_df.filter(col("year").between(f'{start_date}', f'{end_date}')) p_dfm = df_10.toPandas() # converting spark DF to Pandas DF
for i in range(n): if p_dfm.columns[i] != 'year': # year is x axis in integer vcolumn = p_dfm.columns[i] print(vcolumn) params = model.guess(p_dfm[vcolumn], x = p_dfm['year']) result = model.fit(p_dfm[vcolumn], params, x = p_dfm['year']) result.plot_fit() if vcolumn == "AVGFlatPricePerYear": plt.xlabel("Year", fontdict=v.font) plt.ylabel("Flat house prices in millions/GBP", fontdict=v.font) plt.title(f"""Flat price fluctuations in {regionname} for the past 10 years """, fontdict=v.font) plt.text(0.35, 0.45, "Bestfit based on NonLinear Lorentzian Model", transform=plt.gca().transAxes, color="grey", fontsize=10 ) print(result.fit_report()) plt.xlim(left=2009) plt.xlim(right=2022) plt.show() plt.close() ``` So far so good. I get a best fit plot as shown using Lorentzian model
Also I have model fit data
[[Model]]
Model(lorentzian) [[Fit Statistics]] # fitting method = leastsq # function evals = 25 # data points = 11 # variables = 3 chisquare = 8.4155e+09 reduced chisquare = 1.0519e+09 Akaike info crit = 231.009958 Bayesian info crit = 232.203644 [[Variables]] amplitude: 31107480.0 +/ 1471033.33 (4.73%) (init = 6106104) center: 2016.75722 +/ 0.18632315 (0.01%) (init = 2016.5) sigma: 8.37428353 +/ 0.45979189 (5.49%) (init = 3.5) fwhm: 16.7485671 +/ 0.91958379 (5.49%) == '2.0000000*sigma' height: 1182407.88 +/ 15681.8211 (1.33%) == '0.3183099*amplitude/max(2.220446049250313e16, sigma)' [[Correlations]] (unreported correlations are < 0.100) C(amplitude, sigma) = 0.977 C(amplitude, center) = 0.644 C(center, sigma) = 0.603
Now I need to predict the prices for years 20212022 based on this fit. Is there any way I can use some plt functions to provide extrapolated values for 2021 and beyond?
Thanks
If your data set is 11 points, surely this is not a distributed problem? or are you asking how to build tens of thousands of those projections in parallel?
On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh < [hidden email]> wrote: Hi,
I am not sure Spark forum is the correct avenue for this question.
I am using PySpark with matplotlib to get the best fit for data using the Lorentzian Model. This curve uses 20102020 data points (11 on xaxis). I need to predict predict the prices for years 20212025 based on this fit. So not sure if someone can advise me? If Ok, then I can post the details
Thanks
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.


Thanks again
Just to clarify, I want to see the average price for year 2021, 2022 etc based on the best fit. So naively if someone asked a question what the average price will be in 2022, I should be able to make some predictions.
I can of course crudely use pen and pencil like shown in the attached figure, but I was wondering if this is possible with anything that matplotlib offers?
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
You will need to use matplotlib on the driver to plot in any event. If this is a single extrapolation, over 11 data points, you can just use Spark to do the aggregation, call .toPandas, and do whatever you want in the Python ecosystem to fit and plot that result.
On Tue, Jan 5, 2021 at 9:18 AM Mich Talebzadeh < [hidden email]> wrote: thanks Sean.
This is the gist of the case
I have data points for xaxis from 2010 till 2020 and values for y axis. I am using PySpark, pandas and matplotlib. Data is read into PySpark from the underlying database and a pandas Data Frame is built on it. Data is aggregated over each year. However, the underlying prices are provided on a monthly basis in CSV file which has been loaded into a Hive table summary_df = spark.sql(f"""SELECT cast(Year as int) as year, AVGFlatPricePerYear, AVGTerracedPricePerYear, AVGSemiDetachedPricePerYear, AVGDetachedPricePerYear FROM {v.DSDB}.yearlyhouseprices""")
df_10 = summary_df.filter(col("year").between(f'{start_date}', f'{end_date}')) p_dfm = df_10.toPandas() # converting spark DF to Pandas DF
for i in range(n): if p_dfm.columns[i] != 'year': # year is x axis in integer vcolumn = p_dfm.columns[i] print(vcolumn) params = model.guess(p_dfm[vcolumn], x = p_dfm['year']) result = model.fit(p_dfm[vcolumn], params, x = p_dfm['year']) result.plot_fit() if vcolumn == "AVGFlatPricePerYear": plt.xlabel("Year", fontdict=v.font) plt.ylabel("Flat house prices in millions/GBP", fontdict=v.font) plt.title(f"""Flat price fluctuations in {regionname} for the past 10 years """, fontdict=v.font) plt.text(0.35, 0.45, "Bestfit based on NonLinear Lorentzian Model", transform=plt.gca().transAxes, color="grey", fontsize=10 ) print(result.fit_report()) plt.xlim(left=2009) plt.xlim(right=2022) plt.show() plt.close() ``` So far so good. I get a best fit plot as shown using Lorentzian model
Also I have model fit data
[[Model]]
Model(lorentzian) [[Fit Statistics]] # fitting method = leastsq # function evals = 25 # data points = 11 # variables = 3 chisquare = 8.4155e+09 reduced chisquare = 1.0519e+09 Akaike info crit = 231.009958 Bayesian info crit = 232.203644 [[Variables]] amplitude: 31107480.0 +/ 1471033.33 (4.73%) (init = 6106104) center: 2016.75722 +/ 0.18632315 (0.01%) (init = 2016.5) sigma: 8.37428353 +/ 0.45979189 (5.49%) (init = 3.5) fwhm: 16.7485671 +/ 0.91958379 (5.49%) == '2.0000000*sigma' height: 1182407.88 +/ 15681.8211 (1.33%) == '0.3183099*amplitude/max(2.220446049250313e16, sigma)' [[Correlations]] (unreported correlations are < 0.100) C(amplitude, sigma) = 0.977 C(amplitude, center) = 0.644 C(center, sigma) = 0.603
Now I need to predict the prices for years 20212022 based on this fit. Is there any way I can use some plt functions to provide extrapolated values for 2021 and beyond?
Thanks
If your data set is 11 points, surely this is not a distributed problem? or are you asking how to build tens of thousands of those projections in parallel?
On Tue, Jan 5, 2021 at 6:04 AM Mich Talebzadeh < [hidden email]> wrote: Hi,
I am not sure Spark forum is the correct avenue for this question.
I am using PySpark with matplotlib to get the best fit for data using the Lorentzian Model. This curve uses 20102020 data points (11 on xaxis). I need to predict predict the prices for years 20212025 based on this fit. So not sure if someone can advise me? If Ok, then I can post the details
Thanks
LinkedIn https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.


You need to fit a curve to those points using your chosen model. It sounds like you want scipy's curve_fit maybe? matplotlib is for plotting, not curve fitting. But that and the plotting are nothing to do with Spark here. Spark gives you the data as pandas so you can use all these tools as you like. On Tue, Jan 5, 2021 at 9:38 AM Mich Talebzadeh < [hidden email]> wrote: Thanks again
Just to clarify, I want to see the average price for year 2021, 2022 etc based on the best fit. So naively if someone asked a question what the average price will be in 2022, I should be able to make some predictions.
I can of course crudely use pen and pencil like shown in the attached figure, but I was wondering if this is possible with anything that matplotlib offers?
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
You will need to use matplotlib on the driver to plot in any event. If this is a single extrapolation, over 11 data points, you can just use Spark to do the aggregation, call .toPandas, and do whatever you want in the Python ecosystem to fit and plot that result.


You need to fit a curve to those points using your chosen model. It sounds like you want scipy's curve_fit maybe? matplotlib is for plotting, not curve fitting. But that and the plotting are nothing to do with Spark here. Spark gives you the data as pandas so you can use all these tools as you like. On Tue, Jan 5, 2021 at 9:38 AM Mich Talebzadeh < [hidden email]> wrote: Thanks again
Just to clarify, I want to see the average price for year 2021, 2022 etc based on the best fit. So naively if someone asked a question what the average price will be in 2022, I should be able to make some predictions.
I can of course crudely use pen and pencil like shown in the attached figure, but I was wondering if this is possible with anything that matplotlib offers?
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction
of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.
You will need to use matplotlib on the driver to plot in any event. If this is a single extrapolation, over 11 data points, you can just use Spark to do the aggregation, call .toPandas, and do whatever you want in the Python ecosystem to fit and plot that result.

