Apache ML Agorithm Solution

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Apache ML Agorithm Solution

SRITHALAM, ANUPAMA (Risk Value Stream)

Classification: Limited


Hi Team,

 

We are trying to use Gradient Boosting Classification algorithm and in Python we tried using Sklearn library and in Pyspark we are using ML library.

 

We have around 45k dataset which is used for training and that dataset is taking around 3 to 4 hours in python but in Pyspark it is taking more than 18 hours for the same hyper parameters used between Python and Pyspark.

 

We tried Pyspark by repartitioning the dataframe and can see a little improvement in performance but still we are not able to get timings near to Python.

 

We have live run which need to evaluation predictions for 40million plus data and data resides in Hadoop. So it is difficult to get that huge amount to data to different system and convert to Pandas dataframe and run against Python.

 

So we are trying to train the same model against Pyspark so, that I can do the evaluation against trained model in Pyspark but, here the concern that we have is the time taken for training is very high and we want to check what will be the general approach followed in these kind of scenarios.

 

 

Thanks,

Anupama.

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

Re: Apache ML Agorithm Solution

Adi Polak
Hi Anupama,

A couple of questions:
-  Where are you running your PySpark application? How many executors do you have available? how much it uses?
-  What is the data format and actual size in MG/GB/PB?
-  Did you see any failures in the Spark History Server? 


As a distributed computing engine, Apache Spark has the advantage when you need to distribute the compute over more than one machine.
On the other hand, the Sklearn library, without distributed support, runs on one machine.

You can run PySpark on one machine and get better performance when configured to work in parallel. 
Configuring the SparkSession:

spark = SparkSession.builder.master("local[*]") \
The '[*]' tells spark to use all the cores available for the machine as local threads. Only local will use one thread. local[2] uses two threads.. and so on.


BTW, Sklearn can be configured to use parallelism on one machine as well. 

Adi Polak

On Wed, 7 Apr 2021 at 19:16, SRITHALAM, ANUPAMA (Risk Value Stream) <[hidden email]> wrote:

Classification: Limited


Hi Team,

 

We are trying to use Gradient Boosting Classification algorithm and in Python we tried using Sklearn library and in Pyspark we are using ML library.

 

We have around 45k dataset which is used for training and that dataset is taking around 3 to 4 hours in python but in Pyspark it is taking more than 18 hours for the same hyper parameters used between Python and Pyspark.

 

We tried Pyspark by repartitioning the dataframe and can see a little improvement in performance but still we are not able to get timings near to Python.

 

We have live run which need to evaluation predictions for 40million plus data and data resides in Hadoop. So it is difficult to get that huge amount to data to different system and convert to Pandas dataframe and run against Python.

 

So we are trying to train the same model against Pyspark so, that I can do the evaluation against trained model in Pyspark but, here the concern that we have is the time taken for training is very high and we want to check what will be the general approach followed in these kind of scenarios.

 

 

Thanks,

Anupama.

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

Re: Apache ML Agorithm Solution

srowen
In reply to this post by SRITHALAM, ANUPAMA (Risk Value Stream)

On Wed, Apr 7, 2021 at 11:17 AM SRITHALAM, ANUPAMA (Risk Value Stream) <[hidden email]> wrote:

Classification: Limited


Hi Team,

 

We are trying to use Gradient Boosting Classification algorithm and in Python we tried using Sklearn library and in Pyspark we are using ML library.

 

We have around 45k dataset which is used for training and that dataset is taking around 3 to 4 hours in python but in Pyspark it is taking more than 18 hours for the same hyper parameters used between Python and Pyspark.

 

We tried Pyspark by repartitioning the dataframe and can see a little improvement in performance but still we are not able to get timings near to Python.

 

We have live run which need to evaluation predictions for 40million plus data and data resides in Hadoop. So it is difficult to get that huge amount to data to different system and convert to Pandas dataframe and run against Python.

 

So we are trying to train the same model against Pyspark so, that I can do the evaluation against trained model in Pyspark but, here the concern that we have is the time taken for training is very high and we want to check what will be the general approach followed in these kind of scenarios.

 

 

Thanks,

Anupama.

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

Re: Apache ML Agorithm Solution

Mich Talebzadeh
LOL. yes indeed. Previous one from the Lloyds Banking Group and this one from TCS that provides services to the same company.

Just slightly different wording but the same content



 

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.

 



On Wed, 7 Apr 2021 at 17:54, Sean Owen <[hidden email]> wrote:

On Wed, Apr 7, 2021 at 11:17 AM SRITHALAM, ANUPAMA (Risk Value Stream) <[hidden email]> wrote:

Classification: Limited


Hi Team,

 

We are trying to use Gradient Boosting Classification algorithm and in Python we tried using Sklearn library and in Pyspark we are using ML library.

 

We have around 45k dataset which is used for training and that dataset is taking around 3 to 4 hours in python but in Pyspark it is taking more than 18 hours for the same hyper parameters used between Python and Pyspark.

 

We tried Pyspark by repartitioning the dataframe and can see a little improvement in performance but still we are not able to get timings near to Python.

 

We have live run which need to evaluation predictions for 40million plus data and data resides in Hadoop. So it is difficult to get that huge amount to data to different system and convert to Pandas dataframe and run against Python.

 

So we are trying to train the same model against Pyspark so, that I can do the evaluation against trained model in Pyspark but, here the concern that we have is the time taken for training is very high and we want to check what will be the general approach followed in these kind of scenarios.

 

 

Thanks,

Anupama.

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.