FW: Email to Spark Org please

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

FW: Email to Spark Org please

Williams, David (Risk Value Stream)

Classification: Public

 

Hi Team,

 

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training.

 

We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller dataset.

 

Our dataset size is 40000. Number of features used for training is 564.

 

The same dataset when we use in Sklearn python training is completed in 3 hours but when used ML Gradient Boosting it is taking 2 days.

 

We tried increasing number of executors, executor cores, driver memory etc but couldn’t see any improvements.

 

The following are the parameters used for training.

 

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, subsamplingRate=0.5, minInstancesPerNode=110)

 

If you could help us with any suggestions to tune this,  that will be really helpful

 

Many thanks,

Dave Williams

 

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

Re: FW: Email to Spark Org please

srowen
Spark is overkill for this problem; use sklearn.
But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark.
repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task.

On Thu, Mar 25, 2021 at 11:39 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Hi Team,

 

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training.

 

We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller dataset.

 

Our dataset size is 40000. Number of features used for training is 564.

 

The same dataset when we use in Sklearn python training is completed in 3 hours but when used ML Gradient Boosting it is taking 2 days.

 

We tried increasing number of executors, executor cores, driver memory etc but couldn’t see any improvements.

 

The following are the parameters used for training.

 

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, subsamplingRate=0.5, minInstancesPerNode=110)

 

If you could help us with any suggestions to tune this,  that will be really helpful

 

Many thanks,

Dave Williams

 

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

RE: FW: Email to Spark Org please

Williams, David (Risk Value Stream)

Classification: Limited

 

Many thanks for your response Sean.

 

Question - why spark is overkill for this and why is sklearn is faster please?  It’s the same algorithm right?

 

Thanks again,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 25 March 2021 16:40
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Spark is overkill for this problem; use sklearn.

But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark.

repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task.

 

On Thu, Mar 25, 2021 at 11:39 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Hi Team,

 

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training.

 

We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller dataset.

 

Our dataset size is 40000. Number of features used for training is 564.

 

The same dataset when we use in Sklearn python training is completed in 3 hours but when used ML Gradient Boosting it is taking 2 days.

 

We tried increasing number of executors, executor cores, driver memory etc but couldn’t see any improvements.

 

The following are the parameters used for training.

 

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, subsamplingRate=0.5, minInstancesPerNode=110)

 

If you could help us with any suggestions to tune this,  that will be really helpful

 

Many thanks,

Dave Williams

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

Re: FW: Email to Spark Org please

srowen
Simply because the data set is so small. Anything that's operating entirely in memory is faster than something splitting the same data across multiple machines, running multiple processes, and incurring all the overhead of sending the data and results, combining them, etc.

That said, I suspect that you are not using any parallelism in Spark either. You probably have 1 partition, which means at most 1 core is used no matter how many are there. Repartition the data set.

On Fri, Mar 26, 2021 at 8:15 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Limited

 

Many thanks for your response Sean.

 

Question - why spark is overkill for this and why is sklearn is faster please?  It’s the same algorithm right?

 

Thanks again,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 25 March 2021 16:40
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Spark is overkill for this problem; use sklearn.

But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark.

repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task.

 

On Thu, Mar 25, 2021 at 11:39 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Hi Team,

 

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training.

 

We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller dataset.

 

Our dataset size is 40000. Number of features used for training is 564.

 

The same dataset when we use in Sklearn python training is completed in 3 hours but when used ML Gradient Boosting it is taking 2 days.

 

We tried increasing number of executors, executor cores, driver memory etc but couldn’t see any improvements.

 

The following are the parameters used for training.

 

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, subsamplingRate=0.5, minInstancesPerNode=110)

 

If you could help us with any suggestions to tune this,  that will be really helpful

 

Many thanks,

Dave Williams

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

RE: FW: Email to Spark Org please

Williams, David (Risk Value Stream)

Classification: Public

 

Thanks again Sean.

 

We did try increasing the partitions but to no avail.  Maybe it's because of the low dataset volumes as you say so the overhead is the bottleneck.

 

If we use sklearn in Spark, we have to make some changes to utilize the distributed cluster. So if we get that working in distributed, will we get benefits similar to spark ML?

 

Best Regards,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 26 March 2021 13:20
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Simply because the data set is so small. Anything that's operating entirely in memory is faster than something splitting the same data across multiple machines, running multiple processes, and incurring all the overhead of sending the data and results, combining them, etc.

 

That said, I suspect that you are not using any parallelism in Spark either. You probably have 1 partition, which means at most 1 core is used no matter how many are there. Repartition the data set.

 

On Fri, Mar 26, 2021 at 8:15 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Limited

 

Many thanks for your response Sean.

 

Question - why spark is overkill for this and why is sklearn is faster please?  It’s the same algorithm right?

 

Thanks again,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 25 March 2021 16:40
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Spark is overkill for this problem; use sklearn.

But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark.

repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task.

 

On Thu, Mar 25, 2021 at 11:39 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Hi Team,

 

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training.

 

We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller dataset.

 

Our dataset size is 40000. Number of features used for training is 564.

 

The same dataset when we use in Sklearn python training is completed in 3 hours but when used ML Gradient Boosting it is taking 2 days.

 

We tried increasing number of executors, executor cores, driver memory etc but couldn’t see any improvements.

 

The following are the parameters used for training.

 

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, subsamplingRate=0.5, minInstancesPerNode=110)

 

If you could help us with any suggestions to tune this,  that will be really helpful

 

Many thanks,

Dave Williams

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

Re: FW: Email to Spark Org please

srowen
Right, could also be the case that the overhead of distributing it is just dominating.
You wouldn't use sklearn with Spark, just use sklearn at this scale.

What you _can_ use Spark for easily in this case is to distribute parameter tuning with something like hyperopt. If you're building hundreds of models, those can build in parallel with sklearn, and then use Spark to drive the model builds in parallel as part of a process to tune the hyperparams.

On Fri, Mar 26, 2021 at 8:43 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Thanks again Sean.

 

We did try increasing the partitions but to no avail.  Maybe it's because of the low dataset volumes as you say so the overhead is the bottleneck.

 

If we use sklearn in Spark, we have to make some changes to utilize the distributed cluster. So if we get that working in distributed, will we get benefits similar to spark ML?

 

Best Regards,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 26 March 2021 13:20
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Simply because the data set is so small. Anything that's operating entirely in memory is faster than something splitting the same data across multiple machines, running multiple processes, and incurring all the overhead of sending the data and results, combining them, etc.

 

That said, I suspect that you are not using any parallelism in Spark either. You probably have 1 partition, which means at most 1 core is used no matter how many are there. Repartition the data set.

 

On Fri, Mar 26, 2021 at 8:15 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Limited

 

Many thanks for your response Sean.

 

Question - why spark is overkill for this and why is sklearn is faster please?  It’s the same algorithm right?

 

Thanks again,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 25 March 2021 16:40
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Spark is overkill for this problem; use sklearn.

But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark.

repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task.

 

On Thu, Mar 25, 2021 at 11:39 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Hi Team,

 

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training.

 

We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller dataset.

 

Our dataset size is 40000. Number of features used for training is 564.

 

The same dataset when we use in Sklearn python training is completed in 3 hours but when used ML Gradient Boosting it is taking 2 days.

 

We tried increasing number of executors, executor cores, driver memory etc but couldn’t see any improvements.

 

The following are the parameters used for training.

 

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, subsamplingRate=0.5, minInstancesPerNode=110)

 

If you could help us with any suggestions to tune this,  that will be really helpful

 

Many thanks,

Dave Williams

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

RE: FW: Email to Spark Org please

Williams, David (Risk Value Stream)

Classification: Public

 

Many thanks for the info.  So you wouldn’t use sklearn with Spark for large datasets but use it with smaller datasets and using hyperopt to build models in parallel for hypertuning on Spark?

 

From: Sean Owen <[hidden email]>
Sent: 26 March 2021 13:53
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Right, could also be the case that the overhead of distributing it is just dominating.

You wouldn't use sklearn with Spark, just use sklearn at this scale.

 

What you _can_ use Spark for easily in this case is to distribute parameter tuning with something like hyperopt. If you're building hundreds of models, those can build in parallel with sklearn, and then use Spark to drive the model builds in parallel as part of a process to tune the hyperparams.

 

On Fri, Mar 26, 2021 at 8:43 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Thanks again Sean.

 

We did try increasing the partitions but to no avail.  Maybe it's because of the low dataset volumes as you say so the overhead is the bottleneck.

 

If we use sklearn in Spark, we have to make some changes to utilize the distributed cluster. So if we get that working in distributed, will we get benefits similar to spark ML?

 

Best Regards,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 26 March 2021 13:20
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Simply because the data set is so small. Anything that's operating entirely in memory is faster than something splitting the same data across multiple machines, running multiple processes, and incurring all the overhead of sending the data and results, combining them, etc.

 

That said, I suspect that you are not using any parallelism in Spark either. You probably have 1 partition, which means at most 1 core is used no matter how many are there. Repartition the data set.

 

On Fri, Mar 26, 2021 at 8:15 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Limited

 

Many thanks for your response Sean.

 

Question - why spark is overkill for this and why is sklearn is faster please?  It’s the same algorithm right?

 

Thanks again,

Dave Williams

 

From: Sean Owen <[hidden email]>
Sent: 25 March 2021 16:40
To: Williams, David (Risk Value Stream) <[hidden email]>
Cc: [hidden email]
Subject: Re: FW: Email to Spark Org please

 

-- This email has reached the Bank via an external source --
 

Spark is overkill for this problem; use sklearn.

But I'd suspect that you are using just 1 partition for such a small data set, and get no parallelism from Spark.

repartition your input to many more partitions, but, it's unlikely to get much faster than in-core sklearn for this task.

 

On Thu, Mar 25, 2021 at 11:39 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Hi Team,

 

We are trying to utilize ML Gradient Boosting Tree Classification algorithm and found the performance of the algorithm is very poor during training.

 

We would like to see we can improve the performance timings since, it is taking 2 days for training for a smaller dataset.

 

Our dataset size is 40000. Number of features used for training is 564.

 

The same dataset when we use in Sklearn python training is completed in 3 hours but when used ML Gradient Boosting it is taking 2 days.

 

We tried increasing number of executors, executor cores, driver memory etc but couldn’t see any improvements.

 

The following are the parameters used for training.

 

gbt = GBTClassifier(featuresCol='features', labelCol='bad_flag', predictionCol='prediction', maxDepth=11,  maxIter=10000, stepSize=0.01, subsamplingRate=0.5, minInstancesPerNode=110)

 

If you could help us with any suggestions to tune this,  that will be really helpful

 

Many thanks,

Dave Williams

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.

 

This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Lloyds Banking Group plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC95000. Telephone: 0131 225 4555.

Lloyds Bank plc. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 2065. Telephone 0207626 1500.

Bank of Scotland plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC327000. Telephone: 03457 801 801.

Lloyds Bank Corporate Markets plc. Registered office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 10399850.

Scottish Widows Schroder Personal Wealth Limited. Registered Office: 25 Gresham Street, London EC2V 7HN. Registered in England and Wales no. 11722983.

Lloyds Bank plc, Bank of Scotland plc and Lloyds Bank Corporate Markets plc are authorised by the Prudential Regulation Authority and regulated by the Financial Conduct Authority and Prudential Regulation Authority.

Scottish Widows Schroder Personal Wealth Limited is authorised and regulated by the Financial Conduct Authority.

Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is a wholly-owned subsidiary of Lloyds Bank Corporate Markets plc. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH has its registered office at Thurn-und-Taxis Platz 6, 60313 Frankfurt, Germany. The company is registered with the Amtsgericht Frankfurt am Main, HRB 111650. Lloyds Bank Corporate Markets Wertpapierhandelsbank GmbH is supervised by the Bundesanstalt für Finanzdienstleistungsaufsicht.

Halifax is a division of Bank of Scotland plc.

HBOS plc. Registered Office: The Mound, Edinburgh EH1 1YZ. Registered in Scotland no. SC218813.


This e-mail (including any attachments) is private and confidential and may contain privileged material. If you have received this e-mail in error, please notify the sender and delete it (including any attachments) immediately. You must not copy, distribute, disclose or use any of the information in it or any attachments. Telephone calls may be monitored or recorded.

Reply | Threaded
Open this post in threaded view
|

Re: FW: Email to Spark Org please

srowen
Yes that's a great option when the modeling process itself doesn't really need Spark. You can use any old modeling tool you want and get the parallelism in tuning via hyperopt's Spark integration.

On Thu, Apr 1, 2021 at 10:50 AM Williams, David (Risk Value Stream) <[hidden email]> wrote:

Classification: Public

 

Many thanks for the info.  So you wouldn’t use sklearn with Spark for large datasets but use it with smaller datasets and using hyperopt to build models in parallel for hypertuning on Spark?