Data growth vs Cluster Size planning

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Data growth vs Cluster Size planning

Aakash Basu-2
Hi,

I ran a dataset of 200 columns and 0.2M records in a cluster of 1 master 18 GB, 2 slaves 32 GB each, 16 cores/slave, took around 772 minutes for a very large ML tuning based job (training).

Now, my requirement is to run the same operation on 3M records. Any idea on how we should proceed? Should we go for a vertical scaling or a horizontal one? How should this problem be approached in a stepwise/systematic manner?

Thanks in advance.

Regards,
Aakash.
Reply | Threaded
Open this post in threaded view
|

Re: Data growth vs Cluster Size planning

Phillip Henry
Too little information to give an answer, if indeed an answer a priori is possible.

However, I would do the following on your test instances:

- Run jstat -gc on all your nodes. It might be that the GC is taking a lot of time.

- Poll with jstack semi frequently. I can give you a fairly good idea where in the code the time is being spent in a non-invasive manner.

Phillip



On Mon, Feb 11, 2019 at 9:48 AM Aakash Basu <[hidden email]> wrote:
Hi,

I ran a dataset of 200 columns and 0.2M records in a cluster of 1 master 18 GB, 2 slaves 32 GB each, 16 cores/slave, took around 772 minutes for a very large ML tuning based job (training).

Now, my requirement is to run the same operation on 3M records. Any idea on how we should proceed? Should we go for a vertical scaling or a horizontal one? How should this problem be approached in a stepwise/systematic manner?

Thanks in advance.

Regards,
Aakash.