Fwd: Train multiple machine learning models in parallel
I have a 1T dataset which contains records for 50 users. Each user has 20G data averagely.
I wanted to use spark to train a machine learning model (e.g., XGBoost tree model) for each user. Ideally, the result should be 50 models. However, it'd be infeasible to submit 50 spark jobs through 'spark-submit'.
The model parameters and feature engineering steps for each user's data would be exactly same, I am wondering if there is a way to train this 50 models in parallel?