[spark-ml] How to write a Spark Application correctly?

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[spark-ml] How to write a Spark Application correctly?

Pola Yao
Hello Spark Community,

I have a dataset of size 20G, 20 columns. Each column is categorical, so I applied string-indexer and one-hot-encoding on every column. After, I applied vector-assembler on all the newly derived columns to form a feature vector for each record, and then feed the feature vectors to a ML algorithm.

However, during the feature engineering steps, I observed from Spark UI that the input size (i.e., from Executor tab) increased dramatically to 600G+. The cluster I used might not have so much resource. Are there any ways for optimizing the memory usage of intermediate results?

Thanks.