[spark-ml] How to write a Spark Application correctly?
Hello Spark Community,
I have a dataset of size 20G, 20 columns. Each column is categorical, so I applied string-indexer and one-hot-encoding on every column. After, I applied vector-assembler on all the newly derived columns to form a feature vector for each record, and then feed the feature vectors to a ML algorithm.
However, during the feature engineering steps, I observed from Spark UI that the input size (i.e., from Executor tab) increased dramatically to 600G+. The cluster I used might not have so much resource. Are there any ways for optimizing the memory usage of intermediate results?