Use more memory to gain better performance, or spark will keep
spilling the data into disks, that is much slower.
You also could give more memory to Python worker by set
spark.python.worker.memory=1g or 2g
> sc = SparkContext(conf = conf)
> file = sc.textFile("file:///home/xzhang/data/soc-LiveJournal1.txt")
> records = file.flatMap(lambda line: Undirect(line)).reduceByKey(lambda a, b:
> a + "\t" + b )
a + "\t" + b will be very slow, if the number of values is large,
groupByKey() will be better than it.
Re: spark is running extremely slow with larger data set, like 2G
Thank you very much.
Changing to groupByKey works, it runs much more faster.
By the way, could you give me some explanation of the following configurations, after reading the official explanation, i'm still confused, what's the relationship between them? is there any memory overlap between them?
> Thank you very much.
> Changing to groupByKey works, it runs much more faster.
> By the way, could you give me some explanation of the following
> configurations, after reading the official explanation, i'm still confused,
> what's the relationship between them? is there any memory overlap between
spark.driver.memory is used for JVM together with you local python
scripts (called driver),
spark.executor.memory is used for JVM in spark cluster (called slave
In local mode, driver and executor share the same JVM, so
spark.driver.memory is used.
spark.python.worker.memory is used for Python worker in executor.
Because of GIL,
pyspark use multiple python process in the executor, one for each task.
spark.python.worker.memory will tell the python worker to when to
spill the data into disk.
It's not hard limit, so the memory used in python worker maybe is
little higher than it.
If you have enough memory in executor, increase spark.python.worker.memory will
let python worker to use more memory during shuffle (like groupBy()),
which will increase