I am new to spark. I am running a hdfs file system on a remote cluster
whereas my spark workers are on another cluster. When my textFile RDD gets
executed, does spark worker read from the file according to hdfs partitions
task by task, or do they read it once when the blockmanager sets after the
start of first task and distributes it among the memory of spark cluster?
I have this question because I have a situation where, when I have only one
worker executing a job it shows less run time per task (shown in history
server) then when I have two workers executing the same job in parallel.
Even though the total duration is almost the same.
I am running a simple grep application and no shuffles within the cluster.
Text file is on a remote hdfs cluster and is of 813MB distributed into 7
chunks of 128MB, last chunk is left over size.
You said your hdfs cluster and spark cluster is running on different
cluster.This is not a good idea,because you should consider data
locality.Your spark node need config hdfs client configuration.
Spark Job is composed of stages，each stage have one or more
partitions。Parallelism of job decided by these partitions.
Shuffle process is decided by your operator，like
reduceByKey、repartition、sortBy and so on.
I am running a model where the workers should not have the data stored in
them. They are only for execution purpose. The other cluster (its just a
single node) which I am receiving data from is just acting as a file server,
for which I could have used any other way like nfs or ftp. So I went with
hdfs so that it would not have to worry about partitioning of data and also
it does not effect my experiment. So I just had this question that does
spark worker read all the data before computation once its first task start,
and then distribute it among the workers memory or do they read it chunk by
chunk, by each worker and then store the end result in memory to send the
First, Spark worker not have the ability to compute.In fact,executor is
responsible for computation.
Executor running tasks is distributed by driver.
Each Task just read some section of data in normal, but the stage have only
IF your operators not contains the operator that will pull middle result
from each task, like collect or show，driver will not store any data.
Each Executor not store the end result in memory by default, unless your
operator contains the operator that cache data to memory, like cache or