I'm having a computation held on top of a big dynamic model that is
constantly having changes / online updates, therefore, thought that working
in batch mode (stateless): s.t. requires of heavy model sent to spark will
be less appropriate than working in stream mode.
Therefore, was able to have computations in stream mode.... and have the
model living in spark and getting live updates required.
However, as the main motivation for using spark was complexity
reduction/compute speedup given a distributed algorithm. State management is
evidently a challenging task in terms of memory resources which i don't want
to overwhelm/disrupt computation....
My main Q is: as effective spark may be computation wise (using a
corresponding distributed algorithm) is there any grade of control for user
with is memory footprint?
e.g.: Is splitting its memory between its workers is feasible? or all memory
is eventually centralized to spark master (sometime driver, depending on
work mode: client or cluster)
I'm basically looking for a way to scale out memory-wise and not just
is the transformation of a centric data-structure into RDDs (which I'm using
in compute) may relieve/distribute memory footprint as well?
for example: can i split my main memory data structure into set of clusters
assigned to each worker etc?
thanks a lot,