AccumulatorV2 vs AccumulableParam (V1)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

AccumulatorV2 vs AccumulableParam (V1)

Sergey Zhemzhitsky
Hello guys,

I've started to migrate my Spark jobs which use Accumulators V1 to
AccumulatorV2 and faced with the following issues:

1. LegacyAccumulatorWrapper now requires the resulting type of
AccumulableParam to implement equals. In other case the
AccumulableParam, automatically wrapped into LegacyAccumulatorWrapper,
will fail with AssertionError (SPARK-23697 [1]).

2. Existing AccumulatorV2 classes are hardly difficult to extend
easily and correctly (SPARK-24154 [2]) due to its "copy" method which
is called during serialization and usually loses type information of
descendant classes which don't override "copy" (and it's easier to
implement an accumulator from scratch than override it correctly)

3. The same instance of AccumulatorV2 cannot be used with the same
SparkContext multiple times (unlike AccumulableParam) failing with
"IllegalStateException: Cannot register an Accumulator twice" even
after "reset" method called. So it's impossible to unregister already
registered accumulator from user code.

4. AccumulableParam (V1) implementations are usually more or less
stateless, while AccumulatorV2 implementations are almost always
stateful, leading to (unnecessary?) type checks (unlike
AccumulableParam). For example typical "merge" method of AccumulatorV2
requires to check whether current accumulator is of an appropriate
type, like here [3]

5. AccumulatorV2 is more difficult to implement correctly unlike
AccumulableParam. For example, in case of AccumulableParam I have to
implement just 3 methods (addAccumulator, addInPlace, zero), in case
of AccumulableParam - just 2 methods (addInPlace, zero) and in case of
AccumulatorV2 - 6 methods (isZero, copy, reset, add, merge, value)

6. AccumulatorV2 classes are hardly possible to be anonymous classes,
because of their "copy" and "merge" methods which typically require a
concrete class to make a type check.

I understand the motivation for AccumulatorV2 (SPARK-14654 [4]), but
just wondering whether there is a way to simplify the API of
AccumulatorV2 to meet the points described above and to be less error
prone?


[1] https://issues.apache.org/jira/browse/SPARK-23697
[2] https://issues.apache.org/jira/browse/SPARK-24154
[3] https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L348
[4] https://issues.apache.org/jira/browse/SPARK-14654

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: AccumulatorV2 vs AccumulableParam (V1)

Wenchen Fan
Hi Sergey,

Thanks for your valuable feedback!

For 1: yea this is definitely a bug and I have sent a PR to fix it.
For 2: I have left my comments on the JIRA ticket.
For 3: I don't quite understand it, can you give some concrete examples?
For 4: yea this is a problem, but I think it's not a big deal, and we couldn't find a better solution at that time.
For 5: I think this is a real problem. It looks to me that we can merge `isZero`, `copyAndReset`, `copy`, `reset` into one API: `zero`, which is basically just the `copyAndReset`. If there is a way to fix this without breaking the existing API, I'm really happy to do it.
For 6: same as 4. It's a problem but not a big deal.

In general, I think accumulator v2 sacrifices some flexibility to simplify the framework and improve the performance. Users can still use accumulator v1 if flexibility is more important to them. We can keep improving accumulator v2 without breaking backward compatibility.

Thanks,
Wenchen

On Thu, May 3, 2018 at 6:20 AM, Sergey Zhemzhitsky <[hidden email]> wrote:
Hello guys,

I've started to migrate my Spark jobs which use Accumulators V1 to
AccumulatorV2 and faced with the following issues:

1. LegacyAccumulatorWrapper now requires the resulting type of
AccumulableParam to implement equals. In other case the
AccumulableParam, automatically wrapped into LegacyAccumulatorWrapper,
will fail with AssertionError (SPARK-23697 [1]).

2. Existing AccumulatorV2 classes are hardly difficult to extend
easily and correctly (SPARK-24154 [2]) due to its "copy" method which
is called during serialization and usually loses type information of
descendant classes which don't override "copy" (and it's easier to
implement an accumulator from scratch than override it correctly)

3. The same instance of AccumulatorV2 cannot be used with the same
SparkContext multiple times (unlike AccumulableParam) failing with
"IllegalStateException: Cannot register an Accumulator twice" even
after "reset" method called. So it's impossible to unregister already
registered accumulator from user code.

4. AccumulableParam (V1) implementations are usually more or less
stateless, while AccumulatorV2 implementations are almost always
stateful, leading to (unnecessary?) type checks (unlike
AccumulableParam). For example typical "merge" method of AccumulatorV2
requires to check whether current accumulator is of an appropriate
type, like here [3]

5. AccumulatorV2 is more difficult to implement correctly unlike
AccumulableParam. For example, in case of AccumulableParam I have to
implement just 3 methods (addAccumulator, addInPlace, zero), in case
of AccumulableParam - just 2 methods (addInPlace, zero) and in case of
AccumulatorV2 - 6 methods (isZero, copy, reset, add, merge, value)

6. AccumulatorV2 classes are hardly possible to be anonymous classes,
because of their "copy" and "merge" methods which typically require a
concrete class to make a type check.

I understand the motivation for AccumulatorV2 (SPARK-14654 [4]), but
just wondering whether there is a way to simplify the API of
AccumulatorV2 to meet the points described above and to be less error
prone?


[1] https://issues.apache.org/jira/browse/SPARK-23697
[2] https://issues.apache.org/jira/browse/SPARK-24154
[3] https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L348
[4] https://issues.apache.org/jira/browse/SPARK-14654

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]


Reply | Threaded
Open this post in threaded view
|

Re: AccumulatorV2 vs AccumulableParam (V1)

Sergey Zhemzhitsky
Hi Wenchen,

Thanks a lot for clarification and help.

Here is what I mean regarding the remaining points

For 2: Should we update the documentation [1] regarding custom
accumulators to be more clear and to highlight that
  a) custom accumulators should always override "copy" method to
prevent unexpected behaviour with losing type information
  b) custom accumulators cannot be direct anonymous subclasses of
AccumulatorV2 because of a)
  c) extending already existing accumulators almost always requires
overriding "copy" because of a)

For 3: Here is [2] the sample that shows that the same
AccumulableParam can be registered twice with different names.
Here is [3] the sample that fails with IllegalStateException on this
line [4] because accumulator's metadata is not null and it's hardly
possible to reset it to null (there is no public API for such a
thing).
I understand, that Spark creates different Accumulators for the same
AccumulableParam internally and because of AccumulatorV2 is stateful
using the same stateful accumulator instance in multiple places for
different things is very dangerous, so maybe we should highlight this
point in the documentation too?

For 5: Should we raise a JIRA for that?


[1] https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators
[2] https://gist.github.com/szhem/52a26ada4bbeb1a3e762710adc3f94ef#file-accumulatorsspec-scala-L36
[3] https://gist.github.com/szhem/52a26ada4bbeb1a3e762710adc3f94ef#file-accumulatorsspec-scala-L59
[4] https://github.com/apache/spark/blob/4d5de4d303a773b1c18c350072344bd7efca9fc4/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L51


Kind Regards,
Sergey

On Thu, May 3, 2018 at 5:20 PM, Wenchen Fan <[hidden email]> wrote:

> Hi Sergey,
>
> Thanks for your valuable feedback!
>
> For 1: yea this is definitely a bug and I have sent a PR to fix it.
> For 2: I have left my comments on the JIRA ticket.
> For 3: I don't quite understand it, can you give some concrete examples?
> For 4: yea this is a problem, but I think it's not a big deal, and we
> couldn't find a better solution at that time.
> For 5: I think this is a real problem. It looks to me that we can merge
> `isZero`, `copyAndReset`, `copy`, `reset` into one API: `zero`, which is
> basically just the `copyAndReset`. If there is a way to fix this without
> breaking the existing API, I'm really happy to do it.
> For 6: same as 4. It's a problem but not a big deal.
>
> In general, I think accumulator v2 sacrifices some flexibility to simplify
> the framework and improve the performance. Users can still use accumulator
> v1 if flexibility is more important to them. We can keep improving
> accumulator v2 without breaking backward compatibility.
>
> Thanks,
> Wenchen
>
> On Thu, May 3, 2018 at 6:20 AM, Sergey Zhemzhitsky <[hidden email]>
> wrote:
>>
>> Hello guys,
>>
>> I've started to migrate my Spark jobs which use Accumulators V1 to
>> AccumulatorV2 and faced with the following issues:
>>
>> 1. LegacyAccumulatorWrapper now requires the resulting type of
>> AccumulableParam to implement equals. In other case the
>> AccumulableParam, automatically wrapped into LegacyAccumulatorWrapper,
>> will fail with AssertionError (SPARK-23697 [1]).
>>
>> 2. Existing AccumulatorV2 classes are hardly difficult to extend
>> easily and correctly (SPARK-24154 [2]) due to its "copy" method which
>> is called during serialization and usually loses type information of
>> descendant classes which don't override "copy" (and it's easier to
>> implement an accumulator from scratch than override it correctly)
>>
>> 3. The same instance of AccumulatorV2 cannot be used with the same
>> SparkContext multiple times (unlike AccumulableParam) failing with
>> "IllegalStateException: Cannot register an Accumulator twice" even
>> after "reset" method called. So it's impossible to unregister already
>> registered accumulator from user code.
>>
>> 4. AccumulableParam (V1) implementations are usually more or less
>> stateless, while AccumulatorV2 implementations are almost always
>> stateful, leading to (unnecessary?) type checks (unlike
>> AccumulableParam). For example typical "merge" method of AccumulatorV2
>> requires to check whether current accumulator is of an appropriate
>> type, like here [3]
>>
>> 5. AccumulatorV2 is more difficult to implement correctly unlike
>> AccumulableParam. For example, in case of AccumulableParam I have to
>> implement just 3 methods (addAccumulator, addInPlace, zero), in case
>> of AccumulableParam - just 2 methods (addInPlace, zero) and in case of
>> AccumulatorV2 - 6 methods (isZero, copy, reset, add, merge, value)
>>
>> 6. AccumulatorV2 classes are hardly possible to be anonymous classes,
>> because of their "copy" and "merge" methods which typically require a
>> concrete class to make a type check.
>>
>> I understand the motivation for AccumulatorV2 (SPARK-14654 [4]), but
>> just wondering whether there is a way to simplify the API of
>> AccumulatorV2 to meet the points described above and to be less error
>> prone?
>>
>>
>> [1] https://issues.apache.org/jira/browse/SPARK-23697
>> [2] https://issues.apache.org/jira/browse/SPARK-24154
>> [3]
>> https://github.com/apache/spark/blob/4f5bad615b47d743b8932aea1071652293981604/core/src/main/scala/org/apache/spark/util/AccumulatorV2.scala#L348
>> [4] https://issues.apache.org/jira/browse/SPARK-14654
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]