RDD-like API for entirely local workflows?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

RDD-like API for entirely local workflows?

Antonin Delpeuch (lists)
Hi,

I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.

Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.

However, OpenRefine is a lightweight tool that runs locally, on the
users' machine, and we want to preserve this use case. Running Spark in
standalone mode works, but I have read at a couple of places that the
standalone mode is only intended for development and testing. This is
confirmed by my experience with it so far:
- the overhead added by task serialization and scheduling is significant
even in standalone mode. This makes sense for testing, since you want to
test serialization as well, but to run Spark in production locally, we
would need to bypass serialization, which is not possible as far as I know;
- some bugs that manifest themselves only in local mode are not getting
a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
it seems dangerous to base a production system on standalone Spark.

So, we cannot use Spark as default runner in the tool. Do you know any
alternative which would be designed for local use? A library which would
provide something similar to the RDD API, but for parallelization with
threads in the same JVM, not machines in a cluster?

If there is no such thing, it should not be too hard to write our
homegrown implementation, which would basically be Java streams with
partitioning. I have looked at Apache Beam's direct runner, but it is
also designed for testing so does not fit our bill for the same reasons.

We plan to offer a Spark-based runner in any case - but I do not think
it can be used as the default runner.

Cheers,
Antonin





---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RDD-like API for entirely local workflows?

Stephen Boesch
Spark in local mode (which is different than standalone) is a solution for many use cases. I use it in conjunction with (and sometimes instead of) pandas/pandasql due to its much wider ETL related capabilities. On the JVM side it is an even more obvious choice - given there is no equivalent to pandas and it has even better performance.  

It is also a strong candidate due to the expressiveness of the sql dialect including support for analytical/windowing functions.    There is a latency hit: on the order of a couple of seconds to start the SparkContext - but pandas is not a high performance tool in any case. 

i see that OpenRefine is implemented in Java so then Spark local should  be a very good complement to it.


On Sat, 4 Jul 2020 at 08:17, Antonin Delpeuch (lists) <[hidden email]> wrote:
Hi,

I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.

Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.

However, OpenRefine is a lightweight tool that runs locally, on the
users' machine, and we want to preserve this use case. Running Spark in
standalone mode works, but I have read at a couple of places that the
standalone mode is only intended for development and testing. This is
confirmed by my experience with it so far:
- the overhead added by task serialization and scheduling is significant
even in standalone mode. This makes sense for testing, since you want to
test serialization as well, but to run Spark in production locally, we
would need to bypass serialization, which is not possible as far as I know;
- some bugs that manifest themselves only in local mode are not getting
a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
it seems dangerous to base a production system on standalone Spark.

So, we cannot use Spark as default runner in the tool. Do you know any
alternative which would be designed for local use? A library which would
provide something similar to the RDD API, but for parallelization with
threads in the same JVM, not machines in a cluster?

If there is no such thing, it should not be too hard to write our
homegrown implementation, which would basically be Java streams with
partitioning. I have looked at Apache Beam's direct runner, but it is
also designed for testing so does not fit our bill for the same reasons.

We plan to offer a Spark-based runner in any case - but I do not think
it can be used as the default runner.

Cheers,
Antonin





---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RDD-like API for entirely local workflows?

Juan Martín Guillén
In reply to this post by Antonin Delpeuch (lists)
Hi Antonin.

It seems you are confusing Standalone with Local mode. They are 2 different modes.

From Spark in Action book: "In local mode, there is only one executor in the same client JVM as the driver, but
this executor can spawn several threads to run tasks.
In local mode, Spark uses your client process as the single executor in the cluster,
and the number of threads specified determines how many tasks can be executed in parallel."

I am pretty sure this is the mode your use case is more suited to.

What you are referring to, I think, is to run an Standalone Cluster locally, something that does not make too much sense resources wise and is what may be considered only for testing purposes.

Running Spark in Local mode is totally fine and supported for non-cluster (local) environments.

Here the options you have to connect you Spark application to: https://spark.apache.org/docs/latest/submitting-applications.html#master-urls

Regards,
Juan Martín.




El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists) <[hidden email]> escribió:


Hi,

I am working on revamping the architecture of OpenRefine, an ETL tool,
to execute workflows on datasets which do not fit in RAM.

Spark's RDD API is a great fit for the tool's operations, and provides
everything we need: partitioning and lazy evaluation.

However, OpenRefine is a lightweight tool that runs locally, on the
users' machine, and we want to preserve this use case. Running Spark in
standalone mode works, but I have read at a couple of places that the
standalone mode is only intended for development and testing. This is
confirmed by my experience with it so far:
- the overhead added by task serialization and scheduling is significant
even in standalone mode. This makes sense for testing, since you want to
test serialization as well, but to run Spark in production locally, we
would need to bypass serialization, which is not possible as far as I know;
- some bugs that manifest themselves only in local mode are not getting
it seems dangerous to base a production system on standalone Spark.

So, we cannot use Spark as default runner in the tool. Do you know any
alternative which would be designed for local use? A library which would
provide something similar to the RDD API, but for parallelization with
threads in the same JVM, not machines in a cluster?

If there is no such thing, it should not be too hard to write our
homegrown implementation, which would basically be Java streams with
partitioning. I have looked at Apache Beam's direct runner, but it is
also designed for testing so does not fit our bill for the same reasons.

We plan to offer a Spark-based runner in any case - but I do not think
it can be used as the default runner.

Cheers,
Antonin





---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RDD-like API for entirely local workflows?

Antonin Delpeuch (lists)
Hi Stephen and Juan,

Thanks both for your replies - you are right, I used the wrong
terminology! The local mode is what fits our needs best (and what I have
benchmarking so far).

That being said, the problems I mention are still applicable to this
context. There is still a serialization overhead (which can be observed
from the web UI), which is really noticeable as a user.

For instance, to display the paginated grid in the tool's UI, I need to
run a simple job (filterByRange), and Spark's own overheads account for
about half of the overall execution time.

Intuitively, when running in local mode there should not be any need for
serializing tasks to pass them between threads, so that is what I am
trying to eliminate.

Regards,
Antonin

On 04/07/2020 17:49, Juan Martín Guillén wrote:

> Hi Antonin.
>
> It seems you are confusing Standalone with Local mode. They are 2
> different modes.
>
> From Spark in Action book: "In local mode, there is only one executor in
> the same client JVM as the driver, but
> this executor can spawn several threads to run tasks.
> In local mode, Spark uses your client process as the single executor in
> the cluster,
> and the number of threads specified determines how many tasks can be
> executed in parallel."
>
> I am pretty sure this is the mode your use case is more suited to.
>
> What you are referring to, I think, is to run an Standalone Cluster
> locally, something that does not make too much sense resources wise and
> is what may be considered only for testing purposes.
>
> Running Spark in Local mode is totally fine and supported for
> non-cluster (local) environments.
>
> Here the options you have to connect you Spark application to:
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
>
> Regards,
> Juan Martín.
>
>
>
>
> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
> <[hidden email]> escribió:
>
>
> Hi,
>
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
>
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
>
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
> it seems dangerous to base a production system on standalone Spark.
>
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
>
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
>
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
>
> Cheers,
> Antonin
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
> <mailto:[hidden email]>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RDD-like API for entirely local workflows?

Juan Martín Guillén
Would you be able to send the code you are running?
That would be great if you include some sample data.
Is that possible?


El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists) <[hidden email]> escribió:


Hi Stephen and Juan,

Thanks both for your replies - you are right, I used the wrong
terminology! The local mode is what fits our needs best (and what I have
benchmarking so far).

That being said, the problems I mention are still applicable to this
context. There is still a serialization overhead (which can be observed
from the web UI), which is really noticeable as a user.

For instance, to display the paginated grid in the tool's UI, I need to
run a simple job (filterByRange), and Spark's own overheads account for
about half of the overall execution time.

Intuitively, when running in local mode there should not be any need for
serializing tasks to pass them between threads, so that is what I am
trying to eliminate.

Regards,
Antonin

On 04/07/2020 17:49, Juan Martín Guillén wrote:

> Hi Antonin.
>
> It seems you are confusing Standalone with Local mode. They are 2
> different modes.
>
> From Spark in Action book: "In local mode, there is only one executor in
> the same client JVM as the driver, but
> this executor can spawn several threads to run tasks.
> In local mode, Spark uses your client process as the single executor in
> the cluster,
> and the number of threads specified determines how many tasks can be
> executed in parallel."
>
> I am pretty sure this is the mode your use case is more suited to.
>
> What you are referring to, I think, is to run an Standalone Cluster
> locally, something that does not make too much sense resources wise and
> is what may be considered only for testing purposes.
>
> Running Spark in Local mode is totally fine and supported for
> non-cluster (local) environments.
>
> Here the options you have to connect you Spark application to:
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
>
> Regards,
> Juan Martín.
>
>
>
>
> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
> <[hidden email]> escribió:
>
>
> Hi,
>
> I am working on revamping the architecture of OpenRefine, an ETL tool,
> to execute workflows on datasets which do not fit in RAM.
>
> Spark's RDD API is a great fit for the tool's operations, and provides
> everything we need: partitioning and lazy evaluation.
>
> However, OpenRefine is a lightweight tool that runs locally, on the
> users' machine, and we want to preserve this use case. Running Spark in
> standalone mode works, but I have read at a couple of places that the
> standalone mode is only intended for development and testing. This is
> confirmed by my experience with it so far:
> - the overhead added by task serialization and scheduling is significant
> even in standalone mode. This makes sense for testing, since you want to
> test serialization as well, but to run Spark in production locally, we
> would need to bypass serialization, which is not possible as far as I know;
> - some bugs that manifest themselves only in local mode are not getting
> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
> it seems dangerous to base a production system on standalone Spark.
>
> So, we cannot use Spark as default runner in the tool. Do you know any
> alternative which would be designed for local use? A library which would
> provide something similar to the RDD API, but for parallelization with
> threads in the same JVM, not machines in a cluster?
>
> If there is no such thing, it should not be too hard to write our
> homegrown implementation, which would basically be Java streams with
> partitioning. I have looked at Apache Beam's direct runner, but it is
> also designed for testing so does not fit our bill for the same reasons.
>
> We plan to offer a Spark-based runner in any case - but I do not think
> it can be used as the default runner.
>
> Cheers,
> Antonin
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
> <mailto:[hidden email]>

>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: RDD-like API for entirely local workflows?

Antonin Delpeuch (lists)
Hi Juan,

Of course! My prototype is here:
https://github.com/OpenRefine/OpenRefine/tree/spark-prototype

I suspect it can be quite hard for you to jump in the code at this stage
of the project, but here are some concise pointers:

The or-spark module contains the Spark-based implementation of our
datamodel. The tasks themselves are generated by the application code
(in the "main" module).

You can try the prototype as a user (clone the repo, checkout the branch
and hit ./refine). If you import a small CSV file via the Clipboard
pane, you can then run a few operations on it and observe the tasks in
Spark's web UI.

I would be happy to give you any additional pointers (perhaps off-list?)
if you want to have a close look.

One general question I have for the list is: do you have a good way to
inspect and optimize the serialization of tasks?

Thank you so much for all your help so far!
Antonin


On 04/07/2020 19:19, Juan Martín Guillén wrote:

> Would you be able to send the code you are running?
> That would be great if you include some sample data.
> Is that possible?
>
>
> El sábado, 4 de julio de 2020 13:09:23 ART, Antonin Delpeuch (lists)
> <[hidden email]> escribió:
>
>
> Hi Stephen and Juan,
>
> Thanks both for your replies - you are right, I used the wrong
> terminology! The local mode is what fits our needs best (and what I have
> benchmarking so far).
>
> That being said, the problems I mention are still applicable to this
> context. There is still a serialization overhead (which can be observed
> from the web UI), which is really noticeable as a user.
>
> For instance, to display the paginated grid in the tool's UI, I need to
> run a simple job (filterByRange), and Spark's own overheads account for
> about half of the overall execution time.
>
> Intuitively, when running in local mode there should not be any need for
> serializing tasks to pass them between threads, so that is what I am
> trying to eliminate.
>
> Regards,
> Antonin
>
> On 04/07/2020 17:49, Juan Martín Guillén wrote:
>> Hi Antonin.
>>
>> It seems you are confusing Standalone with Local mode. They are 2
>> different modes.
>>
>> From Spark in Action book: "In local mode, there is only one executor in
>> the same client JVM as the driver, but
>> this executor can spawn several threads to run tasks.
>> In local mode, Spark uses your client process as the single executor in
>> the cluster,
>> and the number of threads specified determines how many tasks can be
>> executed in parallel."
>>
>> I am pretty sure this is the mode your use case is more suited to.
>>
>> What you are referring to, I think, is to run an Standalone Cluster
>> locally, something that does not make too much sense resources wise and
>> is what may be considered only for testing purposes.
>>
>> Running Spark in Local mode is totally fine and supported for
>> non-cluster (local) environments.
>>
>> Here the options you have to connect you Spark application to:
>>
> https://spark.apache.org/docs/latest/submitting-applications.html#master-urls
>>
>> Regards,
>> Juan Martín.
>>
>>
>>
>>
>> El sábado, 4 de julio de 2020 12:17:01 ART, Antonin Delpeuch (lists)
>> <[hidden email] <mailto:[hidden email]>> escribió:
>>
>>
>> Hi,
>>
>> I am working on revamping the architecture of OpenRefine, an ETL tool,
>> to execute workflows on datasets which do not fit in RAM.
>>
>> Spark's RDD API is a great fit for the tool's operations, and provides
>> everything we need: partitioning and lazy evaluation.
>>
>> However, OpenRefine is a lightweight tool that runs locally, on the
>> users' machine, and we want to preserve this use case. Running Spark in
>> standalone mode works, but I have read at a couple of places that the
>> standalone mode is only intended for development and testing. This is
>> confirmed by my experience with it so far:
>> - the overhead added by task serialization and scheduling is significant
>> even in standalone mode. This makes sense for testing, since you want to
>> test serialization as well, but to run Spark in production locally, we
>> would need to bypass serialization, which is not possible as far as I
> know;
>> - some bugs that manifest themselves only in local mode are not getting
>> a lot of attention (https://issues.apache.org/jira/browse/SPARK-5300) so
>> it seems dangerous to base a production system on standalone Spark.
>>
>> So, we cannot use Spark as default runner in the tool. Do you know any
>> alternative which would be designed for local use? A library which would
>> provide something similar to the RDD API, but for parallelization with
>> threads in the same JVM, not machines in a cluster?
>>
>> If there is no such thing, it should not be too hard to write our
>> homegrown implementation, which would basically be Java streams with
>> partitioning. I have looked at Apache Beam's direct runner, but it is
>> also designed for testing so does not fit our bill for the same reasons.
>>
>> We plan to offer a Spark-based runner in any case - but I do not think
>> it can be used as the default runner.
>>
>> Cheers,
>> Antonin
>>
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
> <mailto:[hidden email]>
>> <mailto:[hidden email]
> <mailto:[hidden email]>>
>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
> <mailto:[hidden email]>
>


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]