NLTK with Spark Streaming

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

NLTK with Spark Streaming

ashish rawat
Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was wondering if this is a good idea and what are the right Spark operators to do this? The reason we want to try this combination is that we don't want to run our transformations in python (pyspark), but after the transformations, we need to run some natural language processing operations and we don't want to restrict the functions data scientists' can use to Spark natural language library. So, Spark streaming with NLTK looks like the right option, from the perspective of fast data processing and data science flexibility.

Regards,
Ashish
Reply | Threaded
Open this post in threaded view
|

Re: NLTK with Spark Streaming

Holden Karau
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow.

On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <[hidden email]> wrote:
Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was wondering if this is a good idea and what are the right Spark operators to do this? The reason we want to try this combination is that we don't want to run our transformations in python (pyspark), but after the transformations, we need to run some natural language processing operations and we don't want to restrict the functions data scientists' can use to Spark natural language library. So, Spark streaming with NLTK looks like the right option, from the perspective of fast data processing and data science flexibility.

Regards,
Ashish
--
Reply | Threaded
Open this post in threaded view
|

Re: NLTK with Spark Streaming

Chetan Khatri
But you can still use Stanford NLP library and distribute through spark right !

On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <[hidden email]> wrote:
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow.

On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <[hidden email]> wrote:
Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was wondering if this is a good idea and what are the right Spark operators to do this? The reason we want to try this combination is that we don't want to run our transformations in python (pyspark), but after the transformations, we need to run some natural language processing operations and we don't want to restrict the functions data scientists' can use to Spark natural language library. So, Spark streaming with NLTK looks like the right option, from the perspective of fast data processing and data science flexibility.

Regards,
Ashish
--

Reply | Threaded
Open this post in threaded view
|

Re: NLTK with Spark Streaming

ashish rawat
Thanks Holden and Chetan.

Holden - Have you tried it out, do you know the right way to do it?
Chetan - yes, if we use a Java NLP library, it should not be any issue in integrating with spark streaming, but as I pointed out earlier, we want to give flexibility to data scientists to use the language and library of their choice, instead of restricting them to a library of our choice.

On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri <[hidden email]> wrote:
But you can still use Stanford NLP library and distribute through spark right !

On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <[hidden email]> wrote:
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow.

On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <[hidden email]> wrote:
Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was wondering if this is a good idea and what are the right Spark operators to do this? The reason we want to try this combination is that we don't want to run our transformations in python (pyspark), but after the transformations, we need to run some natural language processing operations and we don't want to restrict the functions data scientists' can use to Spark natural language library. So, Spark streaming with NLTK looks like the right option, from the perspective of fast data processing and data science flexibility.

Regards,
Ashish
--


Reply | Threaded
Open this post in threaded view
|

Re: NLTK with Spark Streaming

Nicholas Hakobian
Depending on your needs, its fairly easy to write a lightweight python wrapper around the Databricks spark-corenlp library: https://github.com/databricks/spark-corenlp


Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health


On Sun, Nov 26, 2017 at 8:19 AM, ashish rawat <[hidden email]> wrote:
Thanks Holden and Chetan.

Holden - Have you tried it out, do you know the right way to do it?
Chetan - yes, if we use a Java NLP library, it should not be any issue in integrating with spark streaming, but as I pointed out earlier, we want to give flexibility to data scientists to use the language and library of their choice, instead of restricting them to a library of our choice.

On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri <[hidden email]> wrote:
But you can still use Stanford NLP library and distribute through spark right !

On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <[hidden email]> wrote:
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow.

On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <[hidden email]> wrote:
Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was wondering if this is a good idea and what are the right Spark operators to do this? The reason we want to try this combination is that we don't want to run our transformations in python (pyspark), but after the transformations, we need to run some natural language processing operations and we don't want to restrict the functions data scientists' can use to Spark natural language library. So, Spark streaming with NLTK looks like the right option, from the perspective of fast data processing and data science flexibility.

Regards,
Ashish
--



Reply | Threaded
Open this post in threaded view
|

Re: NLTK with Spark Streaming

ashish rawat
Thanks Nicholas, but the problem for us is that we want to use NLTK Python library, since our data scientists are training using that. Rewriting the inference logic using some other library would be time consuming and in some cases, it may not even work because of unavailability of some functions.

On Nov 29, 2017 3:16 AM, "Nicholas Hakobian" <[hidden email]> wrote:
Depending on your needs, its fairly easy to write a lightweight python wrapper around the Databricks spark-corenlp library: https://github.com/databricks/spark-corenlp


Nicholas Szandor Hakobian, Ph.D.
Staff Data Scientist
Rally Health


On Sun, Nov 26, 2017 at 8:19 AM, ashish rawat <[hidden email]> wrote:
Thanks Holden and Chetan.

Holden - Have you tried it out, do you know the right way to do it?
Chetan - yes, if we use a Java NLP library, it should not be any issue in integrating with spark streaming, but as I pointed out earlier, we want to give flexibility to data scientists to use the language and library of their choice, instead of restricting them to a library of our choice.

On Sun, Nov 26, 2017 at 9:42 PM, Chetan Khatri <[hidden email]> wrote:
But you can still use Stanford NLP library and distribute through spark right !

On Sun, Nov 26, 2017 at 3:31 PM, Holden Karau <[hidden email]> wrote:
So it’s certainly doable (it’s not super easy mind you), but until the arrow udf release goes out it will be rather slow.

On Sun, Nov 26, 2017 at 8:01 AM ashish rawat <[hidden email]> wrote:
Hi,

Has someone tried running NLTK (python) with Spark Streaming (scala)? I was wondering if this is a good idea and what are the right Spark operators to do this? The reason we want to try this combination is that we don't want to run our transformations in python (pyspark), but after the transformations, we need to run some natural language processing operations and we don't want to restrict the functions data scientists' can use to Spark natural language library. So, Spark streaming with NLTK looks like the right option, from the perspective of fast data processing and data science flexibility.

Regards,
Ashish
--