Spark Streaming for time consuming job

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark Streaming for time consuming job

Eko Susilo
Hi All,

I have a problem that i would like to consult about spark streaming.

I have a spark streaming application that parse a file (which will be growing as time passed by)This file contains several columns containing lines of numbers, 
these parsing is divided into windows (each 1 minute). Each column represent different entity while each row within a column represent the same entity (for example, first column represent temprature, second column represent humidty, etc, while each row represent the value of each attribute). I use PairDStream for each column. 

Afterwards, I need to run a time consuming algorithm (outlier detection, for now i use box plot algorithm) for each RDD of each PairDStream. 

To run the outlier detection, currently i am thinking about to call collect on each of the PairDStream from method forEachRDD and then i get the List of the items, and then pass the each list of items to a thread. Each thread runs the outlier detection algorithm and process the result.

I run the outlier detection in separate thread in order not to put too much burden on spark streaming task. So, I would like to ask if this model has a risk? or is there any alternatives provided by the framework such that i don't have to run a separate thread for this?

Thank you for your attention. 



--
Best Regards,
Eko Susilo
Reply | Threaded
Open this post in threaded view
|

Re: Spark Streaming for time consuming job

Mayur Rustagi
Calling collect on anything  is almost always a bad idea. The only exception is if you are looking to pass that data on to any other system & never see it again :) . 
I would say you need to implement outlier detection on the rdd & process it in spark itself rather than calling collect on it. 

Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257

On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <[hidden email]> wrote:
Hi All,

I have a problem that i would like to consult about spark streaming.

I have a spark streaming application that parse a file (which will be growing as time passed by)This file contains several columns containing lines of numbers, 
these parsing is divided into windows (each 1 minute). Each column represent different entity while each row within a column represent the same entity (for example, first column represent temprature, second column represent humidty, etc, while each row represent the value of each attribute). I use PairDStream for each column. 

Afterwards, I need to run a time consuming algorithm (outlier detection, for now i use box plot algorithm) for each RDD of each PairDStream. 

To run the outlier detection, currently i am thinking about to call collect on each of the PairDStream from method forEachRDD and then i get the List of the items, and then pass the each list of items to a thread. Each thread runs the outlier detection algorithm and process the result.

I run the outlier detection in separate thread in order not to put too much burden on spark streaming task. So, I would like to ask if this model has a risk? or is there any alternatives provided by the framework such that i don't have to run a separate thread for this?

Thank you for your attention. 



--
Best Regards,
Eko Susilo

Reply | Threaded
Open this post in threaded view
|

Re: Spark Streaming for time consuming job

Eko Susilo
Hi Mayur, 

Thanks for your suggestion.

In fact, that's i'm thinking about; to pass those data, and return only the percentage of the outlier in a particular window. 

I also have some doubt if i would implement the outlier detection on rdd as you have suggested.

From what i understand that those RDD are distributed among spark workers; so, i imagine that i would do as the following (code_905 is a PairDStream)

code_905.foreachRDD(new Function2<JavaPairRDD<String,Long>,Time,Void>(){
public Void call(JavaPairRDD<String, Long> pair,Time time) throws Exception {
if(pair.count()>0){
final List<Double> data=new LinkedList<Double>();
pair.foreach(new VoidFunction<Tuple2<String,Long>>(){
@Override
public void call(Tuple2<String, Long> t)
throws Exception {
double doubleValue=t._2.doubleValue();
//register data from this window to be checked
                                                        data.add(doubleValue);
//register the data to the outlier detector
outlierDetector.addData(doubleValue);
}
});
                                      //get percentage of the outlier for this window.
double percentage=outlierDetector.getOutlierPercentageFromThisData(data);

}
return null;
}
});

the variable outlierDetector is declared on class static variable.  the call "outlierDetector.addData" is needed because i would like to run the outlier detection from the data obtained from previous window(s).

My concern on writing the, outlier detection on spark is it would slow down the spark streaming since, the outlier detection would involve sorting data, calculating some statistic stuff. especially, i would need to run many instances of outlier detection  (each instances to handle different set of data).  So, what do you think about this model? 






On Wed, Oct 1, 2014 at 1:59 PM, Mayur Rustagi <[hidden email]> wrote:
Calling collect on anything  is almost always a bad idea. The only exception is if you are looking to pass that data on to any other system & never see it again :) . 
I would say you need to implement outlier detection on the rdd & process it in spark itself rather than calling collect on it. 

Regards
Mayur

Mayur Rustagi
Ph: <a href="tel:%2B1%20%28760%29%20203%203257" value="+17602033257" target="_blank">+1 (760) 203 3257

On Tue, Sep 30, 2014 at 3:22 PM, Eko Susilo <[hidden email]> wrote:
Hi All,

I have a problem that i would like to consult about spark streaming.

I have a spark streaming application that parse a file (which will be growing as time passed by)This file contains several columns containing lines of numbers, 
these parsing is divided into windows (each 1 minute). Each column represent different entity while each row within a column represent the same entity (for example, first column represent temprature, second column represent humidty, etc, while each row represent the value of each attribute). I use PairDStream for each column. 

Afterwards, I need to run a time consuming algorithm (outlier detection, for now i use box plot algorithm) for each RDD of each PairDStream. 

To run the outlier detection, currently i am thinking about to call collect on each of the PairDStream from method forEachRDD and then i get the List of the items, and then pass the each list of items to a thread. Each thread runs the outlier detection algorithm and process the result.

I run the outlier detection in separate thread in order not to put too much burden on spark streaming task. So, I would like to ask if this model has a risk? or is there any alternatives provided by the framework such that i don't have to run a separate thread for this?

Thank you for your attention. 



--
Best Regards,
Eko Susilo




--
Best Regards,
Eko Susilo