real world streaming code

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

real world streaming code

dachuan
Hello, community,

I have three questions about spark streaming.

1, 
I noticed that one streaming example (StatefulNetworkWordCount) has one interesting phenomenon: 
since this workload only prints the first 10 rows of the final RDD, this means if the data influx rate is fast enough (much faster than hand typing in keyboard), then the final RDD would have more than one partition, assume it's 2 partitions, but the second partition won't be computed at all because the first partition suffice to serve the first 10 rows. However, these two workloads must make checkpoint to that RDD. This would lead to a very time consuming checkpoint process because the checkpoint to the second partition can only start before it is computed. So, is this workload only designed for demonstration purpose, for example, only designed for one partition RDD?

(I have attached a figure to illustrate what I've said, please tell me if mailing list doesn't welcome attachment.
A short description about the experiment
Hardware specs: 4 cores
Software specs: spark local cluster, 5 executors (workers), each one has one core, each executor has 1G memory
Data influx speed: 3MB/s 
Data source: one ServerSocket in local file
Streaming App's name: StatefulNetworkWordCount
Job generation frequency: one job per second
Checkpoint time: once per 10s
JobManager.numThreads = 2)

(And another workload might have the same problem: PageViewStream's slidingPageCounts)

2,
Does anybody have a Top-K wordcount streaming source code?

3,
Can anybody share your real world streaming example? for example, including source code, and cluster configuration details?

thanks,
dachuan.

--
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210

2-exp-12-15-2013-fifo-2--3.pdf (12K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: real world streaming code

Ryan Weald
Hi dachuan, 

Getting top-k up and running using spark streaming is actually very easy using Twitter's Algebird project. I gave a presentation recently at a spark user meetup that wen through an example of using algebird in a spark streaming job. You can find the video and slides here - http://isurfsoftware.com/blog/2014/01/20/spark-meetup-monoids/

Once you get the general idea of using monoids for aggregation it will be easy to drop in the TopKMonoid from Algebird to solve your problem. 

As far as cluster configuration goes, at my old company Sharethrough, we set Spark to course grained mode on apache mesos with spark config to limit the number of CPUs per job. We also made some minor tweaks to JVM settings for bigger heap size and reduced RDD cache time.

Cheers

Ryan Weald


On Fri, Jan 24, 2014 at 7:28 PM, dachuan <[hidden email]> wrote:
Hello, community,

I have three questions about spark streaming.

1, 
I noticed that one streaming example (StatefulNetworkWordCount) has one interesting phenomenon: 
since this workload only prints the first 10 rows of the final RDD, this means if the data influx rate is fast enough (much faster than hand typing in keyboard), then the final RDD would have more than one partition, assume it's 2 partitions, but the second partition won't be computed at all because the first partition suffice to serve the first 10 rows. However, these two workloads must make checkpoint to that RDD. This would lead to a very time consuming checkpoint process because the checkpoint to the second partition can only start before it is computed. So, is this workload only designed for demonstration purpose, for example, only designed for one partition RDD?

(I have attached a figure to illustrate what I've said, please tell me if mailing list doesn't welcome attachment.
A short description about the experiment
Hardware specs: 4 cores
Software specs: spark local cluster, 5 executors (workers), each one has one core, each executor has 1G memory
Data influx speed: 3MB/s 
Data source: one ServerSocket in local file
Streaming App's name: StatefulNetworkWordCount
Job generation frequency: one job per second
Checkpoint time: once per 10s
JobManager.numThreads = 2)

(And another workload might have the same problem: PageViewStream's slidingPageCounts)

2,
Does anybody have a Top-K wordcount streaming source code?

3,
Can anybody share your real world streaming example? for example, including source code, and cluster configuration details?

thanks,
dachuan.

--
Dachuan Huang
Cellphone: <a href="tel:614-390-7234" value="+16143907234" target="_blank">614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210

Reply | Threaded
Open this post in threaded view
|

Re: real world streaming code

dachuan
thanks, Ryan. I will study Algebird first and try to adapt TopKMonoid to spark streaming program.


On Mon, Jan 27, 2014 at 2:54 PM, Ryan Weald <[hidden email]> wrote:
Hi dachuan, 

Getting top-k up and running using spark streaming is actually very easy using Twitter's Algebird project. I gave a presentation recently at a spark user meetup that wen through an example of using algebird in a spark streaming job. You can find the video and slides here - http://isurfsoftware.com/blog/2014/01/20/spark-meetup-monoids/

Once you get the general idea of using monoids for aggregation it will be easy to drop in the TopKMonoid from Algebird to solve your problem. 

As far as cluster configuration goes, at my old company Sharethrough, we set Spark to course grained mode on apache mesos with spark config to limit the number of CPUs per job. We also made some minor tweaks to JVM settings for bigger heap size and reduced RDD cache time.

Cheers

Ryan Weald


On Fri, Jan 24, 2014 at 7:28 PM, dachuan <[hidden email]> wrote:
Hello, community,

I have three questions about spark streaming.

1, 
I noticed that one streaming example (StatefulNetworkWordCount) has one interesting phenomenon: 
since this workload only prints the first 10 rows of the final RDD, this means if the data influx rate is fast enough (much faster than hand typing in keyboard), then the final RDD would have more than one partition, assume it's 2 partitions, but the second partition won't be computed at all because the first partition suffice to serve the first 10 rows. However, these two workloads must make checkpoint to that RDD. This would lead to a very time consuming checkpoint process because the checkpoint to the second partition can only start before it is computed. So, is this workload only designed for demonstration purpose, for example, only designed for one partition RDD?

(I have attached a figure to illustrate what I've said, please tell me if mailing list doesn't welcome attachment.
A short description about the experiment
Hardware specs: 4 cores
Software specs: spark local cluster, 5 executors (workers), each one has one core, each executor has 1G memory
Data influx speed: 3MB/s 
Data source: one ServerSocket in local file
Streaming App's name: StatefulNetworkWordCount
Job generation frequency: one job per second
Checkpoint time: once per 10s
JobManager.numThreads = 2)

(And another workload might have the same problem: PageViewStream's slidingPageCounts)

2,
Does anybody have a Top-K wordcount streaming source code?

3,
Can anybody share your real world streaming example? for example, including source code, and cluster configuration details?

thanks,
dachuan.

--
Dachuan Huang
Cellphone: <a href="tel:614-390-7234" value="+16143907234" target="_blank">614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210




--
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210