I'm forwarding this email along which contains a question from a Spark user Adrien (CC'd) who can't successfully get any emails through to the Apache mailing lists.
Please reply-all when responding to include Adrien. See below for his question.
---------- Forwarded message ----------
From: "Adrien Legrand" <[hidden email]>
Date: May 22, 2014 1:06 AM
Subject: Re: Post validation
To: "Andy Konwinski" <[hidden email]>
Thanks, it would be nice ! Here is the question:
Spark Streaming: Flume stream not found
I am currently trying to process a flume (avro) stream with spark streaming on a yarn cluster. Everything is fine when I try to launch my code locally. To do so, I use the following args :
master = "local"
host, port = the machine I'm sending the stream to with flume (I triple checked the concordance between flume / spark).
val ssc = new StreamingContext(master, "FlumeEventCount", batchInterval,
val stream = FlumeUtils.createStream(ssc, host, port.toInt)
But when I try to launch the same job parallelized (replacing "local" by "yarn-standalone"), the jar is launched (I can see some print I used to debug the code) but it shows the expected output (from the data processing) only 1 time out of 5 or 10.
Here is the complete command line:
$SPARK_HOME/bin/spark-class org.apache.spark.deploy.yarn.Client --jar /home/www/loganalysis-1.0-SNAPSHOT-jar-with-dependencies.jar --class com.loganalysis.Computation --args yarn-standalone --args receiver.priv.fr --args 9999 --num-workers 6 --master-memory 4g --worker-memory 2g --worker-cores 1
For no apparent reasons, sometimes the processing is done. The first guess was that, since I use a big jars with all dependencies in it, other machines don't have those dependencies and thus can't do the processing. That's why I tried to add my executed jar with -addJars but the result was the same.
2014-05-22 7:18 GMT+02:00 Andy Konwinski <[hidden email]>:
|Free forum by Nabble||Edit this page|