count() action is being so slow

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

count() action is being so slow

Mina
This post has NOT been accepted by the mailing list yet.
Hi, I am using pyspark to deal with millions of string items.
Spark seems very fast compared to hadoop mapreduce but, I dont' know why, somehow RDDs.action() is taking so long to complete and almost running forever.

Thank you
Joe
Reply | Threaded
Open this post in threaded view
|

Re: count() action is being so slow

ryan_seq
This post has NOT been accepted by the mailing list yet.
Hi,
Can u post a snippet of your code.
Also make sure the count action is done by an RDD.
It may happen that you have saved the result of an RDD to a variable(not an RDD), which can only perform the count on a single node only, thus causing the delay.

sparkNoob wrote
Hi, I am using pyspark to deal with millions of string items.
Spark seems very fast compared to hadoop mapreduce but, I dont' know why, somehow RDDs.action() is taking so long to complete and almost running forever.

Thank you
Joe
Reply | Threaded
Open this post in threaded view
|

Re: count() action is being so slow

Mina
This post has NOT been accepted by the mailing list yet.
Thank you for your reply. Here is the code:

I looked at running information. It seems like when it counts it partitions rdds so maybe that's why it is taking so long?

def parseToTriple(line):
    return re.findall('<[^\s]*>|_:[^\s]*|".*"', line)

if __name__ == "__main__":
   
    if len(sys.argv) != 3:
        print >> sys.stderr, "Usage: reasoner <schemafile> <instancefile>"
        exit(-1)
   
    # Initialize the spark context
    conf = SparkConf()
    conf.setMaster("local")
    conf.setAppName("RDFS Reasoner")
    conf.set("spark.executor.memory", "1g")
   
    sc = SparkContext(conf = conf)
       
    # Load input file and create RDDs
    schema_triples   = sc.textFile(sys.argv[1]).map(lambda x: parseToTriple(x))    
    instance_triples = sc.textFile(sys.argv[2]).map(lambda x: parseToTriple(x))
    count = instance_triples.count()
Reply | Threaded
Open this post in threaded view
|

Re: count() action is being so slow

Mina
This post has NOT been accepted by the mailing list yet.
This is what is happening when I perform count() action:





Reply | Threaded
Open this post in threaded view
|

Re: count() action is being so slow

ryan_seq
This post has NOT been accepted by the mailing list yet.
In reply to this post by Mina
Joe L wrote
Thank you for your reply. Here is the code:

I looked at running information. It seems like when it counts it partitions rdds so maybe that's why it is taking so long?
Yes, Since you are running it on a single node, the overhead of splitting the files and storing intermediate results makes the task slower.

so if you are able to scale it up to N nodes/cores, the time taken will be (time for processing 1 split)*(no partitions/no nodes)
Reply | Threaded
Open this post in threaded view
|

Re: count() action is being so slow

Mina
This post has NOT been accepted by the mailing list yet.
I tested it on hadoop hdfs with 10nodes but the same thing happened. Are you sure about it? Maybe I am doing wrong. How to run it on clusters? Thank you
Reply | Threaded
Open this post in threaded view
|

Re: count() action is being so slow

ryan_seq
This post has NOT been accepted by the mailing list yet.
I use scala for writing spark apps.

Once u have set up the spark cluster. you can either write a standalone app or run it using spark shell
spark shell is simpler to begin with.

$ MASTER=<master url> ADD_JARS=<comma separated jars>  SPARK_CLASSPATH=<same jar fikes, separated by colen> ./spark-shell

For spark apps here's a sample config

        val sc = new SparkContext("spark://<MASTER_IP>:<PORT>", "APP NAME", System.getenv("SPARK_HOME"), Seq("path/one.jar","path/two.jar","root/spark_files/GetCount/target/scala-2.10/cassandra-count_2.10-1.0.jar"))

here root/RYAN/spark_files/GetCount/target/scala-2.10/cassandra-count_2.10-1.0.jar is the jar file generated after sbt package

all these jar files are first uploaded to the slave nodes.
Reply | Threaded
Open this post in threaded view
|

Re: count() action is being so slow

Mina
This post has NOT been accepted by the mailing list yet.
In reply to this post by ryan_seq
Hi I need your help please. I have a problem with pyspark as mentioned above. I have been stuck in this for weeks I need someone to help me. Could I please contact you through your email?

Regards,
Joe
On Wednesday, March 19, 2014 9:23 PM, ryan_seq [via Apache Spark User List] <[hidden email]> wrote:
Hi,
Can u post a snippet of your code.
Also make sure the count action is done by an RDD.
It may happen that you have saved the result of an RDD to a variable(not an RDD), which can only perform the count on a single node only, thus causing the delay.

sparkNoob wrote
Hi, I am using pyspark to deal with millions of string items.
Spark seems very fast compared to hadoop mapreduce but, I dont' know why, somehow RDDs.action() is taking so long to complete and almost running forever.

Thank you
Joe



If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/count-action-is-being-so-slow-tp2874p2912.html
To unsubscribe from count() action is being so slow, click here.
NAML