filter operation in pyspark

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

filter operation in pyspark

Mohit Singh
Hi,
   I have a csv file... (say "n" columns )

I am trying to do a filter operation like:

query = rdd.filter(lambda x:x[1] == "1234")
query.take(20)
Basically this would return me rows with that specific value?
This manipulation is taking quite some time to execute.. (if i can compare.. maybe slower than hadoop operation..)

I am seeing this on my console:
14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8, finish = 5234
14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took 5.249082169 s
14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1
14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with 1 output partitions (allowLocal=true)
14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at <stdin>:1)
14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition locally
14/03/03 16:13:03 INFO HadoopRDD: Input split: hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728

Am I not doing this correctly?


--
Mohit

"When you want success as badly as you want the air, then you will get it. There is no other secret of success."
-Socrates
Reply | Threaded
Open this post in threaded view
|

Re: filter operation in pyspark

Mayur Rustagi
Could be a number of issues.. maybe your csv is not allowing map tasks to be broken, of the file is not process-node local.. how many tasks are you seeing in spark web ui for map & store data. are all the nodes being used when you look at task level .. is the time taken by each task roughly equal or very skewed...
Regards
Mayur

Mayur Rustagi
Ph: +1 (760) 203 3257


On Mon, Mar 3, 2014 at 4:13 PM, Mohit Singh <[hidden email]> wrote:
Hi,
   I have a csv file... (say "n" columns )

I am trying to do a filter operation like:

query = rdd.filter(lambda x:x[1] == "1234")
query.take(20)
Basically this would return me rows with that specific value?
This manipulation is taking quite some time to execute.. (if i can compare.. maybe slower than hadoop operation..)

I am seeing this on my console:
14/03/03 16:13:03 INFO PythonRDD: Times: total = 5245, boot = 3, init = 8, finish = 5234
14/03/03 16:13:03 INFO SparkContext: Job finished: take at <stdin>:1, took 5.249082169 s
14/03/03 16:13:03 INFO SparkContext: Starting job: take at <stdin>:1
14/03/03 16:13:03 INFO DAGScheduler: Got job 715 (take at <stdin>:1) with 1 output partitions (allowLocal=true)
14/03/03 16:13:03 INFO DAGScheduler: Final stage: Stage 720 (take at <stdin>:1)
14/03/03 16:13:03 INFO DAGScheduler: Parents of final stage: List()
14/03/03 16:13:03 INFO DAGScheduler: Missing parents: List()
14/03/03 16:13:03 INFO DAGScheduler: Computing the requested partition locally
14/03/03 16:13:03 INFO HadoopRDD: Input split: hdfs://master:9000/user/hadoop/data/input.csv:5100273664+134217728

Am I not doing this correctly?


--
Mohit

"When you want success as badly as you want the air, then you will get it. There is no other secret of success."
-Socrates