Is there any window operation for RDDs in Pyspark? like for DStreams

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view

Is there any window operation for RDDs in Pyspark? like for DStreams


I have two RDDs and my goal is to calculate the Pearson's correlation
between them using sliding window. I want to have 200 samples in each window
from rdd1 and rdd2 and calculate the correlation between them and then slide
the window with 120 samples and calculate the correlation between next 200
samples of windows.I know sliding window works for DStream but I have to use
RDD instead of DStream. When I use window function for RDD i get an error
saying RDD doesn't have window attribute. The reason that I need to use
window operation here is that 1) rdd1 and rdd2 are infinite streams and I
need to partition it to the smaller chunks like windows 2) This built-in
Pearson's correlation function in Pyspark only works for the partitions with
equal size so in my case I chose 200 samples per window and 120 samples for
sliding interval.
I'd appreciate it if you have any idea how to solve it.

My code is here:
if __name__ == "__main__":
sc = SparkContext(appName="CorrelationsExample")
input_path1 = sys.argv[1]
input_path2 = sys.argv[2]
num_of_partitions = 1
rdd1 = sc.textFile(input_path1, num_of_partitions).flatMap(lambda line1:
line1.strip().split("\n")).map(lambda strelem1: float(strelem1))
rdd2 = sc.textFile(input_path2, num_of_partitions).flatMap(lambda line2:
line2.strip().split("\n")).map(lambda strelem2: float(strelem2))
1 = rdd1.collect()
l2 = rdd2.collect()
seriesX = sc.parallelize(l1)
seriesY = sc.parallelize(l2)
print("Correlation is: " + str(Statistics.corr(seriesX, seriesY,

Sent from:

To unsubscribe e-mail: [hidden email]