Structured Streaming, Reading and Updating a variable

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Structured Streaming, Reading and Updating a variable

Martin Engen

Hello,

 

I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data.

To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as calculations carry on.

 

Implementing this is a much bigger challenge then I ever imagined.

 

 

I've tried using Accumulators and to Query/Store data to Cassandra after every calculation. Both methods worked somewhat locally , but I don't seem to be able to use these in the Spark Worker Nodes,  as I get the error

"java.lang.NoClassDefFoundError: Could not initialize class error" both for the accumulator and the cassandra connection libary

 

How can you read/update a variable while doing calculations using Structured Streaming?


Thank you



Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming, Reading and Updating a variable

Koert Kuipers
You use a windowed aggregation for this

On Tue, May 15, 2018, 09:23 Martin Engen <[hidden email]> wrote:

Hello,

 

I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data.

To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as calculations carry on.

 

Implementing this is a much bigger challenge then I ever imagined.

 

 

I've tried using Accumulators and to Query/Store data to Cassandra after every calculation. Both methods worked somewhat locally , but I don't seem to be able to use these in the Spark Worker Nodes,  as I get the error

"java.lang.NoClassDefFoundError: Could not initialize class error" both for the accumulator and the cassandra connection libary

 

How can you read/update a variable while doing calculations using Structured Streaming?


Thank you



Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming, Reading and Updating a variable

JayeshLalwani
In reply to this post by Martin Engen

Do you have a code sample, and detailed error message/exception to show?

 

From: Martin Engen <[hidden email]>
Date: Tuesday, May 15, 2018 at 9:24 AM
To: "[hidden email]" <[hidden email]>
Subject: Structured Streaming, Reading and Updating a variable

 

Hello,

 

I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data.

To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as calculations carry on.

 

Implementing this is a much bigger challenge then I ever imagined.

 

 

I've tried using Accumulators and to Query/Store data to Cassandra after every calculation. Both methods worked somewhat locally , but I don't seem to be able to use these in the Spark Worker Nodes,  as I get the error

"java.lang.NoClassDefFoundError: Could not initialize class error" both for the accumulator and the cassandra connection libary

 

How can you read/update a variable while doing calculations using Structured Streaming?

 

Thank you

 

 



The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.

Reply | Threaded
Open this post in threaded view
|

Re: Structured Streaming, Reading and Updating a variable

Martin Engen
I have been testing some with aggregations, but I seem to hit a wall on two issues.
example:
val avg = areaStateDf.groupBy($"plantKey").avg("sensor")

1) How can I use the result from an aggr within the same stream, to do further calculations?
2) It seems to be very slow. If I want a moving window of 24 hours, and to have a moving average on some calculations within this. When testing locally with using

The Accumulator issue:
Simple Counter Accumulator:
object Test {
  private val spark = SparkHelper.getSparkSession()
  import spark.implicits._
  import com.datastax.spark.connector._
  val counter = spark.sparkContext.longAccumulator("counter")
 
  val fetchData = () => {
    counter.add(2)
    counter.value
  }
 
  val fetchdataUDF = spark.sqlContext.udf.register("testUDF", fetchData)

  def calculate(areaStateDf: DataFrame): StreamingQuery = {
    import spark.implicits._
    val ds = areaStateDf.select($"areaKey").withColumn("fetchedData", fetchdataUDF())
    KafkaSinks.debugStream(ds, "volumTest")
  }
}
I would create a custom accumulator to include a smoothing algorithm, but cant seem to be able to get a normal counter working.
This works locally, but on the server running Docker (using a master and 1 worker) throws this error:

18/05/16 08:35:22 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 3, 172.17.0.5, executor 0): java.lang.ExceptionInInitializerError
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:24)
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:15)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:376)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
at com.cambi.assurance.spark.SparkHelper$.getSparkSession(SparkHelper.scala:28)
at com.client.spark.calculations.Test$.<init>(ThpLoad1.scala:10)
at com.client.spark.calculations.Test$.<clinit>(ThpLoad1.scala)
... 18 more

18/05/16 08:35:22 WARN scheduler.TaskSetManager: Lost task 0.1 in stage 3.0 (TID 4, 172.17.0.5, executor 0): java.lang.NoClassDefFoundError: Could not initialize class com.client.spark.calculations.Test$
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:24)
at com.client.spark.calculations.Test$$anonfun$1.apply(ThpLoad1.scala:15)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:235)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)

        
Any ideas about how to handle this error?
        
        
        Thanks,
        Martin Engen

From: Lalwani, Jayesh <[hidden email]>
Sent: Tuesday, May 15, 2018 9:59 PM
To: Martin Engen; [hidden email]
Subject: Re: Structured Streaming, Reading and Updating a variable
 

Do you have a code sample, and detailed error message/exception to show?

 

From: Martin Engen <[hidden email]>
Date: Tuesday, May 15, 2018 at 9:24 AM
To: "[hidden email]" <[hidden email]>
Subject: Structured Streaming, Reading and Updating a variable

 

Hello,

 

I'm working with Structured Streaming, and I need a method of keeping a running average based on last 24hours of data.

To help with this, I can use Exponential Smoothing, which means I really only need to store 1 value from a previous calculation into the new, and update this variable as calculations carry on.

 

Implementing this is a much bigger challenge then I ever imagined.

 

 

I've tried using Accumulators and to Query/Store data to Cassandra after every calculation. Both methods worked somewhat locally , but I don't seem to be able to use these in the Spark Worker Nodes,  as I get the error

"java.lang.NoClassDefFoundError: Could not initialize class error" both for the accumulator and the cassandra connection libary

 

How can you read/update a variable while doing calculations using Structured Streaming?

 

Thank you

 

 



The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates and may only be used solely in performance of work or services for Capital One. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.