Spark seems to think that a particular broadcast variable is large in size

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark seems to think that a particular broadcast variable is large in size

V0lleyBallJunki3
I am trying to do a broadcast join on two tables. The size of the
smaller table will vary based upon the parameters but the size of the
larger table is close to 2TB. What I have noticed is that if I don't
set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
operations do a SortMergeJoin instead of a broadcast join. But the
size of the smaller table shouldn't be this big at all. I wrote the
smaller table to a s3 folder and it took only 12.6 MB of space. I
didn't some operations on the smaller table so the shuffle size
appears on the Spark History Server and the size in memory seemed to
be 150 MB nowhere near 10G. Also if I force a broadcast join on the
smaller table it takes a long time to broadcast, leading me to think
that the table might not just be 150 MB in size. What would be a good
way to figure out the actual size that Spark is seeing and deciding
whether it crosses the spark.sql.autoBroadcastJoinThreshold?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark seems to think that a particular broadcast variable is large in size

Dillon Dukek
In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using?

On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <[hidden email]> wrote:
I am trying to do a broadcast join on two tables. The size of the
smaller table will vary based upon the parameters but the size of the
larger table is close to 2TB. What I have noticed is that if I don't
set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
operations do a SortMergeJoin instead of a broadcast join. But the
size of the smaller table shouldn't be this big at all. I wrote the
smaller table to a s3 folder and it took only 12.6 MB of space. I
didn't some operations on the smaller table so the shuffle size
appears on the Spark History Server and the size in memory seemed to
be 150 MB nowhere near 10G. Also if I force a broadcast join on the
smaller table it takes a long time to broadcast, leading me to think
that the table might not just be 150 MB in size. What would be a good
way to figure out the actual size that Spark is seeing and deciding
whether it crosses the spark.sql.autoBroadcastJoinThreshold?

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark seems to think that a particular broadcast variable is large in size

V0lleyBallJunki3
I did try that mechanism before but the data never shows up in the
storage tab. The storage tab is always blank. I have tried it in
Zeppelin as well as spark-shell.

scala> val classCount = spark.read.parquet("s3:// ..../classCount")
scala> classCount.persist
scala> classCount.count

Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
However, I have several running applications in production that does
show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
workarounds to see the data in cache.
On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <[hidden email]> wrote:

>
> In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using?
>
> On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <[hidden email]> wrote:
>>
>> I am trying to do a broadcast join on two tables. The size of the
>> smaller table will vary based upon the parameters but the size of the
>> larger table is close to 2TB. What I have noticed is that if I don't
>> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
>> operations do a SortMergeJoin instead of a broadcast join. But the
>> size of the smaller table shouldn't be this big at all. I wrote the
>> smaller table to a s3 folder and it took only 12.6 MB of space. I
>> didn't some operations on the smaller table so the shuffle size
>> appears on the Spark History Server and the size in memory seemed to
>> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
>> smaller table it takes a long time to broadcast, leading me to think
>> that the table might not just be 150 MB in size. What would be a good
>> way to figure out the actual size that Spark is seeing and deciding
>> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [hidden email]
>>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark seems to think that a particular broadcast variable is large in size

V0lleyBallJunki3
The same problem is mentioned here :
https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server
On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri <[hidden email]> wrote:

>
> I did try that mechanism before but the data never shows up in the
> storage tab. The storage tab is always blank. I have tried it in
> Zeppelin as well as spark-shell.
>
> scala> val classCount = spark.read.parquet("s3:// ..../classCount")
> scala> classCount.persist
> scala> classCount.count
>
> Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
> However, I have several running applications in production that does
> show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
> workarounds to see the data in cache.
> On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <[hidden email]> wrote:
> >
> > In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using?
> >
> > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <[hidden email]> wrote:
> >>
> >> I am trying to do a broadcast join on two tables. The size of the
> >> smaller table will vary based upon the parameters but the size of the
> >> larger table is close to 2TB. What I have noticed is that if I don't
> >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> >> operations do a SortMergeJoin instead of a broadcast join. But the
> >> size of the smaller table shouldn't be this big at all. I wrote the
> >> smaller table to a s3 folder and it took only 12.6 MB of space. I
> >> didn't some operations on the smaller table so the shuffle size
> >> appears on the Spark History Server and the size in memory seemed to
> >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> >> smaller table it takes a long time to broadcast, leading me to think
> >> that the table might not just be 150 MB in size. What would be a good
> >> way to figure out the actual size that Spark is seeing and deciding
> >> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: [hidden email]
> >>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Spark seems to think that a particular broadcast variable is large in size

Dillon Dukek
You keep mentioning that you're viewing this after the fact in the spark history server. Also the spark-shell isn't a UI so I'm not sure what you mean by saying that the storage tab is blank in the spark-shell. Just so I'm clear about what you're doing, are you looking at this info while your application is running in the SparkUI through the ResourceManager link in the EMR console? That would be the route I would go. I'm not sure that spark retains storage information to be viewed after the fact since after the program is complete the dataframe will be freed and you will lose context about that data. If you would like to do a paired down test in the spark-shell you can do that as well. Once spark is started via the spark-shell command it will launch a spark UI for you to view the job progress. This even sounds small enough that if you are allowed to do so you should be able to launch this from your local machine and see the UI at localhost:4040. I've confirmed this works locally for some data that I have.

On Tue, Oct 16, 2018 at 8:05 AM Venkat Dabri <[hidden email]> wrote:
The same problem is mentioned here :
https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server
On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri <[hidden email]> wrote:
>
> I did try that mechanism before but the data never shows up in the
> storage tab. The storage tab is always blank. I have tried it in
> Zeppelin as well as spark-shell.
>
> scala> val classCount = spark.read.parquet("s3:// ..../classCount")
> scala> classCount.persist
> scala> classCount.count
>
> Nothing shows up in the Storage tab of either Zeppelin or spark-shell.
> However, I have several running applications in production that does
> show the data in cache. I am using Scala and Spark 2.2.1 in EMR. Any
> workarounds to see the data in cache.
> On Mon, Oct 15, 2018 at 2:53 PM Dillon Dukek <[hidden email]> wrote:
> >
> > In your program persist the smaller table and use count to force it to materialize. Then in the Spark UI go to the Storage tab. The size of your table as spark sees it should be displayed there. Out of curiosity what version / language of Spark are you using?
> >
> > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri <[hidden email]> wrote:
> >>
> >> I am trying to do a broadcast join on two tables. The size of the
> >> smaller table will vary based upon the parameters but the size of the
> >> larger table is close to 2TB. What I have noticed is that if I don't
> >> set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
> >> operations do a SortMergeJoin instead of a broadcast join. But the
> >> size of the smaller table shouldn't be this big at all. I wrote the
> >> smaller table to a s3 folder and it took only 12.6 MB of space. I
> >> didn't some operations on the smaller table so the shuffle size
> >> appears on the Spark History Server and the size in memory seemed to
> >> be 150 MB nowhere near 10G. Also if I force a broadcast join on the
> >> smaller table it takes a long time to broadcast, leading me to think
> >> that the table might not just be 150 MB in size. What would be a good
> >> way to figure out the actual size that Spark is seeing and deciding
> >> whether it crosses the spark.sql.autoBroadcastJoinThreshold?
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe e-mail: [hidden email]
> >>