Spark 3.0 almost 1000 times slower to read json than Spark 2.4

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration0.0 ms0.0 ms0.0 ms1.0 ms62.0 ms
GC Time0.0 ms0.0 ms0.0 ms0.0 ms11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklisted
driver10.0.0.8:4915936 s2203002203false


Spark 3.0

Summary Metrics for 8 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration3.8 min4.0 min4.1 min4.4 min5.0 min
GC Time3 s3 s3 s4 s4 s
Input Size / Records15.6 MiB / 5102816.2 MiB / 5330316.8 MiB / 5525917.8 MiB / 5814820.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklistedInput Size / Records
driver10.0.0.8:5022433 min8008false136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Gourav Sengupta
Hi,

can you please share the SPARK code? 



Regards,
Gourav 

On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra <[hidden email]> wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration0.0 ms0.0 ms0.0 ms1.0 ms62.0 ms
GC Time0.0 ms0.0 ms0.0 ms0.0 ms11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklisted
driver10.0.0.8:4915936 s2203002203false


Spark 3.0

Summary Metrics for 8 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration3.8 min4.0 min4.1 min4.4 min5.0 min
GC Time3 s3 s3 s4 s4 s
Input Size / Records15.6 MiB / 5102816.2 MiB / 5330316.8 MiB / 5525917.8 MiB / 5814820.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklistedInput Size / Records
driver10.0.0.8:5022433 min8008false136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra
There is not much code, I am using spark-shell provided by Spark 2.4 and Spark 3. 

val dp = spark.read.json("/Users/<userhome>/data/dailyparams/20200528")



On Mon, Jun 29, 2020 at 2:25 AM Gourav Sengupta <[hidden email]> wrote:
Hi,

can you please share the SPARK code? 



Regards,
Gourav 

On Sun, Jun 28, 2020 at 12:58 AM Sanjeev Mishra <[hidden email]> wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration0.0 ms0.0 ms0.0 ms1.0 ms62.0 ms
GC Time0.0 ms0.0 ms0.0 ms0.0 ms11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklisted
driver10.0.0.8:4915936 s2203002203false


Spark 3.0

Summary Metrics for 8 Completed Tasks

MetricMin25th percentileMedian75th percentileMax
Duration3.8 min4.0 min4.1 min4.4 min5.0 min
GC Time3 s3 s3 s4 s4 s
Input Size / Records15.6 MiB / 5102816.2 MiB / 5330316.8 MiB / 5525917.8 MiB / 5814820.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor IDLogsAddressTask TimeTotal TasksFailed TasksKilled TasksSucceeded TasksBlacklistedInput Size / Records
driver10.0.0.8:5022433 min8008false136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

ArtemisDev
In reply to this post by Sanjeev Mishra

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Maxim Gekk
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27 16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

ArtemisDev

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27
                                        16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27
                                      16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot 2020-06-27
                                        16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27
                                      16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

ArtemisDev

According to the spec, in addition to the line breaks, you should also put the nested object values in arrays instead of dictionaries.  You may want to give a try and see if this would give you a better performance.

Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

-- ND

On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot
                                                2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27
                                              16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Gourav Sengupta
Hi Sanjeev,

can you share the exact code that you are using to read the JSON files? Currently I am getting only 11 records from the tar file that you have attached with JIRA.

Regards,
Gourav Sengupta

On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <[hidden email]> wrote:

According to the spec, in addition to the line breaks, you should also put the nested object values in arrays instead of dictionaries.  You may want to give a try and see if this would give you a better performance.

Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

-- ND

On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk

Software Engineer

Databricks, Inc.



On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

Screenshot
                                                2020-06-27 16.30.26.png

Spark 3.0 DAG

Screenshot 2020-06-27
                                              16.32.32.png


Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra
There is not much code, I am just using spark-shell and reading the data like so

spark.time(spark.read.json("/data/small-anon/"))

On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,

can you share the exact code that you are using to read the JSON files? Currently I am getting only 11 records from the tar file that you have attached with JIRA.

Regards,
Gourav Sengupta

On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <[hidden email]> wrote:

According to the spec, in addition to the line breaks, you should also put the nested object values in arrays instead of dictionaries.  You may want to give a try and see if this would give you a better performance.

Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

-- ND

On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk
Software Engineer
Databricks, Inc.


On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

<Screenshot 2020-06-27 16.30.26.png>

Spark 3.0 DAG

<Screenshot 2020-06-27 16.32.32.png>



Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Gourav Sengupta
Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA tickets is it correct?


Regards,
Gourav Sengupta 

On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <[hidden email]> wrote:
There is not much code, I am just using spark-shell and reading the data like so

spark.time(spark.read.json("/data/small-anon/"))

On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,

can you share the exact code that you are using to read the JSON files? Currently I am getting only 11 records from the tar file that you have attached with JIRA.

Regards,
Gourav Sengupta

On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <[hidden email]> wrote:

According to the spec, in addition to the line breaks, you should also put the nested object values in arrays instead of dictionaries.  You may want to give a try and see if this would give you a better performance.

Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

-- ND

On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk
Software Engineer
Databricks, Inc.


On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

<Screenshot 2020-06-27 16.30.26.png>

Spark 3.0 DAG

<Screenshot 2020-06-27 16.32.32.png>



Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra
There are total 11 files as part of tar. You will have to untar it to get to actual files (.json.gz)

No, I am getting

Count: 33447

spark.time(spark.read.json(“/data/small-anon/"))
Time taken: 431 ms
res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more fields]

scala> res73.count()
res74: Long = 33447

ls -ltr
total 7592
-rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 part-00009-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 part-00008-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 part-00007-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 part-00006-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 part-00005-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 part-00004-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 part-00003-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 part-00002-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 part-00000-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 part-00001-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff       0 Jun 29 08:40 _SUCCESS

On Jun 30, 2020, at 5:37 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA tickets is it correct?


Regards,
Gourav Sengupta 

On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <[hidden email]> wrote:
There is not much code, I am just using spark-shell and reading the data like so

spark.time(spark.read.json("/data/small-anon/"))

On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,

can you share the exact code that you are using to read the JSON files? Currently I am getting only 11 records from the tar file that you have attached with JIRA.

Regards,
Gourav Sengupta

On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <[hidden email]> wrote:

According to the spec, in addition to the line breaks, you should also put the nested object values in arrays instead of dictionaries.  You may want to give a try and see if this would give you a better performance.

Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

-- ND

On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk
Software Engineer
Databricks, Inc.


On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

<Screenshot 2020-06-27 16.30.26.png>

Spark 3.0 DAG

<Screenshot 2020-06-27 16.32.32.png>




Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Gourav Sengupta
Hi, Sanjeev,

I think that I did precisely that, can you please download my ipython notebook and have a look, and let me know where I am going wrong. its attached with the JIRA ticket.


Regards,
Gourav Sengupta 

On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra <[hidden email]> wrote:
There are total 11 files as part of tar. You will have to untar it to get to actual files (.json.gz)

No, I am getting

Count: 33447

spark.time(spark.read.json(“/data/small-anon/"))
Time taken: 431 ms
res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more fields]

scala> res73.count()
res74: Long = 33447

ls -ltr
total 7592
-rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 part-00009-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 part-00008-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 part-00007-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 part-00006-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 part-00005-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 part-00004-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 part-00003-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 part-00002-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 part-00000-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 part-00001-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff       0 Jun 29 08:40 _SUCCESS

On Jun 30, 2020, at 5:37 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA tickets is it correct?


Regards,
Gourav Sengupta 

On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <[hidden email]> wrote:
There is not much code, I am just using spark-shell and reading the data like so

spark.time(spark.read.json("/data/small-anon/"))

On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,

can you share the exact code that you are using to read the JSON files? Currently I am getting only 11 records from the tar file that you have attached with JIRA.

Regards,
Gourav Sengupta

On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <[hidden email]> wrote:

According to the spec, in addition to the line breaks, you should also put the nested object values in arrays instead of dictionaries.  You may want to give a try and see if this would give you a better performance.

Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

-- ND

On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk
Software Engineer
Databricks, Inc.


On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms
GC Time 0.0 ms 0.0 ms 0.0 ms 0.0 ms 11.0 ms
Showing 1 to 2 of 2 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted
driver
10.0.0.8:49159 36 s 2203 0 0 2203 false


Spark 3.0

Summary Metrics for 8 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 3.8 min 4.0 min 4.1 min 4.4 min 5.0 min
GC Time 3 s 3 s 3 s 4 s 4 s
Input Size / Records 15.6 MiB / 51028 16.2 MiB / 53303 16.8 MiB / 55259 17.8 MiB / 58148 20.2 MiB / 71624
Showing 1 to 3 of 3 entries
 

Aggregated Metrics by Executor

 

Executor ID Logs Address Task Time Total Tasks Failed Tasks Killed Tasks Succeeded Tasks Blacklisted Input Size / Records
driver
10.0.0.8:50224 33 min 8 0 0 8 false 136.1 MiB / 451999


The DAG is also different
Spark 2.0 DAG

<Screenshot 2020-06-27 16.30.26.png>

Spark 3.0 DAG

<Screenshot 2020-06-27 16.32.32.png>




Reply | Threaded
Open this post in threaded view
|

Re: Spark 3.0 almost 1000 times slower to read json than Spark 2.4

Sanjeev Mishra
Hi Gourav,

Please check the comments of the ticket, looks like the performance degradation is attributed to inferTimestamp option that is true by default (I have no idea why) in Spark 3.0. This forces Spark to scan entire text and so the poor performance.

Regards
Sanjeev

On Jun 30, 2020, at 8:12 AM, Gourav Sengupta <[hidden email]> wrote:

Hi, Sanjeev,

I think that I did precisely that, can you please download my ipython notebook and have a look, and let me know where I am going wrong. its attached with the JIRA ticket.


Regards,
Gourav Sengupta 

On Tue, Jun 30, 2020 at 1:42 PM Sanjeev Mishra <[hidden email]> wrote:
There are total 11 files as part of tar. You will have to untar it to get to actual files (.json.gz)

No, I am getting

Count: 33447

spark.time(spark.read.json(“/data/small-anon/"))
Time taken: 431 ms
res73: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 2 more fields]

scala> res73.count()
res74: Long = 33447

ls -ltr
total 7592
-rw-r--r--  1 sanjeevmishra  staff  132413 Jun 29 08:40 part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272767 Jun 29 08:40 part-00009-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  272314 Jun 29 08:40 part-00008-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  277158 Jun 29 08:40 part-00007-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  321451 Jun 29 08:40 part-00006-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  331419 Jun 29 08:40 part-00005-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  337195 Jun 29 08:40 part-00004-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  366346 Jun 29 08:40 part-00003-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  423154 Jun 29 08:40 part-00002-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  458187 Jun 29 08:40 part-00000-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff  673836 Jun 29 08:40 part-00001-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz
-rw-r--r--  1 sanjeevmishra  staff       0 Jun 29 08:40 _SUCCESS

On Jun 30, 2020, at 5:37 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,
that just gives 11 records from the sample that you have loaded to the JIRA tickets is it correct?


Regards,
Gourav Sengupta 

On Tue, Jun 30, 2020 at 1:25 PM Sanjeev Mishra <[hidden email]> wrote:
There is not much code, I am just using spark-shell and reading the data like so

spark.time(spark.read.json("/data/small-anon/"))

On Jun 30, 2020, at 3:53 AM, Gourav Sengupta <[hidden email]> wrote:

Hi Sanjeev,

can you share the exact code that you are using to read the JSON files? Currently I am getting only 11 records from the tar file that you have attached with JIRA.

Regards,
Gourav Sengupta

On Mon, Jun 29, 2020 at 5:46 PM ArtemisDev <[hidden email]> wrote:

According to the spec, in addition to the line breaks, you should also put the nested object values in arrays instead of dictionaries.  You may want to give a try and see if this would give you a better performance.

Nevertheless, this still doesn't explain why Spark 2.4.6 outperforms 3.0.  Hope the Databricks engineers will find an answer or bug fix soon.

-- ND

On 6/29/20 12:27 PM, Sanjeev Mishra wrote:
The tar file that I have attached has bunch of json.zip files and this is the file that is being processed. Each line is self contained JSON as shown below

zcat < part-00010-2558524a-1a1f-4f14-b027-4276a8143194-c000.json.gz | head -3
{"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624062260}
{"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","Arris.NVG4xx.Missing.CA","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624060253}
{"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Arris.NVG4xx.Missing.CA","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":[{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}]},"ts":1590624061415}

On Mon, Jun 29, 2020 at 9:20 AM ArtemisDev <[hidden email]> wrote:

Could you please share your input file instead of output files on that ticket?   Not sure if you were following the specific file format requirement for JSON in Spark.  The following is a snippet from the Spark online doc:

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

-- ND

On 6/29/20 11:55 AM, Sanjeev Mishra wrote:

On Mon, Jun 29, 2020 at 8:21 AM Maxim Gekk <[hidden email]> wrote:
Hello Sanjeev,

It is hard to troubleshoot the issue without input files. Could you open an JIRA ticket at https://issues.apache.org/jira/projects/SPARK and attach the JSON files there (or samples or code which generates JSON files)? 

Maxim Gekk
Software Engineer
Databricks, Inc.


On Mon, Jun 29, 2020 at 6:12 PM Sanjeev Mishra <[hidden email]> wrote:
It has read everything. As you notice the timing of count is still smaller in Spark 2.4

Spark 2.4

scala> spark.time(spark.read.json("/data/20200528"))
Time taken: 19691 ms
res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res61.count())
Time taken: 7113 ms
res64: Long = 2605349

Spark 3.0
scala> spark.time(spark.read.json("/data/20200528"))
20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
Time taken: 849652 ms
res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]

scala> spark.time(res0.count())
Time taken: 8201 ms
res2: Long = 2605349




On Mon, Jun 29, 2020 at 7:45 AM ArtemisDev <[hidden email]> wrote:

Could you share your code?  Are you sure you Spark 2.4 cluster had indeed read anything?  Looks like the Input size field is empty under 2.4.

-- ND

On 6/27/20 7:58 PM, Sanjeev Mishra wrote:

I have large amount of json files that Spark can read in 36 seconds but Spark 3.0 takes almost 33 minutes to read the same. On closer analysis, looks like Spark 3.0 is choosing different DAG than Spark 2.0. Does anyone have any idea what is going on? Is there any configuration problem with Spark 3.0.

Here are the details:

Spark 2.4

Summary Metrics for 2203 Completed Tasks

Metric Min 25th percentile Median 75th percentile Max
Duration 0.0 ms 0.0 ms 0.0 ms 1.0 ms 62.0 ms