Reading TB of JSON file

classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading TB of JSON file

Chetan Khatri
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Patrick McCarthy-2
Assuming that the file can be easily split, I would divide it into a number of pieces and move those pieces to HDFS before using spark at all, using `hdfs dfs` or similar. At that point you can use your executors to perform the reading instead of the driver.

On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Jörn Franke
In reply to this post by Chetan Khatri
Depends on the data types you use.

Do you have in jsonlines format? Then the amount of memory plays much less a role.

Otherwise if it is one large object or array I would not recommend it.

> Am 18.06.2020 um 15:12 schrieb Chetan Khatri <[hidden email]>:
>
> 
> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?
>
> Thanks

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

nihed mbarek
In reply to this post by Chetan Khatri
Hi, 

What is the size of one json document ? 

There is also the scan of your json to define the schema, the overhead can be huge. 
2 solution: 
define a schema and use directly during the load or ask spark to analyse a small part of the json file (I don't remember how to do it) 

Regards, 


On Thu, Jun 18, 2020 at 3:12 PM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks


--

M'BAREK Med Nihed,
Fedora Ambassador, TUNISIA, Northern Africa
http://www.nihed.com



Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Chetan Khatri
In reply to this post by Patrick McCarthy-2
File is available at S3 Bucket.


On Thu, Jun 18, 2020 at 9:15 AM Patrick McCarthy <[hidden email]> wrote:
Assuming that the file can be easily split, I would divide it into a number of pieces and move those pieces to HDFS before using spark at all, using `hdfs dfs` or similar. At that point you can use your executors to perform the reading instead of the driver.

On Thu, Jun 18, 2020 at 9:12 AM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks


--

Patrick McCarthy 

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016

Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Chetan Khatri
In reply to this post by Jörn Franke
It is dynamically generated and written at s3 bucket not historical data so I guess it doesn't have jsonlines format

On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke <[hidden email]> wrote:
Depends on the data types you use.

Do you have in jsonlines format? Then the amount of memory plays much less a role.

Otherwise if it is one large object or array I would not recommend it.

> Am 18.06.2020 um 15:12 schrieb Chetan Khatri <[hidden email]>:
>
> 
> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?
>
> Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Gourav Sengupta
Hi,
So you have a single JSON record in multiple lines?
And all the 50 GB is in one file? 

Regards,
Gourav

On Thu, 18 Jun 2020, 14:34 Chetan Khatri, <[hidden email]> wrote:
It is dynamically generated and written at s3 bucket not historical data so I guess it doesn't have jsonlines format

On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke <[hidden email]> wrote:
Depends on the data types you use.

Do you have in jsonlines format? Then the amount of memory plays much less a role.

Otherwise if it is one large object or array I would not recommend it.

> Am 18.06.2020 um 15:12 schrieb Chetan Khatri <[hidden email]>:
>
> 
> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?
>
> Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Stephan Wehner
In reply to this post by Chetan Khatri
It's an interesting problem. What is the structure of the file? One big array? On hash with many key-value pairs?

Stephan

On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks


--
Stephan Wehner, Ph.D.
The Buckmaster Institute, Inc.
2150 Adanac Street
Vancouver BC V5L 2E7
Canada
Cell (604) 767-7415
Fax (888) 808-4655

Sign up for our free email course
http://buckmaster.ca/small_business_website_mistakes.html

http://www.buckmaster.ca
http://answer4img.com
http://loggingit.com
http://clocklist.com
http://stephansmap.org
http://benchology.com
http://www.trafficlife.com
http://stephan.sugarmotor.org (Personal Blog)
@stephanwehner (Personal Account)
VA7WSK (Personal call sign)
Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Chetan Khatri
In reply to this post by Gourav Sengupta
Yes

On Thu, Jun 18, 2020 at 12:34 PM Gourav Sengupta <[hidden email]> wrote:
Hi,
So you have a single JSON record in multiple lines?
And all the 50 GB is in one file? 

Regards,
Gourav

On Thu, 18 Jun 2020, 14:34 Chetan Khatri, <[hidden email]> wrote:
It is dynamically generated and written at s3 bucket not historical data so I guess it doesn't have jsonlines format

On Thu, Jun 18, 2020 at 9:16 AM Jörn Franke <[hidden email]> wrote:
Depends on the data types you use.

Do you have in jsonlines format? Then the amount of memory plays much less a role.

Otherwise if it is one large object or array I would not recommend it.

> Am 18.06.2020 um 15:12 schrieb Chetan Khatri <[hidden email]>:
>
> 
> Hi Spark Users,
>
> I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?
>
> Thanks
Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Chetan Khatri
In reply to this post by Stephan Wehner
All transactions in JSON, It is not a single array. 

On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner <[hidden email]> wrote:
It's an interesting problem. What is the structure of the file? One big array? On hash with many key-value pairs?

Stephan

On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks


--
Stephan Wehner, Ph.D.
The Buckmaster Institute, Inc.
2150 Adanac Street
Vancouver BC V5L 2E7
Canada
Cell (604) 767-7415
Fax (888) 808-4655

Sign up for our free email course
http://buckmaster.ca/small_business_website_mistakes.html

http://www.buckmaster.ca
http://answer4img.com
http://loggingit.com
http://clocklist.com
http://stephansmap.org
http://benchology.com
http://www.trafficlife.com
http://stephan.sugarmotor.org (Personal Blog)
@stephanwehner (Personal Account)
VA7WSK (Personal call sign)
Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Jörn Franke
Make every json object a line and then read t as jsonline not as multiline 

Am 19.06.2020 um 14:37 schrieb Chetan Khatri <[hidden email]>:


All transactions in JSON, It is not a single array. 

On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner <[hidden email]> wrote:
It's an interesting problem. What is the structure of the file? One big array? On hash with many key-value pairs?

Stephan

On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks


--
Stephan Wehner, Ph.D.
The Buckmaster Institute, Inc.
2150 Adanac Street
Vancouver BC V5L 2E7
Canada
Cell (604) 767-7415
Fax (888) 808-4655

Sign up for our free email course
http://buckmaster.ca/small_business_website_mistakes.html

http://www.buckmaster.ca
http://answer4img.com
http://loggingit.com
http://clocklist.com
http://stephansmap.org
http://benchology.com
http://www.trafficlife.com
http://stephan.sugarmotor.org (Personal Blog)
@stephanwehner (Personal Account)
VA7WSK (Personal call sign)
Reply | Threaded
Open this post in threaded view
|

Re: Reading TB of JSON file

Chetan Khatri
Thanks, you meant in a for loop. could you please put pseudocode in spark

On Fri, Jun 19, 2020 at 8:39 AM Jörn Franke <[hidden email]> wrote:
Make every json object a line and then read t as jsonline not as multiline 

Am 19.06.2020 um 14:37 schrieb Chetan Khatri <[hidden email]>:


All transactions in JSON, It is not a single array. 

On Thu, Jun 18, 2020 at 12:55 PM Stephan Wehner <[hidden email]> wrote:
It's an interesting problem. What is the structure of the file? One big array? On hash with many key-value pairs?

Stephan

On Thu, Jun 18, 2020 at 6:12 AM Chetan Khatri <[hidden email]> wrote:
Hi Spark Users,

I have a 50GB of JSON file, I would like to read and persist at HDFS so it can be taken into next transformation. I am trying to read as spark.read.json(path) but this is giving Out of memory error on driver. Obviously, I can't afford having 50 GB on driver memory. In general, what is the best practice to read large JSON file like 50 GB?

Thanks


--
Stephan Wehner, Ph.D.
The Buckmaster Institute, Inc.
2150 Adanac Street
Vancouver BC V5L 2E7
Canada
Cell (604) 767-7415
Fax (888) 808-4655

Sign up for our free email course
http://buckmaster.ca/small_business_website_mistakes.html

http://www.buckmaster.ca
http://answer4img.com
http://loggingit.com
http://clocklist.com
http://stephansmap.org
http://benchology.com
http://www.trafficlife.com
http://stephan.sugarmotor.org (Personal Blog)
@stephanwehner (Personal Account)
VA7WSK (Personal call sign)