Generating StructType from dataframe.printSchema

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Generating StructType from dataframe.printSchema

Jeroen Miller
Hello Spark users,

Does anyone know if there is a way to generate the Scala code for a complex structure just from the output of dataframe.printSchema?

I have to analyse a significant volume of data and want to explicitly set the schema(s) to avoid having to read my (compressed) JSON files multiple times. What I am doing so far is to read a few files, print the schema, and manually write the code to define the corresponding StructType: tedious and error-prone.

I'm sure there is a much better way, but can't find anything about it.

Pointers anyone?

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Generating StructType from dataframe.printSchema

Silvio Fiorito
If you’re confident the schema of all files is consistent, then just infer the schema from a single file and reuse it when loading the whole data set:

val schema = spark.read.json(“/path/to/single/file.json”).schema

val wholeDataSet = spark.read.schema(schema).json(“/path/to/whole/datasets”)


Thanks,
Silvio

On 10/16/17, 10:20 AM, "Jeroen Miller" <[hidden email]> wrote:

    Hello Spark users,
   
    Does anyone know if there is a way to generate the Scala code for a complex structure just from the output of dataframe.printSchema?
   
    I have to analyse a significant volume of data and want to explicitly set the schema(s) to avoid having to read my (compressed) JSON files multiple times. What I am doing so far is to read a few files, print the schema, and manually write the code to define the corresponding StructType: tedious and error-prone.
   
    I'm sure there is a much better way, but can't find anything about it.
   
    Pointers anyone?
   
    Jeroen
   
   
    ---------------------------------------------------------------------
    To unsubscribe e-mail: [hidden email]
   
   


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]
Reply | Threaded
Open this post in threaded view
|

Re: Generating StructType from dataframe.printSchema

Jeroen Miller
On 16 Oct 2017, at 16:22, Silvio Fiorito <[hidden email]> wrote:
> [...] then just infer the schema from a single file and reuse it when loading the whole data set:

Well, that is a possibility indeed.

Thanks,

Jeroen


---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]