回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

Tang Jinxin
Maybe datanode stop data transfer due    to timeout.Could you please provide exception stack?

xiaoxingstack
邮箱:xiaoxingstack@...

签名由 网易邮箱大师 定制

2020年04月22日 19:53[hidden email] 写道:

    Today I meet the same problem using rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will appear when the amount of data reaches about 100GB.

    I guess there may be something wrong with deserialization. Has anyone else encountered this problem?

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020420 10:33
收件人: [hidden email]
主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

 

Hi all,

I get a Dataset[Row] through the following code:

 
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
 

After that I want to collect it to the driver:

 
val df_rows: Array[Row] = df.collect()
 

The Spark web ui shows that all tasks have run successfully, but the application did not stop. After more than ten minutes, an error will be generated in the shell:

 

java.io.EOFException: Premature EOF: no length prefix available

 

Environment:
    Spark 2.4.3
    Hadoop 2.7.7
    Total rows of data about 800,000,000, 12GB

   

    More detailed information can be seen here:

https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e

    Does anyone know the reason?

 

Best regards,

maqy

 

 

Reply | Threaded
Open this post in threaded view
|

回复: 回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

maqy1995@outlook.com

    Hi Jinxin,

spark web ui shows that all tasks are completed successfully, this error appears in the shell:

java.io.EOFException: Premature EOF: no length prefix available
    at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)

More information can be seen here:

https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e

 

I speculate that there is a problem with deserialization, because after the web ui shows that the tasks of collect() are completed, the memory occupied by the “spark submit” process is still increasing. After a few minutes, the memory usage will no longer increase, and after a few minutes, the shell will report this error.

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020422 23:16
收件人: [hidden email]
抄送: [hidden email]
主题: 回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

 

Maybe datanode stop data transfer due    to timeout.Could you please provide exception stack?

xiaoxingstack

签名由 网易邮箱大师 定制

20200422 19:53[hidden email] 写道:

    Today I meet the same problem using rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will appear when the amount of data reaches about 100GB.

    I guess there may be something wrong with deserialization. Has anyone else encountered this problem?

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020420 10:33
收件人: [hidden email]
主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

 

Hi all,

I get a Dataset[Row] through the following code:

 
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
 

After that I want to collect it to the driver:

 
val df_rows: Array[Row] = df.collect()
 

The Spark web ui shows that all tasks have run successfully, but the application did not stop. After more than ten minutes, an error will be generated in the shell:

 

java.io.EOFException: Premature EOF: no length prefix available

 

Environment:
    Spark 2.4.3
    Hadoop 2.7.7
    Total rows of data about 800,000,000, 12GB

   

    More detailed information can be seen here:

https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e

    Does anyone know the reason?

 

Best regards,

maqy

 

 

 

Reply | Threaded
Open this post in threaded view
|

回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

Tang Jinxin
In reply to this post by Tang Jinxin
Hi maqy,
   The exception is occurd by connection closed,one of reasons is datanode side timeout if We have not find problem In spark before the exception.So We could try to find more clues In datanode log.
   
   Best wishes,
   Jinxin

xiaoxingstack
邮箱:xiaoxingstack@...

签名由 网易邮箱大师 定制

2020年04月22日 23:40[hidden email] 写道:

    Hi Jinxin,

spark web ui shows that all tasks are completed successfully, this error appears in the shell:

java.io.EOFException: Premature EOF: no length prefix available
    at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:244)
    at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:244)
    at org.apache.hadoop.hdfs.DFSOutputStream$DataStream$ResponseProcessor.run(DFSOutputStream.java:733)

More information can be seen here:

https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e

 

I speculate that there is a problem with deserialization, because after the web ui shows that the tasks of collect() are completed, the memory occupied by the “spark submit” process is still increasing. After a few minutes, the memory usage will no longer increase, and after a few minutes, the shell will report this error.

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020422 23:16
收件人: [hidden email]
抄送: [hidden email]
主题: 回复:[Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

 

Maybe datanode stop data transfer due    to timeout.Could you please provide exception stack?

xiaoxingstack

签名由 网易邮箱大师 定制

20200422 19:53[hidden email] 写道:

    Today I meet the same problem using rdd.collect (), the format of rdd is Tuple2 [Int, Int]. And this problem will appear when the amount of data reaches about 100GB.

    I guess there may be something wrong with deserialization. Has anyone else encountered this problem?

 

Best regards,

maqy

 

发件人: [hidden email]
发送时间: 2020420 10:33
收件人: [hidden email]
主题: [Spark SQL] [Beginner] Dataset[Row] collect to driver throwjava.io.EOFException: Premature EOF: no length prefix available

 

Hi all,

I get a Dataset[Row] through the following code:

 
val df: Dataset[Row] = spark.read.format("csv).schema(schema).load("hdfs://master:9000/mydata")
 

After that I want to collect it to the driver:

 
val df_rows: Array[Row] = df.collect()
 

The Spark web ui shows that all tasks have run successfully, but the application did not stop. After more than ten minutes, an error will be generated in the shell:

 

java.io.EOFException: Premature EOF: no length prefix available

 

Environment:
    Spark 2.4.3
    Hadoop 2.7.7
    Total rows of data about 800,000,000, 12GB

   

    More detailed information can be seen here:

https://stackoverflow.com/questions/61202566/spark-sql-datasetrow-collect-to-driver-throw-java-io-eofexception-premature-e

    Does anyone know the reason?

 

Best regards,

maqy