streaming pdf

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

streaming pdf

Nicolas Paris-2
Hi

I have pdf to load into spark with at least <filename, byte_array>
format. I have considered some options:

- spark streaming does not provide a native file stream for binary with
  variable size (binaryRecordStream specifies a constant size) and I
  would have to write my own receiver.

- Structured streaming allows to process avro/parquet/orc files
  containing pdfs, but this makes things more complicated than
  monitoring a simple folder  containing pdfs

- Kafka is not designed to handle messages > 100KB, and for this reason
  it is not a good option to use in the stream pipeline.

Somebody has a suggestion ?

Thanks,

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: streaming pdf

Jörn Franke
Why does it have to be a stream?

> Am 18.11.2018 um 23:29 schrieb Nicolas Paris <[hidden email]>:
>
> Hi
>
> I have pdf to load into spark with at least <filename, byte_array>
> format. I have considered some options:
>
> - spark streaming does not provide a native file stream for binary with
>  variable size (binaryRecordStream specifies a constant size) and I
>  would have to write my own receiver.
>
> - Structured streaming allows to process avro/parquet/orc files
>  containing pdfs, but this makes things more complicated than
>  monitoring a simple folder  containing pdfs
>
> - Kafka is not designed to handle messages > 100KB, and for this reason
>  it is not a good option to use in the stream pipeline.
>
> Somebody has a suggestion ?
>
> Thanks,
>
> --
> nicolas
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [hidden email]
>

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: streaming pdf

Nicolas Paris-2
On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
> Why does it have to be a stream?
>

Right now I manage the pipelines as spark batch processing. Mooving to
stream would add some improvements such:
- simplification of the pipeline
- more frequent data ingestion
- better resource management (?)


On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:

> Why does it have to be a stream?
>
> > Am 18.11.2018 um 23:29 schrieb Nicolas Paris <[hidden email]>:
> >
> > Hi
> >
> > I have pdf to load into spark with at least <filename, byte_array>
> > format. I have considered some options:
> >
> > - spark streaming does not provide a native file stream for binary with
> >  variable size (binaryRecordStream specifies a constant size) and I
> >  would have to write my own receiver.
> >
> > - Structured streaming allows to process avro/parquet/orc files
> >  containing pdfs, but this makes things more complicated than
> >  monitoring a simple folder  containing pdfs
> >
> > - Kafka is not designed to handle messages > 100KB, and for this reason
> >  it is not a good option to use in the stream pipeline.
> >
> > Somebody has a suggestion ?
> >
> > Thanks,
> >
> > --
> > nicolas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: [hidden email]
> >
>

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: streaming pdf

Jörn Franke
Well, I am not so sure about the use cases, but what about using StreamingContext.fileStream?


Am 19.11.2018 um 09:22 schrieb Nicolas Paris <[hidden email]>:

On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
Why does it have to be a stream?


Right now I manage the pipelines as spark batch processing. Mooving to
stream would add some improvements such:
- simplification of the pipeline
- more frequent data ingestion
- better resource management (?)


On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
Why does it have to be a stream?

Am 18.11.2018 um 23:29 schrieb Nicolas Paris <[hidden email]>:

Hi

I have pdf to load into spark with at least <filename, byte_array>
format. I have considered some options:

- spark streaming does not provide a native file stream for binary with
variable size (binaryRecordStream specifies a constant size) and I
would have to write my own receiver.

- Structured streaming allows to process avro/parquet/orc files
containing pdfs, but this makes things more complicated than
monitoring a simple folder  containing pdfs

- Kafka is not designed to handle messages > 100KB, and for this reason
it is not a good option to use in the stream pipeline.

Somebody has a suggestion ?

Thanks,

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: streaming pdf

Jörn Franke
And you have to write your own input format, but this is not so complicated (probably anyway recommended for the PDF case)

Am 20.11.2018 um 08:06 schrieb Jörn Franke <[hidden email]>:

Well, I am not so sure about the use cases, but what about using StreamingContext.fileStream?


Am 19.11.2018 um 09:22 schrieb Nicolas Paris <[hidden email]>:

On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
Why does it have to be a stream?


Right now I manage the pipelines as spark batch processing. Mooving to
stream would add some improvements such:
- simplification of the pipeline
- more frequent data ingestion
- better resource management (?)


On Mon, Nov 19, 2018 at 07:23:10AM +0100, Jörn Franke wrote:
Why does it have to be a stream?

Am 18.11.2018 um 23:29 schrieb Nicolas Paris <[hidden email]>:

Hi

I have pdf to load into spark with at least <filename, byte_array>
format. I have considered some options:

- spark streaming does not provide a native file stream for binary with
variable size (binaryRecordStream specifies a constant size) and I
would have to write my own receiver.

- Structured streaming allows to process avro/parquet/orc files
containing pdfs, but this makes things more complicated than
monitoring a simple folder  containing pdfs

- Kafka is not designed to handle messages > 100KB, and for this reason
it is not a good option to use in the stream pipeline.

Somebody has a suggestion ?

Thanks,

--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]



--
nicolas

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]