Merge multiple different s3 logs using pyspark 2.4.3

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Merge multiple different s3 logs using pyspark 2.4.3

anbutech
Hello,

version = spark 2.4.3

I have 3 different sources json logs data which having same schema(same
columns order) in the raw data and want to add one new column as
"src_category"  for all the  3 different source to distinguish the source
category  and merge all the  3 different sources into the single dataframe
to read the json data for the  processing.what is the best way to handle
this case.

df = spark.read.json(merged_3sourcesraw_data)

Input:

s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json

output:
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows


Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge multiple different s3 logs using pyspark 2.4.3

Gourav Sengupta
why s3a?

On Thu, Jan 9, 2020 at 2:20 AM anbutech <[hidden email]> wrote:
Hello,

version = spark 2.4.3

I have 3 different sources json logs data which having same schema(same
columns order) in the raw data and want to add one new column as
"src_category"  for all the  3 different source to distinguish the source
category  and merge all the  3 different sources into the single dataframe
to read the json data for the  processing.what is the best way to handle
this case.

df = spark.read.json(merged_3sourcesraw_data)

Input:

s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json

output:
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows


Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge multiple different s3 logs using pyspark 2.4.3

Shraddha Shah
Unless I am reading this wrong, this can be achieved with aws sync ? 

aws s3 sync s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/src_category=other/y=2019/m=12/d=12

Thanks,
-Shraddha



On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <[hidden email]> wrote:
why s3a?

On Thu, Jan 9, 2020 at 2:20 AM anbutech <[hidden email]> wrote:
Hello,

version = spark 2.4.3

I have 3 different sources json logs data which having same schema(same
columns order) in the raw data and want to add one new column as
"src_category"  for all the  3 different source to distinguish the source
category  and merge all the  3 different sources into the single dataframe
to read the json data for the  processing.what is the best way to handle
this case.

df = spark.read.json(merged_3sourcesraw_data)

Input:

s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json

output:
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows


Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]

Reply | Threaded
Open this post in threaded view
|

Re: Merge multiple different s3 logs using pyspark 2.4.3

Gourav Sengupta
Hi Shraddha,

what is interesting to me that people do not even have the courtesy to write their name when they request for help to user groups :)

your solution is spot on, there is another option available in spark SQL though for this.


Regards,
Gourav Sengupta

On Thu, Jan 9, 2020 at 1:19 PM Shraddha Shah <[hidden email]> wrote:
Unless I am reading this wrong, this can be achieved with aws sync ? 

aws s3 sync s3://my-bucket/ingestion/source1/y=2019/m=12/d=12 s3://my-bucket/ingestion/processed/src_category=other/y=2019/m=12/d=12

Thanks,
-Shraddha



On Thu, Jan 9, 2020 at 7:05 AM Gourav Sengupta <[hidden email]> wrote:
why s3a?

On Thu, Jan 9, 2020 at 2:20 AM anbutech <[hidden email]> wrote:
Hello,

version = spark 2.4.3

I have 3 different sources json logs data which having same schema(same
columns order) in the raw data and want to add one new column as
"src_category"  for all the  3 different source to distinguish the source
category  and merge all the  3 different sources into the single dataframe
to read the json data for the  processing.what is the best way to handle
this case.

df = spark.read.json(merged_3sourcesraw_data)

Input:

s3a://my-bucket/ingestion/source1/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source2/y=2019/m=12/d=12/logs1.json
s3a://my-bucket/ingestion/source3/y=2019/m=12/d=12/logs1.json

output:
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=other
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows-new
s3a://my-bucket/ingestion/processed/y=2019/m=12/d=12/src_category=windows


Thanks




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: [hidden email]