티스토리 뷰

공부

[Spark] small files in hadoop too slow

승가비 2020. 6. 26. 01:08
728x90

http://cloudsqale.com/2019/12/30/spark-slow-load-into-partitioned-hive-table-on-s3-direct-writes-output-committer-algorithms/

 

Spark – Slow Load Into Partitioned Hive Table on S3 – Direct Writes, Output Committer Algorithms – Large-Scale Data Engine

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow. Initially it to

cloudsqale.com

https://medium.com/arabamlabs/small-files-in-hadoop-88708e2f6a46

 

Small files in Hadoop

Problem

medium.com

https://vanducng.dev/2020/12/05/Compact-multiple-small-files-on-HDFS/

 

Compact multiple small files on HDFS

Hadoop can handle with very big file size, but will encounter performance issue with too many files with small size. The reason is explained in detailed from here. In short, every single on a data n

vanducng.dev

 

728x90

'공부' 카테고리의 다른 글

[Python] flask swagger parameter model  (0) 2020.06.26
[Java] list files in directory  (0) 2020.06.26
[Java] /Contents/Home/jre/bin/javac not found  (0) 2020.06.26
[Kibana] filter greater than count  (0) 2020.06.26
[Python] binary string to int  (0) 2020.06.26
댓글