[Spark] small files in hadoop too slow

티스토리 뷰

공부

[Spark] small files in hadoop too slow

승가비 2020. 6. 26. 01:08

728x90

http://cloudsqale.com/2019/12/30/spark-slow-load-into-partitioned-hive-table-on-s3-direct-writes-output-committer-algorithms/

Spark – Slow Load Into Partitioned Hive Table on S3 – Direct Writes, Output Committer Algorithms – Large-Scale Data Engine

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow. Initially it to

cloudsqale.com

https://medium.com/arabamlabs/small-files-in-hadoop-88708e2f6a46

Small files in Hadoop

Problem

medium.com

https://vanducng.dev/2020/12/05/Compact-multiple-small-files-on-HDFS/

Compact multiple small files on HDFS

Hadoop can handle with very big file size, but will encounter performance issue with too many files with small size. The reason is explained in detailed from here. In short, every single on a data n

vanducng.dev

728x90

저작자표시 비영리

'공부' 카테고리의 다른 글

[Python] flask swagger parameter model (0)	2020.06.26
[Java] list files in directory (0)	2020.06.26
[Java] /Contents/Home/jre/bin/javac not found (0)	2020.06.26
[Kibana] filter greater than count (0)	2020.06.26
[Python] binary string to int (0)	2020.06.26

250x250

공지사항

최근에 올라온 글

최근에 달린 댓글

Total

Today

Yesterday

링크

TAG more

« 2024/04 »
일	월	화	수	목	금	토
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

글 보관함

배우기를 멈추는 사람은 20세건 80세건 늙은 것이다.

티스토리 뷰

[Spark] small files in hadoop too slow

'공부' 카테고리의 다른 글

티스토리툴바