[spark] sql repartition (by query hint) `SELECT /*+ COALESCE(1) */` is better than `SELECT /*+ REPARTITION(1) */`

공부

[spark] sql repartition (by query hint) `SELECT /+ COALESCE(1) /` is better than `SELECT /+ REPARTITION(1) /`

승가비 2022. 9. 22. 22:08

728x90

fun insertOverwrite(target: String, db: DB, table: Table, partition: String, columns: List<String>) {
    execute(
        """
        INSERT OVERWRITE TABLE ${db.name}.${table.name}
        PARTITION ($partition)
        SELECT /*+ REPARTITION(1) */ ${columns.concat(ConstUtil.COMMA)}
        FROM $target
        """.trimIndent()
    )
}

https://stackoverflow.com/questions/46932373/how-to-consolidate-results-of-a-spark-sql-query-to-avoid-lots-of-small-files-a

How to consolidate results of a spark SQL query to avoid lots of small files / avoid empty files

Context: In our data pipeline, we use spark SQL to run lots of queries that are supplied from our end users as text files that we then parameterise. Situation: Our queries look like: INSERT OVE...

stackoverflow.com

https://kontext.tech/article/1155/use-spark-sql-partitioning-hints

Use Spark SQL Partitioning Hints

In Spark or PySpark, we can use coalesce and repartition functions to change the partitions of a DataFrame. In article Spark repartition vs. coalesce , I summarized the key differences between these two. If we are using Spark SQL directly, how do we repa

kontext.tech

https://github.com/dhkdn9192/data_engineer_should_know/blob/master/interview/hadoop/difference_between_repartition_and_coalesce_in_spark.md

GitHub - dhkdn9192/data_engineer_should_know: 데이터 엔지니어가 알아야 하는 것들

데이터 엔지니어가 알아야 하는 것들. Contribute to dhkdn9192/data_engineer_should_know development by creating an account on GitHub.

github.com

728x90

저작자표시 비영리 (새창열림)