[spark] sql repartition (by query hint) `SELECT /*+ COALESCE(1) */` is better than `SELECT /*+ REPARTITION(1) */`
fun insertOverwrite(target: String, db: DB, table: Table, partition: String, columns: List<String>) {
execute(
"""
INSERT OVERWRITE TABLE ${db.name}.${table.name}
PARTITION ($partition)
SELECT /*+ REPARTITION(1) */ ${columns.concat(ConstUtil.COMMA)}
FROM $target
""".trimIndent()
)
}
How to consolidate results of a spark SQL query to avoid lots of small files / avoid empty files
Context: In our data pipeline, we use spark SQL to run lots of queries that are supplied from our end users as text files that we then parameterise. Situation: Our queries look like: INSERT OVE...
stackoverflow.com
https://kontext.tech/article/1155/use-spark-sql-partitioning-hints
Use Spark SQL Partitioning Hints
In Spark or PySpark, we can use coalesce and repartition functions to change the partitions of a DataFrame. In article Spark repartition vs. coalesce , I summarized the key differences between these two. If we are using Spark SQL directly, how do we repa
kontext.tech
GitHub - dhkdn9192/data_engineer_should_know: 데이터 엔지니어가 알아야 하는 것들
데이터 엔지니어가 알아야 하는 것들. Contribute to dhkdn9192/data_engineer_should_know development by creating an account on GitHub.
github.com