I need to load 5 million records from a CSV file into a PostgreSQL table as quickly as possible using PySpark. The performance and speed of the operation are critical for me. I often run my code from either my PC or a server, and I suspect the optimal repartition count might differ depending on the environment.
Here are my configurations:
PC Configuration:
- RAM: 16 GB
- Storage: 256 GB
- CPU: 8 cores (6 cores available)
Server Configuration:
- RAM: 64 GB
- Storage: 2 TB
- CPU: 32 cores
Tech Stack:
- Backend: Django 5.0.7
- Python: 3.11.9
- Database: PostgreSQL 15
- PySpark: 3.5.1
df.repartition(400)
.write
.mode("overwrite")
.format("csv")
.save(filepath, header='false')
What I Need:
I want to dynamically determine the number of partitions based on the size of the data, the number of file rows, and the available CPU resources. The goal is to optimize the loading speed. Can someone suggest a dynamic function or strategy to calculate the appropriate repartition count for different environments?