How to Dynamically Determine the Repartition Count for Loading Large CSV Files into PostgreSQL Using PySpark?

I need to load 5 million records from a CSV file into a PostgreSQL table as quickly as possible using PySpark. The performance and speed of the operation are critical for me. I often run my code from either my PC or a server, and I suspect the optimal repartition count might differ depending on the environment.

Here are my configurations:

PC Configuration:

  • RAM: 16 GB
  • Storage: 256 GB
  • CPU: 8 cores (6 cores available)

Server Configuration:

  • RAM: 64 GB
  • Storage: 2 TB
  • CPU: 32 cores

Tech Stack:

  • Backend: Django 5.0.7
  • Python: 3.11.9
  • Database: PostgreSQL 15
  • PySpark: 3.5.1
df.repartition(400)
    .write
    .mode("overwrite")
    .format("csv")
    .save(filepath, header='false')

What I Need:

I want to dynamically determine the number of partitions based on the size of the data, the number of file rows, and the available CPU resources. The goal is to optimize the loading speed. Can someone suggest a dynamic function or strategy to calculate the appropriate repartition count for different environments?

This seems kinda off-topic. Where exactly is the relevance to the Unity game engine? We usually don’t really discuss or handle server side programming here and your question is quite specific for technology and languages that have no connection to game development in Unity.

I never used Spark or PySpark, but the code snippet you shared seems to export data to csv. However in the beginning you said you want to import CSV data into your database. Do you intend to export data in order to be edited outside the DB to be later re-imported? Questions regarding performance is also kinda pointless as there’s much more to it than the number of lines. Python is also not really the fastest approach to anything -.-