Limitations of Pandas¶

We can use Pandas for data processing. It provides rich APIs to read data from different sources, process the data and then write it to different targets.

Pandas works well for light weight data processing.
Pandas is typically single threaded, which means only one process take care of processing the data.
As data volume grows, the processing time might grow exponentially and also run into resource contention.
It is not trivial to use distributed processing using Pandas APIs. We will end up struggling with multi threading rather than business logic.
There are Distributed Computing Frameworks such as Hadoop Map Reduce, Spark etc to take care of data processing at scale on multi node Hadoop or Spark Clusters.
Both Hadoop Map Reduce and Spark comes with Distributed Computing Frameworks as well as APIs.

Pandas is typically used for light weight Data Processing and Spark is used for Data Processing at Scale.

Mastering Pyspark

Limitations of Pandas¶