Whenever the threads vs. processes discussion comes up, everywhere you read, people will agree that threads are lighter and come with fewer overheads, and it is often much more preferable to use threads to parallelize your Python code rather than processes. However, almost everywhere, people will blame the GIL to be the spoilsport when it comes to Python, forcing them to consider processes over threads, especially for the computationally heavy processes.
For those who are unaware, Python’s GIL allows only one thread of a process to access Python’s interpreter. You’ll find countless debates on the internet, discussing the need for GIL. While evolution may give us a GIL-free Python in the future, till that time, we have to deal with the GIL.
The Supposed Anomaly
Now, my curiosity was ignited when I read that dask array and dask dataframes use a thread-based scheduler by default. Whereas dask-bags use a process-based scheduler.
If you aren’t aware of dask, it is a library for performing scalable analytics in Python. Very crudely speaking, it can be used to parallelize your python code. It has a low-level scheduler and high-level collections that mimic pure python collections and replicates the API of these collections to some extent. Thus, there is dask array to mimic NumPy array, dask dataframe to mimic Pandas dataframe. Dask bag deals with unordered collections of objects (a hybrid between a list and a set).
Dask array and dataframe using a threads-based scheduler by default seemed counter-intuitive. After all, when I’m performing operations on an array or a dataframe, I’m essentially performing some computation. I’m not just waiting idly for a network I/O event. Naturally, I’d expect Python’s GIL to be occupied. Why doesn’t dask dask use a process-based scheduler here then?
On the page explaining single machine scheduler in dask’s documentation, they say the following:
This option (threads) is good for numeric code that releases the GIL (like NumPy, Pandas, Scikit-Learn, Numba, …) because data is free to share. This is the default scheduler for
So this is where my curiosity began. What does NumPy releasing the GIL mean? Does it not need the python interpreter to perform its computation?
I went through several articles and forums on this topic. Finally, the answer was clear to me. Basically, a lot of NumPy computation happens using C-libraries. Therefore, such computations don’t require to use the Python interpreter. This is also a reason why several of the NumPy functions are so fast. In fact, if you see the NumPy source code, you will see an extensive C-codebase. An example is given here.
It should be noted that not all NumPy routines release the GIL. The ones that do are delimited by the NPY_BEGIN_THREADS and NPY_END_THREADS macros. You can search for these macros in the NumPy source code to see which all functions release the GIL.
NumPy keeps modifying functions such that more functions release the GIL with each release. So now it makes sense why dask uses a threads-based scheduler by default for dask arrays and dask dataframes. Dask bags generally deal with pure python objects like str or dict, and keep blocking the GIL. Therefore, it makes sense that they use a process-based scheduler by default.
- Scipy Cookbook – Parallel programming with numpy and scipy
- Stackoverflow – Why numpy calculations are not affected by GIL
- NumPy Source Code
- Stackoverflow: Numpy and Global Interpreter Lock