This post covers a modified Breadth First Search (BFS) approach for performing fast tree traversal on a remote fileshare. It combines multi-threading and multi-processing to achieve a 90x speedup over a naive Depth First Search solution.
This post covers a modified Breadth First Search (BFS) approach for performing fast tree traversal on a remote fileshare. It combines multi-threading and multi-processing to achieve a 90x speedup over a naive Depth First Search solution. We created it to generate faster diffs for Kubric's sync system.
One of Kubric's products is a smart asset search & storage system. Using it, you can sync with other storage systems like Dropbox, Amazon S3, Azure Files, Google Drive etc.
A problem any sync solution faces is this: how do you incrementally sync between the source and destination when either of them doesn't provide change notifications? The most obvious solution is to periodically compute a diff between the source and destination and copy over only the diff. The ability to list all the files in the source and the destination is needed for computing a diff. This is problematic because many services (such as Azure Files) do not provide any way to get a complete listing of files in one go.
So, our task is this - given a directory tree with hundreds of thousands of files and folders, we want to get a list of absolute paths of all the files in the tree as fast as possible. As an added edge of complexity, the directory tree is an Azure fileshare and the only way to interact with it is by either mounting the drive or by using the Azure Files HTTP API.
Our first instinct was to mount the fileshare as a disk. As it turns out Azure Fileshares can be mounted with CIFS on a Linux box. This turned out to be quite terrible and slow. A single
cd command would take multiple seconds to run.
So we decided to try out the HTTP API with a naive Depth First Search. Start from the root and keep going down until you hit a leaf. Rinse and repeat. This was definitely faster than the mounted disk but it was still too slow.
So how did we do it?
We went with an approach which performs level order traversal. Let's say we have a method to get a list of files and directories inside a directory. Let's call it
get_files_and_dirs. At each level,
- we append child files of the current level to our overall list of files
- child directories of the current level are appended to a list (let's call it
- get_files_and_dirs is called with
next_level_dirsas the input via a
ThreadPoolExecutor(threads are great for this problem because it's I/O intensive)
We keep doing this while
next_level_dirs is not empty. Since we have multiple starting directories, it makes sense to have a process pool. In the pool, each process performs the steps described above for one starting directory.
This is the implementation -
We found that a thread-pool of size 10 worked the best after some tests on a GCP Compute Engine
n1-standard-2 instance (2 vCPUs / 7.50 GB RAM). We had around 500,000 files and 12,000 directories in the file share . These were the results of the tests -
|Processes||Threads||Final Time (seconds)|
|1||1||7258 (this is the naive approach)|
So we managed to go from 7258 seconds to 80 seconds ➡️ a 90x speedup 🙂. As I mentioned earlier, we implemented this for getting a list of files from Azure Files. However, you can use this approach with any storage product as long as you have an API to list files in a directory.
We are hiring! If you are interested in building smart systems around Scalable Search, Storage, Orchestration & Rendering drop us a mail at [email protected]