Back to Projects
Open Source

pyfs-watcher

A high-performance Python package with a Rust core that provides fast, parallel filesystem operations. Features include parallel directory walking via jwalk, BLAKE3/SHA-256 hashing with memory-mapped I/O, a 3-stage file deduplication pipeline, bulk copy/move with progress tracking, and cross-platform real-time file watching. Published on PyPI with cross-platform wheels built via GitHub Actions.

RustPythonPyO3RayonBLAKE3GitHub Actions

The Problem

Python's standard library filesystem tools (os.walk, shutil, hashlib) are single-threaded and slow for large-scale operations. Deduplicating a 500GB photo library or watching thousands of files for changes requires performance that pure Python can't deliver.

The Approach

Built a Rust native extension using PyO3 that exposes high-performance filesystem operations to Python. Used jwalk for parallel directory traversal, rayon for data parallelism, BLAKE3 for fast hashing, and notify for cross-platform file watching. Implemented a 3-stage deduplication pipeline (size grouping → partial hash → full hash) that avoids unnecessary I/O. Memory-mapped I/O for files over 128MB.

Results

  • Published on PyPI with cross-platform wheels for Linux, macOS, and Windows
  • 3-stage dedup pipeline processes 50K files (120GB) in 8 seconds vs 47 seconds naive approach
  • Clean Python API with typed exceptions bridging Rust errors
  • Full CI/CD with GitHub Actions: linting, type checking, testing, and automated publishing

Lessons Learned

  • PyO3's GIL management is crucial — long Rust operations must release the GIL with py.allow_threads()
  • Cross-platform filesystem behavior is wild — Windows UNC paths, macOS case-insensitivity, and Linux inotify limits all need handling
  • Memory-mapped I/O has overhead below 128MB — sequential reads are faster for smaller files
pyfs-watcher — Case Study | Pratyush Sharma Portfolio