Skip to main content

Get process's progress when reading a file

Hi! Lately in my day job I've been using a Blender plugin to read PC2 animations. Animations of about 8Gb in size. On the current state of this plugin this blocks the whole GUI of Blender and can take up to ~10 minutes to load the whole file. Sometimes I could work on something else in parallel, but one time I've been stuck waiting for this to load though "how difficult can it be to find out how far into reading the file is the process?".

Turns out, thanks to /proc it is not very difficult to know from outside a process how much that process has read into a given file.

A simple test

But first, let us prepare a minimalistic sample, a script that will start reading a large file, and will take some time to do it. This script will create a sparse file TEST_FILE and read it slowly (a sparse file is one that is smart enough to not ignore big gaps in internal contents, and so can take little disk space even if its much bigger):

#!/usr/bin/env python3

import os
import time
os.system("truncate -s 200k TEST_FILE") # Create a middle-sized, empty file

# Reading file, slowly
with open("TEST_FILE", "rb") as f:
    while True:
        content = f.read(1024)
        if len(content) == 0:
            break
        print(".", end="", flush=True) # Give some feedback on the progress
        time.sleep(1) # We are not in a hurry

os.unlink("TEST_FILE") # Cleanup after ourselves

Let's call it read-slowly.py. If we run it it will take ~200 seconds to read the file.

File reading progress

So, what's the trick to know how much has the process read on the file?

Turns out, as you can see with man 5 proc, the /proc filesystem exposes the files opened by a process on /proc/[pid]/fd/, and the position that a process is at on that file, on /proc/[pid]/fdinfo/.

The trick, then, is easy. Find the File Descriptor for that file on /proc/[pid]/fd and look up it's posiiton on /proc/[pid]/fdinfo. This script (let's call it read-fd-progress.sh) does both:

#!/usr/bin/env bash

# Check that we have at least a PID
if [ -z "$1" ];then
   echo "$0 <PID> [<FD>]"
   exit 1
fi

set -eu  # Bail on any error
# See: http://redsymbol.net/articles/unofficial-bash-strict-mode/

PID=$1 # The PID is taken as the first parameter
FD=${2:-} # Take FD if we have second parameter, else consider it empty

if [ -z "$FD" ];then
   # Show the user the available file descriptors
   echo "Select a file descriptor:"
   for i in /proc/$PID/fd/*;do
       printf "  %s: " $(basename $i)
       readlink $i
   done

   read -p "FD: " FD
fi

FSIZE=$(stat -L /proc/$PID/fd/$FD --printf=%s)  # Read full file size
while [ 1 ];do

      # Stop if the process has finished reading the file
      if [ ! -f /proc/$PID/fdinfo/$FD ];then
         break
      fi

      # Read position on file
      x=$(cat "/proc/$PID/fdinfo/$FD"|grep pos: |cut -d: -f2|tr -d '\t ')

      # Convert that position into a % of the file size.
      #
      # This is not the interesting part, as it's just some hack to
      # have fixed-point like numbers in the shell. But the trick
      # is to have a per-10-thousands, and when printing split it
      # in the integer per-100 and the decimal per-100.
      # ... does that make sense?
      PER10000=$(( $x * 10000 / $FSIZE ))
      if [ $PER10000 -le 100 ];then
         # Less than 1%
         printf "  0.%02i%%\n" $(( $PER10000 ));
      else
         # More than 1%
         printf "%3i.%02i%%\n" $(( $PER10000 / 100 )) $(( $PER10000 % 100 ));
      fi

      # Wait for the next loop
      sleep 1
done

Now, if we call python3 read-slowly.py on a terminal, and bash read-fd-progress.sh on other, we can find out the progress of the reading process:

~ - > bash read-fd-progress.sh `pgrep -f read-slowly`
Select a file descriptor:
  0: /dev/pts/4
  1: /dev/pts/4
  2: /dev/pts/4
  3: /home/kenkeiras/TEST_FILE
  325: /dev/urandom
FD: 3
  2.00%
  4.00%
  4.00%
  4.00%
  4.00%
  6.00%
...

As a final note: you might notice that while python reads 0.5% of the file each second, bash is only "batch" movements of 2% every 4 seconds. I suspect this is due to some internal caching in python that is appreciable is the small file sizes of this test.

Hope you learned something!