Deduplicating Data With XFS And Reflinks
Table of Contents
Copy-on-Write filesystems have the nice property that it is possible to "clone" files instantly, by having the new file refer to the old blocks, and copying (possibly) changed blocks. This both saves time and space, and can be very beneficial in a lot of situations (for example when working with big files). In Linux this type of copy is called "reflink". We will see how to use it on the XFS filesystem.
1 Create a virtual block device
Linux supports a special block device called the loop device, which maps a normal file onto a virtual block device. This allows for the file to be used as a "virtual file system". Let's create such a file:
$ dd if=/dev/zero of=disk.img bs=100M count=10 10+0 records in 10+0 records out 1048576000 bytes (1,0 GB, 1000 MiB) copied, 10,8156 s, 97,0 MB/s
The size of this file is 1GB:
$ du -hs disk.img 1001M disk.img
Now let's create a loop device with this file:
$ sudo losetup -f disk.img $ losetup -a /dev/loop0: : (/home/user/disk.img)
The option -f finds an unused loop device, and
losetup -a shows
the name of the loop device that was created.
2 Create an XFS filesystem with the 'reflink' flag
Let's create an XFS filesystem on it, with the flag reflink=1 and the label test:
$ mkfs.xfs -m reflink=1 -L test disk.img meta-data=disk.img isize=512 agcount=4, agsize=64000 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=0, rmapbt=0, reflink=1 data = bsize=4096 blocks=256000, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=1850, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
Now we can mount it on a directory:
$ mkdir mnt $ sudo mount /dev/loop0 mnt $ df -h mnt/ Filesystem Size Used Avail Use% Mounted on /dev/loop0 993M 40M 954M 4% /home/user/mnt
3 Copy files with '–reflink'
Let's create for testing a file of size 100 MB (with random data):
$ sudo chown mnt: user $ cd mnt/ $ dd if=/dev/urandom of=test bs=1M count=100 100+0 records in 100+0 records out 104857600 bytes (105 MB, 100 MiB) copied, 0,445498 s, 235 MB/s
df -h . shows us 140M used:
$ df -h . Filesystem Size Used Avail Use% Mounted on /dev/loop0 993M 140M 854M 15% /home/user/mnt
Let’s copy the file with reflinks enabled:
$ cp -v --reflink=always test test1 'test' -> 'test1' $ ls -hsl total 200M 100M -rw-rw-r-- 1 user user 100M Aug 30 10:44 test 100M -rw-rw-r-- 1 user user 100M Aug 30 10:49 test1 $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/loop0 993M 140M 854M 15% /home/user/mnt
So, each copy of the file is 100M, but both of them still take on disk the same amout of space as before (140M). This shows the space-saving feature of reflinks. If the file was big enough, we would have noticed as well that the reflink copy takes no time at all, it is done instantly.
4 Deduplicate existing files
If there were already normal (non-reflink) copies of the file, we
could deduplicate them with a tool like
$ cp test test2 $ cp test test3 $ ls -hsl total 400M 100M -rw-rw-r-- 1 user user 100M Aug 30 10:44 test 100M -rw-rw-r-- 1 user user 100M Aug 30 10:49 test1 100M -rw-rw-r-- 1 user user 100M Aug 30 11:03 test2 100M -rw-rw-r-- 1 user user 100M Aug 30 11:03 test3 $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/loop0 993M 340M 654M 35% /home/user/mnt
Now let's install and run
$ sudo apt install duperemove $ duperemove -hdr --hashfile=/tmp/test.hash . $ df -h . Filesystem Size Used Avail Use% Mounted on /dev/loop0 993M 198M 796M 20% /home/user/mnt
So, it has reduced the amount of disk space used from 340M to 198M.
5 Clean up
Unmount and delete the test directory
$ cd .. $ umount mnt/ $ rmdir mnt/
Delete the loop device:
$ losetup -a $ sudo losetup -d /dev/loop0
Remove the file that was used to create the loop device:
$ rm disk.img