Deduplicating Btrfs

1. Intro

BTRFS supports deduplication. According to the BTRFS docs:

Deduplication is the process of looking up identical data blocks tracked separately and creating a shared logical link while removing one of the copies of the data blocks. This leads to data space savings while it increases metadata consumption.

BTRFS provides the basic building blocks for deduplication allowing other tools to choose the strategy and scope of the deduplication.

So, to take advantage of deduplication in BTRFS, we have to use one of these deduplication tools.

We will use BEES (Best-Effort Extent-Same). It is a block-oriented userspace deduplication agent designed for large btrfs filesystems.

2. Installation

In a Debian 12 server we have to build it from the source:

cd ~
git clone https://github.com/Zygo/bees
cd bees/
apt install -y build-essential btrfs-progs markdown
make
make install
which beesd
apt install -y uuid-runtime    # it installs 'uuidparse'

3. Configuration

For more details look at this page.

First we need to find out the UUIDs of the filesystems we want to run Bees on:

btrfs filesystem show

Then we should create a config file for each filesystem, like this:

cat <<EOF > /etc/bees/disk1.conf
UUID=91f2d0de-6678-4e89-9b0d-9ab8bdc724f2
OPTIONS="-P -v 6"
DB_SIZE=$((256*1024*1024))
EOF

The sample config file /etc/bees/beesd.conf.sample has also some comments with some explanations.

4. Running

We want to run Bees as a service (one for each filesystem):

cp ~/bees/scripts/beesd@.service /lib/systemd/system/

systemctl enable --now beesd@fe0a1142-51ab-4181-b635-adbf9f4ea6e6.service
systemctl enable --now beesd@91f2d0de-6678-4e89-9b0d-9ab8bdc724f2.service

systemctl status 'bees*'

After it has been running for some time (maybe a few hours or more), we will notice that the amount of the used disk space is decreased:

btrfs filesystem show

The command top will also show that bees is working intensively.

Because I use LXD containers, mostly with the same operating system, and also use Docker (on the host and on the containers), there is a lot of opportunity for deduplication. I have noticed a great decrease in the used disk space, about 20-25% of the original disk space. For example, if there were 400GB of the disk used, after deduplication it went down to 100GB.