Rebalancing Files

note This is a new feature (as of 2.40). It may change or contain bugs. We always do the best we can to ensure MogileFS is stable, but be careful.

In 2.40+ MogileFS has a rewritten drain/rebalance system. You define a set of rules and it will select devices to pull files from, and devices to put files onto.

So if you’re looking to retire some devices, or have added new hosts and wish to move some files onto it, this is how you do that.

When to Run Rebalance

Rebalancing isn’t a required operation of mogile. However it is good practice to pay attention to your device storage and make decisions on where your files should be. If you add three new empty hosts, it’s often better to shuffle existing files onto them, freeing up space across all of your hosts to distribute new files (which may be accessed more frequently than older ones).

MogileFS replication also enjoys having lots of options to replicate files toward. So keep it happy :)

Rebalance Policies

The new rebalance/drain system works via policies. You define a string of options and the system evaluates the options and makes a decision. As of this writing rebalance is under construction, so the options here may not be all of the options available.

To show the rebalance settings:

$ mogadm rebalance settings
             rebal_policy = from_percent_used=95 to_percent_free=50 limit_type=device limit_by=size limit=5g fid_age=old

And to set them:

$ mogadm rebalance policy --options="from_hosts=3 to_percent_free=50"

When you first start a rebalance, mogile will discover and save the list of source devices. However, every few seconds it will re-evaluate the list of destination devices. This is both for avoiding a ping-pong state and tpo always find the best possible candidates for a destination.

Policy Options

(and their defaults)

    # source
    from_hosts => [],           # host ids (not names).
    from_devices => [],         # device ids.
    from_percent_used => undef, # 0.nn * 100
    from_percent_free => undef,
    from_space_used => undef,
    from_space_free => undef,
    fid_age => 'old',           # old|new
    limit_type => 'device',     # global|device
    limit_by => 'none',         # size|count|percent|none
    limit => undef,             # 100g|10%|5000

    # target
    to_hosts => [],
    to_devices => [],
    to_percent_used => undef,
    to_percent_free => undef,
    to_space_used => undef,
    to_space_free => undef,
    not_to_hosts => [],
    not_to_devices => [],
    use_dest_devs => 'all',     # all|N (list up to N devices to rep pol)
    leave_in_drain_mode => 0,

from_(percent|space)_*: Pull fids from devices “at least this x”. At least this much space used, this much percent free. The space options are expressed in megabytes.

to_(percent|space)_*: Same as above, except used in limiting possible destination devices.

(from|to)_hosts: Given a comma separated list of host ids, choose all devices

(from|to)_devices: Directly specify which devices you want to pull from or drop files onto. Other options may further reduce this list (_percent_used, etc). Note: This is a comma seperated list of the device ids, e.g. from_devices=199,201,233

not_to_*: filter out specific devices or hosts from the destination._

fid_age: Defines whether rebalance will choose “old” (ascending numbered) fids or “new” (descending numbered) fids from the device first. Since MogileFS uses an incrementor for the fid, files are naturally ordered by age. In some setups “old” files may be accessed less often than new ones, which can influence your rebalancing decision.

limit_type: (global|device) whether or not the specified limit is applied globally (drain 5000g in total from any of these devices), or per device (pick 12 devices and drain 10g from each)

limit_by: (size|count|percent|none) defines what the limit is. note as of writing “percent” is unimplemented. You can limit by ‘count’, which would be a number of files to move, by ‘size’, a specific number of bytes to copy, or ‘none’, which means to kill all files from the devices.

limit: The limit as defined by you above. The ‘size’ limit is expressed in bytes by default, but takes a human modifier (500m, 10g, 13t, etc)

use_dest_devs: (all|N) after applying all filters, you may have any number of destination devices. A handful, dozens, hundreds. This limits the amount of devices that replication will later consider. It’s mostly an optimization, if you have any devices you might want to set this to some reasonable number.

leave_in_drain_mode: (0|1) In previous versions setting a device to ‘drain’ meant that mogile would harass the device automatically and remove files. Now it simply means “don’t put new files here”. While rebalance is working on a device, it is set from alive and into drain mode. If you wish to emulate the old drain behavior, you set this value to 1. Setting no limit and enabling this will remove all files from a device and not allow new ones to be added again.

Running a Rebalance

As noted above, create your policy. You can review it via mogadm rebalance settings

Testing

$ mogadm rebalance test
Tested rebalance policy...
Policy: etc

Source devices:
 - 100
 - 102
 - 103
 - 104
Destination devices:
 - 156
 - 157
 - 158
 - 159

Before starting a rebalance, you should review what devices the policy would match.

Hopefully future versions will display more information about the devices, but for now you may match the lists up against the output of mogadm check mentally :)

Starting

$ mogadm rebalance start
$ mogadm rebalance stop
$ mogadm rebalance reset

Rebalance will pause when stopped, but existing entries in REBAL_QUEUE will continue to run. You can view the queue with mogstats --stats="general-queues". It may be advisable to wait for this to finish before starting again.

To restart your rebalance from scratch, either change the policy or run a mogadm rebalance reset while stopped.

Watching

$ mogadm rebalance status
Rebalance is running
Rebalance status:
             bytes_queued = 126008251219
           completed_devs = ,102,125,151,148
              fids_queued = 519021
             sdev_current = 119
             sdev_lastfid = 54646960511250969
               sdev_limit = 2840763873
              source_devs = 108,115,103,113,152,142,107,141,100

The status output for rebalance is simply a dump of its internal state, along with some counters it keeps. The status is updated every few seconds after the job master runs.

sdev_current is the device it’s working on.

sdev_limit are the remaining bytes (or number of files) rebalance is attempting to move.

fids_queued is the global count of how many fids have been queued since starting.

bytes_queued are the number of bytes moved since starting.

The state will stick around after rebalance has finished running, so you may look at the final state.

It’s generally a good idea to watch mogile’s syslog output during a rebalance. If it runs into fids it cannot rebalance for some reason, the information is sent to syslog (or !watch if you telnet to a tracker).

Optimizing

Most of the optimizations for an FSCK apply to a rebalance so see: https://github.com/mogilefs/mogilefs-docs/blob/master/FSCK.md#tuning-fsck for tips on speeding up the process.

Basically increase the number of replicate jobs across your trackers.

telnet trackername trackerport
!want 5 replicate

There is a configurable setting: queue_rate_for_rebal which defaults to 60 (in Mogilefs 2.45). Most people can just leave it as is. If adding mor e replicate processes stops helping and load across trackers/database is still low you may want to look into tweaking this.

Rebalance Examples

Make everything roughly even

As of this writing there’re some missing shortcuts for evening out your file distribution. There’re two approaches to doing this:

One is to calculate how much disk space to move from each device on the “fuller” devices, and where to put them ie:

from_percent_used=90 to_percent_used=10 limit_type=device limit_by=size limit=5g

If you have two hosts with four full disks each, and two new hosts with empty disks, the above would move 40g of data onto the new hosts.

The other approach is to run rebalance over and over until mogadm rebalance test doesn’t pick any new devices to work on.

from_percent_used=51 to_percent_free=51 limit_type=device limit_by=size limit=1g

The above will take files off anything > 51% full, and move 1g from each device toward anything at least 51% empty. The theory is if you run this enough times it should trend towards even. Be mindful; if your entire cluster is more than 50% full, you could end up re-running rebalance forever.

How Rebalance Works

Filter Devices

Simply put, the above document tells you how to build a string which filters the list of devices and selects which ones should or shouldn’t be drained from, or should or shouldn’t be replicated to.

Once the devices are selected, rebalance will run them through one at a time and queue up fids for the replication workers to actually rebalance.

Internally Rebalance

Below is a high level example of the flow a replicate worker has for rebalance. This is useful to know in case of seeing “would_worsen” errors on small clusters.

fid 5 has 3 copies. one on each of dev 31, 32, 33.
rebalance decides to pull the fid off of dev33.

Rebalance tells the replication code “replicate FID 5, and pretend you don’t have this copy on dev33, and you can’t put the FID on dev33”

This is a constraint that rebalance can never accidentaly leave a fid worse off than it was before, if anything goes wrong it should be “too_happy” and not sad. This also deals with being able to move fids which have only one valid copy. Otherwise the drain code would nuke the fid.

Now that it has 4 copies, the rebalance code deletes the one from dev33
It should then have 3 copies, and is now balanced away from dev33.