MogileFS Filesystem Check

MogileFS, being a large distributed HTTP-backed data store, does not have a traditional ‘fsck’ component. It does have a mechanism for walking and verifying the contents of the entire filesystem in a parallel, online, asynchronous fashion.

The MogileFS fsck will, by default, walk across every FID stored and verify that:

The FID is happy (but not too happy) with its replication policy (ie: has 3 copies across multiple hosts).
That each copy of the FID is stored where it’s supposed to be, exists, and is the correct length.
If a file has gone missing (no paths work), it will attempt to find the FID on any device.
Checks devcount cache, and some other misc minor notes.
Note that it doesn’t yet confirm file checksums.

The FSCK is strictly brute force. You cannot focus it to a particular domain, class or prefix of files. However, each release of MogileFS is adding small improvements to the FSCK code, so check the latest code to see what’s been added and updated.

When to Run a FSCK

While you should not need to run a FSCK constantly, or normally, it’s healthy to run one occasionally, or after major events. Major software errors, notable upgrades, power outages, etc. All good excuses for running a FSCK. Also, if you edit a class replication policy (add or remove replicas), the changes will not take effect until a FSCK has run.

FSCK can repair bugs in older versions. It can also help recover from situations where you don’t have enough unique storage hosts to satisfy replication for a while. FIDs will end up on multiple devices, just not on enough hosts to be happy.

Running a MogileFS FSCK

Kicking Off a FSCK

fsck is controlled via mogadm.

$ mogadm fsck 
Help for 'fsck' command:
 (enter any command prefix, leaving off options, for further help)

  mogadm fsck clearlog                               Clear the fsck log
  mogadm fsck printlog                               Display the fsck log
  mogadm fsck reset [opts]                           Reset fsck position back to the beginning
  mogadm fsck start                                  Start (or resume) background fsck
  mogadm fsck status                                 Show fsck status
  mogadm fsck stop                                   Stop (pause) background fsck
  mogadm fsck taillog                                Tail the fsck log

When starting a fsck for the first time, simply run ‘mogadm fsck start’. After a few moments, it should start running. Run mogadm fsck status to watch its progress.

FSCK options

You can start a fsck with a few options. If you want to run it against only newer files, you can tell it which FID number to start at, or tell it to only do a replication policy check without checking the state of files on disk.

mogadm fsck stop
mogadm fsck reset --startpos=5000 --policy-only
mogadm fsck start

Monitoring a FSCK Run

If you like watching grass grow, fsck monitoring is perfect for you!

    Running: Yes
     Status: 55252778 / 75053798 (73.61%)
       Time: 791m (1164 fids/s; 19801020m remain)
 Check Type: Normal (check policy + files)

 [num_GONE]: 1
 [num_NOPA]: 1
 [num_POVI]: 365
 [num_REPL]: 365
 [num_SRCH]: 1

You may periodically run mogadm fsck status to monitor progress of a fsck. Note that the status information is slightly weird since version 2.30. 2.33 and higher have much improved status output, however keep in mind that once a fsck has switched to “Running: No”, the fsck can still be filtering FIDs for a few minutes afterwards, while internal queues drain.

You can examine the results of the fsck with mogadm fsck printlog, or watch output while it’s running with mogadm fsck taillog

Tuning FSCK

In the 2.30+ versions, FSCK has gotten many tunable improvements. Previous fsck would run from exactly one worker process on a single tracker. If you had hundreds of millions of files, it could take over a month to run. Now that it’s distributed, it can run as fast as you have resources available to run with.

It’s a good idea to watch tracker logs (via syslog or running !watch while telnet’ed to a tracker), and look for timeout errors. If you start getting a lot of those, your cluster is running too hot.

Number of FSCK Workers

The easiest tunable is simply the number of parallel fsck workers you have.

telnet trackerhost 7001
Trying 172.16.151.10...
Connected to 172.16.151.10
Escape character is '^]'.
!jobs
[...]
fsck count 1
fsck desired 1
fsck pids 32664
.
!want 5 fsck
!want 5 fsck
Now desiring 5 children doing 'fsck'.
.

Slowly increase the number of fsck workers while monitoring load across your trackers, database, and storage cluster. The trackers will slowly ramp up how much work it fetches from the fsck queue at once, so it could take a few minutes for them to ramp up. However, if you’re adding more fsck workers and the fsck isn’t running any faster, you probably need to tune a few other settings.

FSCK Speed Settings

FSCK has a number of queue management options, which default relatively low to avoid adding too much load to small setups. If you have 25+ fsck workers (and your cluster isn’t overloading!) you might need to tune these in order to get it to run faster.

$ mogadm settings list
     internal_queue_limit = 500
      queue_rate_for_fsck = 1000
      queue_size_for_fsck = 20000
[etc]

If you have not adjusted these values from the defaults, they will not display here. It’s a little hard to discover what the defaults are until this is fixed (you can look in JobMaster.pm), but in the meantime the above are the defaults for 2.33 through 2.45

internal_queue_limit is the number of FIDs a tracker will fetch from the fsck database queue at once. Fsck workers will fetch chunks of FIDs internally from this queue. If you look at a tracker with the ‘!stats’ command, you will see various variables like work_queue_for_fsck 0. If you are actively running a fsck with many workers, and this stat is often very low or 0, increasing the internal_queue_limit will allow it to go faster. Be wary of increasing it too high.. A few thousand is probably all you need. There is a slow internal ramp up, so be patient. You might also need to tune the values below to keep this fed.

queue_size_for_fsck is the number of FIDs that fsck will leave queued in the database for trackers to pick up and send to their workers. The higher this is the more disk space you waste, but if the queue is constantly zero, doubling or tripling the limit can help the queue stay ahead of demand.

queue_rate_for_fsck is the number of FIDs that fsck will inject into the queue per cycle (every other second-ish), if the queue is below the limit. Setting the variable too high can cause too much DB load, but too low and the trackers can get ahead of the queue. The total number of FIDs that can be queued at once could be up to queue_size_for_fsck + queue_rate_for_fsck.

Interpreting Results

Fsck will log a bunch of obscure error codes as it finds and handles problems.

NOPA: FID has no usable paths, and is likely dead.
POVI: FID violates its class replication policy. This can mean too few, or too many copies.
MISS: FID is missing at least one copy that is supposed to exist.
BLEN: Copy of FID does not have the correct file length
GONE: Attempted to find any working, correct, copies of FID, but was not able to. File is dead. If you get these errors, you should take a deeper look at the FID and your app to find out why it might have gone missing.
SRCH: FSCK has started an exhaustive search for a missing file.
FOND: FSCK has recovered a FID that was thought to be lost.
REPL: FID has been scheduled for replication to fix a policy violation.
BCNT: FID’s ‘devcount’ cache is incorrect.
BSUM: Database checksum didn’t match on-disk contents
NSUM: Database checksum was missing from a class which requires checksums
MSUM: (Only for fsck_checksum=$HASH users) multiple checksums were calculated and MogileFS did not know which is correct.

If you have a new enough version of MogileFS::Utils installed, you may take fids noted in the FSCK logs and run them through mogfiledebug. This utility will trawl everything it can find out a particular fid, which will help give you ideas on what went wrong. If nothing else, it gives great output you can use to get help on the mailing list or IRC.

Cleaning up Between Runs

fsck status will count violation summaries from the log. Which means if you don’t clear the log in between runs you will see more failures than there actually are. Also, the log file can get huge and take a lot of space on your database. So once you have looked through the log entries (or copied them into a file somewhere), it’s a good idea to reset it.

mogadm fsck status
(confirm that it isn't running, and hasn't been for a few minutes at least)
mogadm fsck clearlog
mogadm fsck reset

mogilefs-docs

MogileFS documents