~~META:
title = C2020D20 Root Cause Analysis Storage Disruption 20201118
~~
{{htmlmetatags>
metatag-keywords=(rcs root cause analysis storage disruption)
metatag-og:title=(Root Cause Analysis Storage Disruption 20201118)
metatag-og:description=(
	A planned storage failover test in november 2020 has led to a
	severe storage disruption on the NPO Hosting platform.
	This document aims to give a root cause analysis for the
	incident.
	)
}}
====== Root Cause Analysis Storage Disruption 20201118 ======
A [[c2020d20-storage-onderhoud-202011|planned storage failover]]
test in november 2020 has led to a
[[c2020d20-verstoring-agv-storage-onderhoud-202011|severe storage disruption]]
on the NPO Hosting platform.
During this test all mysql, mariadb, postgresql, elastic search, postfix
and some other instances crashed. In total about 350 instances, leading
to a major outage on the NPO hosting platform.

In general we consider the storage solution as described here as very
stable and performant. We would recommend this solution to anyone
who's willing to pay for it. It is a high-end enterprise-class
storage solution. Also we keep the support organisation of the
supplier in high regard. Most capable and knowledgeable. So it is
very unfortunate that a test on the high-availability aspects of
the system led to this outage.  Also, from a systems point of view,
this test actually succeeded. The failover on the server-side worked
as designed, unfortunately the clients reacted in an unexpected
manner. I.e. "The operation was successful, but the patient died".

This document aims to give a root cause analysis for the incident.

====== Storage setup ======
For a better understanding of exactly what the test entailed and
why this could lead to a storage disruption, some knowledge is needed
about the storage platform, the storage protocol used (NFS) and how the NPO
Hosting platform uses NFS to store database files.

===== Server Side =====
Storage to the clients is delivered by a two fileservers in
active/passive failover mode.
The fileservers don't have local storage, but consume storage from a block
storage system connected via fibrechannel.
This is called a SAN((Storage Area Network)).
The fileservers consume blockstorage from the block storage system
through the SAN and provide file storage (NFS and CIFS) to a.o. a farm
of linux clients which is the backbone of the NPO Hosting platform.

Both the blockstorage system and the fileservers are geographically
spread over two locations. The idea being that if something bad happens
in one datacenter the other datacenter can do a full take-over.

The fileservers offer two levels of redundancy:
  - Storage redundancy, where all data is replicated between two datacenters
  - Service redundancy, where one fileserver can take over all fileserving tasks of the sibling server in the other datacenter.

Storage redundancy is obtained by a process named
"Sync-DR"((Synchronous Data Repliation)). This functionality is
delivered natively by the block storage system. Block storage on a
physical location can be in one of two states: Primary ("P-VOL") or
Secondary ("S-VOL"). Data is always written to the primary location and
replicated from the primary to the secondary location.
The process is synchronous. This means that a write operation does not
terminate before the write to both the primary and the secondary
location has finished. The upshot of this is that both locations
always contain an up-to-date copy of all data.

The idea is that should the primary location fail, a failover to the
secondary location is possible by promoting the secondary to primary.
This is the mechanism that the NPO wanted to test in the failover test
that led to the incident.

Service redundancy is obtained by a process named "EVS migration".
An EVS((Enterprise Virtual Server)) can be seen as a kind of virtual
machine.  Both fileservers act as virtualisation hosts for the EVS's.
On either host a number of EVS's can be running. The actual fileserving
is done through the EVS. The EVS holds the server IP that is used by the
clients to connect to the network storage. 
An EVS migration can be seen as a live migration of an EVS from one host 
to the other.
This can be very useful when e.g. network or power maintenance in one
datacenter is needed. By performing an EVS migration to the host in the
other datacenter the host in the first datacenter is relieved of its
tasks and maintenance in the first datacenter can take place without
impact to the file services.
EVS migrations are done on a regular basis in the NPO environment.
Also, if a datacenter might fail, an EVS migration to the remaining
datacenter is done automatically by the remaining fileserver.

Of course there's much more to be said about the server-side storage
architecture (the implementation document alone is over 100 pages)
but since many of the implementation details are not relevant to this
particular incident, for the sake of brevity we will not include them here.

===== Client Side =====
The (NFS) client side consists of a number of clusters of linux servers,
both bare-metal and VM's. In total somewhere between 100 and 150 linux
systems are involved. The CIFS shares are exported to ~500 PC's, but
since the issue was with NFS, not with CIFS these systems are not in
scope.
On the server most filesystems are exported with te flags below:
<code>
[ip-subnet](rw,sec=sys)
</code>
Meaning:
  * ''[subnet]'': The subnet applicable to the specific share. Typically the different linux clusters live in different networks. And only the shares for the specific cluster are exported to the matching subnet.
  * ''rw'': Most shares are exported read/write. Although a number of shares (the ones containing e.g. php code) are exported read/only (ro) to the production clusters.
  * ''sec=sys'': Use classic AUTH_SYS to authenticate NFS operations. I.e. don't authenticate aigains kerberos or the like.

The fileservers announce themselves as follows using rpcinfo:
<code>
$ rpcinfo -T tcp6 -s evs-web-01-328
   program version(s) netid(s)                         service     owner
    100003  2,3       TCP,TCP6                         nfs         
    100005  1,3       UDP,UDP6,TCP,TCP6                mountd      
    100021  3,1,4     UDP,UDP6,TCP,TCP6                nlockmgr    
    100024  1         UDP,UDP6,TCP,TCP6                status      
    100011  1,2       UDP,UDP6,TCP,TCP6                rquotad     
    100000  2,3       UDP,UDP6,TCP,TCP6                portmapper  
    334741  3         UDP,UDP6,TCP,TCP6                -           
</code>

On the linux systems the NFS filesystems are mounted with the following
mount flags:
<code>
rw,nosuid,nodev,noexec,nolock,nocto,noatime,vers=3,acdirmin=10,acdirmax=10,acregmin=16,acregmax=16,intr,proto=tcp6
</code>
We'll list them below:
  * ''rw'': Most shares are mounted read/write, execept for the ones containing code. Those are mounted read/only.
  * ''nosuid,nodev,noexec'': security measures. Don't honour set-uid bits (nosuid) or device nodes (nodev). Also don't allow execution of binaries from nfs shares (noexec).
  * ''nolock'': Don't support NFS cluster wide locking.
  * ''nocto,noatime,acdirmax=10,acregmin=16,acregmax=16'': Performance measures. Our workload is very metadata-intensive, so we want to cache metadata as much as possible, including the results of repeated ''stat()'' and ''access()'' operations.
  * ''vers=3,proto=tcp6'': We use NFSv3 because that is hardware accelerated on the server (as opposed to NFSv4). Use tcp as opposed to udp, since tcp has better performance in a non lossy environment. Mount it over IPv6 because we're living in 2020.
  * ''intr'': This is a legacy setting in our environment. In the past it used to be necessary in certain circumstances, but nowadays it doesn't do anything anymore. From the manpage: "The  intr / nointr mount option is deprecated after kernel 2.6.25.  Only SIGKILL can interrupt  a  pending  NFS operation on these kernels, and if specified, this mount option is ignored  to  provide  backwards  compatibility with older kernels."

Apart from the explicitly listed options above there are a number of default NFS mount options. We keep them at their defaults. For completeness we'll list them below:
  * ''rsize=65536,wsize=65536'': This is the default for TCP mounts, which is fine for our environment
  * ''namlen=255'': there aren't any pathname components > 255 characters.
  * ''hard'': Retry NFS requests indefinitely. (**Important** because we don't want to hand over NFS errors to the applications. Instead simply wait until NFS springs back to life in case of an outage)
  * ''timeo=600,retrans=2'': Default timeout values.
  * ''mountvers=3,mountport=4004,mountproto=tcp6'': Specifics for the rpc.mountd protocol.

==== Loopback mounts ====
However, this is not all there is to it! Databases and some other
applications (Elastic Search to name one) don't like it much when their
datafiles reside directly on NFS storage. This is while NFS may look,
feel and smell very much like a regular local file system (i.e. a
filesystem backed by a local harddisk or perhaps backed by an iscsi
device), it differs in some aspects. Mainly how deleting an open
file is treated. Consider the following sequence of events:
(this example assumes a linux client
with a ''/proc'' filesystem in order to show the current open files)
<code>
### create an open file foo on a regular filesystem (/tmp)
$ cd /tmp
### open two filedesriptors, one for writing, one for reading
$ exec {writefd}>foo {readfd}<foo
### see the open files
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/10 -> /tmp/foo
lr-x------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/11 -> /tmp/foo
###now remove the directory entry
$ rm foo
### the filedescriptors still exists!
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/10 -> /tmp/foo (deleted)
lr-x------ 1 dick dick 64 Nov 19 14:43 /proc/self/fd/11 -> /tmp/foo (deleted)
### and can be written to:
$ echo hello >&$writefd
### Where does "hello" live now?
### It has to be somewhere, because we can still read the filecontents
### through the read filedescriptor:
$ read -u $readfd filecontents
$ echo $filecontents
hello
### only after all open files are closed, the contents become inaccessible
$ exec {readfd}<&- {writefd}<&-
$ ls -l /proc/self/fd/{$writefd,$readfd}
ls: cannot access /proc/self/fd/10: No such file or directory
ls: cannot access /proc/self/fd/11: No such file or directory
$ read -u $readfd filecontents
-bash: read: 11: invalid file descriptor: Bad file descriptor
</code>
On a regular local file system, the kernel knows about the state of the
open files and doesn't delete the inode (where the on-disk location of
the string "hello" is stored) until the open file is closed by the
application. However, on a NFS filesystem this is not possible! That is
because the NFS protocol is stateless (more on that below) and the NFS
server doesn't know anything about files being open or closed. So at the
time the user issues ''rm foo'' the NFS fileserver has a problem,
because it wouldn't have any place to store "hello" later on.
This problem is solved in NFS in a somewhat peculiar way, namely by the
creation of '.nfsXXXXXXXXXX' files by the client.
Remember that the server has no concept of open files, so a ''close()''
call on a client has no meaning to an NFS server. Hence the //client//
has to do something on a ''close()'', not the server.
So when the user issues "rm foo", the client does something along
the lines of "mv foo .nfsXXXXXXXX". An only after the user closes the file,
the client server removes the .nfsXXXXXXXX file. This can be demonstrated
easily. Suppose we do the same sequence of commands on an NFS mounted
filesystem, let's see what happens:
(in this specific example ''/d/test3/rw/00/tmp'' happens to be NFS mounted filesystem)
<code>
$ cd /d/test3/rw/00/tmp
$ exec {writefd}>foo {readfd}<foo
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:51 /proc/self/fd/10 -> /d/test3/rw/00/tmp/foo
lr-x------ 1 dick dick 64 Nov 19 14:51 /proc/self/fd/11 -> /d/test3/rw/00/tmp/foo
$ rm foo
### Now see how this differs from a locally mounted filesystem!
$ ls -l /proc/self/fd/{$writefd,$readfd}
l-wx------ 1 dick dick 64 Nov 19 14:52 /proc/self/fd/10 -> /d/test3/rw/00/tmp/.nfs0000000087a83ac70000001a
lr-x------ 1 dick dick 64 Nov 19 14:52 /proc/self/fd/11 -> /d/test3/rw/00/tmp/.nfs0000000087a83ac70000001a
### A mysterious .nfsXXX file appeared!
$ ls -l .nfs*
-rw-rw-r-- 1 dick dick 0 Nov 19 14:51 .nfs0000000087a83ac70000001a
$ echo hello >&$writefd
### See how the mysterious .nfsXXX file grew by 6 bytes ("hello" + newline)
$ ls -l .nfs*
-rw-rw-r-- 1 dick dick 6 Nov 19 14:53 .nfs0000000087a83ac70000001a
$ read -u $readfd filecontents
$ echo $filecontents
hello
### And see the .nfsXXX file disappear when we close our filedescriptors
$ exec {readfd}<&- {writefd}<&-
$ ls -l .nfs*
ls: cannot access .nfs*: No such file or directory
</code>

All very interesting of course, but what this teaches us is that
there's a difference in semantics between regular and NFS mounted
filesystems. And some applications pick up on that. They might get upset
when suddenly an .nfsXXX file appears at a place where they don't expect
it.

Apart from a difference in semantics there might also be noticable
differences in performance. Suppose an application creates a great many
small files. Now, when the application wants to list those files ("ls"
in unix) and the files reside on an NFS filesystem, for //each// file a
round-trip to the NFS server is needed to get the metadata (filesize,
permissions and the like). Each roundtrip might take about a
millisecond, but these milliseconds add up. So when there are a thousand
files, listing them all takes one second. And ten-thousand files would
take ten seconds. Whereas on a local filesystem getting the metadata of
a file is in the order of microseconds.

So for these reasons we cannot run all of our applications directly on
NFS filesystems. However, we //want// them on some sort of networked
filesystem, to achieve some form of high availability.
A common answer to this conundrum is "use something like iscsi". This
gives you a network block device, on which a "local like" filesystem can
be created with all the semantics of a true local filesystem.

However, a) our current license does not include the use of iscsi on the
fileserver and b) iscsi has its own slew of problems where it comes to
timeouts caused by network hickups or problems on the fileserver.
NFS is (in our eyes) a much more reliable protocol when it comes to
network related problems.

So how to combine the advantages of a local filesystem with the
advantages of NFS? The answer is "use loopback filesystems".

Linux has support for something called a "loopback device".
A loopback device is a block device that is backed by a regular file on
a filesystem. These can be created using the "losetup" command.
Since it creates a new block device, this can contain anything a regular
block device can. Specifically a filesystem image, that can be mounted
by the server.  Here's an example:
<code>
### First, create an empty file on a regular filesystem (/tmp)
$ cd /tmp
### create an empty, sparse 1Gbyte file, named "fsimage"
$ truncate --size=1g fsimage
### now put a filesystem in that image (xfs in this case)
$ mkfs -t xfs fsimage
meta-data=fsimage                isize=256    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
### Next, create a block device out of this image:
$ sudo losetup -f --show /tmp/fsimage
/dev/loop1
### Okay, so now we've got a /dev/loop1, let's inspect it
$ sudo losetup /dev/loop1
/dev/loop1: [0700]:133 (/tmp/fsimage)
$ sudo blockdev --getsize64 /dev/loop1
1073741824
### And, since it contains a filesystem image, we can mount it
$ sudo mount /dev/loop1 /mnt
$ df /mnt
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop1     1014M   33M  982M   4% /mnt
### and cleanup (-d cleans up the loop back device)
$ sudo umount -d /mnt
</code>

The nice thing about loopback devices is that losetup doesn't care
where its backing storage is located. The backing store can just as
easily be a file on an NFS mounted filesystem.

So that is what we use for our databases and the like. Their storage
resides on a loopback file system, backed by a NFS mounted file.
For all intents and purposes this "looks like" a local filesystem to the
applications. And on the performance front is also "feels like" a local
filesystem. Getting the metadata of a thousand files in this case does
not require 1000 roundtrips to the fileserver. Instead it translates to
NFS as a block operation (on the NFS backed storage file). Very likely
only requiring one or two roundtrips to the fileserver.

====== Background on NFS operation ======
In order to understand what went wrong during this incident, some
background on how an NFS server processes file operations is needed.

===== File handles =====
The idea behind the NFS protocol is that it is stateless. This means
that the NFS server does not have to maintain any state to handle a
request. It does not have to "remember" whether files are open or closed
or the like. The way this works is that the NFS server deals out opaque
"file handles" to the client. And then the client can use this handle for a
subsequent request. A typical transaction migh look like this((which is
of course a gross oversimplification of the actual protocol. E.g. to get
to file /foo/bar/baz a number of individual lookups of /, /foo, /foo/bar
and /foo/bar/baz are needed))
<code>
client: Hey, server, I want to read file /foo/bar/baz
server: Fine. The file handle you can use is XYZZY
client: Hey, server, please give me byte 0--9 belonging to handle XYZZY
server: (hands over 10 bytes) There you go!
client: Hey, server, please give me byte 10--19 belonging to handle XYZZY
server: (hands over 10 bytes) There you go!
</code>

The thing is that the server doesn't have to know whether some client
had an open file or not. As long as it can translate arbitrary handles to
files it does not have to know this and can simply hand over the
requested data. Furthermore, by choosing the handles cleverly the
fileserver doesn't even have to know to what file on the filesystem a
specific handle belongs. Usually NFS servers choose something called an
"inode" as the handle. The inode of a file is a lower level filesystem
structure that points directly to the data. On classical filesystems
there used to be an inode array. And inode N used to map directly to the
N'th item in this array. The fields of the inode then contain the
metadata of the file (owner, permission etc) and pointers to diskblocks
where the actual data is stored.
On modern filesystems this is somewhat more complex, but the idea still
holds that when the fileservers chooses its filehandles cleverly it
doesn't have to do much translation between a filehandle and the data
that is being requested.

===== Mount handles =====
Next, you might ask, what happens when a fileserver doesn't export one
filesystem, but it exports two or more? How can the fileserver tell
which incoming filehandle belongs to which filesystem? The answer is: 
usually this information is encoded in the file handle.
This is where "mount handles" (also known as "root handles") come in.
This is a three step
process:
  - When a client mounts a NFS filesystem what happens under the hood is really not much more than the NFS server issueing a so called "mount handle" to the client. The only thing the client has to do is to remember this mount handle
  - Next when a client wants access to a file it requests a file handle and hands over the mount handle to the server, so the server knows relative to which filesystem it should look.
  - And finally when the clients wants to read from or write to this file it hands over the file handle.

The mount handle is used by the fileserver to distinguish between
different exports on the fileserver. Often it is a combination of the
major/minor device number of the exported device and the root inode of
the export. In practice the mount handle is the file handle for the root
of the exported filesystem.

===== Stale file handles =====
Anyone who has done anything with NFS has seen the much dreaded "Stale
file handle" error at some point in time. But what do they mean?

Well, remember that NFS is a network filesystem and one server might
serve an arbitrary number of clients. These clients don't know anything
about each other. They only talk to the server.
Furthermore, often to reduce network or server load, the clients use a
cache to cache filehandles.
Now suppose there are two clients, client A and client B.
Also supose the next chain of events happen:
<code>
client A: Hey, server, I want to read file /foo/bar/baz
server: Fine. The file handle you can use is XYZZY
client B: Hey, server, I want to read file /foo/bar/baz
server: Fine. The file handle you can use is XYZZY
client A: Hey server, please remove file /foo/bar/baz
server: Okidoki. Poof! It's gone now
client B: Hey, server, please give me byte 0--9 belonging to handle XYZZY
server: XYZZY?!? That doesn't exist -> Stale file handle error
</code>

When a filehandle cannot be traced back to actual file data by the NFS
server, the fileserver has to assume that the handle is no longer valid
and issues a Stale file handle error.

===== Stale mount handles =====
(We are finally getting somewhere!)
A stale mount handle is essentialy a stale file handle for the
mountpoint.
Suppose that a fileserver decides to unmount an existing, exported
filesystem. What happens now?
Actually on many unices it is not possible to unmount an exported
filesystem, without unexporting it first. (That is because the kernel or
rpc.mountd or the nfsd daemon itself is very likely to have at least one open
file (the root inode) on the exported filesystem.
Now suppose the  following sequence of events happens:
<code>
client: Hey server, can I get the mount handle for export /foo ?
server: Sure! XZXZXZ
admin-on-server: unexport /foo
admin-on-server: unmount /foo
client: Hey server, I want the filehandle for /foo/bar/baz, given mount handle XZXZXZ
server: XZXZXZ? Never heard of it! Stale filehandle!
client: WTF?!
</code>
So now, all applications on the client get a "Stale file handle"
error on any file they access.

If after some time the server decides to mount and export the filesystem
again, the situation improves:
<code>
admin-on-server: mount /foo
admin-on-server: export /foo
client: Hey server, I want the filehandle for /foo/bar/baz, given mount handle XZXZXZ
server: Sure! Here it is: XYZZY
</code>

Whether this helps depends on the applications accessing the storage.
Typically a webserver will recover. However, a database server might
remain down/broken/crashed once is has received a single "Stale file
handle" error.

===== Stale handles on loopback filesystems =====
(Almost there!)
Now what happens if we trigger a stale file handle (mount handle) on the
backing store of a loopback filesystem?

In this case the linux block layer will see the stale file handle errors
and hand them over to the filesystem code, where the handling depends on
the type of filesystem.
In our case we use XFS in almost all cases. XFS will simply shut down
the filesystem:
<code>
blk_update_request: I/O error, dev loop7, sector 1048610
XFS (loop7): metadata I/O error: block 0x100022 ("xlog_iodone") error 5 numblks 64
XFS (loop7): xfs_do_force_shutdown(0x2) called from line 1233 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff8161541c
XFS (loop7): Log I/O Error Detected.  Shutting down filesystem
XFS (loop7): Please umount the filesystem and rectify the problem(s)
</code>

What the clients see is a broken filesystem:
<code>
$ ls
ls: cannot open directory .: Input/output error
</code>

How applications handle this is entirely up to them, but many
applications will simply exit, because this is an error that cannot
generally be fixed by the application.

The way XFS works is that even after the backing store comes back online
again, the harm is already done and the filesystem remains shut down
and unusable. The only way to recover is to unmount the filesystem and
mount it afresh.

====== Root Cause ======
The test consisted of a Sync-DR failover, triggered as follows:
<code>
sudo /opt/hds/HNASSDR/hnassdr_switch.py -c /opt/hds/HNASSDR/conf/hnassdr.conf --span WEB --to-site MGW
</code>

The root cause of this incident is that there is a distinct difference
in how the clients "see" a Sync-DR failover as opposed to how they see
an EVS migration.
For the Sync-DR failover the server has to unmount (and probably
unexport) its filesystems. The moment that happens all clients get stale
file errors. As a result all loopback mounted XFS filesystems are shut
down and consequentially all applications running on these filesystems
crash.

Now you might ask yourself: If for an EVS migration the EVS has to
switch between two physically different servers, does it not also need a
filesystem unmount on the first server and a mount on the second server?
And would that also not lead to the same problems?

The answer is "No". This is due to the way that the storage is
implemented on this class of fileservers. The filesystem is not mounted
like a classical unix filesystem would be. Instead the filesystem
implementation lives in silicon (FPGA) and the server talks to the
filesystem through a custom board. Apparently both servers can access
these filesystems without needing to mount or unmount them.
Because the filesystem does not need to be unmounted during an EVS
migration, the clients never get stale file handles. The only
more-or-less visible action for the clients is that the IP address goes
away for a bit and comes back to life soon thereafter. What the clients
cannot see is that the IP address now lives on another host. So just the
ordinary NFS retry and timeout rules kick in. And when the IP address
is back alive the clients can continue as if nothing happened.
(yes, in case of tcp nfs mounts, the clients very likely need to open a
new tcp connection to the server, but this is handled transparently by the
nfs-over-tcp protocol)

====== Next Steps ======
So the problem is that there is an interval in which the fileserver
actively tells the clients that they are providing it with stale file handles.
If the fileserver simply stopped answering during this interval there
wouldn't be any problem. The clients would simply retry until the server
started answering their requests again. In the meantime the NFS clients
would block all filesystem access from the applications. The
applications would temporary freeze (probably thinking "meh, this disk
is real slow" or not thinking at all because they would be in blocked,
waiting for IO state) but would not crash.

So are there ways in which we could temporarily block access to the
fileservers? The answer is yes:
  * The clients could temporarily activate some firewalling (iptables) rules to block all traffic from and to the fileservers
  * Or maybe the EVS could temporarily shut down it's IP address
  * Or maybe there exists some command on the fileserver to do exactly this.

To test this perhaps we could make a small, replicated test volume.
Export it over some test IP address. Next, a failover test could be done
where the (test) clients block access to this IP address before 
Sync-DR failover and unblock it when the failover is done.