From ebabcff3386077ae4ad9ecc678ca69d9ef2f0ea9 Mon Sep 17 00:00:00 2001 From: David Teigland Date: Apr 10 2017 20:03:15 +0000 Subject: update web page --- diff --git a/README.mk b/README.mk new file mode 100644 index 0000000..2ed9908 --- /dev/null +++ b/README.mk @@ -0,0 +1,916 @@ +See https://pagure.io/sanlock + +See sanlock(8) at sanlock.git/src/sanlock.8 + +See wdmd(8) at sanlock.git/wdmd/wdmd.8 + +``` +SANLOCK(8) System Manager's Manual SANLOCK(8) + +NAME + sanlock - shared storage lock manager + +SYNOPSIS + sanlock [COMMAND] [ACTION] ... + +DESCRIPTION + sanlock is a lock manager built on shared storage. Hosts with access + to the storage can perform locking. An application running on the + hosts is given a small amount of space on the shared block device or + file, and uses sanlock for its own application-specific synchroniza‐ + tion. Internally, the sanlock daemon manages locks using two disk- + based lease algorithms: delta leases and paxos leases. + + · delta leases are slow to acquire and demand regular i/o to shared + storage. sanlock only uses them internally to hold a lease on its + "host_id" (an integer host identifier from 1-2000). They prevent two + hosts from using the same host identifier. The delta lease renewals + also indicate if a host is alive. ("Light-Weight Leases for Storage- + Centric Coordination", Chockler and Malkhi.) + + · paxos leases are fast to acquire and sanlock makes them available to + applications as general purpose resource leases. The disk paxos + algorithm uses host_id's internally to represent different hosts, and + the owner of a paxos lease. delta leases provide unique host_id's + for implementing paxos leases, and delta lease renewals serve as a + proxy for paxos lease renewal. ("Disk Paxos", Eli Gafni and Leslie + Lamport.) + + Externally, the sanlock daemon exposes a locking interface through lib‐ + sanlock in terms of "lockspaces" and "resources". A lockspace is a + locking context that an application creates for itself on shared stor‐ + age. When the application on each host is started, it "joins" the + lockspace. It can then create "resources" on the shared storage. Each + resource represents an application-specific entity. The application + can acquire and release leases on resources. + + To use sanlock from an application: + + · Allocate shared storage for an application, e.g. a shared LUN or LV + from a SAN, or files from NFS. + + · Provide the storage to the application. + + · The application uses this storage with libsanlock to create a + lockspace and resources for itself. + + · The application joins the lockspace when it starts. + + · The application acquires and releases leases on resources. + + How lockspaces and resources translate to delta leases and paxos leases + within sanlock: + + Lockspaces + + · A lockspace is based on delta leases held by each host using the + lockspace. + + · A lockspace is a series of 2000 delta leases on disk, and requires + 1MB of storage. + + · A lockspace can support up to 2000 concurrent hosts using it, each + using a different delta lease. + + · Applications can i) create, ii) join and iii) leave a lockspace, + which corresponds to i) initializing the set of delta leases on disk, + ii) acquiring one of the delta leases and iii) releasing the delta + lease. + + · When a lockspace is created, a unique lockspace name and disk loca‐ + tion is provided by the application. + + · When a lockspace is created/initialized, sanlock formats the sequence + of 2000 on-disk delta lease structures on the file or disk, e.g. + /mnt/leasefile (NFS) or /dev/vg/lv (SAN). + + · The 2000 individual delta leases in a lockspace are identified by + number: 1,2,3,...,2000. + + · Each delta lease is a 512 byte sector in the 1MB lockspace, offset by + its number, e.g. delta lease 1 is offset 0, delta lease 2 is offset + 512, delta lease 2000 is offset 1023488. + + · When an application joins a lockspace, it must specify the lockspace + name, the lockspace location on shared disk/file, and the local + host's host_id. sanlock then acquires the delta lease corresponding + to the host_id, e.g. joining the lockspace with host_id 1 acquires + delta lease 1. + + · The terms delta lease, lockspace lease, and host_id lease are used + interchangably. + + · sanlock acquires a delta lease by writing the host's unique name to + the delta lease disk sector, reading it back after a delay, and veri‐ + fying it is the same. + + · If a unique host name is not specified, sanlock generates a uuid to + use as the host's name. The delta lease algorithm depends on hosts + using unique names. + + · The application on each host should be configured with a unique + host_id, where the host_id is an integer 1-2000. + + · If hosts are misconfigured and have the same host_id, the delta lease + algorithm is designed to detect this conflict, and only one host will + be able to acquire the delta lease for that host_id. + + · A delta lease ensures that a lockspace host_id is being used by a + single host with the unique name specified in the delta lease. + + · Resolving delta lease conflicts is slow, because the algorithm is + based on waiting and watching for some time for other hosts to write + to the same delta lease sector. If multiple hosts try to use the + same delta lease, the delay is increased substantially. So, it is + best to configure applications to use unique host_id's that will not + conflict. + + · After sanlock acquires a delta lease, the lease must be renewed until + the application leaves the lockspace (which corresponds to releasing + the delta lease on the host_id.) + + · sanlock renews delta leases every 20 seconds (by default) by writing + a new timestamp into the delta lease sector. + + · When a host acquires a delta lease in a lockspace, it can be referred + to as "joining" the lockspace. Once it has joined the lockspace, it + can use resources associated with the lockspace. + + Resources + + · A lockspace is a context for resources that can be locked and + unlocked by an application. + + · sanlock uses paxos leases to implement leases on resources. The + terms paxos lease and resource lease are used interchangably. + + · A paxos lease exists on shared storage and requires 1MB of space. It + contains a unique resource name and the name of the lockspace. + + · An application assigns its own meaning to a sanlock resource and the + leases on it. A sanlock resource could represent some shared object + like a file, or some unique role among the hosts. + + · Resource leases are associated with a specific lockspace and can only + be used by hosts that have joined that lockspace (they are holding a + delta lease on a host_id in that lockspace.) + + · An application must keep track of the disk locations of its + lockspaces and resources. sanlock does not maintain any persistent + index or directory of lockspaces or resources that have been created + by applications, so applications need to remember where they have + placed their own leases (which files or disks and offsets). + + · sanlock does not renew paxos leases directly (although it could). + Instead, the renewal of a host's delta lease represents the renewal + of all that host's paxos leases in the associated lockspace. In + effect, many paxos lease renewals are factored out into one delta + lease renewal. This reduces i/o when many paxos leases are used. + + · The disk paxos algorithm allows multiple hosts to all attempt to + acquire the same paxos lease at once, and will produce a single win‐ + ner/owner of the resource lease. (Shared resource leases are also + possible in addition to the default exclusive leases.) + + · The disk paxos algorithm involves a specific sequence of reading and + writing the sectors of the paxos lease disk area. Each host has a + dedicated 512 byte sector in the paxos lease disk area where it + writes its own "ballot", and each host reads the entire disk area to + see the ballots of other hosts. The first sector of the disk area is + the "leader record" that holds the result of the last paxos ballot. + The winner of the paxos ballot writes the result of the ballot to the + leader record (the winner of the ballot may have selected another + contending host as the owner of the paxos lease.) + + · After a paxos lease is acquired, no further i/o is done in the paxos + lease disk area. + + · Releasing the paxos lease involves writing a single sector to clear + the current owner in the leader record. + + · If a host holding a paxos lease fails, the disk area of the paxos + lease still indicates that the paxos lease is owned by the failed + host. If another host attempts to acquire the paxos lease, and finds + the lease is held by another host_id, it will check the delta lease + of that host_id. If the delta lease of the host_id is being renewed, + then the paxos lease is owned and cannot be acquired. If the delta + lease of the owner's host_id has expired, then the paxos lease is + expired and can be taken (by going through the paxos lease algo‐ + rithm.) + + · The "interaction" or "awareness" between hosts of each other is lim‐ + ited to the case where they attempt to acquire the same paxos lease, + and need to check if the referenced delta lease has expired or not. + + · When hosts do not attempt to lock the same resources concurrently, + there is no host interaction or awareness. The state or actions of + one host have no effect on others. + + · To speed up checking delta lease expiration (in the case of a paxos + lease conflict), sanlock keeps track of past renewals of other delta + leases in the lockspace. + + Expiration + + · If a host fails to renew its delta lease, e.g. it looses access to + the storage, its delta lease will eventually expire and another host + will be able to take over any resource leases held by the host. san‐ + lock must ensure that the application on two different hosts is not + holding and using the same lease concurrently. + + · When sanlock has failed to renew a delta lease for a period of time, + it will begin taking measures to stop local processes (applications) + from using any resource leases associated with the expiring lockspace + delta lease. sanlock enters this "recovery mode" well ahead of the + time when another host could take over the locally owned leases. + sanlock must have sufficient time to stop all local processes that + are using the expiring leases. + + · sanlock uses three methods to stop local processes that are using + expiring leases: + + 1. Graceful shutdown. sanlock will execute a "graceful shutdown" + program that the application previously specified for this case. The + shutdown program tells the application to shut down because its + leases are expiring. The application must respond by stopping its + activities and releasing its leases (or exit). If an application + does not specify a graceful shutdown program, sanlock sends SIGTERM + to the process instead. The process must release its leases or exit + in a prescribed amount of time (see -g), or sanlock proceeds to the + next method of stopping. + + 2. Forced shutdown. sanlock will send SIGKILL to processes using the + expiring leases. The processes have a fixed amount of time to exit + after receiving SIGKILL. If any do not exit in this time, sanlock + will proceed to the next method. + + 3. Host reset. sanlock will trigger the host's watchdog device to + forcibly reset it. sanlock carefully manages the timing of the + watchdog device so that it fires shortly before any other host could + take over the resource leases held by local processes. + + Failures + + If a process holding resource leases fails or exits without releasing + its leases, sanlock will release the leases for it automatically + (unless persistent resource leases were used.) + + If the sanlock daemon cannot renew a lockspace delta lease for a spe‐ + cific period of time (see Expiration), sanlock will enter "recovery + mode" where it attempts to stop and/or kill any processes holding + resource leases in the expiring lockspace. If the processes do not + exit in time, sanlock will force the host to be reset using the local + watchdog device. + + If the sanlock daemon crashes or hangs, it will not renew the expiry + time of the per-lockspace connections it had to the wdmd daemon. This + will lead to the expiration of the local watchdog device, and the host + will be reset. + + Watchdog + + sanlock uses the wdmd(8) daemon to access /dev/watchdog. wdmd multi‐ + plexes multiple timeouts onto the single watchdog timer. This is + required because delta leases for each lockspace are renewed and expire + independently. + + sanlock maintains a wdmd connection for each lockspace delta lease + being renewed. Each connection has an expiry time for some seconds in + the future. After each successful delta lease renewal, the expiry time + is renewed for the associated wdmd connection. If wdmd finds any con‐ + nection expired, it will not renew the /dev/watchdog timer. Given + enough successive failed renewals, the watchdog device will fire and + reset the host. (Given the multiplexing nature of wdmd, shorter over‐ + lapping renewal failures from multiple lockspaces could cause spurious + watchdog firing.) + + The direct link between delta lease renewals and watchdog renewals pro‐ + vides a predictable watchdog firing time based on delta lease renewal + timestamps that are visible from other hosts. sanlock knows the time + the watchdog on another host has fired based on the delta lease time. + Furthermore, if the watchdog device on another host fails to fire when + it should, the continuation of delta lease renewals from the other host + will make this evident and prevent leases from being taken from the + failed host. + + If sanlock is able to stop/kill all processing using an expiring + lockspace, the associated wdmd connection for that lockspace is + removed. The expired wdmd connection will no longer block /dev/watch‐ + dog renewals, and the host should avoid being reset. + + Storage + + On devices with 512 byte sectors, lockspaces and resources are 1MB in + size. On devices with 4096 byte sectors, lockspaces and resources are + 8MB in size. sanlock uses 512 byte sectors when shared files are used + in place of shared block devices. Offsets of leases or resources must + be multiples of 1MB/8MB according to the sector size. + + Using sanlock on shared block devices that do host based mirroring or + replication is not likely to work correctly. When using sanlock on + shared files, all sanlock io should go to one file server. + + Example + + This is an example of creating and using lockspaces and resources from + the command line. (Most applications would use sanlock through libsan‐ + lock rather than through the command line.) + + 1. Allocate shared storage for sanlock leases. + + This example assumes 512 byte sectors on the device, in which case + the lockspace needs 1MB and each resource needs 1MB. + + # vgcreate vg /dev/sdb + # lvcreate -n leases -L 1GB vg + + 2. Start sanlock on all hosts. + + The -w 0 disables use of the watchdog for testing. + + # sanlock daemon -w 0 + + 3. Start a dummy application on all hosts. + + This sanlock command registers with sanlock, then execs the sleep + command which inherits the registered fd. The sleep process acts + as the dummy application. Because the sleep process is registered + with sanlock, leases can be acquired for it. + + # sanlock client command -c /bin/sleep 600 & + + 4. Create a lockspace for the application (from one host). + + The lockspace is named "test". + + # sanlock client init -s test:0:/dev/test/leases:0 + + 5. Join the lockspace for the application. + + Use a unique host_id on each host. + + host1: + # sanlock client add_lockspace -s test:1/dev/vg/leases:0 + host2: + # sanlock client add_lockspace -s test:2/dev/vg/leases:0 + + 6. Create two resources for the application (from one host). + + The resources are named "RA" and "RB". Offsets are used on the + same device as the lockspace. Different LVs or files could also be + used. + + # sanlock client init -r test:RA:/dev/vg/leases:1048576 + # sanlock client init -r test:RB:/dev/vg/leases:2097152 + + 7. Acquire resource leases for the application on host1. + + Acquire an exclusive lease (the default) on the first resource, and + a shared lease (SH) on the second resource. + + # export P=`pidof sleep` + # sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P + # sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P + + 8. Acquire resource leases for the application on host2. + + Acquiring the exclusive lease on the first resource will fail + because it is held by host1. Acquiring the shared lease on the + second resource will succeed. + + # export P=`pidof sleep` + # sanlock client acquire -r test:RA:/dev/vg/leases:1048576 -p $P + # sanlock client acquire -r test:RB:/dev/vg/leases:2097152:SH -p $P + + 9. Release resource leases for the application on both hosts. + + The sleep pid could also be killed, which will result in the san‐ + lock daemon releasing its leases when it exits. + + # sanlock client release -r test:RA:/dev/vg/leases:1048576 -p $P + # sanlock client release -r test:RB:/dev/vg/leases:2097152 -p $P + + 10. Leave the lockspace for the application. + + host1: + # sanlock client rem_lockspace -s test:1/dev/vg/leases:0 + host2: + # sanlock client rem_lockspace -s test:2/dev/vg/leases:0 + + 11. Stop sanlock on all hosts. + + # sanlock shutdown + +OPTIONS + COMMAND can be one of three primary top level choices + + sanlock daemon start daemon + sanlock client send request to daemon (default command if none given) + sanlock direct access storage directly (no coordination with daemon) + + Daemon Command + sanlock daemon [options] + + -D no fork and print all logging to stderr + + -Q 0|1 quiet error messages for common lock contention + + -R 0|1 renewal debugging, log debug info for each renewal + + -L pri write logging at priority level and up to logfile (-1 none) + + -S pri write logging at priority level and up to syslog (-1 none) + + -U uid user id + + -G gid group id + + -t num max worker threads + + -g sec seconds for graceful recovery + + -w 0|1 use watchdog through wdmd + + -h 0|1 use high priority (RR) scheduling + + -l num use mlockall (0 none, 1 current, 2 current and future) + + -b sec seconds a host id bit will remain set in delta lease bitmap + + -e str local host name used in delta leases + + Client Command + sanlock client action [options] + + sanlock client status + + Print processes, lockspaces, and resources being managed by the sanlock + daemon. Add -D to show extra internal daemon status for debugging. + Add -o p to show resources by pid, or -o s to show resources by + lockspace. + + sanlock client host_status + + Print state of host_id delta leases read during the last renewal. + State of all lockspaces is shown (use -s to select one). Add -D to + show extra internal daemon status for debugging. + + sanlock client gets + + Print lockspaces being managed by the sanlock daemon. The LOCKSPACE + string will be followed by ADD or REM if the lockspace is currently + being added or removed. Add -h 1 to also show hosts in each lockspace. + + sanlock client renewal -s LOCKSPACE + + Print a history of renewals with timing details. See the Renewal his‐ + tory section below. + + sanlock client log_dump + + Print the sanlock daemon internal debug log. + + sanlock client shutdown + + Ask the sanlock daemon to exit. Without the force option (-f 0), the + command will be ignored if any lockspaces exist. With the force option + (-f 1), any registered processes will be killed, their resource leases + released, and lockspaces removed. With the wait option (-w 1), the + command will wait for a result from the daemon indicating that it has + shut down and is exiting, or cannot shut down because lockspaces exist + (command fails). + + sanlock client init -s LOCKSPACE + + Tell the sanlock daemon to initialize a lockspace on disk. The -o + option can be used to specify the io timeout to be written in the + host_id leases. (Also see sanlock direct init.) + + sanlock client init -r RESOURCE + + Tell the sanlock daemon to initialize a resource lease on disk. (Also + see sanlock direct init.) + + sanlock client read -s LOCKSPACE + + Tell the sanlock daemon to read a lockspace from disk. Only the + LOCKSPACE path and offset are required. If host_id is zero, the first + record at offset (host_id 1) is used. The complete LOCKSPACE and io + timeout are printed. + + sanlock client read -r RESOURCE + + Tell the sanlock daemon to read a resource lease from disk. Only the + RESOURCE path and offset are required. The complete RESOURCE is + printed. (Also see sanlock direct read_leader.) + + sanlock client align -s LOCKSPACE + + Tell the sanlock daemon to report the required lease alignment for a + storage path. Only path is used from the LOCKSPACE argument. + + sanlock client add_lockspace -s LOCKSPACE + + Tell the sanlock daemon to acquire the specified host_id in the + lockspace. This will allow resources to be acquired in the lockspace. + The -o option can be used to specify the io timeout of the acquiring + host, and will be written in the host_id lease. + + sanlock client inq_lockspace -s LOCKSPACE + + Inquire about the state of the lockspace in the sanlock daemon, whether + it is being added or removed, or is joined. + + sanlock client rem_lockspace -s LOCKSPACE + + Tell the sanlock daemon to release the specified host_id in the + lockspace. Any processes holding resource leases in this lockspace + will be killed, and the resource leases not released. + + sanlock client command -r RESOURCE -c path args + + Register with the sanlock daemon, acquire the specified resource lease, + and exec the command at path with args. When the command exits, the + sanlock daemon will release the lease. -c must be the final option. + + sanlock client acquire -r RESOURCE -p pid + sanlock client release -r RESOURCE -p pid + + Tell the sanlock daemon to acquire or release the specified resource + lease for the given pid. The pid must be registered with the sanlock + daemon. acquire can optionally take a versioned RESOURCE string + RESOURCE:lver, where lver is the version of the lease that must be + acquired, or fail. + + sanlock client convert -r RESOURCE -p pid + + Tell the sanlock daemon to convert the mode of the specified resource + lease for the given pid. If the existing mode is exclusive (default), + the mode of the lease can be converted to shared with RESOURCE:SH. If + the existing mode is shared, the mode of the lease can be converted to + exclusive with RESOURCE (no :SH suffix). + + sanlock client inquire -p pid + + Print the resource leases held the given pid. The format is a ver‐ + sioned RESOURCE string "RESOURCE:lver" where lver is the version of the + lease held. + + sanlock client request -r RESOURCE -f force_mode + + Request the owner of a resource do something specified by force_mode. + A versioned RESOURCE:lver string must be used with a greater version + than is presently held. Zero lver and force_mode clears the request. + + sanlock client examine -r RESOURCE + + Examine the request record for the currently held resource lease and + carry out the action specified by the requested force_mode. + + sanlock client examine -s LOCKSPACE + + Examine requests for all resource leases currently held in the named + lockspace. Only lockspace_name is used from the LOCKSPACE argument. + + sanlock client set_event -s LOCKSPACE -i host_id -g gen -e num -d num + + Set an event for another host. When the sanlock daemon next renews its + delta lease for the lockspace it will: set the bit for the host_id in + its bitmap, and set the generation, event and data values in its own + delta lease. An application that has registered for events from this + lockspace on the destination host will get the event that has been set + when the destination sees the event during its next delta lease + renewal. + + sanlock client set_config -s LOCKSPACE + + Set a configuration value for a lockspace. Only lockspace_name is used + from the LOCKSPACE argument. The USED flag has the same effect on a + lockspace as a process holding a resource lease that will not exit. + The USED_BY_ORPHANS flag means that an orphan resource lease will have + the same effect as the USED. + -u 0|1 Set (1) or clear (0) the USED flag. + -O 0|1 Set (1) or clear (0) the USED_BY_ORPHANS flag. + + Direct Command + sanlock direct action [options] + + -o sec io timeout in seconds + + sanlock direct init -s LOCKSPACE + sanlock direct init -r RESOURCE + + Initialize storage for 2000 host_id (delta) leases for the given + lockspace, or initialize storage for one resource (paxos) lease. Both + options require 1MB of space. The host_id in the LOCKSPACE string is + not relevant to initialization, so the value is ignored. (The default + of 2000 host_ids can be changed for special cases using the -n + num_hosts and -m max_hosts options.) With -s, the -o option specifies + the io timeout to be written in the host_id leases. With -r, the -z 1 + option invalidates the resource lease on disk so it cannot be used + until reinitialized normally. + + sanlock direct read_leader -s LOCKSPACE + sanlock direct read_leader -r RESOURCE + + Read a leader record from disk and print the fields. The leader record + is the single sector of a delta lease, or the first sector of a paxos + lease. + + sanlock direct dump path[:offset[:size]] + + Read disk sectors and print leader records for delta or paxos leases. + Add -f 1 to print the request record values for paxos leases, and + host_ids set in delta lease bitmaps. + + LOCKSPACE option string + -s lockspace_name:host_id:path:offset + + lockspace_name name of lockspace + host_id local host identifier in lockspace + path path to storage reserved for leases + offset offset on path (bytes) + + RESOURCE option string + -r lockspace_name:resource_name:path:offset + + lockspace_name name of lockspace + resource_name name of resource + path path to storage reserved for leases + offset offset on path (bytes) + + RESOURCE option string with suffix + -r lockspace_name:resource_name:path:offset:lver + + lver leader version + + -r lockspace_name:resource_name:path:offset:SH + + SH indicates shared mode + + Defaults + sanlock help shows the default values for the options above. + + sanlock version shows the build version. + +OTHER + Request/Examine + The first part of making a request for a resource is writing the + request record of the resource (the sector following the leader + record). To make a successful request: + + · RESOURCE:lver must be greater than the lver presently held by the + other host. This implies the leader record must be read to discover + the lver, prior to making a request. + + · RESOURCE:lver must be greater than or equal to the lver presently + written to the request record. Two hosts may write a new request at + the same time for the same lver, in which case both would succeed, + but the force_mode from the last would win. + + · The force_mode must be greater than zero. + + · To unconditionally clear the request record (set both lver and + force_mode to 0), make request with RESOURCE:0 and force_mode 0. + + The owner of the requested resource will not know of the request unless + it is explicitly told to examine its resources via the "examine" + api/command, or otherwise notfied. + + The second part of making a request is notifying the resource lease + owner that it should examine the request records of its resource + leases. The notification will cause the lease owner to automatically + run the equivalent of "sanlock client examine -s LOCKSPACE" for the + lockspace of the requested resource. + + The notification is made using a bitmap in each host_id delta lease. + Each bit represents each of the possible host_ids (1-2000). If host A + wants to notify host B to examine its resources, A sets the bit in its + own bitmap that corresponds to the host_id of B. When B next renews + its delta lease, it reads the delta leases for all hosts and checks + each bitmap to see if its own host_id has been set. It finds the bit + for its own host_id set in A's bitmap, and examines its resource + request records. (The bit remains set in A's bitmap for set_bit‐ + map_seconds.) + + force_mode determines the action the resource lease owner should take: + + · FORCE (1): kill the process holding the resource lease. When the + process has exited, the resource lease will be released, and can then + be acquired by anyone. The kill signal is SIGKILL (or SIGTERM if + SIGKILL is restricted.) + + · GRACEFUL (2): run the program configured by sanlock_killpath against + the process holding the resource lease. If no killpath is defined, + then FORCE is used. + + Persistent and orphan resource leases + A resource lease can be acquired with the PERSISTENT flag (-P 1). If + the process holding the lease exits, the lease will not be released, + but kept on an orphan list. Another local process can acquire an + orphan lease using the ORPHAN flag (-O 1), or release the orphan lease + using the ORPHAN flag (-O 1). All orphan leases can be released by + setting the lockspace name (-s lockspace_name) with no resource name. + + Renewal history + sanlock saves a limited history of lease renewal information in each + lockspace. See sanlock.conf renewal_history_size to set the amount of + history or to disable (set to 0). + + IO times are measured in delta lease renewal (each delta lease renewal + includes one read and one write). + + For each successful renewal, a record is saved that includes: + + · the timestamp written in the delta lease by the renewal + + · the time in milliseconds taken by the delta lease read + + · the time in milliseconds taken by the delta lease write + + Also counted and recorded are the number io timeouts and other io + errors that occur between successful renewals. + + Two consecutive successful renewals would be recorded as: + timestamp=5332 read_ms=482 write_ms=5525 next_timeouts=0 next_errors=0 + timestamp=5353 read_ms=99 write_ms=3161 next_timeouts=0 next_errors=0 + + Those fields are: + + · timestamp is the value written into the delta lease during that + renewal. + + · read_ms/write_ms are the milliseconds taken for the renewal + read/write ios. + + · next_timeouts are the number of io timeouts that occured after the + renewal recorded on that line, and before the next successful renewal + on the following line. + + · next_errors are the number of io errors (not timeouts) that occured + after renewal recorded on that line, and before the next successful + renewal on the following line. + + The command 'sanlock client renewal -s lockspace_name' reports the full + history of renewals saved by sanlock, which by default is 180 records, + about 1 hour of history when using a 20 second renewal interval for a + 10 second io timeout. + +INTERNALS + Disk Format + · This example uses 512 byte sectors. + + · Each lockspace is 1MB. It holds 2000 delta_leases, one per sector, + supporting up to 2000 hosts. + + · Each paxos_lease is 1MB. It is used as a lease for one resource. + + · The leader_record structure is used differently by each lease type. + + · To display all leader_record fields, see sanlock direct read_leader. + + · A lockspace is often followed on disk by the paxos_leases used within + that lockspace, but this layout is not required. + + · The request_record and host_id bitmap are used for requests/events. + + · The mode_block contains the SHARED flag indicating a lease is held in + the shared mode. + + · In a lockspace, the host using host_id N writes to a single + delta_lease in sector N-1. No other hosts write to this sector. All + hosts read all lockspace sectors when renewing their own delta_lease, + and are able to monitor renewals of all delta_leases. + + · In a paxos_lease, each host has a dedicated sector it writes to, con‐ + taining its own paxos_dblock and mode_block structures. Its sector + is based on its host_id; host_id 1 writes to the dblock/mode_block in + sector 2 of the paxos_lease. + + · The paxos_dblock structures are used by the paxos_lease algorithm, + and the result is written to the leader_record. + + 0x000000 lockspace foo:0:/path:0 + + (There is no representation on disk of the lockspace in general, only + the sequence of specific delta_leases which collectively represent the + lockspace.) + + delta_lease foo:1:/path:0 + 0x000 0 leader_record (sector 0, for host_id 1) + magic: 0x12212010 + space_name: foo + resource_name: host uuid/name + ... + host_id bitmap (leader_record + 256) + + delta_lease foo:2:/path:0 + 0x200 512 leader_record (sector 1, for host_id 2) + magic: 0x12212010 + space_name: foo + resource_name: host uuid/name + ... + host_id bitmap (leader_record + 256) + + delta_lease foo:3:/path:0 + 0x400 1024 leader_record (sector 2, for host_id 3) + magic: 0x12212010 + space_name: foo + resource_name: host uuid/name + ... + host_id bitmap (leader_record + 256) + + delta_lease foo:2000:/path:0 + 0xF9E00 leader_record (sector 1999, for host_id 2000) + magic: 0x12212010 + space_name: foo + resource_name: host uuid/name + ... + host_id bitmap (leader_record + 256) + + 0x100000 paxos_lease foo:example1:/path:1048576 + 0x000 0 leader_record (sector 0) + magic: 0x06152010 + space_name: foo + resource_name: example1 + + 0x200 512 request_record (sector 1) + magic: 0x08292011 + + 0x400 1024 paxos_dblock (sector 2, for host_id 1) + 0x480 1152 mode_block (paxos_dblock + 128) + + 0x600 1536 paxos_dblock (sector 3, for host_id 2) + 0x680 1664 mode_block (paxos_dblock + 128) + + 0x800 2048 paxos_dblock (sector 4, for host_id 3) + 0x880 2176 mode_block (paxos_dblock + 128) + + 0xFA200 paxos_dblock (sector 2001, for host_id 2000) + 0xFA280 mode_block (paxos_dblock + 128) + + 0x200000 paxos_lease foo:example2:/path:2097152 + 0x000 0 leader_record (sector 0) + magic: 0x06152010 + space_name: foo + resource_name: example2 + + 0x200 512 request_record (sector 1) + magic: 0x08292011 + + 0x400 1024 paxos_dblock (sector 2, for host_id 1) + 0x480 1152 mode_block (paxos_dblock + 128) + + 0x600 1536 paxos_dblock (sector 3, for host_id 2) + 0x680 1664 mode_block (paxos_dblock + 128) + + 0x800 2048 paxos_dblock (sector 4, for host_id 3) + 0x880 2176 mode_block (paxos_dblock + 128) + + 0xFA200 paxos_dblock (sector 2001, for host_id 2000) + 0xFA280 mode_block (paxos_dblock + 128) + + Lease ownership + Not shown in the leader_record structures above are the owner_id, + owner_generation and timestamp fields. These are the fields that + define the lease owner. + + The delta_lease at sector N for host_id N+1 has leader_record.owner_id + N+1. The leader_record.owner_generation is incremented each time the + delta_lease is acquired. When a delta_lease is acquired, the + leader_record.timestamp field is set to the time of the host and the + leader_record.resource_name is set to the unique name of the host. + When the host renews the delta_lease, it writes a new + leader_record.timestamp. When a host releases a delta_lease, it writes + zero to leader_record.timestamp. + + When a host acquires a paxos_lease, it uses the host_id/generation + value from the delta_lease it holds in the lockspace. It uses this + host_id/generation to identify itself in the paxos_dblock when running + the paxos algorithm. The result of the algorithm is the winning + host_id/generation - the new owner of the paxos_lease. The winning + host_id/generation are written to the paxos_lease + leader_record.owner_id and leader_record.owner_generation fields and + leader_record.timestamp is set. When a host releases a paxos_lease, it + sets leader_record.timestamp to 0. + + When a paxos_lease is free (leader_record.timestamp is 0), multiple + hosts may attempt to acquire it. The paxos algorithm, using the + paxos_dblock structures, will select only one of the hosts as the new + owner, and that owner is written in the leader_record. The paxos_lease + will no longer be free (non-zero timestamp). Other hosts will see this + and will not attempt to acquire the paxos_lease until it is free again. + + If a paxos_lease is owned (non-zero timestamp), but the owner has not + renewed its delta_lease for a specific length of time, then the owner + value in the paxos_lease becomes expired, and other hosts will use the + paxos algorithm to acquire the paxos_lease, and set a new owner. + +FILES + /etc/sanlock/sanlock.conf + +SEE ALSO + wdmd(8) + + 2015-01-23 SANLOCK(8) +``` diff --git a/README.rst b/README.rst deleted file mode 100644 index c197d46..0000000 --- a/README.rst +++ /dev/null @@ -1,3 +0,0 @@ -| ``See https://pagure.io/sanlock -| ``See sanlock(8) at sanlock.git/src/sanlock.8`` -| ``See wdmd(8) at sanlock.git/wdmd/wdmd.8``