samba4 cluster for AD : DRBD ocfs2 CTDB

With proper DC fail-over just around the corner, we turn our attention to that most neglected of forest workhorses: the file server. We've always wanted another file server for our domain, mainly for peace of mind. DFS was no good in a domain with windows and Linux workstations. It just had to be HA.

In true open source style, getting started in HA is strictly rocket scientists only. Documentation: apart from the excellent plain English drbd guide if it isn't out of date, there are bits missing and essential detail is assumed. So beware. The learning curve is steep and beginner support almost non existent. Anyway, let's have a go. If you can build a Samba4 DC from source, you can do this, it says here.

To build a 2 node cluster. Both nodes up. One fails, the other takes over. Simple.

All cluster articles have to have an unintelligible diagram with lines, arrows and IP addresses all over the place. So here's the only one which made any sense to us; from the DRBD link.

easy to understand 2 node cluster

As we understand it, we have smbd running on node 1. If that fails, both the ip and the smbd process will be transferred to node 2. We have a mirrored disk between the two.

dramatis personae
- an AD domain
- 2 spare computers
- 4 network interface cards
- a crossover cable
- 2 straight cables
- a table with everything close by
- 2 spare disks

Connect the first network card on each node to the switch using the straight cables. Connect crossover cable between the second 2 cards. Some detect the type of cable you have attached, others don't. For this post and to obviate the need for me to get the screwdriver out again, I've done it with vms.

Each node has 2 physical ethernet cards. One each is LAN side. These will carry both the physical IP for the node on that subnet and the public IPs, allocated by CTDB, which are used as the IPs for the cluster itself. The other pair carry the synchronisation traffic on a private subnet and are connected with the crossover cable. Both nodes will run smbd and winbind and will be configured with the same netbios name. A third bonding interface must be configured to load balance the actual IP of the node and the domain IPs which the workstations will use when requesting data. The slave interface for the bonding will be the card out to LAN. Unlike the crossover internal network, this interface will not be configured with an IP at operating system level. I think the term is, 'the physical interface is enslaved in the bonded interface'.

We shall join the cluster to the following domain:
DC: hh16,

cluster hostnames and addresses
node 1: smb1
node 2: smb2


The 1.x subnet goes out to the LAN, the 0.x is the private crossover.

The physical interfaces are enp0s3 and enp0s8 with bond0 bonded to enp0s3. The latter will carry the fail-over IP over to the domain which will be allocated by CTDB.

This is one place where openSUSE's Yast really saves you time and energy, otherwise it's back to editing files and wondering about the syntax.
On vBox, add another network adapter and connect the cable

and add a 2Gb disk

Kill the firewall and apparmor. Work out the ports and files later.

We're on openSUSE 13.1. Both nodes have un-provisioned Samba 4.1.9. You need DRBD 8.4. The documentation for 8.3 suggests that this is sufficiently different to matter. On openSUSE 13.1, this meant adding the ha and network:samba repositories. We had problems with CTDB 2.3, but 2.5.3 from the same repo. went in fine. For the file system, forget ext4; it just crashes spectacularly. ocfs2 works fine and screams between the nodes with anything you can throw at it. You also need ocfs2-tools.

identical ON BOTH nodes unless specified otherwise:

set the only dns to contain:
search hh3.site

        default_realm = HH3.SITE
        dns_lookup_realm = false
        dns_lookup_kdc = true
        default_ccache_name = /tmp/krb5cc_%{uid}

/etc/hosts localhost.localdomain localhost smb1 smb2

node 1: smb1
node 2: smb2

global {
  usage-count yes;
common {
  net {
    protocol C;

resource r0 {
 handlers {
   split-brain "/usr/lib/drbd/notify-split-brain.sh steve";

net { 
  after-sb-0pri discard-zero-changes;
  after-sb-1pri discard-secondary;
  after-sb-2pri disconnect;
startup { 
  become-primary-on both;
  on smb1 {
    device    /dev/drbd1;
    disk      /dev/sdb1;
    meta-disk internal;
  on smb2 {
    device    /dev/drbd1;
    disk      /dev/sdb1;
    meta-disk internal;

(make the folder yourself!)
        ip_port = 7777
        ip_address = 
        number = 1
        name = smb1
        cluster = ocfs2
        ip_port = 7777
        ip_address = 
        number = 2
        name = smb2
        cluster = ocfs2
        node_count = 2
        name = ocfs2  


 /etc/ctdb/public_addresses bond0 bond0

chmod +x some stuff:
-rwxr--r-- 1 root root  7713 Jul  4 13:06 10.interface
-rwxr--r-- 1 root root  1102 Jul  4 13:06 11.routing
-rwxr--r-- 1 root root  1070 Jul  4 13:06 49.winbind
-rwxr--r-- 1 root root  3491 Jul  4 13:06 50.samba


workgroup = HH3
netbios name = SMBCLUSTER
realm = HH3.SITE
security = ADS
kerberos method = secrets only
winbind enum users = Yes
winbind enum groups = Yes
winbind use default domain = Yes
winbind nss info = rfc2307
idmap config * : backend = tdb
idmap config * : range = 19900-19999
idmap config HH3 : backend  = ad
idmap config HH3 : range = 20000-4000000
idmap config HH3 : schema_mode = rfc2307
clustering = Yes
ctdbd socket = /var/lib/ctdb/ctdb.socket
path = /cluster/users
read only = No
path = /cluster/profiles
read only = No

passwd: files winbind
group:  files winbind
hosts:  files dns

partition the spare disk
use fdisk or yast to end up with a partition, /dev/sdb1 where:
fdisk -l
Disk /dev/sda: 12.9 GB, 12884901888 bytes, 25165824 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000f1bbc

   Device Boot      Start         End      Blocks   Id  System

/dev/sda1            2048     1525759      761856   82  Linux swap / Solaris
/dev/sda2   *     1525760    25165823    11820032   83  Linux

Disk /dev/sdb: 2147 MB, 2147483648 bytes, 4194304 sectors

Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk label type: dos
Disk identifier: 0x000ab2e5

   Device Boot      Start         End      Blocks   Id  System

/dev/sdb1            2048     4194303     2096128   83  Linux

**If you are on a vm, clone the machine and disk at this stage.

create the drbd metadata
drbdadm create-md r0
writing meta data...
initializing activity log
NOT initialized bitmap
New drbd meta data block successfully created.

start drbd 
drbd up r0

start the sync
Bring up drbd on node 2. We chose to synchronise from node 1. This only matters if you have data. Be careful. If either of your nodes has data and you have not repartitioned, choose that. The other node will be overwritten.
node 1:
drbdadm  primary --force r0

Wait until:
cat /proc/drbd
responds with:
version: 8.4.4 (api:1/proto:86-101)
GIT-hash: 3c1f46cb19993f98b22fdf7e18958c21ad75176d build by SuSE Build Service

 1: cs:Connected ro:Secondary/Secondary ds:UpToDate/UpToDate C r-----
monitoring the initial synchronisation

setup ocfs2-tools
/etc/init.d/o2cb configure
Configuring the O2CB driver.

This will configure the on-boot properties of the O2CB driver.
The following questions will determine whether the driver is loaded on
boot.  The current values will be shown in brackets ('[]').  Hitting
<ENTER> without typing an answer will keep that current value.  Ctrl-C
will abort.

Load O2CB driver on boot (y/n) [n]: y
Cluster stack backing O2CB [o2cb]: 
Cluster to start on boot (Enter "none" to clear) []: ocfs2
Specify heartbeat dead threshold (>=7) [31]: 
Specify network idle timeout in ms (>=5000) [30000]: 
Specify network keepalive delay in ms (>=1000) [2000]: 
Specify network reconnect delay in ms (>=2000) [2000]: 
Writing O2CB configuration: OK
Loading filesystem "configfs": OK
Loading stack plugin "o2cb": OK
Loading filesystem "ocfs2_dlmfs": OK
Creating directory '/dlm': OK
Mounting ocfs2_dlmfs filesystem at /dlm: OK
Setting cluster stack "o2cb": OK
Registering O2CB cluster "ocfs2": OK
Setting O2CB cluster timeouts : OK

create the file system on node 1
mkfs -t ocfs2 -N 2 -L stevescluster /dev/drbd1
mkfs.ocfs2 1.8.2
Cluster stack: classic o2cb
Label: ocfs2_drbd01
Features: sparse extended-slotmap backup-super unwritten inline-data strict-journal-super xattr indexed-dirs refcount discontig-bg
Block size: 4096 (12 bits)
Cluster size: 4096 (12 bits)
Volume size: 2146332672 (524007 clusters) (524007 blocks)
Cluster groups: 17 (tail covers 7911 clusters, rest cover 32256 clusters)
Extent allocator size: 4194304 (1 groups)
Journal size: 67108864
Node slots: 2
Creating bitmaps: done
Initializing superblock: done
Writing system files: done
Writing superblock: done
Writing backup superblock: 1 block(s)
Formatting Journals: done
Growing extent allocator: done
Formatting slot map: done
Formatting quota files: done
Writing lost+found: done
mkfs.ocfs2 successful

make a mount point for the cluster on both nodes:
mkdir /cluster

become primary on node 1
drbdadm primary r0

mount the cluster on node 1
mount /dev/drbd1 /cluster

make the samba shares:
mkdir /cluster/users
mkdir /cluster/profiles

join the domain from node 1 only:
net ads join -UAdministrator
Enter Administrator's password:
Using short domain name -- HH3
Joined 'SMBCLUSTER' to dns domain 'hh3.site'
Not doing automatic DNS update in a clustered setup.

on the dc, add the round robin DNS entries
samba-tool dns add hh16 hh3.site smbcluster A
samba tool dns add hh16 hh3.site smbcluster A
samba-tool dns add hh16 1.168.192.in-addr.arpa 80 PTR smbcluster
samba-tool dns add hh16 1.168.192.in-addr.arpa 81 PTR smbcluster

on node 1, start CTDB (hold on tight!)
systemctl start ctdb && ctdb enable

on the DC, create a domain user, stevec
samba-tool user add stevec
then edit him:
ldbedit -e joe --url=/usr/local/samba/private/sam.ldb cn=stevec
to contain these attributes:
uidNumber: 3000092
gidNumber: 20513
unixHomeDirectory: /home/users/stevec
loginShell: /bin/bash
homeDrive: Z:
homeDirectory: \\smbcluster\users\stevec
profilePath: \\smbcluster\profiles\stevec

set the permissions on the profiles share
chmod 1777 /cluster/profiles

with both nodes primary, mount the disk /dev/drbd1. Here is node 1:
drbdadm up r0
drbdadm primary r0
mount /dev/drbd1 /cluster
mount | grep drbd
/dev/drbd1 on /cluster type ocfs2 (rw,relatime,_netdev,heartbeat=local,nointr,data=ordered,errors=remount-ro,atime_quantum=60,coherency=full,user_xattr,acl)

test the synchronisation by creating a file on node 1 and vica versa

start ctdb
systemctl start ctdb

tail the logs in a second root terminal:
tail -f /var/log/messages

on node 2:
ctdb enable
on node 1:
ctdb enable

node 1 tail:
2014/07/13 08:52:52.755117 [recoverd: 2779]: Takeover run starting
2014/07/13 08:52:53.411916 [ 2614]: Takeover of IP on interface bond0
2014-07-13T08:52:54.300701+02:00 smb1 avahi-daemon[384]: Registering new address record for on bond0.IPv4.
2014/07/13 08:52:55.241254 [recoverd: 2779]: Takeover run completed successfully

node 2 responds:
2014/07/13 08:52:30.341070 [recoverd: 2775]: Reenabling takeover runs
2014/07/13 08:52:50.977141 [recoverd: 2775]: Node 0 has changed flags - now 0x0  was 0x4
2014/07/13 08:52:52.750525 [recoverd: 2775]: Disabling takeover runs for 60 seconds
2014/07/13 08:52:52.772956 [ 2611]: Release of IP on interface bond0  node:0
2014/07/13 08:52:52.934313 [ 2611]: 10.interface: flock: failed to execute /sbin/iptables: No such file or directory
2014-07-13T08:52:53.272856+02:00 smb2 avahi-daemon[382]: Withdrawing address record for on bond0.
2014/07/13 08:52:55.234398 [recoverd: 2775]: Reenabling takeover runs

check that smbd and winbind are started:
ps aux|grep smbd
root     10360  0.9  1.4  48416  7328 ?        Ss   21:18   0:00 /usr/sbin/smbd 
root     10383  0.1  0.6  48416  3360 ?        S    21:18   0:00 /usr/sbin/smbd 
ps aux|grep win
root     10322  0.0  0.7  26396  3676 ?        Ss   21:18   0:00 /usr/sbin/winbindd
root     10335  0.0  1.0  26496  5276 ?        S    21:18   0:00 /usr/sbin/winbindd
root     10361  0.0  0.9  29388  4604 ?        S    21:18   0:00 /usr/sbin/winbindd
root     10362  0.0  1.1  32532  5736 ?        S    21:18   0:00 /usr/sbin/winbindd

root     10364  0.0  0.8  26436  4092 ?        S    21:18   0:00 /usr/sbin/winbindd

winbind: this looks familiar
id stevec
uid=3000092(stevec) gid=20513(domain users) groups=20513(domain users),19901(BUILTIN\users)
getent group Domain\ Users
domain users:x:20513:
getent passwd stevec


Windows domain clients
stevec on xp served from each of node 1 and node 2
Linux domain clients
To mount the same shares automatically, see our cifs autofs post. To test this manually, don't forget to specify a key for the mount: if you have specified 
kerberos method = system keytab
on your client, you will have a suitable key when you joined the client to the domain BUT NOT on the clustered file servers:
mount.cifs //smbcluster/users /home/users -osec=krb5,username=CATRAL$,multiuser
Where catral is the hostname of a Linux client. In this screenshot, the client is running sssd and began life on node 1.
stevec on a Linux client transparently served from node 2
That should be enough to get you started. It sure beats DFS. We now need to re-add the firewall and apparmor, automate start up and then see if we can break it. I wonder how it will do under load? Then decide if and when we can go public. Maybe we should go the whole hog and add PaceMaker to the mix? Fence wobbly nodes? STONITH it?