This file describes the implementation of (and a little bit of the rationale for) out FreeBSD-based "virtual node" support. The summary is that we use a highly customized FreeBSD kernel supporting beefed-up jails, virtual disks, virtual ethernet interfaces, and multiple routing tables to implement reasonably isolated virtual nodes for network- centric activity. We do not yet provide complete resource isolation, in particular to limit CPU or memory consumption. A. Rationale Why jails? One way to achieve multiplexing of activity on physical nodes is simply to start up multiple copies of an experimenter's desired applications. For example, if they want 10 "nodes" running a background traffic generator, we just start up 10 copies on a single physical node. There are a number of problems with this approach. One is that it will likely require customization of the application, possibly even changing the code itself, to allow it to co-exist with other instances of itself. For example, the application might have a hardwired path in the filesystem for configuration or logging information or require a specific port number. The first step in virtualizing an application's environment is thus to give it its own name space. BSD jails (jail(2)) serve to restrict a process and all its descendents to a unique slice of the filesystem namespace using chroot. This not only gives each jail a custom, virtual root filesystem (/, /var, etc.) but also insulates them from the filesystem activities of others (and visa-versa). Jails also provide a mechanism for virtualizing and restricting access to the network. When a jail is created, it is given a virtual hostname and a set of IP addresses that it can bind to. These IP addresses are associated with network interfaces outside of the jail context and cannot be changed from within a jail. Hence, jails are implicitly limited to a set of interfaces they may use. Further, jails allow processes within them to run as root, albeit a watered-down variant. With root inside a jail, applications can add/change/remove whatever files they want (except for device special files), bind to privileged ports, and kill any other jailed processes. Why virtual disks? One potential problem with the filesystem virtualization provided by jails, is constraining disk usage. Even though each jail has its own subset of the filesystem name space, that space is likely to be part of a larger filesystem. Jails themselves do nothing to limit now much disk space can be used within the hosting filesystem. Disk quotas aren't useful since, within the jail's name space, files are not restricted to a single uid or even subset of uids, they can be owned by anyone. The BSD vnode disk driver (vn(4)) allows us to create a regular file with a fixed size and expose it via a disk interface. These fixed-size virtual disks are used to contain a root filesystem for each jail which is mounted at the root of each jail's name space. Since the virtual disks are contained in regular files, they are also easy and efficient to move or clone. Why virtual ethernet interfaces? An important part of BSD jails is the ability to restrict them to specific IP addresses. But the jail mechanism does not ... - allows attaching of ipfw/dummynet - ensures jail sees only traffic it should without overly limiting to specific target IPs - encapsulation The BSD virtual ethernet driver, which we wrote, is a goofy hybrid of a virtual device, an encapsulating device and a bridging device. It allows us to create lots and lots of ethernet interfaces (virtualization), multiplex them on physical interfaces or tie them together in a loopback fashion (bridging) and have them communicate transparently through our switch fabric (encapsulation). Why virtual routing tables? - cannot have multiple veths in the same subnet without this - needed to route packets correctly through a topo when endpoints are on the same host (ow, gets short-circuited). - needed to support multiple vnodes with different routes to same destination B. Implementation All virtual nodes on a physical node will belong to the same experiment. This eases the immediate burden of providing isolation somewhat and also allows us to evade tricky issues of who has what access to the hosting physical node. So a physical node, mapped to an experiment, boots up and eventually runs the bootvnodes script. That script contacts Testbed Central and discovers that it has vnodes to setup. It performs a couple of global "one time" actions: it creates a filesystem on the 4th DOS partition for jail disk space and ensures that sufficient virtual disk (vn) devices exist for all vnodes. It then runs vnodesetup for every vnode. The vnodesetup script is used to start ("boot"), stop ("halt") and restart ("reboot") vnodes. It is also used to setup vnodes on widearea nodes, but we consider only local cluster nodes here. Vnodesetup forks and runs another script, mkjail.pl, in the child. The parent hangs around and cleans up if the jail dies. It also serves as a focal point for killing the jail, catching signals and forcibly terminating the jail. It is the parent vnodesetup process that handles informing stated of state transitions in the jail. In mkjail.pl we finally get down to it. This script builds up the filesystem hierarchy used by the jail, sets up its interfaces (including the virtual control net interface, routes and dummynet delay pipes), and then starts the jail. Note that the filesystem and interfaces are setup outside the jail and passed as parameters into the jail. The filesystem consists of a per-jail vnode-disk and loopback (null) mounts of various physical node filesystems. The whole shebang is located in /var/emulab/jails/. root.vnode is the regular file which serves as the root disk. It is attached to a vn device that is then mounted at /var/emulab/jails//root. The mkjail.pl script populates the disk by copying in some directories from the physical node and customizing the content. This loading of the disk only happens on the first boot up of the virtual node. Subsequent boots simply mount the vn disk. To save space in the per-jail disk, the binary directories /sbin and /bin are remounted read-only with a loopback mount as is /usr. The shared /share, /proj, /users and /local directories are also loopback mounted read-write from the physical node. From the perspective of the physical node, there will be at least 8 mounts for every jail: /dev/vn?c (the root), /bin, /sbin, /usr, /proc, /share, /proj/, and /local/, plus one in /users for every user in the project (and maybe a /group/ too!). The network setup starts with configuration of the virtual control net interface. This consists of a 172.16/12 address alias assigned to the real control net interface. There will be one such alias per virtual node. These aliases allow us to further isolate jails from each other as each jail now has a unique address on which to run services. We no longer have to assign port ranges within the primary control net address and run daemons on weird port numbers. We can also use DNS to map symbolic names to the virtual nodes. Note that these address are "unroutable" and thus are not exposed outside emulab.net. Accessing services on the nodes from outside requires a proxy on ops.emulab.net, such as an ssh tunnel. For experimental interfaces, mkjail.pl creates an rc file to setup the virtual interfaces and routes, and runs them. A virtual interface's parameters are assigned as follows. We assign veth virtual MACs based on the interfaces IP address, in the form: 00:00:IP#0:IP#1:IP#2:IP#3, ensuring uniqueness. We assign veth tags using the subnet part of the interface's IP address. Since we use the 192.168 space, we need only the third octet of the IP to identify the subnet, but for future expansion, we use the second and third octets. A veth's physical interface is determined by assign. We may use multiple physical interfaces between nodes or we may use no physical interfaces at all. The routing table ID is unique per virtual node on a physical node, so we simply use a per-physical node counter to assign these when the vnodes are booted. All veths for a virtual node get the same counter value. There is nothing magical in route setup, just an extra argument to the route command to ensure the routes get added to the correct table. Likewise for delay setup, ipfw rules are simply applied to veths rather than physical interfaces. Setting up routes and dummynet outside the jail is largely historical. Finally, the jail startup is done. Our augmented jail implementation takes some new parameters in addition to a "primary" IP address, the root directory of the jail, and the program to run. The important additional parameter is a list of IP addresses. These addresses, along with the primary, implicitly define which interfaces are accessible to the jail: those to which the IP addresses are assigned. This is analogous to the root directory specification which determines which mounted filesystems are accessible: those at or below the level of the root directory. The program run by the jail is yet another perl script, /etc/injail.pl, that is effectively the /sbin/init of the virtual node. Its primary jobs are to fire off /etc/rc to bring up the virtual node and then sit around and wait for a signal to shutdown the jail. The startup scripts run by /etc/rc in the jail are scaled back versions of what would run on a real node. This scaling back reflects the fact that the node has already been partially initialized and also that it usually will not run as many services as a real node. A typical jail syslogd, cron and sshd, as well as the Emulab watchdog and optional agents like trafgen or delay_agent. From the perspective of the physical node, each jail has at least 8 processes running: vnodesetup, mkjail.pl and proxy-tmcc outside the jail as well as injail.pl, syslogd, cron, sshd, and the Emulab watchdog inside the jail. Empirically, it appears that each jail requires 12-16MB of physical memory for its base processes. C. Details FreeBSD jails. Jails provide filesystem and network namespace isolation and some degree of superuser privilege restriction. To the basic jail mechanism we have added: 1. Raw socket access. 2. Read-only BPF access. 3. Multiple-IP and INADDR_ANY support 4. Associate jail with a routing table 5. Bug fixes: cannot umount fs not mounted from within jail [ move this detail somewhere else, a man page maybe? ] First a few words on veth devices. They are configured with a few parameters: a virtual MAC (VMAC) address, a broadcast domain tag, an associated physical (parent) ethernet interface, and a routing table ID. The VMAC obviously identifies the interface and needs to be unique within a broadcast domain, *not* unique per physical node. Thus, you could have the same VMAC on multiple veths on the same physical node (though we don't do that) and you may have to have distinct VMACs within a set of physical nodes, it depends on the topology. The broadcast domain tag is basically a VLAN tag, it allows us to use the same physical wire for multiple LANs. Unlike VLANs where you can have up to 4096, we currently use a 16-bit value allowing up to 64k LANs. Again, a tag's uniqueness has nothing to do with physical node boundaries, veths on different physical nodes may need to have the same tag while veths on the same node may have different values. The parent interface parameter determines which interface the veths send and receive encapsulated packets to and from. All veths with the same parent interface and broadcast domain tag, can talk to each other. If such veths are on the same physical node, they talk via a loopback with no encapsulation and without packets going out on the physical interface. Specifying a null parent can be used for a strictly loopback connection. The routing table ID is used with incoming packets to determine which table to use for lookups when forwarding. The route table ID effectively identifies a virtual node: all interfaces associated with a virtual node have the same ID, every virtual node has its own unique ID. The route table ID is a local-node only value, different physical nodes can use the same ID for different purposes. More about the startup pieces: vnodesetup hangs around so that you can signal it and easily reboot the vnode. I guess the idea is that it is also jail/vserver independent as opposed to... mkjail.pl is jail specific and hangs around so that it can clean up jail specific things when the jail exits. injail.pl is the jail's init process. This is the single point of contact for killing the jail. /bin/sleep is just an artifact. D. Examples Consider a topology of two nodes connected via a link. Assume that both nodes have been mapped to virtual nodes on two different physical nodes. Each virtual node.