Writing a TCP/IP Stack from Scratch in Nim: ARP

ARP

Welcome to the second post in the series. Last time we built the link layer – capturing raw Ethernet frames over BPF on macOS. This time we’re going one notch up the stack: ARP, the Address Resolution Protocol. It’s how a host figures out which MAC address corresponds to a given IPv4 address on the local network, so Ethernet frames know where to actually go.

But before we get to ARP itself, we need to revisit the link layer. The version we ended Part 1 with was fine for reading and printing frames, but it has a handful of problems that show up as soon as we try to do anything more serious – like sending frames back, processing more than one packet per syscall, or letting upper layers (ARP, IP, TCP) compare ethertypes without parsing strings.

So this post has two halves:

Fix up the link layer.
Then build ARP on top of it.

Fixing the link layer

Let’s go through the issues one at a time.

Well, first of all, we need to address the existing issue with read_frame() implementation. Right now, it’s just a procedure that reads exactly one packet and returns the frame.


        ┌────────┬─────────────────────┬───┬────────┬─────────────────────┬───┬────────┬─────────────────────┬───┐   
        │bpf_hdr0│       Packet        │ ◊ │bpf_hdr1│        Packet       │ ◊ │bpf_hdr2│        Packet       │ ◊ │   
        └────────┴─────────────────────┴───┴────────┴─────────────────────┴───┴────────┴─────────────────────┴───┘   
        ◄─── parsed ──────────────────►    ◄────────── thrown away ──────────────────────────────────────────────►

But of course, that was a simplification on our end, in reality the correct way to handle it is to treat the buffer as a list: walk a cursor, parse a packet, advance to others. That’s precisely because BPF’s read API fundamentally hands back N packets per call.

-proc read_frame*(self: var BPFLinkLayer): FrameData =
-  var buffer: array[4096, uint8]
-  let bytesRead = read(cint(self.bpf_fd), addr buffer[0], buffer.len)
-  ...
-  let bh = cast[ptr BPFHeader](addr buffer[0])[]
-  let pktStart = bh.bh_hdrlen.int
-  ...
-  result = FrameData(...)
+iterator read_frames*(self: var BPFLinkLayer): FrameData =
+  var buf: array[BPF_BUF_SIZE, uint8]
+  while true:
+    let bytesRead = read(cint(self.bpf_fd), addr buf[0], buf.len)
+    ...
+    var pos = 0
+    while pos < bytesRead:
+      let bh = cast[ptr BPFHeader](addr buf[pos])[]
+      ...
+      yield FrameData(...)
+      let totalLen = bh.bh_hdrlen.int + bh.bh_caplen.int
+      pos += (totalLen + BPF_ALIGNMENT - 1) and not (BPF_ALIGNMENT - 1)

BPF_ALIGNMENT – is the 4-byte (on MacOS/BSD) boundary the kernel pads each packet to in the buffer, so the next bpf_hdr lands on a word-aligned offset that userspace can cast and read directly, allowing for a safe access to fields.

The math behind that step – (totalLen + BPF_ALIGNMENT - 1) and not (BPF_ALIGNMENT - 1) – is the standard “round up to the next multiple of N” trick. We add align - 1 to push past the boundary if we’re below it, then mask off the low bits to land back on it.

So the re-written version is a generator that yields each frame as we go, which can later be used like this:

for frame in bpf.read_frames():
    ...

Another thing that we need to change is the way we pass around Ethernet frames:

type
    FrameData* = object
-    dest_mac*: string
-    src_mac*: string
-    eth_type*: string
-    eth_type_name*: string
+    dest_mac*: MacAddr
+    src_mac*: MacAddr
+    eth_type*: uint16
    payload*: seq[uint8]

Sure, dealing with MAC addresses for debugging and printing was easier when we just used strings, but it pushes a parsing problem onto every upper layer. Better to hold the raw bytes and format only when something is being printed. Let’s define MacAddr type and for eth_type we would just use 16-bit number, since it’s a 2 byte field:

type
  MacAddr* = array[6, uint8]  # six raw octets

Also, let’s fix our BIOCIMMEDIATE flag setup and correctly handle any errors:

-discard ioctl(cint(self.bpf_fd), culong(BIOCIMMEDIATE), cint(1))
+var enable: cuint = 1
+if ioctl(cint(self.bpf_fd), culong(BIOCIMMEDIATE), addr enable) != 0:
+  raise newException(OSError, "BIOCIMMEDIATE failed")

What we’ve missed here is the fact that the third argument to ioctl for BIOCIMMEDIATE is a pointer to a u_int, not the integer itself. The old call cast 1 to a pointer and copying failed silently in the kernel, we don’t really want that. In general, we should check the return values of all our ioctl() calls:

-discard ioctl(cint(self.bpf_fd), culong(BIOCSETIF), addr ifreq)
+if ioctl(cint(self.bpf_fd), culong(BIOCSETIF), addr ifreq) != 0:
+  raise newException(OSError, fmt"BIOCSETIF failed for interface {self.iface}")

Before we start implementing ARP, let’s quickly add a few necessary blocks to our Link Layer that would later on be used not only for ARP, but for future layers as well.

First things first, we need to somehow send frames back using BPF, for that let’s define send_frame() procedure:

proc send_frame*(self: var BPFLinkLayer, frame: openArray[uint8]) =
  let n = write(cint(self.bpf_fd), addr frame[0], frame.len)
  if n < 0:
    let err = errno
    raise newException(IOError, fmt"BPF write failed: errno={err} ({$strerror(err)})")
  if n != frame.len:
    raise newException(IOError, fmt"BPF short write: {n}/{frame.len}")

It’s just a regular write() call to the same BPF file descriptor, we specify the address of our first frame’s element and the length.

But send_frame() only writes raw bytes – we still need a way to produce those bytes. Hand-assembling an Ethernet frame at every call site would get old fast, so let’s add a small helper:

proc build_ethernet_frame*(dest_mac: MacAddr, src_mac: MacAddr, eth_type: uint16, payload: openArray[uint8]): seq[uint8] =
  var frame: seq[uint8] = @[]
  frame.setLen(14 + payload.len)
  copyMem(addr frame[0], addr dest_mac[0], 6)
  copyMem(addr frame[6], addr src_mac[0], 6)
  bigEndian16(addr frame[12], addr eth_type)
  if payload.len > 0:
    copyMem(addr frame[14], addr payload[0], payload.len)

  frame

It’s the mirror image of the parser from Part 1: 6 bytes for the destination MAC, 6 bytes for the source MAC, 2 bytes for the Ether type, then the payload – total length 14 + payload.len.

Ok, now that it’s out of the way, let’s add one more missing detail. For ARP implementation we need to know our MAC address, let’s go over the function that gets us there.

On MacOS/BSD we will need a getifaddrs() function. Basically, it asks the kernel for a list of every address on every network interface on the machine and hands it back as a linked list. We need to traverse that linked list and find MAC address for the interface we bind BPF to.

Let’s prepare some types that we’ll need:

type
  Sockaddr {.importc: "struct sockaddr", header: "<sys/socket.h>", bycopy.} = object
    sa_len: uint8
    sa_family: uint8

  IfAddrs {.importc: "struct ifaddrs", header: "<ifaddrs.h>", bycopy.} = object
    ifa_next: ptr IfAddrs
    ifa_name: cstring
    ifa_flags: cuint
    ifa_addr: ptr Sockaddr

  SockaddrDl {.importc: "struct sockaddr_dl", header: "<net/if_dl.h>", bycopy.} = object
    sdl_len: uint8
    sdl_family: uint8
    sdl_index: uint16
    sdl_type: uint8
    sdl_nlen: uint8
    sdl_alen: uint8
    sdl_slen: uint8
    sdl_data: array[12, char]

const
  AF_LINK = 18'u8
  IFT_ETHER = 0x06'u8

These mirror C structs the kernel uses, each one needs to match C layout exactly, which is why we use these pragmas.

Let’s quickly recap pragmas we’ve seen earlier, when dealing with Link Layer:

{.importc: "struct sockaddr", header: "<sys/socket.h>"} – this tells compiler “when generating C code, use the existing struct sockaddr type from the system headers, don’t generate a new one”, header just points to it.
bycopy – it instructs the compiler to pass a type by value. In Nim specifically, the compiler might decide to pass a parameter by reference, if it can speed up the execution.

IfAddrs is the linked list node that holds address data. ifa_next is a pointer to the next node in the list. ifa_name - a cstring that holds the interface name, like en0. ifa_addr points to a generic Sockaddr address type that specifies length of the address and the address family (AF_INET for IPv4, AF_LINK for the Link Layer etc).

The generic Sockaddr lets us identify the address type, but the bytes behind the pointer are actually SockaddrDl type, when sa_family is AF_LINK – so we cast to that type to access the MAC address.

Define the clib functions:

proc getifaddrs(ifap: ptr ptr IfAddrs): cint {.importc, header: "<net/if.h>".}
proc freeifaddrs(ifap: ptr IfAddrs) {.importc, header: "<ifaddrs.h>".}

And we’re ready to write our function to get the MAC address:

proc get_my_mac_addr(self: var BPFLinkLayer): MacAddr =
  var ifap: ptr IfAddrs
  if getifaddrs(addr ifap) != 0:
    raise newException(Exception, "Failed to get interface addresses")

  defer: freeifaddrs(ifap)

  var curr: ptr IfAddrs = ifap
  while curr != nil:
    if curr.ifa_addr != nil and curr.ifa_addr.sa_family == AF_LINK and curr.ifa_name == self.iface.cstring:
      let sdl = cast[ptr SockaddrDl](curr.ifa_addr)
      if sdl.sdl_type == IFT_ETHER and sdl.sdl_alen == 6:
        let macPtr = cast[ptr UncheckedArray[uint8]](
          cast[uint](addr sdl.sdl_data) + sdl.sdl_nlen.uint
        )
        var mac: array[6, uint8]
        for i in 0..5:
          mac[i] = macPtr[i]
        return mac

    curr = curr.ifa_next
  
  raise newException(Exception, "No MAC address found for interface")

Let’s break it down step by step.

We start by declaring an uninitialized pointer ifap and pass its address into getifaddrs. The function takes a ptr ptr IfAddrs (a pointer to our pointer) so it can write the address of the freshly allocated list back into our variable – this is the standard C idiom for “give me back a new pointer.” On failure it returns non-zero, so we bail out with an exception.

Right after that, we set up cleanup with defer. The kernel allocated a linked list for us, and we’re responsible for freeing it. defer runs freeifaddrs(ifap) at the end of the scope no matter how we exit, basically a try/finally.

Now we walk the linked list. Standard traversal: start at the head, follow ifa_next until we hit nil:

var curr: ptr IfAddrs = ifap
while curr != nil:
  ...
  curr = curr.ifa_next

One thing worth noting – getifaddrs returns one node per (interface, address) pair, not one per interface. So en0 will appear multiple times in the list. That’s why we filter by both name and address family:

if curr.ifa_addr != nil and
   curr.ifa_addr.sa_family == AF_LINK and
   curr.ifa_name == self.iface.cstring:

Once we’ve found the right node, we know the bytes behind ifa_addr are actually a sockaddr_dl, so we cast:

let sdl = cast[ptr SockaddrDl](curr.ifa_addr)

Before extracting the MAC, 2 additional checks:

if sdl.sdl_type == IFT_ETHER and sdl.sdl_alen == 6:

AF_LINK covers more than Ethernet – loopback, WI-FI, etc, so we need the Ethernet type IFT_ETHER.

Now the trickiest part. sockaddr_dl packs the interface name and the MAC into a single sdl_data buffer, with the name first:

       ◄── sdl_nlen ──►◄── sdl_alen ──►◄── sdl_slen ──►
      ┌────────────────┬───────────────┬───────────────┐
      │ interface name │  link address │   selector    │
      │   ("en0")      │    (the MAC)  │  (often empty)│
      └────────────────┴───────────────┴───────────────┘

So to find where the MAC starts, we have to skip past the name, which means jumping sdl_nlen bytes past the start of sdl_data:

let macPtr = cast[ptr UncheckedArray[uint8]](
  cast[uint](addr sdl.sdl_data) + sdl.sdl_nlen.uint
)

How did I know this? Claude told me. I mean, the header says char sdl_data[12] with a comment “contains both if name and ll address” and that’s it. A few years BC (Before Claude) you’d find this by reading Stevens’ UNIX Network Programming or grepping through ifconfig.c. I did neither.

Finally, we copy the six bytes into a fixed-size array we own and return:

var mac: array[6, uint8]
for i in 0..5:
  mac[i] = macPtr[i]
return mac

ARP implementation

Let’s move on to ARP implementation.

So what does an ARP exchange actually look like on the wire?

Picture two machines on the same LAN. Host A (192.168.0.1) wants to send an IP packet to Host B (192.168.0.2). It knows the IP, but the Ethernet layer needs a destination MAC. Host A doesn’t know it yet.

Host A broadcasts an ARP request:

“Who has 192.168.0.2? Tell 192.168.0.1, my MAC is aa:bb:cc:dd:ee:ff.”

Every host on the segment sees it. Host B sees its own IP in the question and replies – this time as a unicast straight back to A:

“192.168.0.2 is at MAC 66:65:74:68:00:01.”

Host A caches the answer and moves on. Subsequent IP packets to .2 use that MAC directly.

That’s the whole protocol. Two opcodes (request and reply), one wire format for both, and an in-memory cache so we don’t broadcast for the same answer over and over.

It’s time to dissect how an ARP packet (ethertype 0x0806) actually looks like:


  ┌───────┬────────┬──────┬─────────────────────────────────────────────────┐
  │ Field │ Offset │ Size │                     Meaning                     │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ htype │ 0      │ 2 B  │ Hardware type. 1 = Ethernet.                    │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ ptype │ 2      │ 2 B  │ Protocol type. 0x0800 = IPv4.                   │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ hlen  │ 4      │ 1 B  │ Hardware address length. 6 for MAC.             │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ plen  │ 5      │ 1 B  │ Protocol address length. 4 for IPv4.            │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ op    │ 6      │ 2 B  │ Operation. 1 = request, 2 = reply.              │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ sha   │ 8      │ 6 B  │ Sender hardware address (MAC of who's asking).  │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ spa   │ 14     │ 4 B  │ Sender protocol address (IPv4 of who's asking). │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ tha   │ 18     │ 6 B  │ Target hardware address (zero in a request).    │
  ├───────┼────────┼──────┼─────────────────────────────────────────────────┤
  │ tpa   │ 24     │ 4 B  │ Target protocol address (the IPv4 we're after). │
  └───────┴────────┴──────┴─────────────────────────────────────────────────┘

htype, ptype, and op are big-endian on the wire – we’ll use bigEndian16 to read and write them, same as we did for the ethertype.

Mirror the layout in Nim and add a couple of ARP-specific constants:

  const
    ArpHwEthernet = 1'u16
    ArpProtoIPv4  = 0x0800'u16
    ArpOpRequest  = 1'u16
    ArpOpReply    = 2'u16
    EthertypeArp* = 0x0806'u16

  type
    ArpFrame* = object
      htype*: uint16        # 1 = Ethernet
      ptype*: uint16        # 0x0800 = IPv4
      hlen*:  uint8
      plen*:  uint8
      op*:    uint16        # 1 = request, 2 = reply
      sha*:   MacAddr       # sender MAC
      spa*:   IPv4Addr      # sender IPv4
      tha*:   MacAddr       # target MAC
      tpa*:   IPv4Addr      # target IPv4

The 16-bit fields are plain uint16, not 2-byte arrays. Same reasoning as eth_type: the wire is bytes, but once parsed we want a number we can compare against constants.

IPv4Addr is array[4, uint8] – same logic as MacAddr.

We also need an in-memory cache to track what we’ve learned and some context (BPF handler and our IP-address):

type
  ArpCache* = Table[IPv4Addr, MacAddr]

  ArpCtx* = object
    bpf*:   BPFLinkLayer
    cache*: ArpCache
    my_ip*: IPv4Addr

The cache is a plain Nim Table. my_ip is the address our stack claims, so we’ll reply to ARP requests for it, and ignore everything else.

When we get read_frames() generator’s FrameData, the first 28 bytes of that payload is our ArpFrame:

proc parse_arp_frame*(frame: link_layer.FrameData): Option[ArpFrame] =
  if frame.payload.len < 28:
    return none(ArpFrame)

  var arp_frame: ArpFrame
  let p = frame.payload
  bigEndian16(addr arp_frame.htype, addr p[0])
  bigEndian16(addr arp_frame.ptype, addr p[2])
  arp_frame.hlen = p[4]
  arp_frame.plen = p[5]
  bigEndian16(addr arp_frame.op, addr p[6])
  copyMem(addr arp_frame.sha[0], addr p[8],  6)
  copyMem(addr arp_frame.spa[0], addr p[14], 4)
  copyMem(addr arp_frame.tha[0], addr p[18], 6)
  copyMem(addr arp_frame.tpa[0], addr p[24], 4)

  if arp_frame.htype != ArpHwEthernet or
      arp_frame.ptype != ArpProtoIPv4 or
      arp_frame.hlen  != 6'u8 or
      arp_frame.plen  != 4'u8:
    return none(ArpFrame)

  some(arp_frame)

We return Option[ArpFrame] rather than raising. ARP is best-effort, so any malformed packets are just ignored.

We also need to be able to build an ARP frame, so we can send a reply back:

proc build_arp_frame*(arp_frame: ArpFrame): seq[uint8] =
  result.setLen(28)
  bigEndian16(addr result[0], addr arp_frame.htype)
  bigEndian16(addr result[2], addr arp_frame.ptype)
  result[4] = arp_frame.hlen
  result[5] = arp_frame.plen
  bigEndian16(addr result[6], addr arp_frame.op)
  copyMem(addr result[8],  addr arp_frame.sha[0], 6)
  copyMem(addr result[14], addr arp_frame.spa[0], 4)
  copyMem(addr result[18], addr arp_frame.tha[0], 6)
  copyMem(addr result[24], addr arp_frame.tpa[0], 4)

Again, just a mirror image of the parser.

Handling incoming ARP

Now the actual process: what do we do when an ARP packet arrives?

proc handle_arp_ingress*(ctx: var ArpCtx, frame: link_layer.FrameData) =
  if frame.eth_type != EthertypeArp:
    return
  let arpOpt = parse_arp_frame(frame)
  if arpOpt.isNone:
    return
  let arp = get(arpOpt)

  if arp.sha == ctx.bpf.my_mac:
    return  # ignore our own outbound frames captured via BPF

  ctx.cache[arp.spa] = arp.sha

  case arp.op
  of ArpOpRequest:
    if arp.tpa == ctx.my_ip:
      send_arp_reply(ctx, arp)
  of ArpOpReply:
    discard  # cache already updated above
  else:
    discard  # unknown opcode (RARP, InARP, etc.) – ignore

First thing we do – drop a frame if it’s not an ARP. We’re not going to process a malformed frame either – skip it.

This one is subtle – BPF captures both directions on the interface. So when we send_frame() an ARP reply, we’ll see our own packet come back through read_frames() a moment later:

if arp.sha == ctx.bpf.my_mac:
  return

One thing worth pointing out – we should always update our cache, this comes straight from RFC 826. This gives us up-to-date information.

When we get a request for our IP, we build a reply by swapping sender and target around:

proc send_arp_reply*(ctx: var ArpCtx, arp: ArpFrame) =
  var reply: ArpFrame
  reply.htype = ArpHwEthernet
  reply.ptype = ArpProtoIPv4
  reply.hlen  = 6
  reply.plen  = 4
  reply.op    = ArpOpReply

  reply.sha   = ctx.bpf.my_mac     # we are the sender now
  reply.spa   = ctx.my_ip
  reply.tha   = arp.sha            # the original requester is the target
  reply.tpa   = arp.spa

  let arpBytes = build_arp_frame(reply)
  let eth = build_ethernet_frame(arp.sha, ctx.bpf.my_mac,
                                  EthertypeArp, arpBytes)
  send_frame(ctx.bpf, eth)

Everything together:

when isMainModule:
  var ctx = ArpCtx(
    bpf: BPFLinkLayer(iface: "feth1"),
    cache: initTable[IPv4Addr, MacAddr](),
    my_ip: [192'u8, 168, 42, 2],
  )
  ctx.bpf.open_bpf()
  for frame in ctx.bpf.read_frames():
    handle_arp_ingress(ctx, frame)

Open BPF on some interface, claim 192.168.42.2 as our address, then read frames forever and feed each one into the handler.

A small note on the address itself: we’ve hardcoded 192.168.42.2 rather than picking it up from the kernel, and that’s deliberate.

In a real TCP/IP stack you’d get your address from DHCP, link-local autoconfiguration, or static config. We’re not there yet – we’re building just enough to demonstrate ARP.

Testing

Since we’ve connected everything together, it’s time to test our ARP implementation. You might notice from our code that I’m using ‘feth1’ interface, instead of just en0 and you might ask why.

Well, 2 reasons for that:

The kernel competes. Our OS already runs a TCP/IP stack on en0, and for any IP it owns there, it answers ARP requests before our userspace stack can.
The loopback. Even setting that aside, you can’t drive an ARP exchange to your own IP from your own machine. The kernel shortcuts traffic to loopback interface.

So, in order to test our ARP implementation and future layers, we need to have an isolated interface to test against. We’ll create our own pair of virtual interfaces and let the kernel route between them. One interface for the kernel to own, one for us:

sudo ifconfig feth0 create  # kernel's end
sudo ifconfig feth1 create  # our end
sudo ifconfig feth0 peer feth1 # The call that turns them into a pair
sudo ifconfig feth0 inet 192.168.42.1/24 up
sudo ifconfig feth1 up

inet 192.168.42.1/24 up – give feth0 an IP and bring it up. This is what makes the rest work – assigning the address triggers the kernel to install a route: 192.168.42.0/24 via feth0. Now anything on the host trying to reach .42.x knows exactly how to get there.
feth1 up – bring up our end without an IP. Deliberate: no kernel claim on 192.168.42.2, so the loopback shortcut doesn’t apply, and nothing on the host competes with our stack to answer ARP for it.

Now we can go ahead and test it. We flush our ARP cache, just in case we have any stale entries.

sudo arp -d 192.168.42.2 2>/dev/null

Then we can trigger our ARP request:

ping -c 1 -W 1000 192.168.42.2

It should time out and that’s fine. We haven’t built ICMP yet – that’s a future post.

PING 192.168.42.2 (192.168.42.2): 56 data bytes

--- 192.168.42.2 ping statistics ---
1 packets transmitted, 0 packets received, 100.0% packet loss

But before timing out, the kernel needs a MAC for .2 to send the echo to, so it ARPs first:

sudo nim c -r src/arp.nim

Successfully opened /dev/bpf0 with fd 3
Opened bpf device
My MAC address: 66:65:74:68:00:01
Successfully bound to interface
ARP REQ: 192.168.42.1 (66:65:74:68:00:00) -> 192.168.42.2
Sending ARP reply
ARP ingress handled

Wrap-up

That’s ARP done. We patched a handful of issues in the link layer along the way: multi-packet BPF reads, raw frame types, sending frames, fetching our own MAC, and built a small ARP responder on top. A few things that we deliberately dropped in our implementation:

Cache expiration. Entries in our cache live forever. Real ARP caches expire entries after a few minutes, so stale mappings don’t linger when MACs change. We’d add a timestamp per entry and evict old ones on a timer.
Sending requests. Our stack only answers requests, never sends them. A complete implementation also needs to originate an ARP request when it wants to reach an IP whose MAC it doesn’t know yet.
Gratuitous ARP. When a host claims an IP, it’s polite to announce the mapping unsolicited so neighbors populate their caches without having to ask. Useful when an interface comes up or a MAC changes.
IPv6 support. We’re IPv4-only here.

For now what we have is enough to demonstrate the protocol.

Next post: IPv4 and ICMP, so the ping from this post actually returns a reply instead of timing out.

All the code is on Github.

#Networking #Tcp/Ip #Nim #Bpf #Arp