User Mode

Running programs in user mode is one of the most important features of an operating system. It provides a controlled environment for programs to run in, and prevents them from interfering with each other or the kernel. This is done by restricting the instructions that can be executed, and the memory that can be accessed. Once in user mode, a program can only return to kernel mode by executing a system call or through an interrupt (e.g. a timer interrupt). Even exiting the program requires a system call. We won't be implementing system calls in this section. We'll just focus on switching from kernel mode to user mode. The user program won't be able to do anything useful for now, but we should have a minimal user mode environment to build on later.

The main way to switch to user mode is to manually create an interrupt stack frame, as if the user program had just been interrupted by an interrupt. It should look like this:

            Stack
                     ┌──── stack bottom
    ┌────────────────▼─┐
    │        SS        │ +32  ◄── Data segment selector
    ├──────────────────┤
    │        RSP       │ +24  ◄── User stack pointer
    ├──────────────────┤
    │       RFLAGS     │ +16  ◄── CPU flags with IF=1
    ├──────────────────┤
    │        CS        │ +8   ◄── User code segment selector
    ├──────────────────┤
    │        RIP       │ 0    ◄── User code entry point
    ├────────────────▲─┤
    │                └──── stack top
    ├──────────────────┤

Then we can use the iretq instruction to switch to user mode. The iretq instruction pops the stack frame, loads the SS and RSP registers to switch to the user stack, loads the RFLAGS register, and loads the CS and RIP registers to switch to the user code entry point. The RFLAGS value should have the IF flag set to 1, which enables interrupts. This is important, because it allows the kernel to take control back from the user program when an interrupt occurs.

An important thing to note is that, since this stack frame is at the bottom of the stack ( i.e. highest address in the page where the stack is mapped), if the user program returns from the entry point, a page fault will occur, since the area above the stack is unmapped. As mentioned earlier, the only way to return to kernel mode is through a system call or an interrupt. So, for the purpose of this section, we'll just create a user program that never returns (until we implement system calls).

Preparing for User Mode

So far, the virtual memory mapping we have is for kernel space only. We need to create a different mapping for user space so that the user program can access it. This includes mapping of the user code, data, and stack regions, as well as the kernel space (which is protected since it's marked as supervisor only). Mapping the kernel space in the user page table is necessary, since interrupts and system calls cause the CPU to jump to kernel code without switching page tables. Also, many system calls will need access to data in user space.

Since we don't have the ability in the kernel to access disks and filesystems yet, we won't load the user program from disk. What we can do is build the user program separately, and copy it alongside the kernel image, and let the bootloader load it for us. So here's the plan to get user mode working:

Create a program that we want to run in user mode.
Build the program and copy it to the efi\fusion directory (next to the kernel image).
In the bootloader, load the user program into memory, and pass its physical address and size to the kernel.
Allocate memory for the user stack.
Create a new page table for user space.
Map the user code and stack regions to user space.
Copy the kernel space page table entries to the user page table.
Craft an interrupt stack frame that will switch to user mode. Place it at the bottom of the user stack (i.e. the top of the mapped stack region).
Change the rsp register to point to the top of the interrupt stack frame (i.e. the last pushed value).
Load the user page table physical address into the cr3 register.
Use the iretq instruction to pop the interrupt stack frame and switch to user mode.

User Program

Let's start by creating a new module in src/user/utask.nim for the user code, and defining a function that we want to run in user mode. We'll call it UserMain.

# src/user/utask.nim

{.used.}

proc NimMain() {.importc.}

proc UserMain() =
  NimMain()

  asm """
  .loop:
    pause
    jmp .loop
  """

The function will just execute the pause instruction in a loop. The pause instruction is a hint to the CPU that the code is in a spin loop, allowing it to greatly reduce the processor's power consumption.

Let's create a linker script to define the layout of the user code and data sections. It's very similar to the kernel linker script, except we link the user program at a virtual address in user space, instead of kernel space (it doesn't matter where in user space, as long as it's mapped).

/* src/user/utask.ld */

SECTIONS
{
  . = 0x00000000040000000; /* 1 GiB */
  .text   : {
    *utask*.o(.*text.UserMain)
    *utask*.o(.*text.*)
    *(.*text*)
  }
  .rodata : { *(.*rodata*) }
  .data   : { *(.*data) *(.*bss) }

  .shstrtab : { *(.shstrtab) } /* cannot be discarded */
  /DISCARD/ : { *(*) }
}

Now, let's add a nim.cfg file to the src/user directory to configure the Nim compiler for the user program. It should be very similar to the kernel nim.cfg file.

# src/user/nim.cfg
amd64.any.clang.linkerexe="ld.lld"
--passc:"-target x86_64-unknown-none"
--passc:"-ffreestanding"
--passc:"-ffunction-sections"
--passc:"-mcmodel=large"
--passl:"-nostdlib"
--passl:"-T src/user/utask.ld"
--passl:"-entry=UserMain"
--passl:"-Map=build/utask.map"
--passl:"--oformat=binary"

Let's update our justfile to build the user program and copy it in place.

...

user_nim := "src/user/utask.nim"
user_out := "utask.bin"

...

user:
  nim c {{nimflags}} --out:build/{{user_out}} {{user_nim}}

run *QEMU_ARGS: bootloader kernel user
  mkdir -p {{disk_image_dir}}/efi/boot
  mkdir -p {{disk_image_dir}}/efi/fusion
  cp build/{{boot_out}} {{disk_image_dir}}/efi/boot/{{boot_out}}
  cp build/{{kernel_out}} {{disk_image_dir}}/efi/fusion/{{kernel_out}}
  cp build/{{user_out}} {{disk_image_dir}}/efi/fusion/{{user_out}}

  @echo ""
  qemu-system-x86_64 \
    -drive if=pflash,format=raw,file={{ovmf_code}},readonly=on \
    -drive if=pflash,format=raw,file={{ovmf_vars}} \
    -drive format=raw,file=fat:rw:{{disk_image_dir}} \
    -machine q35 \
    -net none \
    -debugcon stdio {{QEMU_ARGS}}

clean:
  rm -rf build
  rm -rf {{disk_image_dir}}/efi/boot/{{boot_out}}
  rm -rf {{disk_image_dir}}/efi/fusion/{{kernel_out}}
  rm -rf {{disk_image_dir}}/efi/fusion/{{user_out}}

Finally, let's build the user program and check the linker map.

$ just user

$ head -n 20 build/utask.map
        VMA              LMA     Size Align Out     In      Symbol
          0                0 40000000     1 . = 0x00000000040000000
   40000000         40000000     9fec    16 .text
   40000000         40000000       59    16         .../fusion/build/@mutask.nim.c.o:(.ltext.UserMain)
   40000000         40000000       59     1                 UserMain
   40000060         40000060       9b    16         .../fusion/build/@mutask.nim.c.o:(.ltext.nimFrame)
   40000060         40000060       9b     1                 nimFrame
   40000100         40000100       12    16         .../fusion/build/@mutask.nim.c.o:(.ltext.PreMainInner)
   40000100         40000100       12     1                 PreMainInner
   40000120         40000120       1e    16         .../fusion/build/@mutask.nim.c.o:(.ltext.PreMain)

This looks good. The UserMain function linked first and starts at 0x40000000, which is what we asked for.

Loading the User Program

Now, let's try to load the user program in the bootloader. We'll do the same thing we did for the kernel, except we'll load the user program to an arbitrary physical address, instead of a specific address. We'll mark this region of memory as UserCode so that it's not considered free.

In the bootloader, we already have code that loads the kernel image. Let's reuse this code to load the user program. Let's refactor this code into a loadImage proc, and use it for both the kernel and the user task.

# src/boot/bootx64.nim

proc loadImage(
  imagePath: WideCString,
  rootDir: ptr EfiFileProtocol,
  memoryType: EfiMemoryType,
  loadAddress: Option[EfiPhysicalAddress] = none(EfiPhysicalAddress),
): tuple[base: EfiPhysicalAddress, pages: uint64] =
  # open the image file
  var file: ptr EfiFileProtocol

  consoleOut "boot: Opening image: "
  consoleOut imagePath
  checkStatus rootDir.open(rootDir, addr file, imagePath, 1, 1)

  # get file size
  var fileInfo: EfiFileInfo
  var fileInfoSize = sizeof(EfiFileInfo).uint

  consoleOut "boot: Getting file info"
  checkStatus file.getInfo(
    file, addr EfiFileInfoGuid, addr fileInfoSize, addr fileInfo
  )
  echo &"boot: Image file size: {fileInfo.fileSize} bytes"

  var imageBase: EfiPhysicalAddress
  let imagePages = (fileInfo.fileSize + 0xFFF).uint div PageSize.uint # round up to nearest page

  consoleOut &"boot: Allocating memory for image"
  if loadAddress.isSome:
    imageBase = cast[EfiPhysicalAddress](loadAddress.get)
    checkStatus uefi.sysTable.bootServices.allocatePages(
      AllocateAddress,
      memoryType,
      imagePages,
      cast[ptr EfiPhysicalAddress](imageBase.addr)
    )
  else:
    checkStatus uefi.sysTable.bootServices.allocatePages(
      AllocateAnyPages,
      memoryType,
      imagePages,
      cast[ptr EfiPhysicalAddress](imageBase.addr)
    )

  # read the image into memory
  consoleOut "boot: Reading image into memory"
  checkStatus file.read(file, cast[ptr uint](addr fileInfo.fileSize), cast[pointer](imageBase))

  # close the image file
  consoleOut "boot: Closing image file"
  checkStatus file.close(file)

  result = (imageBase, imagePages.uint64)

The proc allows loading an image at a specific address (if loadAddress is provided), or at any address (if loadAddress is none). The former is useful for loading the kernel image at a specific address. The latter is useful for loading the user program at any address.

Let's now update the EfiMainInner by replacing that section of the code with two calls to loadImage.

# src/boot/bootx64.nim
...

proc EfiMainInner(imgHandle: EfiHandle, sysTable: ptr EFiSystemTable): EfiStatus =
  ...

  # open the root directory
  var rootDir: ptr EfiFileProtocol

  consoleOut "boot: Opening root directory"
  checkStatus fileSystem.openVolume(fileSystem, addr rootDir)

  # load kernel image
  let (kernelImageBase, kernelImagePages) = loadImage(
    imagePath = W"efi\fusion\kernel.bin",
    rootDir = rootDir,
    memoryType = OsvKernelCode,
    loadAddress = KernelPhysicalBase.EfiPhysicalAddress.some
  )

  # load user task image
  let (userImageBase, userImagePages) = loadImage(
    imagePath = W"efi\fusion\utask.bin",
    rootDir = rootDir,
    memoryType = OsvUserCode,
  )

  # close the root directory
  consoleOut "boot: Closing root directory"
  checkStatus rootDir.close(rootDir)

Notice that I added a new value to the EfiMemoryType enum called OsvUserCode. This is just a value that we'll use to mark the user code region as used. Here's the updated enum:

# src/common/uefi.nim
...

  EfiMemoryType* = enum
    ...
    OsvKernelCode = 0x80000000
    OsvKernelData = 0x80000001
    OsvKernelStack = 0x80000002
    OsvUserCode = 0x80000003
    EfiMaxMemoryType

Let's map this value to a new UserCode value in our MemoryType enum (which is what we pass to the kernel as part of the memory map). While we're here, I'm going to also add values for UserData and UserStack (which we'll use later).

# src/common/bootinfo.nim
...

type
  MemoryType* = enum
    Free
    KernelCode
    KernelData
    KernelStack
    UserCode
    UserData
    UserStack
    Reserved

Let's also update convertUefiMemoryMap to account for the new memory type.

# src/boot/bootx64.nim

proc convertUefiMemoryMap(...): seq[MemoryMapEntry] =
   ...

  for i in 0 ..< uefiNumMemoryMapEntries:
    ...
    let memoryType =
      if uefiEntry.type in FreeMemoryTypes:
        Free
      elif uefiEntry.type == OsvKernelCode:
        KernelCode
      elif uefiEntry.type == OsvKernelData:
        KernelData
      elif uefiEntry.type == OsvKernelStack:
        KernelStack
      elif uefiEntry.type == OsvUserCode:
        UserCode
      else:
        Reserved
    ...

And finally, we need to tell the kernel where to find the user task image in memory. Let's add a couple of fields to BootInfo to store the user image physical address and number of pages.

# src/common/bootinfo.nim
...

  BootInfo* = object
    physicalMemoryMap*: MemoryMap
    virtualMemoryMap*: MemoryMap
    physicalMemoryVirtualBase*: uint64
    userImagePhysicalBase*: uint64
    userImagePages*: uint64

And to populate the fields, we'll update createBootInfo to take the values returned by loadImage as parameters.

# src/boot/bootx64.nim

proc createBootInfo(
  bootInfoBase: uint64,
  kernelImagePages: uint64,
  physMemoryPages: uint64,
  physMemoryMap: seq[MemoryMapEntry],
  virtMemoryMap: seq[MemoryMapEntry],
  userImageBase: uint64,
  userImagePages: uint64,
): ptr BootInfo =
  ...

  bootInfo.userImagePhysicalBase = userImageBase
  bootInfo.userImagePages = userImagePages

  result = bootInfo

...

proc EfiMain(imgHandle: EfiHandle, sysTable: ptr EFiSystemTable): EfiStatus {.exportc.} =
  ...

  let bootInfo = createBootInfo(
    bootInfoBase,
    kernelImagePages,
    physMemoryPages,
    physMemoryMap,
    virtMemoryMap,
    userImageBase,
    userImagePages,
  )
  ...

Let's test it out by printing the user image physical address and number of pages in the kernel.

# src/kernel/main.nim
...

proc KernelMainInner(bootInfo: ptr BootInfo) =
  ...

  debugln &"kernel: User image physical address: {bootInfo.userImagePhysicalBase:#010x}"
  debugln &"kernel: User image pages: {bootInfo.userImagePages}"

If we build and run the kernel, we should see the following output:

kernel: Fusion Kernel
...
kernel: Initializing GDT [success]
kernel: Initializing IDT [success]
kernel: User image physical address: 0x06129000
kernel: User image pages: 268

It seems like it's working. The user image is loaded at some address allocated by the bootloader. The kernel now knows where to find the user image, and should be able to map it to user space.

User Page Table

Now, let's create a new PML4Table for the user page table. We'll copy the kernel page table entries to the user page table, and map the user code and stack regions to user space.

# src/kernel/main.nim
...

const
  UserImageVirtualBase = 0x0000000040000000
  UserStackVirtualBase = 0x0000000050000000

...

proc KernelMainInner(bootInfo: ptr BootInfo) =
  debugln ""
  debugln "kernel: Fusion Kernel"

  ...

  debugln "kernel: Initializing user page table"
  var upml4 = cast[ptr PML4Table](new PML4Table)

  debugln "kernel:   Copying kernel space user page table"
  var kpml4 = getActivePML4()
  for i in 256 ..< 512:
    upml4.entries[i] = kpml4.entries[i]

  debugln &"kernel:   Mapping user image ({UserImageVirtualBase:#x} -> {bootInfo.userImagePhysicalBase:#x})"
  mapRegion(
    pml4 = upml4,
    virtAddr = UserImageVirtualBase.VirtAddr,
    physAddr = bootInfo.userImagePhysicalBase.PhysAddr,
    pageCount = bootInfo.userImagePages,
    pageAccess = paReadWrite,
    pageMode = pmUser,
  )

  # allocate and map user stack
  let userStackPhysAddr = pmAlloc(1).get
  debugln &"kernel:   Mapping user stack ({UserStackVirtualBase:#x} -> {userStackPhysAddr.uint64:#x})"
  mapRegion(
    pml4 = upml4,
    virtAddr = UserStackVirtualBase.VirtAddr,
    physAddr = userStackPhysAddr,
    pageCount = 1,
    pageAccess = paReadWrite,
    pageMode = pmUser,
  )

This should be straightforward. A few things to note:

We don't physically copy the kernel page table structures to the user page table. We just set the PML4 entries to point to the same page table structures as the kernel page table. This makes the kernel space portion of the user page table dynamic, so that if we change the kernel page table, the user page table will automatically reflect the changes (unless we map new PML4 entries in the kernel page table, which we won't do for now).
We're setting the pageMode to pmUser for the user code and stack regions.
We allocate one page for the user stack, and map it to the virtual address 0x50000000, so the stack region will be 0x50000000 to 0x50001000 (end address is exclusive).

Interrupt Stack Frame

Now, in order to switch to user mode, we'll create an interrupt stack frame, as if the user program had just been interrupted. We'll populate five entries at the bottom of the stack: RIP, CS, RFLAGS, RSP, and SS.

# src/kernel/main.nim
...

proc KernelMainInner(bootInfo: ptr BootInfo) =
  ...

  debugln "kernel: Creating interrupt stack frame"
  let userStackBottom = UserStackVirtualBase + PageSize
  let userStackPtr = cast[ptr array[512, uint64]](p2v(userStackPhysAddr))
  userStackPtr[^1] = cast[uint64](DataSegmentSelector) # SS
  userStackPtr[^2] = cast[uint64](userStackBottom) # RSP
  userStackPtr[^3] = cast[uint64](0x202) # RFLAGS
  userStackPtr[^4] = cast[uint64](UserCodeSegmentSelector) # CS
  userStackPtr[^5] = cast[uint64](UserImageVirtualBase) # RIP
  debugln &"            SS: {userStackPtr[^1]:#x}"
  debugln &"           RSP: {userStackPtr[^2]:#x}"
  debugln &"        RFLAGS: {userStackPtr[^3]:#x}"
  debugln &"            CS: {userStackPtr[^4]:#x}"
  debugln &"           RIP: {userStackPtr[^5]:#x}"

  let rsp = cast[uint64](userStackBottom - 5 * 8)

Stack terminology can be confusing. The stack grows downwards, so the bottom of the stack is the highest address. This is why we set userStackBottom to the highest address of the stack region. Now, to manipulate the stack region from the kernel, we reverse-map the stack's physical address to a virtual address, and cast it to a pointer to an array of 512 uint64 values (remember that UserStackVirtualBase is valid only in the user page table, not the kernel page table). We then populate the five entries at the bottom of the stack, and set rsp to point to the top entry. This simulates pushing the interrupt stack frame on the stack.

Switching to User Mode

We're finally ready to switch to user mode. We'll activate the user page table, set the rsp register to point to the interrupt stack frame, and use the iretq instruction to switch to user mode.

# src/kernel/main.nim

proc KernelMainInner(bootInfo: ptr BootInfo) =
  ...

  debugln "kernel: Switching to user mode"
  setActivePML4(upml4)
  asm """
    mov rbp, 0
    mov rsp, %0
    iretq
    :
    : "r"(`rsp`)
  """

If we did everything correctly, we should see the following output:

kernel: Fusion Kernel
...
kernel: Initializing user page table
kernel:   Copying kernel space user page table
kernel:   Mapping user image (1073741824 -> 0x6129000)
kernel:   Mapping user stack (0x50000000 -> 0x3000)
kernel: Creating interrupt stack frame
            SS: 0x13
           RSP: 0x50001000
        RFLAGS: 0x202
            CS: 0x1b
           RIP: 0x40000000
kernel: Switching to user mode

How do we know we're in user mode? Well, we can't really tell from the output, so let's use QEMU's monitor to check the CPU registers.

(qemu) info registers

CPU#0
RAX=0000000000000000 RBX=0000000000000000 RCX=0000000050000fc8 RDX=0000000000000000
RSI=0000000000000001 RDI=0000000050000fc8 RBP=0000000050000ff8 RSP=0000000050000fc8
R8 =ffff800100003c30 R9 =0000000000000001 R10=000000000636d001 R11=0000000000000004
R12=0000000000000000 R13=0000000006bb1588 R14=0000000000000000 R15=0000000007ebf1e0
RIP=000000004000004c RFL=00000206 [-----P-] CPL=3 II=0 A20=1 SMM=0 HLT=0
ES =0013 0000000000000000 000fffff 000ff300 DPL=3 DS   [-WA]
CS =001b 0000000000000000 000fffff 002ffa00 DPL=3 CS64 [-R-]
SS =0013 0000000000000000 000fffff 000ff300 DPL=3 DS   [-WA]
DS =0013 0000000000000000 000fffff 000ff300 DPL=3 DS   [-WA]
FS =0013 0000000000000000 000fffff 000ff300 DPL=3 DS   [-WA]
GS =0013 0000000000000000 000fffff 000ff300 DPL=3 DS   [-WA]
LDT=0000 0000000000000000 0000ffff 00008200 DPL=0 LDT
TR =0000 0000000000000000 0000ffff 00008b00 DPL=0 TSS64-busy
GDT=     ffff800000226290 0000001f
IDT=     ffff8000002262b0 00000fff
...

We can see that CPL=3, which means we're in user mode! The CS register is 0x1b, which is the user code segment selector (0x18 with RPL=3). The RIP register is 0x4000004c, which is several instructions into the UserMain function. Let's try to disassemble the code at the entry point.

(qemu) x /15i 0x40000000
0x40000000:  55                       pushq    %rbp
0x40000001:  48 89 e5                 movq     %rsp, %rbp
0x40000004:  48 83 ec 30              subq     $0x30, %rsp
0x40000008:  48 b8 8b a6 00 40 00 00  movabsq  $0x4000a68b, %rax
0x40000010:  00 00
0x40000012:  48 89 45 d8              movq     %rax, -0x28(%rbp)
0x40000016:  48 b8 7d a5 00 40 00 00  movabsq  $0x4000a57d, %rax
0x4000001e:  00 00
0x40000020:  48 89 45 e8              movq     %rax, -0x18(%rbp)
0x40000024:  48 c7 45 e0 00 00 00 00  movq     $0, -0x20(%rbp)
0x4000002c:  66 c7 45 f0 00 00        movw     $0, -0x10(%rbp)
0x40000032:  48 b8 70 00 00 40 00 00  movabsq  $0x40000070, %rax
0x4000003a:  00 00
0x4000003c:  48 8d 7d d0              leaq     -0x30(%rbp), %rdi
0x40000040:  ff d0                    callq    *%rax
0x40000042:  48 c7 45 e0 06 00 00 00  movq     $6, -0x20(%rbp)
0x4000004a:  f3 90                    pause
0x4000004c:  e9 f9 ff ff ff           jmp      0x4000004a

Looks like we're executing the UserMain function! Notice that the last two instructions are a pause instruction and a jump to the pause instruction. This is the loop we created in the UserMain function. We can also see that the RIP register is set to 0x4000004c, which is the address of the jmp instruction. Everything seems to be working as expected.

This is another big milestone! We now have a minimal user mode environment. It's not very useful yet, but we'll build on it in the next section. We should look into system calls next, but before we do that, we need to allow the CPU to switch back to kernel mode. This requires something called the Task State Segment (TSS), which we'll cover in the next section.