show/tell me about DEC hardware (and x86 memory management)

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > show/tell me about DEC hardware (and x86 memory management)

«‹›6 »

eschaton: Mar 7, 2007; Don't you just hate when you wind up in a store with people who are in a socioeconomic class that is pretty obviously about two levels lower than your own?

also I/O is handled as part of the instruction set so you basically have putc() and getc() in hardware which means you can even get pretty far in (1) loading, (2) running, and (3) reporting the results from the actual DEC diagnostic tests in those couple hours

# ? Jan 12, 2022 07:30

Adbot: ADBOT LOVES YOU

# ? May 2, 2024 06:52

Good Sphere: Jun 16, 2018

my dad worked there for 20+ years, before moving onto Intel when they were bought up

i had a poster on my bedroom wall of the alpha processor

his mother asked him to look for a job. he opened up a phone book and saw digital equipment corporation and said "oh computers. i've heard of these". he first worked with hardware assembly, then was an instructor for various things, including using oscilloscopes, until he went onto technical writing. he made a silly (but very inspiring) video on a weekend there in his early days. there was a production rental department where he got cameras, and made a video about him stuck in a microprocessor, a journey through the manufacturing plant, and something with someone dressed in a gorilla costume. gotta post it someday

# ? Jan 12, 2022 20:13

cheque_some: Dec 6, 2006; The Wizard of Menlo Park

Good Sphere posted:

my dad worked there for 20+ years, before moving onto Intel when they were bought up

i had a poster on my bedroom wall of the alpha processor

his mother asked him to look for a job. he opened up a phone book and saw digital equipment corporation and said "oh computers. i've heard of these". he first worked with hardware assembly, then was an instructor for various things, including using oscilloscopes, until he went onto technical writing. he made a silly (but very inspiring) video on a weekend there in his early days. there was a production rental department where he got cameras, and made a video about him stuck in a microprocessor, a journey through the manufacturing plant, and something with someone dressed in a gorilla costume. gotta post it someday

someday soon plz

# ? Jan 13, 2022 04:29

Kazinsal: Dec 13, 2011

Captain Foo posted:

go on about VAX architecture plz

the VAX ISA is fuckin wild. it takes the idea of a complex instruction set computer and adds a dash of purestrain late 70s "dude this blow is amazing we need to fly to colombia for the weekend more often". it's a 16-register machine with a similar register layout to ARM, a base ISA that's more or less the PDP-11 ISA extended to 32 bits, a four-ring protection level system, four one-gigabyte segments, and a whole bunch of insane extra instructions for making assembly programming "easier". want instructions to implement doubly-linked ring buffer operations with optional multiprocessor-safe locking in a single machine cycle? VAX has those. want two- and extended three-operand versions of the C standard library string functions as microcoded instructions? VAX has that. need an instruction to do CRC16 and/or CRC32 on an arbitrary length string of bytes? VAX has one. have you ever wanted an instruction that's a whole implementation of a stream editor like sed? look no further than VAX! and if that's not enough and you want to emulate your own instruction set extensions in software, VAX has a mechanism for that.

later versions of the VAX ISA describe an alternate larger-exponent-smaller-mantissa 64-bit floating point format and a super large 128-bit floating point format, and extend all the instructions to allow use of those anywhere the normal 32-bit and 64-bit floats are usable. there's also a full set of instructions to convert between every type of float to each other, and to convert different sized integers to different sized floats and back, with different rounding/truncation options, and each option is its own opcode.

oh and of course since it's a full 32-bit minicomputer it's got all the nice fancy memory management features you'd want out of something that will have dozens of simultaneous users and hundreds of processes running like paging and per-page privilege level checking and segment limits so the OS can do things like auto-grow process stacks. and separate machine-level registers for setting the base and limit of the system segment on a context switch so you could implement syscall trampolines if you wanted to

Kazinsal fucked around with this message at 09:31 on Jan 13, 2022

# ? Jan 13, 2022 09:26

echinopsis: Apr 13, 2004; by Fluffdaddy

man I want to try cocaine

# ? Jan 13, 2022 09:36

Kazinsal: Dec 13, 2011

other awesome wild VAX stuff: since DEC was the only company making VAXen the spec has a whole bunch of standard requirements for VAX machines built in that the operating system can expect. things like clocks, built-in timers, PDP-11 emulation, and a proto-"lights out console" attached to the system firmware that can be used to debug the machine and single-step through code in the event of a crash. the first multiprocessor VAX came out in 1982 and the first line of SMP VAXes were released in 1986. by the early 90s you could get VAXes with up to 8 processors and support for a gig of RAM if you wanted to drop some serious cash. and all your software from 1978 would conveniently still work!

# ? Jan 13, 2022 10:04

git apologist: Jun 4, 2003

eschaton posted:

I have several, though not a lot of pics

here�s my DEC 3000 AXP Model 400, from the first series of Alpha workstations and servers, booted into OpenVMS 7.1 as a workstation

I also have a couple of AlphaStation systems, an AlphaServer 1000A, and an adorable little AlphaServer DS10e which I finally got a video card for so it can be used as a workstation

as for what they were like to use, they were �just� a workstation, but compared to their contemporaries they felt really loving fast

what was more unique about them was that DECwindows was a complete environment very early on (pre-CDE) and reasonably productivity-focused so it even had tools like a paint app and DEC even made a MacWrite clone (DECwrite)

DECwindows started life on the VAXstation workstations, but was kept high-level and built for both VMS and Ultrix on them and so was easy to bring to the MIPS and then Alpha lines too, and DEC really thought they�d be able to put X terminals on the desks of non-technical staff to run productivity software instead of using PCs, which wasn�t that insane when PCs were also a couple grand each and ran DOS and the idea of a network was new, but by 1992 the writing was on the wall

it still meant that OEMs using things like an Alpha workstation or server as a PostScript rasterizer for phototypesetting would actually build monitoring applications that fit into the environment, instead of having their own wacky interface. that�s what my 400 was used for before I got it

oh yeah this is the poo poo *exhales cigarette*

# ? Jan 13, 2022 11:08

AlbertFlasher: Feb 14, 2006; Hulk Hogan and the Wrestling Boot Band

Gentle Autist posted:

oh yeah this is the poo poo *exhales cigarette*

# ? Jan 13, 2022 15:47

Best Bi Geek Squid: Mar 25, 2016

shame the market was filled with anti-VAXers

# ? Jan 13, 2022 15:56

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

thanks for the info dump, seems neat

# ? Jan 13, 2022 16:23

Good Sphere: Jun 16, 2018

Kazinsal posted:

the VAX ISA is fuckin wild. it takes the idea of a complex instruction set computer and adds a dash of purestrain late 70s "dude this blow is amazing we need to fly to colombia for the weekend more often". it's a 16-register machine with a similar register layout to ARM, a base ISA that's more or less the PDP-11 ISA extended to 32 bits, a four-ring protection level system, four one-gigabyte segments, and a whole bunch of insane extra instructions for making assembly programming "easier". want instructions to implement doubly-linked ring buffer operations with optional multiprocessor-safe locking in a single machine cycle? VAX has those. want two- and extended three-operand versions of the C standard library string functions as microcoded instructions? VAX has that. need an instruction to do CRC16 and/or CRC32 on an arbitrary length string of bytes? VAX has one. have you ever wanted an instruction that's a whole implementation of a stream editor like sed? look no further than VAX! and if that's not enough and you want to emulate your own instruction set extensions in software, VAX has a mechanism for that.

later versions of the VAX ISA describe an alternate larger-exponent-smaller-mantissa 64-bit floating point format and a super large 128-bit floating point format, and extend all the instructions to allow use of those anywhere the normal 32-bit and 64-bit floats are usable. there's also a full set of instructions to convert between every type of float to each other, and to convert different sized integers to different sized floats and back, with different rounding/truncation options, and each option is its own opcode.

oh and of course since it's a full 32-bit minicomputer it's got all the nice fancy memory management features you'd want out of something that will have dozens of simultaneous users and hundreds of processes running like paging and per-page privilege level checking and segment limits so the OS can do things like auto-grow process stacks. and separate machine-level registers for setting the base and limit of the system segment on a context switch so you could implement syscall trampolines if you wanted to

when you say segments, what are you referring to exactly? i'm not up to par on this processor lingo

# ? Jan 15, 2022 01:42

Kazinsal: Dec 13, 2011

Good Sphere posted:

when you say segments, what are you referring to exactly? i'm not up to par on this processor lingo

basically separate sections of the virtual address space that are mapped to different regions for different purposes. in the VAX world the 4 GB virtual address space is split up into four segments of a max size of 1 GB that each have their own permissions and addressing limit. if code reaches the segment limit a fault fires and the kernel can deal with it in an appropriate manner (eg. map more stack if it's the stack segment or map more heap if it's the heap region, or terminate the process if it's some kind of fatal segmentation fault).

on VAX the implicit segments are P0 (starts at 0x00000000 and goes to 0x3FFFFFFF, intended for user-space code and data, grows upwards), P1 (0x40000000 to 0x7FFFFFFF, intended for user-space stack and data, grows downwards because that's how the stack works), SYSTEM (0x80000000 to 0xBFFFFFFF intended for system code and data, grows upwards), and the System Reserved area (0xC0000000 to 0xFFFFFFFF). each segment has its own page table, which is a virtual address mapping method where every access to any given virtual address is checked against the page table entry for the page (VAX pages are 512 bytes; x86 pages are 4096 bytes for standard sized pages -- large and huge pages don't exist in VAX) to determine if there's memory there, if the code is allowed to access that memory, and what physical address to translate the virtual address to. there's also a few extra bits in page table entries that are used by the OS to implement things like copy on write etc.

in the x86 world there's also segmentation but it's a lot more freeform and was originally designed to deal with the fact that the 8086 was a 16-bit machine with a 20-bit physical address space. 32-bit x86 still has segmentation but it was more intended for separating out address spaces by privilege level. 64-bit x86 doesn't allow you to have any segments with bases other than 0 and limits other than 0xFFFF[...]FFFF, but there's a special model-specific register (internal processor registers only accessible via kernel-exclusive read/write instructions) for setting one of the segmentation registers' base address on the fly to make thread-local storage really easy. page tables on x86 are also more or less global; you slap the physical address of your highest level page table pointer (you need a lot of tables for a modern virtual address space so it's practically a page table pointer pointer pointer pointer these days) into a control register and the processor invalidates the translation lookaside buffers and boom, your virtual address space mappings are refreshed.

paging on VAX gets a bit more complex sometimes because some I/O devices on VAX use the VAX MMU which means they're subject to paging, whereas on x86 the device bus always has direct access to the physical address space

# ? Jan 15, 2022 02:13

Neslepaks: Sep 3, 2003

incredible effortposting itt. 5.

# ? Jan 15, 2022 14:30

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

Neslepaks posted:

incredible effortposting itt. 5.

# ? Jan 15, 2022 15:37

An cruiscin lan: Mar 4, 2020

The_Franz posted:

hp did

https://www.youtube.com/watch?v=VLTh4uVJduI

you have to wonder who the target demographic for this was, as these were workstations that cost thousands (or 10s of thousands) of 1990s us dollars and were mainly sold to companies where everyone wore a tie. then again, it probably really stood out in a sea of the typical business-to-business sales presentations that consisted of some stuffed shirt reading a checklist of product points in a monotone voice

HAha this ownes!

# ? Jan 15, 2022 23:36

Kazinsal: Dec 13, 2011

Neslepaks posted:

incredible effortposting itt. 5.

weird low level computer poo poo is one of the few things I can just blather on about for hours. I�ll probably get baked and write something about the insanity of x86 memory management later tonight

# ? Jan 16, 2022 00:51

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

Kazinsal posted:

weird low level computer poo poo is one of the few things I can just blather on about for hours. I�ll probably get baked and write something about the insanity of x86 memory management later tonight

# ? Jan 16, 2022 00:54

echinopsis: Apr 13, 2004; by Fluffdaddy

Kazinsal posted:

weird low level computer poo poo is one of the few things I can just blather on about for hours. I�ll probably get baked and write something about the insanity of x86 memory management later tonight

please do

idk why but I love it. make me wish I worked at that level.

# ? Jan 16, 2022 01:01

akadajet: Sep 14, 2003

kitten smoothie posted:

wish I could find the Huey Lewis knockoff song called �Power of Sun� but I think it must have been memory holed

https://www.youtube.com/watch?v=Tzeu-gqMy0A

first result I got

# ? Jan 16, 2022 04:51

Kazinsal: Dec 13, 2011

okay, time to talk about x86 memory management. this post is going to get a bit out of hand so I'm breaking it into sections.

way back before the 8086 the 8080 family had a 16-bit address space but no internal mechanism to manage it, so if you wanted more than 64KiB of RAM/ROM/MMIO you needed an external chip that you could frob to switch what a certain range of the address bus actually pointed at. this sucked but a lot of 8-bit CPU families did this and the chips are fairly simple. you hook them up to say a 16 KiB window and expose a couple I/O ports to the CPU so software can switch that 16 KiB bank to a different slice of whatever RAM chip is hooked up to it through the bank selector chip.

so when Intel designed the 8086 they realized how badly this sucked and even though they were still making a 16-bit CPU they stuck 4 more address lines onto it so they could basically do a sort of pseudo bank switching system in the standard address decoding logic. the 8086 has four 16-bit segment registers that are used in the address decoding logic to create a 20-bit address from a 16-bit segment and a 16-bit offset (usually formatted in documentation as eg. 1234:CDEF, where the word before the colon is the segment and the word after it is the offset. the logic is pretty simple: the segment word is shifted four bits to the left and the offset is added to it. in the above example we'd get this:

code:

segment << 4 | 12340
    + offset |  CDEF
------------ | -----
   = address | 1F12F

this is super advantageous over the bank switching chip thing for two reasons: one, you get full access to the whole 20-bit address space of the system at all times, and two, your code can always assume it starts at 0000 and goes to however much memory it needs in a 16-bit space since the OS can just give you a block of memory and say "your segment is 0x1234". all your indexes and offsets are from 0, no matter where in the 20-bit address space you actually are! and you get separate selectors for code (CS), data (DS and ES), and stack (SS), and segment override prefixes so you can access up to four 64K segments at any time without reloading the segment registers

now, there's no memory protection on the 8086 so any code can run over any memory at any time, and there's no concept of privilege levels so any code can use any instructions including mov Sreg, r/m16 to change what's in your segment registers, which is one of the main reasons why no one really ever cared too much about multi-user multitasking OSes for the 8086. the other main issue is that the most common 8086 out there was actually the 8088, which only had an 8-bit data bus instead of a 16-bit one so full word memory and I/O accesses needed two cycles instead of one, but that's something for a different post or a sidebar

sidebar: the 8086 was also interesting compared to a lot of other general-purpose 16-bit CPUs in that it carried over the separate I/O port space from the 8080/Z80. it uses its own in and out instructions and has a 16-bit address space of its own. it was relatively fast on real 8086es because it only took a couple clock cycles but it's still about the same speed on modern machines because the I/O port space is emulated in System Management Mode these days and anything it's accessing is an ISA device which is also either emulated by SMM, by a SuperIO chip, or by an actual 4.77 MHz ISA device going through half a dozen different bridge chips

so now we come to the 80286. it was still a 16-bit CPU but it had a 24-bit address bus and added a new CPU mode called protected mode that changed how memory management worked significantly. when you're in protected mode, instead of pointing directly to a segment base address, you point at an index into the Global Descriptor Table, which is an array of 48-bit fields aligned to 64 bits that describe the 24-bit base address for the selector (the name for the GDT entry you write the index for into a segment register), the segment limit (up to 0xFFFF; it's still a 16-bit machine at this point), and a flags field for things like minimum privilege level and some things mostly related to hardware task switching. the only problem with protected mode was that on the 286, you couldn't *leave* protected mode without resetting the CPU, and the BIOS's built-in driver routines were all only available in real address mode (the retronym for the 8086's operating mode). in order to go back you'd have to set the reset vector to your 16-bit real mode trampoline code and then intentionally catastrophically fault the CPU (by, say, setting the GDT or the IDT -- Interrupt Descriptor Table; same idea as the GDT but for interrupt vectors -- no a null pointer and causing an interrupt/exception, which would cause a recursive exception and "triple fault" the CPU, resetting it).

needless to say the 286 protected mode didn't see a lot of use on common systems like DOS and early Windows. Windows/286 2.x existed but no one really wrote programs for it because RAM was still pretty expensive and by that point there were bank switching systems for using more than 640KiB in real mode so you could sorta use big chunks of RAM 64K at a time in DOS. when the PC/AT debuted with an 80286, most of the computing world treated it as just a faster 8086 (as it ran at either 6 or 8 MHz depending on what options you bought from IBM instead of 4.77 MHz).

however this did technically give the 286 the capability for multitasking with process isolation! the GDT could technically have up to 8192 entries if you really wanted to, and because the 8-byte-aligned nature of GDT entries gave Intel 3 bits at the bottom of the selector index to work with, they assigned two of those bits to the "desired privilege level" -- if your DPL was lower (closer to kernel mode) than the minimum privilege level of the GDT entry, the CPU would generate a protection fault -- and one to "use Local Descriptor Table". you could then give each process its own LDT, load it into the LDT pointer register before switching to that process, and then the process could use that LDT for all of its selectors if it wanted to instead of using global selectors. this meant you could assign multiple chunks of memory to a process without needing to perform a costly system call back to kernel mode and rewrite the contents of a GDT entry every time a process wanted to access a different 64K chunk.

a couple years later the 80386 came out. this was a big fuckin deal as it was a full fat 32-bit machine with a 32-bit physical address bus, a 32-bit virtual address size, and a 32-bit data bus (16-bit on the 80386SX). to fit in this 32-bit setup, Intel used the 16 bits of padding in GDT entries for a 16/32-bit flag, shoved in another 8 bits to make the base address 32-bit instead of 24-bit, and another 4 bits for the segment limit. why 4 bits? well, when you set the 32-bit flag, the limit gets shifted 12 bits to the left and ANDed with 0xFFF. this seems like a bit of a granularity issue but it's because the 386 was really designed to be used less as a machine with segmentation-based memory management and more like a real machine with full paging capabilities. the 386 did add two more segment registers (FS and GS) however, which ended up being used to easily implement thread-local storage when multi-threaded application programming became common.

x86 has a base page size of 4096 bytes, and the 80386 had a two-level page table system wherein your top-level page table pointer (held in CR3 -- Control Register 3) would be a "page directory" composed of pointers to first-level page tables and flags for how they're allowed to be accessed (minimum privilege level, read/write or read only, cache enable/disable, and a few other flags that were later added), and then each page table was composed of pointers to the upper 20 bits of physical addresses to translate the matching virtual address to ANDed with another set of the same general flags, just this time for that specific page. the granularity on this is awesome and it's part of how x86 became a real 32-bit workhorse in the late 80s. the formula for determining how to translate a virtual address to a physical address is fairly simple: take the top 10 bits of the virtual address; this is your index into the page directory. dereference that to get your page table (or throw a page fault if the access flags don't match your current system state). take the next 10 bits of the virtual address; this is your index into the page table. dereference that to get the physical address to replace the top 20 bits with (or throw a page fault if the access flags don't match your current system state).

paging was great for simplifying memory protection and mapping because instead of each process needing a bunch of selectors, as far as they were concerned they owned the whole 32-bit address space except where the kernel was (and they couldn't read/write kernel space). every process could use the same ring 3 (privilege level) selectors and the OS would just slap the process's page directory address into CR3 before a context switch. and of course, you can just grab whatever free slab of memory you find first when you need to allocate memory to a process because with virtual address translation the actual physical location of a page doesn't matter to the process at all. Intel finally had a real multitasking x86 processor and it was still backwards compatible with original 8086 real mode code as well as 286 16-bit protected mode if you really wanted it (with the added bonus of allowing you to switch between all these modes at will instead of needing to reset the CPU).

sidebar: yes, you could mix paging with arbitrary segment bases and limits. pretty much no one ever did this and while I don't have the 80386 programmer's manual handy I'm pretty sure it says that it's a bad idea and the 386 can't efficiently cache translations if you gently caress around with weird segment bases and paging at the same time the only one I can think of off the top of my head is OpenBSD/i386, which uses the segment limit on code selectors to implement W^X on pre-64-bit machines.

along with paging a couple extra instructions for it were added for dealing with cache control registers and the like, because the 386 cached a bunch more stuff than the 286 did. if you modified a page table entry you'd need to reload the CR3 register with the same pointer (eg. mov eax, cr3; mov cr3, eax) and the 386 would invalidate the whole translation lookaside buffer. later, the 486 added an instruction to invalidate a single page; executing an invlpg address instruction would remove the cached translation for the page that address is in.

pushing onwards a bit past the 486 and Pentium we reach the Pentium Pro, which is our next stop here because it added a new awesome thing: Physical Address Extension (PAE). Intel looked at the 286's address bus compared to the 8086's, said "let's do that again", and slapped on another four bits. now, the problem here is that the page entries were already chock full of bits so what Intel did was add a flag to a control register to enable PAE mode, in which page table and directory entries became 64 bits wide instead of 4 bits wide to fit the extra bit and some additional flags for the future, and the CR3 register now pointed at a four-entry Page Directory Pointer Table where each entry would point to a page directory that controlled 1 GiB of the 32-bit virtual address space.

if you're having a "wait, this reminds me of the VAX post" moment, welcome to the magic of x86. it was basically designed stealing all the best parts of other architectures and steadfastly refusing to throw away any of your old legacy cruft.

another cool thing that was added on the original Pentium but wasn't used too often at the time was the page size flag. in a page directory entry, you could flip the page size flag (formerly a must-be-zero reserved bit) from 0 to 1 and instead of pointing at a page table, that page directory would be a direct mapping to a 4 MiB virtual slab of the address space (2 MiB in PAE mode), effectively making the 10 bits (9 in PAE mode) normally used as an index into a page table just part of the virtual address that wasn't translated by the memory management unit. now you could either slam a huge chunk of RAM into a process's address space or map a block as "not present" and use the fault it generated as a way to signal the kernel to do some other operation like disk I/O or whatnot.

shoving on ahead to 64-bit x86 we now have 64 bits of virtual address space as an extension of PAE mode, right?

kinda.

we actually have 48 bits. in 64-bit pointers. when you enter 64-bit long mode, CR3, which was extended to being 64 bits wide, now points at the Page Map Level 4 (PML4), which has 512 (9 bits' worth) entries, each of which points at a page directory pointer table, which has 512 entries, each of which points at a page directory, which has 512 entries, each of which points at a page table, which has 512 entries, each of which points at a page. in order to make the address "canonical" though it must be sign extended and the upper 16 bits are sign-extended from bit 47. so, your canonical addresses suddenly jump from 0x00007FFFFFFFFFFF to 0xFFFF8000000000. this is fine, whatever, you're basically splitting a 256 TiB address space into two 128 TiB address spaces. this is still an enormous amount of address space.

also, at this point, segmentation is dead. hooray! you are literally not allowed to use segments with bases other than 0 and limits other than "all of it" in 64-bit mode. but to make thread-local storage easier, AMD and Intel agreed that it would be a good idea to have hidden processor registers called FSBASE and GSBASE that... readded a base address to any memory accesses made against the FS and GS selectors.

a lot of OSes use this for convenient separation of user and kernel space. user space gets the bottom half, and the kernel gets the upper half. this is, even as of 2022, still enough on all but the most ridiculously big hypercomputing systems to map every byte of physical RAM and every byte of memory-mapped I/O space into kernel space if you wanted to (and most do! you've got all that space and it's advantageous to be able to easily access any physical address by just slapping a prefix onto it to make it a virtual address). so, naturally, Intel went a step further a couple years ago with the Ice Lake processor generation and added a sub-mode of long mode called 5-level paging mode wherein CR3 points at the Page Map Level 5 (PML5), which has 512 entries, each of which points at a PML4. the rest of the virtual address translation works the same as in normal 4-level paging long mode, but this does change the way canonical addresses are generated so the kernel needs to be aware of how to work in 5-level paging mode. I think most kernels do support it at this point because it doesn't require too much extra work to implement, but depending on the underlying kernel space memory mapping implementation it may require a kernel linked to a different address -- this is how Windows handles it; instead of ntoskrnl.exe you'd see your kernel image being ntkrla57.exe.

later 64-bit x86 microarchitectures added huge pages (the cowards at intel refuse to call it this, but everyone else including their competitors with similar implementations do), wherein a PDPT entry points at a 1 GiB block of memory like the 2 MiB large page. this simplifies and speeds up address translation when you know you can just pre-allocate an enormous block of RAM to a process. database servers and the like love huge pages and heavily randomly-accessed databases can genuinely sometimes be more performant with huge pages because each gigabyte of RAM only needs one TLB entry, so the TLB isn't constantly being rewritten and flushed with all the accesses. there's also a security/caching feature called process-context identifiers (PCID) that lets the kernel tag a paging structure with a 12-bit value of its choice, which will propagate to the TLB entries the processor creates, allowing it to only do lookups based on the PCID of the current paging structure. this both speeds up lookups since the TLB is built on content-addressable memory and the PCID just becomes part of the TLB entry address, and is good for security because the processor isn't reading through other threads' TLB entries to get to the right one.

there's some deeper stuff that's pretty much irrelevant these days that I could talk about in more detail like hardware task switching and call gates but those haven't been used since around the 486 era on account of them becoming gradually too slow to be worth it compared to the kernel saving and loading thread/process contexts in software. there's also I/O Privilege Level maps that live in the hardware task switching system's Task State Segment (each process in a hardware task switching OS had its own TSS that its context was stored in) and allows you to let specific ISA I/O ports be used by processes in user mode, which do still exist (you need to have one TSS for the processor to dump user thread context in when doing a user/kernel mode transition, but the kernel then just copies the stuff that matters back to and from the its own thread structure). I know there's also a few additional things like nested paging for virtualization but I honestly don't know much about the VMX instructions and their associated stuff used for building hypervisors so I won't butcher all that.

anyways there's my giant effortpost on x86 memory management. if I think of anything else about re: the x86 that could be fun and/or I could link back to DEC hardware (sorry for fuckin up your thread OP) I'll put together a batch of thoughts and post 'em. but for now I'm going to stop typing and realize that this kind of poo poo is exactly the reason I'm single

# ? Jan 16, 2022 10:27

You Am I: May 20, 2001; Me @ your poasting

I wonder what Scott McNearly is up to these days.

I remember reading in a book about the funny/off the wall side of the early days of Silicon Valley that one April Fool's Day prank the Sun workers did to Scott was to setup his desk in a lift. They managed to get both power and network working fine in there.

# ? Jan 16, 2022 10:40

Tankakern: Jul 25, 2007

Kazinsal posted:

okay, time to talk about x86 memory management. this post is going to get a bit out of hand so I'm breaking it into sections.

surely you mean segments

:haw:

# ? Jan 16, 2022 10:52

Kazinsal: Dec 13, 2011

Tankakern posted:

surely you mean segments

gently caress

# ? Jan 16, 2022 10:58

Tankakern: Jul 25, 2007

extremely good post though!

there were some talks not that long ago of fsbase / gsbase support for something or other in the linux kernel, but i cant remember what it was

# ? Jan 16, 2022 11:38

Tankakern: Jul 25, 2007

A possible end to the FSGSBASE saga

# ? Jan 16, 2022 11:41

Cybernetic Vermin: Apr 18, 2005

getting that fixed to enable the one real usecase, sgx, just in time for the deprecation of sgx.

# ? Jan 16, 2022 13:30

FalseNegative: Jul 24, 2007; 2>/dev/null

This thread is incredible! Thank you for all the posts.

# ? Jan 16, 2022 15:31

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

hell yes

# ? Jan 16, 2022 21:05

epitaph: Dec 31, 2008

lost in the shuffle: virtual 8086 mode which was introduced with the 386. because people didn't want to rewrite their real mode apps intel introduced a mode with similar addressing only translated by the mmu. this enabled things like ems (remember emm386.sys?) where a small window into higher memory ("page frame") was introduced and could be shifted as needed. allegedly even bill gates called it a terrible hack, but oh well.

# ? Jan 16, 2022 21:10

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

epitaph posted:

lost in the shuffle: virtual 8086 mode which was introduced with the 386. because people didn't want to rewrite their real mode apps intel introduced a mode with similar addressing only translated by the mmu. this enabled things like ems (remember emm386.sys?) where a small window into higher memory ("page frame") was introduced and could be shifted as needed. allegedly even bill gates called it a terrible hack, but oh well.

remember having to care about expanded vs extended memory

# ? Jan 16, 2022 21:12

git apologist: Jun 4, 2003

Kazinsal posted:

okay, time to talk about x86 memory management. this post is going to get a bit out of hand so I'm breaking it into sections.

way back before the 8086 the 8080 family had a 16-bit address space but no internal mechanism to manage it, so if you wanted more than 64KiB of RAM/ROM/MMIO you needed an external chip that you could frob to switch what a certain range of the address bus actually pointed at. this sucked but a lot of 8-bit CPU families did this and the chips are fairly simple. you hook them up to say a 16 KiB window and expose a couple I/O ports to the CPU so software can switch that 16 KiB bank to a different slice of whatever RAM chip is hooked up to it through the bank selector chip.

so when Intel designed the 8086 they realized how badly this sucked and even though they were still making a 16-bit CPU they stuck 4 more address lines onto it so they could basically do a sort of pseudo bank switching system in the standard address decoding logic. the 8086 has four 16-bit segment registers that are used in the address decoding logic to create a 20-bit address from a 16-bit segment and a 16-bit offset (usually formatted in documentation as eg. 1234:CDEF, where the word before the colon is the segment and the word after it is the offset. the logic is pretty simple: the segment word is shifted four bits to the left and the offset is added to it. in the above example we'd get this:
code:
segment << 4 | 12340
    + offset |  CDEF
------------ | -----
   = address | 1F12F
this is super advantageous over the bank switching chip thing for two reasons: one, you get full access to the whole 20-bit address space of the system at all times, and two, your code can always assume it starts at 0000 and goes to however much memory it needs in a 16-bit space since the OS can just give you a block of memory and say "your segment is 0x1234". all your indexes and offsets are from 0, no matter where in the 20-bit address space you actually are! and you get separate selectors for code (CS), data (DS and ES), and stack (SS), and segment override prefixes so you can access up to four 64K segments at any time without reloading the segment registers

now, there's no memory protection on the 8086 so any code can run over any memory at any time, and there's no concept of privilege levels so any code can use any instructions including mov Sreg, r/m16 to change what's in your segment registers, which is one of the main reasons why no one really ever cared too much about multi-user multitasking OSes for the 8086. the other main issue is that the most common 8086 out there was actually the 8088, which only had an 8-bit data bus instead of a 16-bit one so full word memory and I/O accesses needed two cycles instead of one, but that's something for a different post or a sidebar

sidebar: the 8086 was also interesting compared to a lot of other general-purpose 16-bit CPUs in that it carried over the separate I/O port space from the 8080/Z80. it uses its own in and out instructions and has a 16-bit address space of its own. it was relatively fast on real 8086es because it only took a couple clock cycles but it's still about the same speed on modern machines because the I/O port space is emulated in System Management Mode these days and anything it's accessing is an ISA device which is also either emulated by SMM, by a SuperIO chip, or by an actual 4.77 MHz ISA device going through half a dozen different bridge chips

so now we come to the 80286. it was still a 16-bit CPU but it had a 24-bit address bus and added a new CPU mode called protected mode that changed how memory management worked significantly. when you're in protected mode, instead of pointing directly to a segment base address, you point at an index into the Global Descriptor Table, which is an array of 48-bit fields aligned to 64 bits that describe the 24-bit base address for the selector (the name for the GDT entry you write the index for into a segment register), the segment limit (up to 0xFFFF; it's still a 16-bit machine at this point), and a flags field for things like minimum privilege level and some things mostly related to hardware task switching. the only problem with protected mode was that on the 286, you couldn't *leave* protected mode without resetting the CPU, and the BIOS's built-in driver routines were all only available in real address mode (the retronym for the 8086's operating mode). in order to go back you'd have to set the reset vector to your 16-bit real mode trampoline code and then intentionally catastrophically fault the CPU (by, say, setting the GDT or the IDT -- Interrupt Descriptor Table; same idea as the GDT but for interrupt vectors -- no a null pointer and causing an interrupt/exception, which would cause a recursive exception and "triple fault" the CPU, resetting it).

needless to say the 286 protected mode didn't see a lot of use on common systems like DOS and early Windows. Windows/286 2.x existed but no one really wrote programs for it because RAM was still pretty expensive and by that point there were bank switching systems for using more than 640KiB in real mode so you could sorta use big chunks of RAM 64K at a time in DOS. when the PC/AT debuted with an 80286, most of the computing world treated it as just a faster 8086 (as it ran at either 6 or 8 MHz depending on what options you bought from IBM instead of 4.77 MHz).

however this did technically give the 286 the capability for multitasking with process isolation! the GDT could technically have up to 8192 entries if you really wanted to, and because the 8-byte-aligned nature of GDT entries gave Intel 3 bits at the bottom of the selector index to work with, they assigned two of those bits to the "desired privilege level" -- if your DPL was lower (closer to kernel mode) than the minimum privilege level of the GDT entry, the CPU would generate a protection fault -- and one to "use Local Descriptor Table". you could then give each process its own LDT, load it into the LDT pointer register before switching to that process, and then the process could use that LDT for all of its selectors if it wanted to instead of using global selectors. this meant you could assign multiple chunks of memory to a process without needing to perform a costly system call back to kernel mode and rewrite the contents of a GDT entry every time a process wanted to access a different 64K chunk.

a couple years later the 80386 came out. this was a big fuckin deal as it was a full fat 32-bit machine with a 32-bit physical address bus, a 32-bit virtual address size, and a 32-bit data bus (16-bit on the 80386SX). to fit in this 32-bit setup, Intel used the 16 bits of padding in GDT entries for a 16/32-bit flag, shoved in another 8 bits to make the base address 32-bit instead of 24-bit, and another 4 bits for the segment limit. why 4 bits? well, when you set the 32-bit flag, the limit gets shifted 12 bits to the left and ANDed with 0xFFF. this seems like a bit of a granularity issue but it's because the 386 was really designed to be used less as a machine with segmentation-based memory management and more like a real machine with full paging capabilities. the 386 did add two more segment registers (FS and GS) however, which ended up being used to easily implement thread-local storage when multi-threaded application programming became common.

x86 has a base page size of 4096 bytes, and the 80386 had a two-level page table system wherein your top-level page table pointer (held in CR3 -- Control Register 3) would be a "page directory" composed of pointers to first-level page tables and flags for how they're allowed to be accessed (minimum privilege level, read/write or read only, cache enable/disable, and a few other flags that were later added), and then each page table was composed of pointers to the upper 20 bits of physical addresses to translate the matching virtual address to ANDed with another set of the same general flags, just this time for that specific page. the granularity on this is awesome and it's part of how x86 became a real 32-bit workhorse in the late 80s. the formula for determining how to translate a virtual address to a physical address is fairly simple: take the top 10 bits of the virtual address; this is your index into the page directory. dereference that to get your page table (or throw a page fault if the access flags don't match your current system state). take the next 10 bits of the virtual address; this is your index into the page table. dereference that to get the physical address to replace the top 20 bits with (or throw a page fault if the access flags don't match your current system state).

paging was great for simplifying memory protection and mapping because instead of each process needing a bunch of selectors, as far as they were concerned they owned the whole 32-bit address space except where the kernel was (and they couldn't read/write kernel space). every process could use the same ring 3 (privilege level) selectors and the OS would just slap the process's page directory address into CR3 before a context switch. and of course, you can just grab whatever free slab of memory you find first when you need to allocate memory to a process because with virtual address translation the actual physical location of a page doesn't matter to the process at all. Intel finally had a real multitasking x86 processor and it was still backwards compatible with original 8086 real mode code as well as 286 16-bit protected mode if you really wanted it (with the added bonus of allowing you to switch between all these modes at will instead of needing to reset the CPU).

sidebar: yes, you could mix paging with arbitrary segment bases and limits. pretty much no one ever did this and while I don't have the 80386 programmer's manual handy I'm pretty sure it says that it's a bad idea and the 386 can't efficiently cache translations if you gently caress around with weird segment bases and paging at the same time the only one I can think of off the top of my head is OpenBSD/i386, which uses the segment limit on code selectors to implement W^X on pre-64-bit machines.

along with paging a couple extra instructions for it were added for dealing with cache control registers and the like, because the 386 cached a bunch more stuff than the 286 did. if you modified a page table entry you'd need to reload the CR3 register with the same pointer (eg. mov eax, cr3; mov cr3, eax) and the 386 would invalidate the whole translation lookaside buffer. later, the 486 added an instruction to invalidate a single page; executing an invlpg address instruction would remove the cached translation for the page that address is in.

pushing onwards a bit past the 486 and Pentium we reach the Pentium Pro, which is our next stop here because it added a new awesome thing: Physical Address Extension (PAE). Intel looked at the 286's address bus compared to the 8086's, said "let's do that again", and slapped on another four bits. now, the problem here is that the page entries were already chock full of bits so what Intel did was add a flag to a control register to enable PAE mode, in which page table and directory entries became 64 bits wide instead of 4 bits wide to fit the extra bit and some additional flags for the future, and the CR3 register now pointed at a four-entry Page Directory Pointer Table where each entry would point to a page directory that controlled 1 GiB of the 32-bit virtual address space.

if you're having a "wait, this reminds me of the VAX post" moment, welcome to the magic of x86. it was basically designed stealing all the best parts of other architectures and steadfastly refusing to throw away any of your old legacy cruft.

another cool thing that was added on the original Pentium but wasn't used too often at the time was the page size flag. in a page directory entry, you could flip the page size flag (formerly a must-be-zero reserved bit) from 0 to 1 and instead of pointing at a page table, that page directory would be a direct mapping to a 4 MiB virtual slab of the address space (2 MiB in PAE mode), effectively making the 10 bits (9 in PAE mode) normally used as an index into a page table just part of the virtual address that wasn't translated by the memory management unit. now you could either slam a huge chunk of RAM into a process's address space or map a block as "not present" and use the fault it generated as a way to signal the kernel to do some other operation like disk I/O or whatnot.

shoving on ahead to 64-bit x86 we now have 64 bits of virtual address space as an extension of PAE mode, right?

kinda.

we actually have 48 bits. in 64-bit pointers. when you enter 64-bit long mode, CR3, which was extended to being 64 bits wide, now points at the Page Map Level 4 (PML4), which has 512 (9 bits' worth) entries, each of which points at a page directory pointer table, which has 512 entries, each of which points at a page directory, which has 512 entries, each of which points at a page table, which has 512 entries, each of which points at a page. in order to make the address "canonical" though it must be sign extended and the upper 16 bits are sign-extended from bit 47. so, your canonical addresses suddenly jump from 0x00007FFFFFFFFFFF to 0xFFFF8000000000. this is fine, whatever, you're basically splitting a 256 TiB address space into two 128 TiB address spaces. this is still an enormous amount of address space.

also, at this point, segmentation is dead. hooray! you are literally not allowed to use segments with bases other than 0 and limits other than "all of it" in 64-bit mode. but to make thread-local storage easier, AMD and Intel agreed that it would be a good idea to have hidden processor registers called FSBASE and GSBASE that... readded a base address to any memory accesses made against the FS and GS selectors.

a lot of OSes use this for convenient separation of user and kernel space. user space gets the bottom half, and the kernel gets the upper half. this is, even as of 2022, still enough on all but the most ridiculously big hypercomputing systems to map every byte of physical RAM and every byte of memory-mapped I/O space into kernel space if you wanted to (and most do! you've got all that space and it's advantageous to be able to easily access any physical address by just slapping a prefix onto it to make it a virtual address). so, naturally, Intel went a step further a couple years ago with the Ice Lake processor generation and added a sub-mode of long mode called 5-level paging mode wherein CR3 points at the Page Map Level 5 (PML5), which has 512 entries, each of which points at a PML4. the rest of the virtual address translation works the same as in normal 4-level paging long mode, but this does change the way canonical addresses are generated so the kernel needs to be aware of how to work in 5-level paging mode. I think most kernels do support it at this point because it doesn't require too much extra work to implement, but depending on the underlying kernel space memory mapping implementation it may require a kernel linked to a different address -- this is how Windows handles it; instead of ntoskrnl.exe you'd see your kernel image being ntkrla57.exe.

later 64-bit x86 microarchitectures added huge pages (the cowards at intel refuse to call it this, but everyone else including their competitors with similar implementations do), wherein a PDPT entry points at a 1 GiB block of memory like the 2 MiB large page. this simplifies and speeds up address translation when you know you can just pre-allocate an enormous block of RAM to a process. database servers and the like love huge pages and heavily randomly-accessed databases can genuinely sometimes be more performant with huge pages because each gigabyte of RAM only needs one TLB entry, so the TLB isn't constantly being rewritten and flushed with all the accesses. there's also a security/caching feature called process-context identifiers (PCID) that lets the kernel tag a paging structure with a 12-bit value of its choice, which will propagate to the TLB entries the processor creates, allowing it to only do lookups based on the PCID of the current paging structure. this both speeds up lookups since the TLB is built on content-addressable memory and the PCID just becomes part of the TLB entry address, and is good for security because the processor isn't reading through other threads' TLB entries to get to the right one.

there's some deeper stuff that's pretty much irrelevant these days that I could talk about in more detail like hardware task switching and call gates but those haven't been used since around the 486 era on account of them becoming gradually too slow to be worth it compared to the kernel saving and loading thread/process contexts in software. there's also I/O Privilege Level maps that live in the hardware task switching system's Task State Segment (each process in a hardware task switching OS had its own TSS that its context was stored in) and allows you to let specific ISA I/O ports be used by processes in user mode, which do still exist (you need to have one TSS for the processor to dump user thread context in when doing a user/kernel mode transition, but the kernel then just copies the stuff that matters back to and from the its own thread structure). I know there's also a few additional things like nested paging for virtualization but I honestly don't know much about the VMX instructions and their associated stuff used for building hypervisors so I won't butcher all that.

anyways there's my giant effortpost on x86 memory management. if I think of anything else about re: the x86 that could be fun and/or I could link back to DEC hardware (sorry for fuckin up your thread OP) I'll put together a batch of thoughts and post 'em. but for now I'm going to stop typing and realize that this kind of poo poo is exactly the reason I'm single

same

# ? Jan 16, 2022 21:22

carry on then: Jul 10, 2010; by VideoGames
(and can't post for 10 years!)

Captain Foo posted:

remember having to care about expanded vs extended memory

there are old issues of byte in the internet archive and the ads and reviews for things like dos memory managers make the experience of using a pc back then sound dire. there's no way i wouldn't have been a mac user if i were born early enough to have to make that decision back then

# ? Jan 16, 2022 22:19

Cybernetic Vermin: Apr 18, 2005

intel (and to some extent microsoft) were geniuses for realizing just how much real value exists in "legacy" software, at every stage dragging every little thing along keeping peoples and businesses things ticking along. which on the theme of the thread is interesting, because as great as the alpha was, maybe the world would have looked really different if dec had done a pentium pro for vax, or motorola had done a pentium pro for the 68k.

afaik there is no real reason it wasn't perfectly doable, dec had some of the pieces already in the rather performant nvax, and the 68k had different challenges than x86 but i don't see that they were worse.

# ? Jan 16, 2022 22:48

ultravoices: May 10, 2004; You are about to embark on a great journey. Are you ready, my friend?

carry on then posted:

there are old issues of byte in the internet archive and the ads and reviews for things like dos memory managers make the experience of using a pc back then sound dire. there's no way i wouldn't have been a mac user if i were born early enough to have to make that decision back then

i still don't know the difference between ems and xms

# ? Jan 16, 2022 23:27

Kazinsal: Dec 13, 2011

glad to see my awful ramblings are appreciated. I'll do my best to not add anything else onto the thread title (partially because I don't know any other architectures well enough, but mostly because stuff like the history of interrupt routing on x86 isn't nearly as fascinating as the myriad memory management modes, though it is pretty batshit)

epitaph posted:

lost in the shuffle: virtual 8086 mode which was introduced with the 386. because people didn't want to rewrite their real mode apps intel introduced a mode with similar addressing only translated by the mmu. this enabled things like ems (remember emm386.sys?) where a small window into higher memory ("page frame") was introduced and could be shifted as needed. allegedly even bill gates called it a terrible hack, but oh well.

thought I had forgotten something! I never actually wrote a v86 monitor but I know a couple people who implemented one for the purpose of doing BIOS calls from within a 32-bit kernel and really, it's a bad idea, don't do it ever. common wisdom is that if you *have* to use the BIOS for anything eg. getting a memory map, do it before you move to protected or long mode and just shove the structs the BIOS hands you somewhere in memory and let your kernel parse them early on.

Cybernetic Vermin posted:

intel (and to some extent microsoft) were geniuses for realizing just how much real value exists in "legacy" software, at every stage dragging every little thing along keeping peoples and businesses things ticking along. which on the theme of the thread is interesting, because as great as the alpha was, maybe the world would have looked really different if dec had done a pentium pro for vax, or motorola had done a pentium pro for the 68k.

afaik there is no real reason it wasn't perfectly doable, dec had some of the pieces already in the rather performant nvax, and the 68k had different challenges than x86 but i don't see that they were worse.

the 68020 was kind of motorola's "okay guys it's time to do things a bit differently wait what's all this weird hacky code you're writing" moment because the 68000 and 68010 had a 24-bit address bus and iirc it just ignored the upper 8 bits of pointers so you could store tag information and stuff in there. in the apple world this became known as "32-bit dirty". a couple early '020 macintosh models claimed to have 32-bit clean ROMs but actually didn't so someone wrote a system extension to patch that and it was so useful that apple just bought the rights to it and made it free lol

I think the 68k's big challenge was that most of motorola's customers dried up (Atari died, Sun invented SPARC, Amiga died, SGI moved on to MIPS) so they teamed up with Apple and IBM to bring POWER to the desktop. there's a few post-Apple 68ks that are interesting like the 68060 which brought it up to roughtly Pentium-class microarchitectural performance but at that point the writing was on the wall for CISC designs. x86 powered through just through sheer brute force and dumb luck (and in a few cases, by implementing the architecture as a virtual machine in custom RISC microcode)

VAX could have lived longer if DEC just extended it to 64 bits and kept cranking out microarchitectural improvements. process improvements on their own would have pushed clock speeds up and iirc NVAX was pushing close to 200 MHz at the end of its life. I guess it just wasn't cost-competitive compared to a Pentium Pro

# ? Jan 17, 2022 00:30

Achmed Jones: Oct 16, 2004

i don't have anything to add here, i only know enough about particular cpus to competently pop shells. but i am enjoying these posts a lot please continue making them

# ? Jan 17, 2022 00:55

Farmer Crack-Ass: Jan 2, 2001; this is me posting irl

Kazinsal posted:

anyways there's my giant effortpost on x86 memory management. if I think of anything else about re: the x86 that could be fun and/or I could link back to DEC hardware (sorry for fuckin up your thread OP) I'll put together a batch of thoughts and post 'em.

apology refused. this post was excellent and i would love to see more like it. :swoon:

# ? Jan 17, 2022 01:12

echinopsis: Apr 13, 2004; by Fluffdaddy

Kazinsal posted:

okay, time to talk about x86 memory management. this post is going to get a bit out of hand so I'm breaking it into sections.

way back before the 8086 the 8080 family had a 16-bit address space but no internal mechanism to manage it, so if you wanted more than 64KiB of RAM/ROM/MMIO you needed an external chip that you could frob to switch what a certain range of the address bus actually pointed at. this sucked but a lot of 8-bit CPU families did this and the chips are fairly simple. you hook them up to say a 16 KiB window and expose a couple I/O ports to the CPU so software can switch that 16 KiB bank to a different slice of whatever RAM chip is hooked up to it through the bank selector chip.

so when Intel designed the 8086 they realized how badly this sucked and even though they were still making a 16-bit CPU they stuck 4 more address lines onto it so they could basically do a sort of pseudo bank switching system in the standard address decoding logic. the 8086 has four 16-bit segment registers that are used in the address decoding logic to create a 20-bit address from a 16-bit segment and a 16-bit offset (usually formatted in documentation as eg. 1234:CDEF, where the word before the colon is the segment and the word after it is the offset. the logic is pretty simple: the segment word is shifted four bits to the left and the offset is added to it. in the above example we'd get this:
code:
segment << 4 | 12340
    + offset |  CDEF
------------ | -----
   = address | 1F12F
this is super advantageous over the bank switching chip thing for two reasons: one, you get full access to the whole 20-bit address space of the system at all times, and two, your code can always assume it starts at 0000 and goes to however much memory it needs in a 16-bit space since the OS can just give you a block of memory and say "your segment is 0x1234". all your indexes and offsets are from 0, no matter where in the 20-bit address space you actually are! and you get separate selectors for code (CS), data (DS and ES), and stack (SS), and segment override prefixes so you can access up to four 64K segments at any time without reloading the segment registers

now, there's no memory protection on the 8086 so any code can run over any memory at any time, and there's no concept of privilege levels so any code can use any instructions including mov Sreg, r/m16 to change what's in your segment registers, which is one of the main reasons why no one really ever cared too much about multi-user multitasking OSes for the 8086. the other main issue is that the most common 8086 out there was actually the 8088, which only had an 8-bit data bus instead of a 16-bit one so full word memory and I/O accesses needed two cycles instead of one, but that's something for a different post or a sidebar

sidebar: the 8086 was also interesting compared to a lot of other general-purpose 16-bit CPUs in that it carried over the separate I/O port space from the 8080/Z80. it uses its own in and out instructions and has a 16-bit address space of its own. it was relatively fast on real 8086es because it only took a couple clock cycles but it's still about the same speed on modern machines because the I/O port space is emulated in System Management Mode these days and anything it's accessing is an ISA device which is also either emulated by SMM, by a SuperIO chip, or by an actual 4.77 MHz ISA device going through half a dozen different bridge chips

so now we come to the 80286. it was still a 16-bit CPU but it had a 24-bit address bus and added a new CPU mode called protected mode that changed how memory management worked significantly. when you're in protected mode, instead of pointing directly to a segment base address, you point at an index into the Global Descriptor Table, which is an array of 48-bit fields aligned to 64 bits that describe the 24-bit base address for the selector (the name for the GDT entry you write the index for into a segment register), the segment limit (up to 0xFFFF; it's still a 16-bit machine at this point), and a flags field for things like minimum privilege level and some things mostly related to hardware task switching. the only problem with protected mode was that on the 286, you couldn't *leave* protected mode without resetting the CPU, and the BIOS's built-in driver routines were all only available in real address mode (the retronym for the 8086's operating mode). in order to go back you'd have to set the reset vector to your 16-bit real mode trampoline code and then intentionally catastrophically fault the CPU (by, say, setting the GDT or the IDT -- Interrupt Descriptor Table; same idea as the GDT but for interrupt vectors -- no a null pointer and causing an interrupt/exception, which would cause a recursive exception and "triple fault" the CPU, resetting it).

needless to say the 286 protected mode didn't see a lot of use on common systems like DOS and early Windows. Windows/286 2.x existed but no one really wrote programs for it because RAM was still pretty expensive and by that point there were bank switching systems for using more than 640KiB in real mode so you could sorta use big chunks of RAM 64K at a time in DOS. when the PC/AT debuted with an 80286, most of the computing world treated it as just a faster 8086 (as it ran at either 6 or 8 MHz depending on what options you bought from IBM instead of 4.77 MHz).

however this did technically give the 286 the capability for multitasking with process isolation! the GDT could technically have up to 8192 entries if you really wanted to, and because the 8-byte-aligned nature of GDT entries gave Intel 3 bits at the bottom of the selector index to work with, they assigned two of those bits to the "desired privilege level" -- if your DPL was lower (closer to kernel mode) than the minimum privilege level of the GDT entry, the CPU would generate a protection fault -- and one to "use Local Descriptor Table". you could then give each process its own LDT, load it into the LDT pointer register before switching to that process, and then the process could use that LDT for all of its selectors if it wanted to instead of using global selectors. this meant you could assign multiple chunks of memory to a process without needing to perform a costly system call back to kernel mode and rewrite the contents of a GDT entry every time a process wanted to access a different 64K chunk.

a couple years later the 80386 came out. this was a big fuckin deal as it was a full fat 32-bit machine with a 32-bit physical address bus, a 32-bit virtual address size, and a 32-bit data bus (16-bit on the 80386SX). to fit in this 32-bit setup, Intel used the 16 bits of padding in GDT entries for a 16/32-bit flag, shoved in another 8 bits to make the base address 32-bit instead of 24-bit, and another 4 bits for the segment limit. why 4 bits? well, when you set the 32-bit flag, the limit gets shifted 12 bits to the left and ANDed with 0xFFF. this seems like a bit of a granularity issue but it's because the 386 was really designed to be used less as a machine with segmentation-based memory management and more like a real machine with full paging capabilities. the 386 did add two more segment registers (FS and GS) however, which ended up being used to easily implement thread-local storage when multi-threaded application programming became common.

x86 has a base page size of 4096 bytes, and the 80386 had a two-level page table system wherein your top-level page table pointer (held in CR3 -- Control Register 3) would be a "page directory" composed of pointers to first-level page tables and flags for how they're allowed to be accessed (minimum privilege level, read/write or read only, cache enable/disable, and a few other flags that were later added), and then each page table was composed of pointers to the upper 20 bits of physical addresses to translate the matching virtual address to ANDed with another set of the same general flags, just this time for that specific page. the granularity on this is awesome and it's part of how x86 became a real 32-bit workhorse in the late 80s. the formula for determining how to translate a virtual address to a physical address is fairly simple: take the top 10 bits of the virtual address; this is your index into the page directory. dereference that to get your page table (or throw a page fault if the access flags don't match your current system state). take the next 10 bits of the virtual address; this is your index into the page table. dereference that to get the physical address to replace the top 20 bits with (or throw a page fault if the access flags don't match your current system state).

paging was great for simplifying memory protection and mapping because instead of each process needing a bunch of selectors, as far as they were concerned they owned the whole 32-bit address space except where the kernel was (and they couldn't read/write kernel space). every process could use the same ring 3 (privilege level) selectors and the OS would just slap the process's page directory address into CR3 before a context switch. and of course, you can just grab whatever free slab of memory you find first when you need to allocate memory to a process because with virtual address translation the actual physical location of a page doesn't matter to the process at all. Intel finally had a real multitasking x86 processor and it was still backwards compatible with original 8086 real mode code as well as 286 16-bit protected mode if you really wanted it (with the added bonus of allowing you to switch between all these modes at will instead of needing to reset the CPU).

sidebar: yes, you could mix paging with arbitrary segment bases and limits. pretty much no one ever did this and while I don't have the 80386 programmer's manual handy I'm pretty sure it says that it's a bad idea and the 386 can't efficiently cache translations if you gently caress around with weird segment bases and paging at the same time the only one I can think of off the top of my head is OpenBSD/i386, which uses the segment limit on code selectors to implement W^X on pre-64-bit machines.

along with paging a couple extra instructions for it were added for dealing with cache control registers and the like, because the 386 cached a bunch more stuff than the 286 did. if you modified a page table entry you'd need to reload the CR3 register with the same pointer (eg. mov eax, cr3; mov cr3, eax) and the 386 would invalidate the whole translation lookaside buffer. later, the 486 added an instruction to invalidate a single page; executing an invlpg address instruction would remove the cached translation for the page that address is in.

pushing onwards a bit past the 486 and Pentium we reach the Pentium Pro, which is our next stop here because it added a new awesome thing: Physical Address Extension (PAE). Intel looked at the 286's address bus compared to the 8086's, said "let's do that again", and slapped on another four bits. now, the problem here is that the page entries were already chock full of bits so what Intel did was add a flag to a control register to enable PAE mode, in which page table and directory entries became 64 bits wide instead of 4 bits wide to fit the extra bit and some additional flags for the future, and the CR3 register now pointed at a four-entry Page Directory Pointer Table where each entry would point to a page directory that controlled 1 GiB of the 32-bit virtual address space.

if you're having a "wait, this reminds me of the VAX post" moment, welcome to the magic of x86. it was basically designed stealing all the best parts of other architectures and steadfastly refusing to throw away any of your old legacy cruft.

another cool thing that was added on the original Pentium but wasn't used too often at the time was the page size flag. in a page directory entry, you could flip the page size flag (formerly a must-be-zero reserved bit) from 0 to 1 and instead of pointing at a page table, that page directory would be a direct mapping to a 4 MiB virtual slab of the address space (2 MiB in PAE mode), effectively making the 10 bits (9 in PAE mode) normally used as an index into a page table just part of the virtual address that wasn't translated by the memory management unit. now you could either slam a huge chunk of RAM into a process's address space or map a block as "not present" and use the fault it generated as a way to signal the kernel to do some other operation like disk I/O or whatnot.

shoving on ahead to 64-bit x86 we now have 64 bits of virtual address space as an extension of PAE mode, right?

kinda.

we actually have 48 bits. in 64-bit pointers. when you enter 64-bit long mode, CR3, which was extended to being 64 bits wide, now points at the Page Map Level 4 (PML4), which has 512 (9 bits' worth) entries, each of which points at a page directory pointer table, which has 512 entries, each of which points at a page directory, which has 512 entries, each of which points at a page table, which has 512 entries, each of which points at a page. in order to make the address "canonical" though it must be sign extended and the upper 16 bits are sign-extended from bit 47. so, your canonical addresses suddenly jump from 0x00007FFFFFFFFFFF to 0xFFFF8000000000. this is fine, whatever, you're basically splitting a 256 TiB address space into two 128 TiB address spaces. this is still an enormous amount of address space.

also, at this point, segmentation is dead. hooray! you are literally not allowed to use segments with bases other than 0 and limits other than "all of it" in 64-bit mode. but to make thread-local storage easier, AMD and Intel agreed that it would be a good idea to have hidden processor registers called FSBASE and GSBASE that... readded a base address to any memory accesses made against the FS and GS selectors.

a lot of OSes use this for convenient separation of user and kernel space. user space gets the bottom half, and the kernel gets the upper half. this is, even as of 2022, still enough on all but the most ridiculously big hypercomputing systems to map every byte of physical RAM and every byte of memory-mapped I/O space into kernel space if you wanted to (and most do! you've got all that space and it's advantageous to be able to easily access any physical address by just slapping a prefix onto it to make it a virtual address). so, naturally, Intel went a step further a couple years ago with the Ice Lake processor generation and added a sub-mode of long mode called 5-level paging mode wherein CR3 points at the Page Map Level 5 (PML5), which has 512 entries, each of which points at a PML4. the rest of the virtual address translation works the same as in normal 4-level paging long mode, but this does change the way canonical addresses are generated so the kernel needs to be aware of how to work in 5-level paging mode. I think most kernels do support it at this point because it doesn't require too much extra work to implement, but depending on the underlying kernel space memory mapping implementation it may require a kernel linked to a different address -- this is how Windows handles it; instead of ntoskrnl.exe you'd see your kernel image being ntkrla57.exe.

later 64-bit x86 microarchitectures added huge pages (the cowards at intel refuse to call it this, but everyone else including their competitors with similar implementations do), wherein a PDPT entry points at a 1 GiB block of memory like the 2 MiB large page. this simplifies and speeds up address translation when you know you can just pre-allocate an enormous block of RAM to a process. database servers and the like love huge pages and heavily randomly-accessed databases can genuinely sometimes be more performant with huge pages because each gigabyte of RAM only needs one TLB entry, so the TLB isn't constantly being rewritten and flushed with all the accesses. there's also a security/caching feature called process-context identifiers (PCID) that lets the kernel tag a paging structure with a 12-bit value of its choice, which will propagate to the TLB entries the processor creates, allowing it to only do lookups based on the PCID of the current paging structure. this both speeds up lookups since the TLB is built on content-addressable memory and the PCID just becomes part of the TLB entry address, and is good for security because the processor isn't reading through other threads' TLB entries to get to the right one.

there's some deeper stuff that's pretty much irrelevant these days that I could talk about in more detail like hardware task switching and call gates but those haven't been used since around the 486 era on account of them becoming gradually too slow to be worth it compared to the kernel saving and loading thread/process contexts in software. there's also I/O Privilege Level maps that live in the hardware task switching system's Task State Segment (each process in a hardware task switching OS had its own TSS that its context was stored in) and allows you to let specific ISA I/O ports be used by processes in user mode, which do still exist (you need to have one TSS for the processor to dump user thread context in when doing a user/kernel mode transition, but the kernel then just copies the stuff that matters back to and from the its own thread structure). I know there's also a few additional things like nested paging for virtualization but I honestly don't know much about the VMX instructions and their associated stuff used for building hypervisors so I won't butcher all that.

anyways there's my giant effortpost on x86 memory management. if I think of anything else about re: the x86 that could be fun and/or I could link back to DEC hardware (sorry for fuckin up your thread OP) I'll put together a batch of thoughts and post 'em. but for now I'm going to stop typing and realize that this kind of poo poo is exactly the reason I'm single

digital cocaine 😍

# ? Jan 17, 2022 01:57

Best Bi Geek Squid: Mar 25, 2016

computers is rocks we taught tricks

# ? Jan 17, 2022 05:38

Adbot: ADBOT LOVES YOU

# ? May 2, 2024 06:52

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

these are good posts

# ? Jan 17, 2022 07:47

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > show/tell me about DEC hardware (and x86 memory management)

«‹›6 »