show/tell me about DEC hardware (and x86 memory management)

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > show/tell me about DEC hardware (and x86 memory management)

echinopsis: Apr 13, 2004; by Fluffdaddy

yeah kaz should find some amfetamines and go to town writing about ISAs

# ¿ Jan 11, 2022 03:19

Adbot: ADBOT LOVES YOU

# ¿ May 22, 2024 06:27

echinopsis: Apr 13, 2004; by Fluffdaddy

rjmccall posted:

op, the dec alpha has a fully weak memory ordering model which requires read barriers on atomic loads of objects even if all the subsequent loads from the object are value-dependent on the loaded pointer. intriguingly, it is my understanding that the weakness of the alpha's memory ordering model is purely theoretical, and all shipping alpha hardware in fact uses a stronger model which does guarantee that dependent loads will be properly ordered. nonetheless, because this is not architecturally guaranteed, any lock-free code must ensure that it uses proper barriers around atomic loads if it is ever ported to alpha. in principle, systems programming languages such as c and c++ would allow programmers to clearly state their requirements and then compile it optimally for the given platform, potentially avoiding load barriers when compiling for systems other than the alpha. unfortunately, inventing a sound formal definition of the concept of value-dependence that still admits reasonable optimizations in code that may not even be aware of the fact that it's carrying a dependency on an atomic load has proven to be an exceptionally tricky problem. even now, a full ten years after the introduction of atomics to c and c++, many compilers do not compile the so-called "consume" memory ordering optimally. this problem would be entirely defined away if processors were instead as overtly hostile as the theoretical but not actual memory ordering model of the dec alpha, op

i�m jealous of people who understand this poo poo

maybe I should just play TIS-100 or Shenzen IO some
more

# ¿ Jan 11, 2022 03:21

echinopsis: Apr 13, 2004; by Fluffdaddy

cool thread

# ¿ Jan 11, 2022 09:36

echinopsis: Apr 13, 2004; by Fluffdaddy

The_Franz posted:

hp did

https://www.youtube.com/watch?v=VLTh4uVJduI

you have to wonder who the target demographic for this was, as these were workstations that cost thousands (or 10s of thousands) of 1990s us dollars and were mainly sold to companies where everyone wore a tie. then again, it probably really stood out in a sea of the typical business-to-business sales presentations that consisted of some stuffed shirt reading a checklist of product points in a monotone voice

holy poo poo lol thats a sight to beholden

# ¿ Jan 12, 2022 06:38

echinopsis: Apr 13, 2004; by Fluffdaddy

man I want to try cocaine

# ¿ Jan 13, 2022 09:36

echinopsis: Apr 13, 2004; by Fluffdaddy

Kazinsal posted:

weird low level computer poo poo is one of the few things I can just blather on about for hours. I�ll probably get baked and write something about the insanity of x86 memory management later tonight

please do

idk why but I love it. make me wish I worked at that level.

# ¿ Jan 16, 2022 01:01

echinopsis: Apr 13, 2004; by Fluffdaddy

Kazinsal posted:

okay, time to talk about x86 memory management. this post is going to get a bit out of hand so I'm breaking it into sections.

way back before the 8086 the 8080 family had a 16-bit address space but no internal mechanism to manage it, so if you wanted more than 64KiB of RAM/ROM/MMIO you needed an external chip that you could frob to switch what a certain range of the address bus actually pointed at. this sucked but a lot of 8-bit CPU families did this and the chips are fairly simple. you hook them up to say a 16 KiB window and expose a couple I/O ports to the CPU so software can switch that 16 KiB bank to a different slice of whatever RAM chip is hooked up to it through the bank selector chip.

so when Intel designed the 8086 they realized how badly this sucked and even though they were still making a 16-bit CPU they stuck 4 more address lines onto it so they could basically do a sort of pseudo bank switching system in the standard address decoding logic. the 8086 has four 16-bit segment registers that are used in the address decoding logic to create a 20-bit address from a 16-bit segment and a 16-bit offset (usually formatted in documentation as eg. 1234:CDEF, where the word before the colon is the segment and the word after it is the offset. the logic is pretty simple: the segment word is shifted four bits to the left and the offset is added to it. in the above example we'd get this:
code:
segment << 4 | 12340
    + offset |  CDEF
------------ | -----
   = address | 1F12F
this is super advantageous over the bank switching chip thing for two reasons: one, you get full access to the whole 20-bit address space of the system at all times, and two, your code can always assume it starts at 0000 and goes to however much memory it needs in a 16-bit space since the OS can just give you a block of memory and say "your segment is 0x1234". all your indexes and offsets are from 0, no matter where in the 20-bit address space you actually are! and you get separate selectors for code (CS), data (DS and ES), and stack (SS), and segment override prefixes so you can access up to four 64K segments at any time without reloading the segment registers

now, there's no memory protection on the 8086 so any code can run over any memory at any time, and there's no concept of privilege levels so any code can use any instructions including mov Sreg, r/m16 to change what's in your segment registers, which is one of the main reasons why no one really ever cared too much about multi-user multitasking OSes for the 8086. the other main issue is that the most common 8086 out there was actually the 8088, which only had an 8-bit data bus instead of a 16-bit one so full word memory and I/O accesses needed two cycles instead of one, but that's something for a different post or a sidebar

sidebar: the 8086 was also interesting compared to a lot of other general-purpose 16-bit CPUs in that it carried over the separate I/O port space from the 8080/Z80. it uses its own in and out instructions and has a 16-bit address space of its own. it was relatively fast on real 8086es because it only took a couple clock cycles but it's still about the same speed on modern machines because the I/O port space is emulated in System Management Mode these days and anything it's accessing is an ISA device which is also either emulated by SMM, by a SuperIO chip, or by an actual 4.77 MHz ISA device going through half a dozen different bridge chips

so now we come to the 80286. it was still a 16-bit CPU but it had a 24-bit address bus and added a new CPU mode called protected mode that changed how memory management worked significantly. when you're in protected mode, instead of pointing directly to a segment base address, you point at an index into the Global Descriptor Table, which is an array of 48-bit fields aligned to 64 bits that describe the 24-bit base address for the selector (the name for the GDT entry you write the index for into a segment register), the segment limit (up to 0xFFFF; it's still a 16-bit machine at this point), and a flags field for things like minimum privilege level and some things mostly related to hardware task switching. the only problem with protected mode was that on the 286, you couldn't *leave* protected mode without resetting the CPU, and the BIOS's built-in driver routines were all only available in real address mode (the retronym for the 8086's operating mode). in order to go back you'd have to set the reset vector to your 16-bit real mode trampoline code and then intentionally catastrophically fault the CPU (by, say, setting the GDT or the IDT -- Interrupt Descriptor Table; same idea as the GDT but for interrupt vectors -- no a null pointer and causing an interrupt/exception, which would cause a recursive exception and "triple fault" the CPU, resetting it).

needless to say the 286 protected mode didn't see a lot of use on common systems like DOS and early Windows. Windows/286 2.x existed but no one really wrote programs for it because RAM was still pretty expensive and by that point there were bank switching systems for using more than 640KiB in real mode so you could sorta use big chunks of RAM 64K at a time in DOS. when the PC/AT debuted with an 80286, most of the computing world treated it as just a faster 8086 (as it ran at either 6 or 8 MHz depending on what options you bought from IBM instead of 4.77 MHz).

however this did technically give the 286 the capability for multitasking with process isolation! the GDT could technically have up to 8192 entries if you really wanted to, and because the 8-byte-aligned nature of GDT entries gave Intel 3 bits at the bottom of the selector index to work with, they assigned two of those bits to the "desired privilege level" -- if your DPL was lower (closer to kernel mode) than the minimum privilege level of the GDT entry, the CPU would generate a protection fault -- and one to "use Local Descriptor Table". you could then give each process its own LDT, load it into the LDT pointer register before switching to that process, and then the process could use that LDT for all of its selectors if it wanted to instead of using global selectors. this meant you could assign multiple chunks of memory to a process without needing to perform a costly system call back to kernel mode and rewrite the contents of a GDT entry every time a process wanted to access a different 64K chunk.

a couple years later the 80386 came out. this was a big fuckin deal as it was a full fat 32-bit machine with a 32-bit physical address bus, a 32-bit virtual address size, and a 32-bit data bus (16-bit on the 80386SX). to fit in this 32-bit setup, Intel used the 16 bits of padding in GDT entries for a 16/32-bit flag, shoved in another 8 bits to make the base address 32-bit instead of 24-bit, and another 4 bits for the segment limit. why 4 bits? well, when you set the 32-bit flag, the limit gets shifted 12 bits to the left and ANDed with 0xFFF. this seems like a bit of a granularity issue but it's because the 386 was really designed to be used less as a machine with segmentation-based memory management and more like a real machine with full paging capabilities. the 386 did add two more segment registers (FS and GS) however, which ended up being used to easily implement thread-local storage when multi-threaded application programming became common.

x86 has a base page size of 4096 bytes, and the 80386 had a two-level page table system wherein your top-level page table pointer (held in CR3 -- Control Register 3) would be a "page directory" composed of pointers to first-level page tables and flags for how they're allowed to be accessed (minimum privilege level, read/write or read only, cache enable/disable, and a few other flags that were later added), and then each page table was composed of pointers to the upper 20 bits of physical addresses to translate the matching virtual address to ANDed with another set of the same general flags, just this time for that specific page. the granularity on this is awesome and it's part of how x86 became a real 32-bit workhorse in the late 80s. the formula for determining how to translate a virtual address to a physical address is fairly simple: take the top 10 bits of the virtual address; this is your index into the page directory. dereference that to get your page table (or throw a page fault if the access flags don't match your current system state). take the next 10 bits of the virtual address; this is your index into the page table. dereference that to get the physical address to replace the top 20 bits with (or throw a page fault if the access flags don't match your current system state).

paging was great for simplifying memory protection and mapping because instead of each process needing a bunch of selectors, as far as they were concerned they owned the whole 32-bit address space except where the kernel was (and they couldn't read/write kernel space). every process could use the same ring 3 (privilege level) selectors and the OS would just slap the process's page directory address into CR3 before a context switch. and of course, you can just grab whatever free slab of memory you find first when you need to allocate memory to a process because with virtual address translation the actual physical location of a page doesn't matter to the process at all. Intel finally had a real multitasking x86 processor and it was still backwards compatible with original 8086 real mode code as well as 286 16-bit protected mode if you really wanted it (with the added bonus of allowing you to switch between all these modes at will instead of needing to reset the CPU).

sidebar: yes, you could mix paging with arbitrary segment bases and limits. pretty much no one ever did this and while I don't have the 80386 programmer's manual handy I'm pretty sure it says that it's a bad idea and the 386 can't efficiently cache translations if you gently caress around with weird segment bases and paging at the same time the only one I can think of off the top of my head is OpenBSD/i386, which uses the segment limit on code selectors to implement W^X on pre-64-bit machines.

along with paging a couple extra instructions for it were added for dealing with cache control registers and the like, because the 386 cached a bunch more stuff than the 286 did. if you modified a page table entry you'd need to reload the CR3 register with the same pointer (eg. mov eax, cr3; mov cr3, eax) and the 386 would invalidate the whole translation lookaside buffer. later, the 486 added an instruction to invalidate a single page; executing an invlpg address instruction would remove the cached translation for the page that address is in.

pushing onwards a bit past the 486 and Pentium we reach the Pentium Pro, which is our next stop here because it added a new awesome thing: Physical Address Extension (PAE). Intel looked at the 286's address bus compared to the 8086's, said "let's do that again", and slapped on another four bits. now, the problem here is that the page entries were already chock full of bits so what Intel did was add a flag to a control register to enable PAE mode, in which page table and directory entries became 64 bits wide instead of 4 bits wide to fit the extra bit and some additional flags for the future, and the CR3 register now pointed at a four-entry Page Directory Pointer Table where each entry would point to a page directory that controlled 1 GiB of the 32-bit virtual address space.

if you're having a "wait, this reminds me of the VAX post" moment, welcome to the magic of x86. it was basically designed stealing all the best parts of other architectures and steadfastly refusing to throw away any of your old legacy cruft.

another cool thing that was added on the original Pentium but wasn't used too often at the time was the page size flag. in a page directory entry, you could flip the page size flag (formerly a must-be-zero reserved bit) from 0 to 1 and instead of pointing at a page table, that page directory would be a direct mapping to a 4 MiB virtual slab of the address space (2 MiB in PAE mode), effectively making the 10 bits (9 in PAE mode) normally used as an index into a page table just part of the virtual address that wasn't translated by the memory management unit. now you could either slam a huge chunk of RAM into a process's address space or map a block as "not present" and use the fault it generated as a way to signal the kernel to do some other operation like disk I/O or whatnot.

shoving on ahead to 64-bit x86 we now have 64 bits of virtual address space as an extension of PAE mode, right?

kinda.

we actually have 48 bits. in 64-bit pointers. when you enter 64-bit long mode, CR3, which was extended to being 64 bits wide, now points at the Page Map Level 4 (PML4), which has 512 (9 bits' worth) entries, each of which points at a page directory pointer table, which has 512 entries, each of which points at a page directory, which has 512 entries, each of which points at a page table, which has 512 entries, each of which points at a page. in order to make the address "canonical" though it must be sign extended and the upper 16 bits are sign-extended from bit 47. so, your canonical addresses suddenly jump from 0x00007FFFFFFFFFFF to 0xFFFF8000000000. this is fine, whatever, you're basically splitting a 256 TiB address space into two 128 TiB address spaces. this is still an enormous amount of address space.

also, at this point, segmentation is dead. hooray! you are literally not allowed to use segments with bases other than 0 and limits other than "all of it" in 64-bit mode. but to make thread-local storage easier, AMD and Intel agreed that it would be a good idea to have hidden processor registers called FSBASE and GSBASE that... readded a base address to any memory accesses made against the FS and GS selectors.

a lot of OSes use this for convenient separation of user and kernel space. user space gets the bottom half, and the kernel gets the upper half. this is, even as of 2022, still enough on all but the most ridiculously big hypercomputing systems to map every byte of physical RAM and every byte of memory-mapped I/O space into kernel space if you wanted to (and most do! you've got all that space and it's advantageous to be able to easily access any physical address by just slapping a prefix onto it to make it a virtual address). so, naturally, Intel went a step further a couple years ago with the Ice Lake processor generation and added a sub-mode of long mode called 5-level paging mode wherein CR3 points at the Page Map Level 5 (PML5), which has 512 entries, each of which points at a PML4. the rest of the virtual address translation works the same as in normal 4-level paging long mode, but this does change the way canonical addresses are generated so the kernel needs to be aware of how to work in 5-level paging mode. I think most kernels do support it at this point because it doesn't require too much extra work to implement, but depending on the underlying kernel space memory mapping implementation it may require a kernel linked to a different address -- this is how Windows handles it; instead of ntoskrnl.exe you'd see your kernel image being ntkrla57.exe.

later 64-bit x86 microarchitectures added huge pages (the cowards at intel refuse to call it this, but everyone else including their competitors with similar implementations do), wherein a PDPT entry points at a 1 GiB block of memory like the 2 MiB large page. this simplifies and speeds up address translation when you know you can just pre-allocate an enormous block of RAM to a process. database servers and the like love huge pages and heavily randomly-accessed databases can genuinely sometimes be more performant with huge pages because each gigabyte of RAM only needs one TLB entry, so the TLB isn't constantly being rewritten and flushed with all the accesses. there's also a security/caching feature called process-context identifiers (PCID) that lets the kernel tag a paging structure with a 12-bit value of its choice, which will propagate to the TLB entries the processor creates, allowing it to only do lookups based on the PCID of the current paging structure. this both speeds up lookups since the TLB is built on content-addressable memory and the PCID just becomes part of the TLB entry address, and is good for security because the processor isn't reading through other threads' TLB entries to get to the right one.

there's some deeper stuff that's pretty much irrelevant these days that I could talk about in more detail like hardware task switching and call gates but those haven't been used since around the 486 era on account of them becoming gradually too slow to be worth it compared to the kernel saving and loading thread/process contexts in software. there's also I/O Privilege Level maps that live in the hardware task switching system's Task State Segment (each process in a hardware task switching OS had its own TSS that its context was stored in) and allows you to let specific ISA I/O ports be used by processes in user mode, which do still exist (you need to have one TSS for the processor to dump user thread context in when doing a user/kernel mode transition, but the kernel then just copies the stuff that matters back to and from the its own thread structure). I know there's also a few additional things like nested paging for virtualization but I honestly don't know much about the VMX instructions and their associated stuff used for building hypervisors so I won't butcher all that.

anyways there's my giant effortpost on x86 memory management. if I think of anything else about re: the x86 that could be fun and/or I could link back to DEC hardware (sorry for fuckin up your thread OP) I'll put together a batch of thoughts and post 'em. but for now I'm going to stop typing and realize that this kind of poo poo is exactly the reason I'm single

digital cocaine 😍

# ¿ Jan 17, 2022 01:57

echinopsis: Apr 13, 2004; by Fluffdaddy

rjmccall posted:

these are good posts

# ¿ Jan 17, 2022 12:03

echinopsis: Apr 13, 2004; by Fluffdaddy

I never understood the difference between chip and fast ram on the amiga

# ¿ Jan 17, 2022 22:08

echinopsis: Apr 13, 2004; by Fluffdaddy

ultravoices posted:

chip ram is the stuff that is shared between the cpu, video with DMA.

fast ram was the next block if ram that was directly addressable by only CPU.

fast ram sucks it seems

# ¿ Jan 18, 2022 05:32

echinopsis: Apr 13, 2004; by Fluffdaddy

# ¿ Jan 18, 2022 10:26

echinopsis: Apr 13, 2004; by Fluffdaddy

Gentle Autist posted:

if you scroll real fast through kazinsal�s posts in greenpos real fast it�s like that thing in popular science fiction movie �the matrix�

# ¿ Jan 20, 2022 10:40

echinopsis: Apr 13, 2004; by Fluffdaddy

Kazinsal posted:

- anecdotally, the Windows kernel team told Intel to make resets via triple faulting on the 80286 faster after getting their hands on some engineering samples because they were using it as a hack to enable multi-tasking with MS-DOS programs under Windows/286, since the 286 didn't actually have a native way to return to real mode from protected mode. the Intel engineers didn't believe them at first until they showed them how it was working and they were reportedly both impressed and horrified enough to go back and do exactly that, so the Windows 2.x and 3.x kernels in 286 protected mode are constantly resetting the CPU. there's a vague reference to this in the 80286 programmer's manual and Larry Osterman has reported it to have been a meeting that actually happened so I'm inclined to believe it

sick

# ¿ Feb 1, 2022 09:35

Adbot: ADBOT LOVES YOU

# ¿ May 22, 2024 06:27

echinopsis: Apr 13, 2004; by Fluffdaddy

i read female dating strategies so I know lvm means low value male (or man) but llvm? no idea

# ¿ Feb 2, 2022 10:10

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > show/tell me about DEC hardware (and x86 memory management)