|
op, the dec alpha has a fully weak memory ordering model which requires read barriers on atomic loads of objects even if all the subsequent loads from the object are value-dependent on the loaded pointer. intriguingly, it is my understanding that the weakness of the alpha's memory ordering model is purely theoretical, and all shipping alpha hardware in fact uses a stronger model which does guarantee that dependent loads will be properly ordered. nonetheless, because this is not architecturally guaranteed, any lock-free code must ensure that it uses proper barriers around atomic loads if it is ever ported to alpha. in principle, systems programming languages such as c and c++ would allow programmers to clearly state their requirements and then compile it optimally for the given platform, potentially avoiding load barriers when compiling for systems other than the alpha. unfortunately, inventing a sound formal definition of the concept of value-dependence that still admits reasonable optimizations in code that may not even be aware of the fact that it's carrying a dependency on an atomic load has proven to be an exceptionally tricky problem. even now, a full ten years after the introduction of atomics to c and c++, many compilers do not compile the so-called "consume" memory ordering optimally. this problem would be entirely defined away if processors were instead as overtly hostile as the theoretical but not actual memory ordering model of the dec alpha, op
|
# ¿ Jan 11, 2022 03:18 |
|
|
# ¿ May 22, 2024 16:45 |
|
these are good posts
|
# ¿ Jan 17, 2022 07:47 |
|
the times just had a clue like “aunt and uncle’s little girl” for “niece”, so
|
# ¿ Jan 18, 2022 17:52 |
|
DuckConference posted:i think the pointer authentication stuff on arm uses the whole space the kernel tells the processor how many bits to use. it also conditionally honors tbi, which allows programmers to use the top 8 bits without faulting
|
# ¿ Jan 24, 2022 22:36 |
|
as far as the isa goes, itanium is really weird first, it’s got a poo poo-ton of registers, both integer and floating-point. there are 32 static registers, meant as scratch between calls, and then a window of 96 more registers that you can rotate during calls to save data without spilling to the stack (at least, not directly in the compiler; if you have enough calls to overflow the register window, it still has to spill, of course). that’s almost an unreasonable number of registers. and that’s for each of integer and fp, and then there are a bunch of specialized registers for things like conditions. it’s really a lot the bigger thing is that itanium wants to run instructions in parallel by default. instructions are 41 bits, and there are three packed into a 128-bit instruction bundle. instructions within a bundle always run in parallel, so if you have dependencies, like if one instruction adds an offset to a pointer and then another loads from that pointer, they need to go in separate bundles. but they can’t just go in separate bundles: you need the second bundle to say specifically that it has a dependency on a previous bundle and so cannot be run in parallel so it turns out that superscalar architectures are pretty good at running things in parallel already. the itanium approach is better in some ways: superscalar architectures can definitely suffer from false dependencies, especially with memory, and itanium can communicate that the processor doesn’t have to worry about that. but to do that, the compiler also has to know that there isn’t a dependency. for simple data dependencies in registers, this is straightforward. for memory, it usually means doing an alias analysis. this is a lot easier in fortran, which has very limited pointers and very strong default assumptions about aliasing, than it is in (say) c but also the compiler has to reorder a lot of stuff just to try to fill bundles so that you don’t end up with appallingly bad code density. the problem is that superscalar architectures get most of the potential value here without all the extra pain. and probably itanium would still have benefitted from using standard superscalar techniques to recognize potential for parallelism even when the instruction stream said there might be a dependency; i don’t know how much of that they did anyway, below the isa level, itanium was positioned as a server / hpc architecture, so its chipsets were designed for beefy systems. in particular, they had a lot of memory bandwidth. so yeah, databases usually aren’t computation-bound, but they can absolutely be memory-bound depending on the workload, and itanium machines were good at that. and they could also be very good at certain hpc workloads that did a million things in parallel over big datasets
|
# ¿ Mar 4, 2022 11:02 |