|
JawnV6 posted:Put the jne on a different page requiring a fault or better yet straddling the canonical boundary. Everyone gets that edge wrong. I imagine those situations will never get fused though, the FE will recognize it needs to issue the fault before the decoder ever sees the pair. Vaguely related: I know of at least three recent CPUs that are broken in various fun ways when variable length instructions cross a page boundary
|
# ? Feb 14, 2017 22:03 |
|
|
# ? May 28, 2024 02:26 |
|
rjmccall posted:who gives a poo poo if someone manages to find a contrived use for one and it has terrible performance? Your competitors' salespeople? Someone started a slapfight in the SHSC AMD thread over AVX performance which I'm sure tens of people might have a valid reason to care about in the expected lifetime of the hardware but it's going to influence purchasing decisions somewhere.
|
# ? Feb 14, 2017 22:17 |
|
omeg posted:I only write low level stuff in C nowadays. I've seen Seriously though, goto has its place. Yes, in structured languages there's alternatives to any goto, but they sometimes require splitting a function where it doesn't otherwise make sense to do so, results in code duplication (usually condition checks), loop break abuse, or things like that. Goto is fine when it makes code concise and enhances its readability. Also, it doesn't really make sense to gratiutiously use gotos in modern languages the way people did in the past, so there's not really that much risk of abuse. In old dialects of Basic or Fortran, goto was often used because C-style control structures simply didn't exist, or because the environment didn't allow for free-form editing and if you needed to insert code in the middle of a routine, you often had to use a pair of unconditional gotos to splice it in.
|
# ? Feb 15, 2017 02:31 |
|
JawnV6 posted:Ok, interrupt deferral makes sense. But you can still fault after the cmp and before the second op though? Put the jne on a different page requiring a fault or better yet straddling the canonical boundary. Everyone gets that edge wrong. I imagine those situations will never get fused though, the FE will recognize it needs to issue the fault before the decoder ever sees the pair. Making the fusing logic very simple and conservative also means that it takes less power, less die area, and is easier to do timing for. It wouldn't be shocking to see rules like "we only fuse if one of the instructions is a MOV and the other instruction is a non-memory-op and both instructions are coming out of the stream cache" still being good enough to see 1% improvement on some benchmark that some client cares about. If you can get a 1% improvement for what might as well be free then you're going to take it, especially if it's the sort of thing that you can potentially tune and improve in future generations to be even better. JawnV6 posted:There's no good way to fuse up a zero-length call though. The STA/STD pair required for the call's implicit push getting fused and not offering an instruction boundary between them, if the implicit stack location had to be paged in just to be marked dirty, etc. Too much going on there.
|
# ? Feb 15, 2017 03:31 |
|
Munkeymon posted:Your competitors' salespeople? Nah. Vector unit performance matters for general-purpose hardware because sometimes people want to do big things that can be meaningfully vectorized, and the people who wrote that code probably cared enough to consider using AVX even if the vast majority of programmers will never have to. If nothing else, some of that code is in SPEC. In contrast, making a true call to the next instruction is not a real thing, just like repeatedly loading words from absolute address -1 is not a real thing.
|
# ? Feb 15, 2017 05:11 |
|
It could be a code size minimizing optimization for a function that ends for _ = 1 to 2 { ... } while under register pressure?
|
# ? Feb 15, 2017 05:21 |
|
sarehu posted:It could be a code size minimizing optimization for a function that ends for _ = 1 to 2 { ... } while under register pressure? If you're under register pressure, you're almost certainly using callee-save registers in the loop, which will break the pattern unless you do another local call first. But yes, that's cute. If your function ends in a loop with a power-of-two trip count, and you're on an x86-like platform that pushes the PC on call, and the loop doesn't need any callee-save registers or the stack pointer, and there's no reason you need to keep the stack aligned, then you can emit the loop using chained calls and rets and all of your branches will be perfectly predicted. Hmm, if you're willing to do another interior call as set-up, you can not only lift the CSR restriction but also do this at an arbitrary position in the function. So the only real restriction is that you need to not use the stack pointer directly. And of course you can do this on an LR architecture by just pushing and popping the PC yourself in your sub-function. rjmccall fucked around with this message at 08:39 on Feb 15, 2017 |
# ? Feb 15, 2017 08:33 |
|
JawnV6 posted:rjmccall alluded to it, for position-independent code like libraries sometimes you need to grab the PC to figure out where "here" is so you can jump "there." I mean I wrote one of those and I can't remember for the life of me how I figured that offset. quote:Also memory is changed, if the chip got too clever and tried to skip that bit it would cause other issues. dougdrums fucked around with this message at 12:45 on Feb 15, 2017 |
# ? Feb 15, 2017 12:41 |
|
You don't usually need a PIC base for jumps and calls, because those instructions can take relative offsets on every architecture I've ever seen. If you don't even have a relative offset, e.g. because the symbol resolves outside of the current image, the linker will use the offset of a stub that materializes the address in some other way — typically the stub loads from a global that the loader initializes, but on some targets the loader just directly rewrites the stub. A similar technique gets used when the instruction encoding doesn't allow a big enough immediate offset to reach an arbitrary place in the image, which is common with fixed-width instruction encodings (e.g. ARM64 allows an offset of ±128MB, but images can reasonably get bigger than that): the compiler optimistically uses an instruction that uses a relative offset, and if the linker can't make that work, it just makes the instruction go to a stub. The place where you need a PIC base is when you need the true address of some global and the target doesn't directly support PC-relative addressing; that means just i386 these days. In that case, you explicitly compute your PIC base and then add a relative offset to that, assuming you have one. In theory, on a target like i386 where MOV and LEA can take a 32-bit immediate absolute address, you could have the compiler just emit that instruction and tell the linker to fill it in during load. Unfortunately, that would be a disaster for launch times and general memory performance, because accessing global memory is common enough that it would dirty almost every page in the text segment. Rewriting a bunch of stubs doesn't have that problem because they're densely packed together, so you're only dirtying a page or two. rjmccall fucked around with this message at 17:13 on Feb 15, 2017 |
# ? Feb 15, 2017 17:11 |
|
My register pressure assumption was because with a register free you could use 5 bytes this way instead:code:
|
# ? Feb 15, 2017 19:06 |
|
sarehu posted:My register pressure assumption was because with a register free you could use 5 bytes this way instead: call rel16 is 3 bytes and has the advantage of allowing all branches to be perfectly predicted. EDIT: Oh, four bytes, I guess, because you'd need to put an operand-size override prefix on it. I think that's one of those tricks that you only do in code-size-at-all-costs mode. rjmccall fucked around with this message at 20:39 on Feb 15, 2017 |
# ? Feb 15, 2017 20:32 |
|
ExcessBLarg! posted:Also, it doesn't really make sense to gratiutiously use gotos in modern languages the way people did in the past, so there's not really that much risk of abuse. In old dialects of Basic or Fortran, goto was often used because C-style control structures simply didn't exist, or because the environment didn't allow for free-form editing and if you needed to insert code in the middle of a routine, you often had to use a pair of unconditional gotos to splice it in. I think this was pointed out earlier in this very thread but Knuth's original "goto considered harmful" was written to admonish a bunch of prehistoric programmers to start using such new-fangled inventions as "if", "while", and "for". While there's very few legit uses for goto these days there's also barely anybody left who would would even consider goto a viable control structure. It's just not a thing to care about anymore.
|
# ? Feb 15, 2017 21:36 |
Dr Monkeysee posted:I think this was pointed out earlier in this very thread but Knuth's original "goto considered harmful" That was Dijkstra.
|
|
# ? Feb 15, 2017 21:39 |
|
Oops. I couldn't remember which one wrote it and my quick googling led me astray.
|
# ? Feb 15, 2017 21:41 |
|
rjmccall posted:call rel16 is 3 bytes and has the advantage of allowing all branches to be perfectly predicted. Oooh. Edit: You'll still have F'd up prediction unless you do 66 e8 01 00 90 though (jump 1 byte ahead, over a nop). (Unfortunately jumping backward 1 or 2 bytes can't be made to help.) sarehu fucked around with this message at 22:42 on Feb 15, 2017 |
# ? Feb 15, 2017 22:11 |
|
ShoulderDaemon posted:Making the fusing logic very simple and conservative also means that it takes less power, less die area, and is easier to do timing for. It wouldn't be shocking to see rules like "we only fuse if one of the instructions is a MOV and the other instruction is a non-memory-op and both instructions are coming out of the stream cache" still being good enough to see 1% improvement on some benchmark that some client cares about. If you can get a 1% improvement for what might as well be free then you're going to take it, especially if it's the sort of thing that you can potentially tune and improve in future generations to be even better. Back in my day I filed a sub-1% perf bug on recursive traces ShoulderDaemon posted:Off the top of my head, I can't think of any way that you'd be able to win by fusing a CALL/POP pair outside of something esoteric like binary translation where you're dynamically recompiling the program stream in large blocks. rjmccall posted:You don't usually need a PIC base for jumps and calls, because those instructions can take relative offsets on every architecture I've ever seen. sarehu posted:Oooh.
|
# ? Feb 15, 2017 23:16 |
|
JawnV6 posted:This is where it's sorta obvious my info is 5+ years out of date, I didn't know MOV was viable for any fusing. Register renaming makes some instructions quite light, but I thought that mechanism was separate from fusing. Stream cache also implies all goofy page/memory boundaries aren't going to be relevant. JawnV6 posted:Hmph. Binary translation shouldn't be that esoteric. C'mon, get things in gear over there.
|
# ? Feb 16, 2017 00:54 |
|
JawnV6 posted:Don't be afraid to jump into the middle of instructions. EB FF will jump to the FF, which can be a DEC. If you try that, 66 e8 fe ff doesn't work and 66 e8 ff ff __ will have to burn a byte anyway. And there's not really any practical choice for that byte.
|
# ? Feb 16, 2017 01:14 |
|
JawnV6 posted:Don't be afraid to jump into the middle of instructions. EB FF will jump to the FF, which can be a DEC. Is your name Mel?
|
# ? Feb 16, 2017 01:54 |
|
sarehu posted:Oooh. Why? The forward branches are unconditional and immediately resolvable and the backward branches are returns. The return address predictor isn't, like, keyed by anything.
|
# ? Feb 16, 2017 02:16 |
|
rjmccall posted:Why? The forward branches are unconditional and immediately resolvable and the backward branches are returns. The return address predictor isn't, like, keyed by anything. The return address predictor would ignore the zero length call, right? And then, an unbalanced ret. My comment was under that assumption.
|
# ? Feb 16, 2017 03:34 |
|
Ah. This entire digression was about whether it was reasonable for a processor to implement that special case by just assuming that calls with zero offset never happened in real code, so no, I was analyzing under the hypothesis that a processor wouldn't want to do that. Obviously, if the processor takes special care to make an instruction less efficient, you should not use that instruction.
|
# ? Feb 16, 2017 05:22 |
|
It's possible to get the same code to run twice with different values after a 0 length call, but there's certainly easier ways to do it with 7 bytescode:
dec [esp] makes the opcode a 00, which is fairly useless targeting memory indexed by eax. Other manipulations of [esp] might prove useful, but most of the time the modrm byte ends up being 0x24 and still targeting [esp] on the second time through, which will muck up the real return address. inc/add al thankfully leave EIP in the same place which greatly simplifies the reasoning.
|
# ? Feb 17, 2017 04:24 |
|
It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions.
|
# ? Feb 17, 2017 05:18 |
|
vOv posted:It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions. Don't kinkshame.
|
# ? Feb 17, 2017 05:20 |
|
vOv posted:It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions. Having only ever touched assembly on tiny embedded platforms (avr, arm thumb ) every new fact I learn about modern x86 is more horrifying than the last.
|
# ? Feb 17, 2017 05:24 |
|
Hey now, x86 is a reasonably decent bytecode format. Not as good as Java or .NET CLI, but they have 40 years of legacy to deal with.
|
# ? Feb 17, 2017 05:55 |
|
vOv posted:It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions. The real horror is that Jawn is now talking about modifying the instruction stream.
|
# ? Feb 17, 2017 07:02 |
|
rjmccall posted:The real horror is that Jawn is now talking about modifying the instruction stream. Don't cross the streams!
|
# ? Feb 17, 2017 07:03 |
|
I don't think the Guardian's web people have discovered the wonders of version control: https://guardiannewsampampmedia.formstack.com/forms/js.php/untitled_form_19_copy_5_copy_2_copy_copy_copy_1_copy_2_copy_1_copy_1_copy_1_copy_1_copy_1_copy_copy_1_copy_1_copy_1_copy_copy_copy_copy_1_copy_4_copy_copy_copy_copy
|
# ? Feb 17, 2017 09:20 |
|
rjmccall posted:The real horror is that Jawn is now talking about modifying the instruction stream. code:
|
# ? Feb 17, 2017 17:49 |
|
Sometimes it's the little things: C# code:
|
# ? Feb 17, 2017 22:47 |
|
JawnV6 posted:It's not SMC. The write is hitting the return IP stored on the stack. Oh right, of course. ...although I do have to note that "add al, 24" does not compute eax+24.
|
# ? Feb 17, 2017 22:56 |
|
redleader posted:Sometimes it's the little things: LOL, I've seen people abuse foreach/for (x : S) with an additional running index, but this one doesn't even refer to thing.
|
# ? Feb 18, 2017 01:51 |
|
Qwertycoatl posted:I don't think the Guardian's web people have discovered the wonders of version control: https://guardiannewsampampmedia.formstack.com/forms/js.php/untitled_form_19_copy_5_copy_2_copy_copy_copy_1_copy_2_copy_1_copy_1_copy_1_copy_1_copy_1_copy_copy_1_copy_1_copy_1_copy_copy_copy_copy_1_copy_4_copy_copy_copy_copy I'm not hip on all the new webdev microservices implementations, but what the hell does this file do that an HTML file couldn't?
|
# ? Feb 18, 2017 02:51 |
|
None of that is hip if it helps
|
# ? Feb 18, 2017 03:16 |
|
I'm js.php
|
# ? Feb 18, 2017 12:13 |
|
rjmccall posted:Oh right, of course. It's x86 so it's true ~86% of the time.
|
# ? Feb 19, 2017 07:22 |
|
JawnV6 posted:It's x86 so it's true ~86% of the time.
|
# ? Feb 19, 2017 18:43 |
|
|
# ? May 28, 2024 02:26 |
|
The above just made me smile after I made a funny typo while using dotnet. I actually typed in dotnot and had to laugh when the programm complained and I actually realised my mistake. You know: dot NOT. HAH. Okay I'm sorry.
|
# ? Feb 20, 2017 15:43 |