Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
b0lt
Apr 29, 2005

JawnV6 posted:

Put the jne on a different page requiring a fault or better yet straddling the canonical boundary. Everyone gets that edge wrong. I imagine those situations will never get fused though, the FE will recognize it needs to issue the fault before the decoder ever sees the pair.

Vaguely related: I know of at least three recent CPUs that are broken in various fun ways when variable length instructions cross a page boundary

Adbot
ADBOT LOVES YOU

Munkeymon
Aug 14, 2003

Motherfucker's got an
armor-piercing crowbar! Rigoddamndicu𝜆ous.



rjmccall posted:

who gives a poo poo if someone manages to find a contrived use for one and it has terrible performance?

Your competitors' salespeople?

Someone started a slapfight in the SHSC AMD thread over AVX performance which I'm sure tens of people might have a valid reason to care about in the expected lifetime of the hardware but it's going to influence purchasing decisions somewhere.

ExcessBLarg!
Sep 1, 2001

omeg posted:

I only write low level stuff in C nowadays. I've seen
code:
do {
   ...
   if (error) break;
} while (0); 
used for error handling. Thoughts? :v:
I think that's Pascal's version of goto, and it's terrible. Also if the loop doesn't return from the function then you have to re-check the error condition following the loop anyways.

Seriously though, goto has its place. Yes, in structured languages there's alternatives to any goto, but they sometimes require splitting a function where it doesn't otherwise make sense to do so, results in code duplication (usually condition checks), loop break abuse, or things like that. Goto is fine when it makes code concise and enhances its readability.

Also, it doesn't really make sense to gratiutiously use gotos in modern languages the way people did in the past, so there's not really that much risk of abuse. In old dialects of Basic or Fortran, goto was often used because C-style control structures simply didn't exist, or because the environment didn't allow for free-form editing and if you needed to insert code in the middle of a routine, you often had to use a pair of unconditional gotos to splice it in.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

JawnV6 posted:

Ok, interrupt deferral makes sense. But you can still fault after the cmp and before the second op though? Put the jne on a different page requiring a fault or better yet straddling the canonical boundary. Everyone gets that edge wrong. I imagine those situations will never get fused though, the FE will recognize it needs to issue the fault before the decoder ever sees the pair.
Yeah, you're probably going to fuse only when you have actual instruction bytes available, or something equivalent (like you're getting fused ops from a stream cache). If you have to stall to fetch bytes for instruction 2, you're going to issue instruction 1 immediately instead of waiting to fuse it. Fusion is an opportunistic optimization that helps, but isn't essential; in practice you can get away with a conservative approach of only fusing when it's easy to prove safety and still see enough reasonable performance gains to justify the effort.

Making the fusing logic very simple and conservative also means that it takes less power, less die area, and is easier to do timing for. It wouldn't be shocking to see rules like "we only fuse if one of the instructions is a MOV and the other instruction is a non-memory-op and both instructions are coming out of the stream cache" still being good enough to see 1% improvement on some benchmark that some client cares about. If you can get a 1% improvement for what might as well be free then you're going to take it, especially if it's the sort of thing that you can potentially tune and improve in future generations to be even better.

JawnV6 posted:

There's no good way to fuse up a zero-length call though. The STA/STD pair required for the call's implicit push getting fused and not offering an instruction boundary between them, if the implicit stack location had to be paged in just to be marked dirty, etc. Too much going on there.
Off the top of my head, I can't think of any way that you'd be able to win by fusing a CALL/POP pair outside of something esoteric like binary translation where you're dynamically recompiling the program stream in large blocks. You're just going to special case CALL 0 in your call/return predictor, and otherwise not do anything special. The POP immediately following the CALL will get its result out of whatever store-forwarding or memory-renaming structures you have, and it should be no more or less performant than any other random top-of-stack manipulation. This is one of those things that is certainly kind of weird, but it's similar enough to all the other weird crap that you have to deal with all the time that your existing microarchitecture should handle it fine.

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

Munkeymon posted:

Your competitors' salespeople?

Someone started a slapfight in the SHSC AMD thread over AVX performance which I'm sure tens of people might have a valid reason to care about in the expected lifetime of the hardware but it's going to influence purchasing decisions somewhere.

Nah. Vector unit performance matters for general-purpose hardware because sometimes people want to do big things that can be meaningfully vectorized, and the people who wrote that code probably cared enough to consider using AVX even if the vast majority of programmers will never have to. If nothing else, some of that code is in SPEC. In contrast, making a true call to the next instruction is not a real thing, just like repeatedly loading words from absolute address -1 is not a real thing.

sarehu
Apr 20, 2007

(call/cc call/cc)
It could be a code size minimizing optimization for a function that ends for _ = 1 to 2 { ... } while under register pressure?

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

sarehu posted:

It could be a code size minimizing optimization for a function that ends for _ = 1 to 2 { ... } while under register pressure?

If you're under register pressure, you're almost certainly using callee-save registers in the loop, which will break the pattern unless you do another local call first. But yes, that's cute. If your function ends in a loop with a power-of-two trip count, and you're on an x86-like platform that pushes the PC on call, and the loop doesn't need any callee-save registers or the stack pointer, and there's no reason you need to keep the stack aligned, then you can emit the loop using chained calls and rets and all of your branches will be perfectly predicted.

Hmm, if you're willing to do another interior call as set-up, you can not only lift the CSR restriction but also do this at an arbitrary position in the function. So the only real restriction is that you need to not use the stack pointer directly. And of course you can do this on an LR architecture by just pushing and popping the PC yourself in your sub-function.

rjmccall fucked around with this message at 08:39 on Feb 15, 2017

dougdrums
Feb 25, 2005
CLIENT REQUESTED ELECTRONIC FUNDING RECEIPT (FUNDS NOW)

JawnV6 posted:

rjmccall alluded to it, for position-independent code like libraries sometimes you need to grab the PC to figure out where "here" is so you can jump "there."
That makes sense, I'm sure I've done that before and not taken notice. I was hoping there was some hot mess that did it as a matter of operation. If I had some sort of dynamic compiler supervisor, it would only need to figure out where it is once to offset to code pages. So the only reason to do it repeatedly is if you have some code that loads some code then that code loads even more code ... maybe that's a real thing too though ...

I mean I wrote one of those and I can't remember for the life of me how I figured that offset.

quote:

Also memory is changed, if the chip got too clever and tried to skip that bit it would cause other issues.
I thought this might be the case, that you could just ignore writing it, but I see how you can't really guarantee that it won't be read again.

dougdrums fucked around with this message at 12:45 on Feb 15, 2017

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
You don't usually need a PIC base for jumps and calls, because those instructions can take relative offsets on every architecture I've ever seen. If you don't even have a relative offset, e.g. because the symbol resolves outside of the current image, the linker will use the offset of a stub that materializes the address in some other way — typically the stub loads from a global that the loader initializes, but on some targets the loader just directly rewrites the stub. A similar technique gets used when the instruction encoding doesn't allow a big enough immediate offset to reach an arbitrary place in the image, which is common with fixed-width instruction encodings (e.g. ARM64 allows an offset of ±128MB, but images can reasonably get bigger than that): the compiler optimistically uses an instruction that uses a relative offset, and if the linker can't make that work, it just makes the instruction go to a stub.

The place where you need a PIC base is when you need the true address of some global and the target doesn't directly support PC-relative addressing; that means just i386 these days. In that case, you explicitly compute your PIC base and then add a relative offset to that, assuming you have one. In theory, on a target like i386 where MOV and LEA can take a 32-bit immediate absolute address, you could have the compiler just emit that instruction and tell the linker to fill it in during load. Unfortunately, that would be a disaster for launch times and general memory performance, because accessing global memory is common enough that it would dirty almost every page in the text segment. Rewriting a bunch of stubs doesn't have that problem because they're densely packed together, so you're only dirtying a page or two.

rjmccall fucked around with this message at 17:13 on Feb 15, 2017

sarehu
Apr 20, 2007

(call/cc call/cc)
My register pressure assumption was because with a register free you could use 5 bytes this way instead:

code:
mov al, 2  ; 2 bytes
top:
...
dec eax  ; 1 byte
jpe top  ; 2 bytes
I wasn't thinking about callee-save registers, but without using them, you still save a register. So... maybe it's "caller-save register pressure."

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

sarehu posted:

My register pressure assumption was because with a register free you could use 5 bytes this way instead:

code:
mov al, 2  ; 2 bytes
top:
...
dec eax  ; 1 byte
jpe top  ; 2 bytes
I wasn't thinking about callee-save registers, but without using them, you still save a register. So... maybe it's "caller-save register pressure."

call rel16 is 3 bytes and has the advantage of allowing all branches to be perfectly predicted.

EDIT: Oh, four bytes, I guess, because you'd need to put an operand-size override prefix on it. I think that's one of those tricks that you only do in code-size-at-all-costs mode.

rjmccall fucked around with this message at 20:39 on Feb 15, 2017

Dr Monkeysee
Oct 11, 2002

just a fox like a hundred thousand others
Nap Ghost

ExcessBLarg! posted:

Also, it doesn't really make sense to gratiutiously use gotos in modern languages the way people did in the past, so there's not really that much risk of abuse. In old dialects of Basic or Fortran, goto was often used because C-style control structures simply didn't exist, or because the environment didn't allow for free-form editing and if you needed to insert code in the middle of a routine, you often had to use a pair of unconditional gotos to splice it in.

I think this was pointed out earlier in this very thread but Knuth's original "goto considered harmful" was written to admonish a bunch of prehistoric programmers to start using such new-fangled inventions as "if", "while", and "for". While there's very few legit uses for goto these days there's also barely anybody left who would would even consider goto a viable control structure.

It's just not a thing to care about anymore.

nielsm
Jun 1, 2009



Dr Monkeysee posted:

I think this was pointed out earlier in this very thread but Knuth's original "goto considered harmful"

That was Dijkstra.

Dr Monkeysee
Oct 11, 2002

just a fox like a hundred thousand others
Nap Ghost
Oops. I couldn't remember which one wrote it and my quick googling led me astray.

sarehu
Apr 20, 2007

(call/cc call/cc)

rjmccall posted:

call rel16 is 3 bytes and has the advantage of allowing all branches to be perfectly predicted.

EDIT: Oh, four bytes, I guess, because you'd need to put an operand-size override prefix on it. I think that's one of those tricks that you only do in code-size-at-all-costs mode.

Oooh.

Edit: You'll still have F'd up prediction unless you do 66 e8 01 00 90 though (jump 1 byte ahead, over a nop).

(Unfortunately jumping backward 1 or 2 bytes can't be made to help.)

sarehu fucked around with this message at 22:42 on Feb 15, 2017

JawnV6
Jul 4, 2004

So hot ...

ShoulderDaemon posted:

Making the fusing logic very simple and conservative also means that it takes less power, less die area, and is easier to do timing for. It wouldn't be shocking to see rules like "we only fuse if one of the instructions is a MOV and the other instruction is a non-memory-op and both instructions are coming out of the stream cache" still being good enough to see 1% improvement on some benchmark that some client cares about. If you can get a 1% improvement for what might as well be free then you're going to take it, especially if it's the sort of thing that you can potentially tune and improve in future generations to be even better.
This is where it's sorta obvious my info is 5+ years out of date, I didn't know MOV was viable for any fusing. Register renaming makes some instructions quite light, but I thought that mechanism was separate from fusing. Stream cache also implies all goofy page/memory boundaries aren't going to be relevant.

Back in my day I filed a sub-1% perf bug on recursive traces :v:

ShoulderDaemon posted:

Off the top of my head, I can't think of any way that you'd be able to win by fusing a CALL/POP pair outside of something esoteric like binary translation where you're dynamically recompiling the program stream in large blocks.
Hmph. Binary translation shouldn't be that esoteric. C'mon, get things in gear over there.

rjmccall posted:

You don't usually need a PIC base for jumps and calls, because those instructions can take relative offsets on every architecture I've ever seen.
...
The place where you need a PIC base is when you need the true address of some global and the target doesn't directly support PC-relative addressing
Thanks for the correction & detail. I'm really, truly glad to have left x86 (mostly) behind.

sarehu posted:

Oooh.

Edit: You'll still have F'd up prediction unless you do 66 e8 01 00 90 though (jump 1 byte ahead, over a nop).

(Unfortunately jumping backward 1 or 2 bytes can't be made to help.)
Don't be afraid to jump into the middle of instructions. EB FF will jump to the FF, which can be a DEC.

ShoulderDaemon
Oct 9, 2003
support goon fund
Taco Defender

JawnV6 posted:

This is where it's sorta obvious my info is 5+ years out of date, I didn't know MOV was viable for any fusing. Register renaming makes some instructions quite light, but I thought that mechanism was separate from fusing. Stream cache also implies all goofy page/memory boundaries aren't going to be relevant.

JawnV6 posted:

Hmph. Binary translation shouldn't be that esoteric. C'mon, get things in gear over there.
Please interpret my examples as demonstrative of a general idea and not as "here is what current Core microarchitecture actually does". I can't possibly disclose actual fusion rules from our microarchitectures. Similarly, please understand "esoteric" to be within the context of "weirder than what most people think of as a normal CPU" and not "weirder than what Core microarchitectures may or may not genuinely do".

sarehu
Apr 20, 2007

(call/cc call/cc)

JawnV6 posted:

Don't be afraid to jump into the middle of instructions. EB FF will jump to the FF, which can be a DEC.

If you try that, 66 e8 fe ff doesn't work and 66 e8 ff ff __ will have to burn a byte anyway. And there's not really any practical choice for that byte.

beuges
Jul 4, 2005
fluffy bunny butterfly broomstick

JawnV6 posted:

Don't be afraid to jump into the middle of instructions. EB FF will jump to the FF, which can be a DEC.

Is your name Mel?

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

sarehu posted:

Oooh.

Edit: You'll still have F'd up prediction unless you do 66 e8 01 00 90 though (jump 1 byte ahead, over a nop).

Why? The forward branches are unconditional and immediately resolvable and the backward branches are returns. The return address predictor isn't, like, keyed by anything.

sarehu
Apr 20, 2007

(call/cc call/cc)

rjmccall posted:

Why? The forward branches are unconditional and immediately resolvable and the backward branches are returns. The return address predictor isn't, like, keyed by anything.

The return address predictor would ignore the zero length call, right? And then, an unbalanced ret. My comment was under that assumption.

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
Ah. This entire digression was about whether it was reasonable for a processor to implement that special case by just assuming that calls with zero offset never happened in real code, so no, I was analyzing under the hypothesis that a processor wouldn't want to do that. Obviously, if the processor takes special care to make an instruction less efficient, you should not use that instruction.

JawnV6
Jul 4, 2004

So hot ...
It's possible to get the same code to run twice with different values after a 0 length call, but there's certainly easier ways to do it with 7 bytes
code:
0:  66 e8 00 00             call   L0
L0:
4:  ff 04 24                inc    DWORD PTR [esp]
On the first pass, the stored EIP is incremented to 5. The ret takes you to the second byte, which 04 24 decodes to "add al, 24", then the code runs again with eax+24. The next time the ret is hit it consumes the proper return address.

dec [esp] makes the opcode a 00, which is fairly useless targeting memory indexed by eax. Other manipulations of [esp] might prove useful, but most of the time the modrm byte ends up being 0x24 and still targeting [esp] on the second time through, which will muck up the real return address. inc/add al thankfully leave EIP in the same place which greatly simplifies the reasoning.

vOv
Feb 8, 2014

It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions.

Absurd Alhazred
Mar 27, 2010

by Athanatos

vOv posted:

It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions.

Don't kinkshame.

hobbesmaster
Jan 28, 2008

vOv posted:

It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions.

Having only ever touched assembly on tiny embedded platforms (avr, arm thumb ) every new fact I learn about modern x86 is more horrifying than the last.

pseudorandom name
May 6, 2007

Hey now, x86 is a reasonably decent bytecode format.

Not as good as Java or .NET CLI, but they have 40 years of legacy to deal with.

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

vOv posted:

It's been ages since I touched asm in college so I have no clue what the hell you all are talking about but I'm pretty sure it still counts as a horror given that you're talking about jumping into the middle of instructions.

The real horror is that Jawn is now talking about modifying the instruction stream.

Absurd Alhazred
Mar 27, 2010

by Athanatos

rjmccall posted:

The real horror is that Jawn is now talking about modifying the instruction stream.

Don't cross the streams!

Qwertycoatl
Dec 31, 2008

I don't think the Guardian's web people have discovered the wonders of version control: https://guardiannewsampampmedia.formstack.com/forms/js.php/untitled_form_19_copy_5_copy_2_copy_copy_copy_1_copy_2_copy_1_copy_1_copy_1_copy_1_copy_1_copy_copy_1_copy_1_copy_1_copy_copy_copy_copy_1_copy_4_copy_copy_copy_copy

JawnV6
Jul 4, 2004

So hot ...

rjmccall posted:

The real horror is that Jawn is now talking about modifying the instruction stream.
It's not SMC. The write is hitting the return IP stored on the stack.
code:
v-- inc [esp]
FF 04 24
   ^--- add al, 0x24
Not modifying, just re-indexing the instruction stream, would still pass something like W^X.

redleader
Aug 18, 2005

Engage according to operational parameters
Sometimes it's the little things:

C# code:

int i = 0;

foreach (Thing thing in response.Things) {

       Assert.AreEqual(result[i], response.Things[i].ThingID);

       Assert.AreEqual("foo", response.Thing[i].Type);

       i++;

}

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

JawnV6 posted:

It's not SMC. The write is hitting the return IP stored on the stack.
code:
v-- inc [esp]
FF 04 24
   ^--- add al, 0x24
Not modifying, just re-indexing the instruction stream, would still pass something like W^X.

Oh right, of course.

...although I do have to note that "add al, 24" does not compute eax+24.

Absurd Alhazred
Mar 27, 2010

by Athanatos

redleader posted:

Sometimes it's the little things:

C# code:
int i = 0;

foreach (Thing thing in response.Things) {

       Assert.AreEqual(result[i], response.Things[i].ThingID);

       Assert.AreEqual("foo", response.Thing[i].Type);

       i++;

}

LOL, I've seen people abuse foreach/for (x : S) with an additional running index, but this one doesn't even refer to thing. :psyduck:

Dr. Stab
Sep 12, 2010
👨🏻‍⚕️🩺🔪🙀😱🙀

I'm not hip on all the new webdev microservices implementations, but what the hell does this file do that an HTML file couldn't?

necrotic
Aug 2, 2005
I owe my brother big time for this!
None of that is hip if it helps :shrug:

The MUMPSorceress
Jan 6, 2012


^SHTPSTS

Gary’s Answer
I'm js.php

JawnV6
Jul 4, 2004

So hot ...

rjmccall posted:

Oh right, of course.

...although I do have to note that "add al, 24" does not compute eax+24.

It's x86 so it's true ~86% of the time.

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

JawnV6 posted:

It's x86 so it's true ~86% of the time.

:2bong:

Adbot
ADBOT LOVES YOU

Tank Boy Ken
Aug 24, 2012
J4G for life
Fallen Rib

The above just made me smile after I made a funny typo while using dotnet. I actually typed in dotnot and had to laugh when the programm complained and I actually realised my mistake. You know: dot NOT. HAH. Okay I'm sorry.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply