Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Bloody
Mar 3, 2013

someone tell me a little bit about cuda vs opencl

i think i will soon have a data problem where i think they may be applicable (doing idk what with a lot of time series data) and idk anything about them or differences or w/e

Adbot
ADBOT LOVES YOU

Athas
Aug 6, 2007

fuck that joker

Bloody posted:

someone tell me a little bit about cuda vs opencl

i think i will soon have a data problem where i think they may be applicable (doing idk what with a lot of time series data) and idk anything about them or differences or w/e

CUDA is much nicer to use in hand-written code. If you can use something like PyOpenCL, then there is less difference.

CUDA is limited to NVIDIA hardware, OpenCL can use many kinds of GPU (also NVIDIA), multicore CPUs, and even exotic things like FPGAs (although I haven't tried it myself).

OpenCL is much nicer as a code generation target.

JawnV6
Jul 4, 2004

So hot ...

rjmccall posted:

for me it's the insistence that his idea solves every problem that ever was and ever will be
observation: people don't write overflow handlers
proposed mitigation: invent new representation that lacks HW support and may return a range instead of a discrete value. requires several new handlers, new machinery for common operations, and as a bonus opens up new failure modes

rjmccall posted:

i have no idea why he thinks that forcing the processor to do arbitrary-precision fp arithmetic will "save power" vs. an implementation with known operand sizes
i think it's literally less bits=less power. all the extra control flow to figure out where to mask before compute? trivial compared to the millijoules spent pulling 64 whole bits out of RAM

Brain Candy
May 18, 2006

rjmccall posted:

reading between the lines, i think he is envisaging a scheme where the processor implicitly manages these and can allocate these to the heap, as if implicit dynamic allocation would actually be acceptable in any of the environments that keep getting brought up as gross failures of floating point

i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result

MeruFM
Jul 27, 2010
you can get the way more expensive video card for those purposes

VikingofRock
Aug 24, 2008




Brain Candy posted:

i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result

Out of curiosity, what keeps stuff like MPFR (or a rational number implementation) from working in your use case?

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

Brain Candy posted:

i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result

the reason dynamic memory allocation is forbidden in control systems isn't because it's slow

Bloody
Mar 3, 2013

JawnV6 posted:

observation: people don't write overflow handlers
proposed mitigation: invent new representation that lacks HW support and may return a range instead of a discrete value. requires several new handlers, new machinery for common operations, and as a bonus opens up new failure modes

i think it's literally less bits=less power. all the extra control flow to figure out where to mask before compute? trivial compared to the millijoules spent pulling 64 whole bits out of RAM

if it was millijoules then maybe

(gigabits per second times millijoules per bit = rip u)

Bloody
Mar 3, 2013

Athas posted:

CUDA is much nicer to use in hand-written code. If you can use something like PyOpenCL, then there is less difference.

CUDA is limited to NVIDIA hardware, OpenCL can use many kinds of GPU (also NVIDIA), multicore CPUs, and even exotic things like FPGAs (although I haven't tried it myself).

OpenCL is much nicer as a code generation target.

cool thanks I'll look in the opencl direction.

iirc altera has the opencl to fpga flow

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





use up/down instead of master/slave imo, or top/bottom if you're at cambridge

Brain Candy
May 18, 2006

rjmccall posted:

the reason dynamic memory allocation is forbidden in control systems isn't because it's slow

fair, i didn't explain

the fake control systems i use the same algorithms as the real systems except i don't have to worry about realtime guarantees. those kinds problems are the exactly kind of problems i, a moron, smash myself into. i am not as smart as Kahan and if i could get some hints as to what was going wrong where it'd certainly appreciate it

whether unums would do this in practice, i don't know, but the thought is appealing

Brain Candy
May 18, 2006

VikingofRock posted:

Out of curiosity, what keeps stuff like MPFR (or a rational number implementation) from working in your use case?

i use java with no jni, the worst possible configuration for numeric computing :unsmigghh:

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

Brain Candy posted:

fair, i didn't explain

the fake control systems i use the same algorithms as the real systems except i don't have to worry about realtime guarantees. those kinds problems are the exactly kind of problems i, a moron, smash myself into. i am not as smart as Kahan and if i could get some hints as to what was going wrong where it'd certainly appreciate it

whether unums would do this in practice, i don't know, but the thought is appealing

is it the same algorithm but somehow a completely different implementation? because if it's supposed to simulating the real system i assume it should also be simulating the fp problems that the real system would have, but maybe that's not the point

Brain Candy
May 18, 2006

rjmccall posted:

is it the same algorithm but somehow a completely different implementation? because if it's supposed to simulating the real system i assume it should also be simulating the fp problems that the real system would have, but maybe that's not the point

yes, that level of detail isn't necessary. i wouldn't need the Kalman filter, but i might need a Kalman filter

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
gotcha

yeah, i can see how having a simulator that warns about running into precision problems would be useful

you could probably just do that with an ordinary software float library, though

Dylan16807
May 12, 2010

rjmccall posted:

i have no idea why he thinks that forcing the processor to do arbitrary-precision fp arithmetic will "save power" vs. an implementation with known operand sizes

he has some numbers he quotes, off a 32nm process

6 picojoules to do a register store/load

64 picojoules to do a 64 bit multiply+add

4200 picojoules per 64 bits read out of main memory

there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them

sarehu
Apr 20, 2007

(call/cc call/cc)

the talent deficit posted:

use up/down instead of master/slave imo, or top/bottom if you're at cambridge

red/black

VikingofRock
Aug 24, 2008




the talent deficit posted:

top/bottom if you're at cambridge

please don't kink shame

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

Dylan16807 posted:

he has some numbers he quotes, off a 32nm process

6 picojoules to do a register store/load

64 picojoules to do a 64 bit multiply+add

4200 picojoules per 64 bits read out of main memory

there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them

this makes some sense, although i will note that fp conversion is not necessarily an efficient operation — it can be, but it's often not optimized because hardware designers reasonably assume that it's not a common operation in well-written fp code, and it's hard enough to optimize the well-written code without worrying about the lovely stuff

the main thing is that these power savings (compared to doing everything in full-precision double, i guess?) assume that the format actually shrinks the amount of memory touched. compressing a double to only occupy 3 bytes out of an 8-byte allocation makes very little difference for memory performance when the bus does almost everything in units of 64 bytes

now maybe you can shrink the format to use fewer "inline" bytes (4?) and overflow to a side allocation when necessary, and that way you could fit a lot more values per cache line. but even ignoring questions like "how is that side allocation managed at all", that still means using at least twice as much memory when the overflow occurs, with really poor locality. it wouldn't take many overflows for that to completely cancel any savings

(also as jawn says you have to ignore the fact that the individual operations on these values would take considerably more power. like i feel like i'm supposed to come away saying "man, what if we could do a dynamically smaller multiply" and ignore that verifying that costs something. theoretically an fp unit could already do smaller multiplies when fewer significand bits are set; checking that almost certainly wouldn't be worth it, and that's with the bits always in the same place, unlike this format)

Wheany
Mar 17, 2006

Spinyahahahahahahahahahahahaha!

Doctor Rope

the talent deficit posted:

use up/down instead of master/slave imo, or top/bottom if you're at cambridge

powerbottom/top

JawnV6
Jul 4, 2004

So hot ...

Dylan16807 posted:

there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them
there's a lot of extra complexity just figuring out what's relevant. seems like a lot of extra branching especially if sizes are being adjusted often. is the construction of a power virus generally understood? the only implementations are in python and julia, which seem poorly suited to fine grained power analysis

the scratchpad area is a whole other level of complexity, seems tantamount to transactional memory in terms of how much the application would have to change

pseudorandom name
May 6, 2007

I think you're all talking about unum 1.0, unum 2.0 is the latest version.

http://www.johngustafson.net/presentations/Unums2.0slides-withNotes.pdf
http://www.johngustafson.net/pubs/RadicalApproach.pdf

JawnV6
Jul 4, 2004

So hot ...
hm, cook accounts linked that interview

i lost the paper somewhere on adding bits and it got to gigabytes of lookup tables somehow and the presentation is careening through the circle of fifths

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

what did i just read

JawnV6
Jul 4, 2004

So hot ...
ive printed out hard copies for train reading

there's this tiny lil nod to 'power saving' right at the start where he derives a 4x4 truth table and claims it's implementable in 88 "ROM transistors", then implementation is chucked out the window to talk about the format's accuracy

i believe and/or don't care that it's accurate, i just wanna see linpack do less loads

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
the decimal parts are amazing

fritz
Jul 26, 2003

Wheany posted:

powerbottom/top

ironmaiden/powerslave

fritz
Jul 26, 2003

fritz posted:

i agree that it's a pretty dumb idea but im not getting a feeling of "crank" out of the guy behind it

obsessed, sure, crank, no

gonna back track on this a little bit after reading the new stuff

HappyHippo
Nov 19, 2003
Do you have an Air Miles Card?
http://math.ucr.edu/home/baez/crackpot.html

suffix
Jul 27, 2013

Wheeee!

JawnV6 posted:

anyway, this END OF NUMERIC ERROR article has been floating around. it seems to be IEEE754 what the Mill is to x86. probably genius, would've been neat if that's how things settled 40 years ago, but reality picked a dodgy implementation and cost/benefit is going to keep us there for a while




VikingofRock
Aug 24, 2008




fritz posted:

ironmaiden/powerslave

JawnV6
Jul 4, 2004

So hot ...
the last 2 slides of the presentation give away the big hints

1) 32/64b implementations undefined, 'RLE' pitched as magic bullet, unknown if table lookup methods scale that far
2) the money quote:

quote:

I admit that I smuggle mathematical correctness into the computing business inside of a Trojan horse of energy and power savings.
which would behoove one to, you know, actually go about proving energy and power savings

Soricidus
Oct 21, 2010
freedom-hating statist shill
mathematical correctness? on computers? it seems so obvious with hindsight, why did nobdoy think of this before?!

JawnV6
Jul 4, 2004

So hot ...
i thiiink i can float the stone on unums now. i know everyone else is done, idgaf

it's not some general purpose thing like 754. you're expected to:
1) gaze deeply into the application
2) determine a useful u-lattice
3) compute tables for +-*/
4) reduce tables

there's scant guidance for 2, and error cases for bad ones include "every result comes back as 'between 4 billion and infinity'". there's one worked example of 3, and 'compute' includes determining if entries are exactly representable in the u-lattice, so i can't tell if im supposed to do that with 754 or symbolically or what. 4 is asserted to be possible, but the tradeoffs in moving lookups to control flow is presumed to compress 32GB down to megabytes without proof or example. runtime is assumed to have some tables in a small ROM or something that doesn't cost the same as DDR loads

the crankiest part is willful blindness to how much hardware implementations of 754 already do. a comparison of a table implementation with the division hints already present would make a fantastic case

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'

fritz posted:

ironmaiden/powerslave

pseudorandom name
May 6, 2007

no, the crankiest part is the anecdote about his defeat of Kahan in this slide deck: https://www.slideshare.net/mobile/insideHPC/unum-computing-an-energy-efficient-and-massively-parallel-approach-to-valid-numerics

Sapozhnik
Jan 2, 2005

Nap Ghost

KAHAAAAAAAN!!!!!!!!!!

abraham linksys
Sep 6, 2010

:darksouls:
huh, tc39 is thinking of adding TCO to JavaScript via a new manual syntax instead of automatic TCO like they'd originally planned, weird https://github.com/tc39/proposal-ptc-syntax/blob/master/README.md

there prior art on this in any other languages?

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





i hope none of the people who complained about missing stack frames use for loops ever

Adbot
ADBOT LOVES YOU

more like dICK
Feb 15, 2010

This is inevitable.
Clojure needs a recur call as the final form in a tail recursive function, since the JVM doesn't do TCO.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply