p-lang thread: (now (have you (problems two)))

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > p-lang thread: (now (have you (problems two)))

«‹›1784 »

Bloody: Mar 3, 2013

someone tell me a little bit about cuda vs opencl

i think i will soon have a data problem where i think they may be applicable (doing idk what with a lot of time series data) and idk anything about them or differences or w/e

# ? Apr 27, 2016 20:51

Adbot: ADBOT LOVES YOU

# ? Jun 12, 2024 00:30

Athas: Aug 6, 2007; fuck that joker

Bloody posted:

someone tell me a little bit about cuda vs opencl

i think i will soon have a data problem where i think they may be applicable (doing idk what with a lot of time series data) and idk anything about them or differences or w/e

CUDA is much nicer to use in hand-written code. If you can use something like PyOpenCL, then there is less difference.

CUDA is limited to NVIDIA hardware, OpenCL can use many kinds of GPU (also NVIDIA), multicore CPUs, and even exotic things like FPGAs (although I haven't tried it myself).

OpenCL is much nicer as a code generation target.

# ? Apr 27, 2016 23:17

JawnV6: Jul 4, 2004; So hot ...

rjmccall posted:

for me it's the insistence that his idea solves every problem that ever was and ever will be

observation: people don't write overflow handlers
proposed mitigation: invent new representation that lacks HW support and may return a range instead of a discrete value. requires several new handlers, new machinery for common operations, and as a bonus opens up new failure modes

rjmccall posted:

i have no idea why he thinks that forcing the processor to do arbitrary-precision fp arithmetic will "save power" vs. an implementation with known operand sizes

i think it's literally less bits=less power. all the extra control flow to figure out where to mask before compute? trivial compared to the millijoules spent pulling 64 whole bits out of RAM

# ? Apr 27, 2016 23:27

Brain Candy: May 18, 2006

rjmccall posted:

reading between the lines, i think he is envisaging a scheme where the processor implicitly manages these and can allocate these to the heap, as if implicit dynamic allocation would actually be acceptable in any of the environments that keep getting brought up as gross failures of floating point

i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result

# ? Apr 28, 2016 01:53

MeruFM: Jul 27, 2010

you can get the way more expensive video card for those purposes

# ? Apr 28, 2016 02:00

VikingofRock: Aug 24, 2008

Brain Candy posted:

i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result

Out of curiosity, what keeps stuff like MPFR (or a rational number implementation) from working in your use case?

# ? Apr 28, 2016 02:23

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

Brain Candy posted:

i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result

the reason dynamic memory allocation is forbidden in control systems isn't because it's slow

# ? Apr 28, 2016 02:27

Bloody: Mar 3, 2013

JawnV6 posted:

observation: people don't write overflow handlers
proposed mitigation: invent new representation that lacks HW support and may return a range instead of a discrete value. requires several new handlers, new machinery for common operations, and as a bonus opens up new failure modes

i think it's literally less bits=less power. all the extra control flow to figure out where to mask before compute? trivial compared to the millijoules spent pulling 64 whole bits out of RAM

if it was millijoules then maybe

(gigabits per second times millijoules per bit = rip u)

# ? Apr 28, 2016 02:47

Bloody: Mar 3, 2013

Athas posted:

CUDA is much nicer to use in hand-written code. If you can use something like PyOpenCL, then there is less difference.

CUDA is limited to NVIDIA hardware, OpenCL can use many kinds of GPU (also NVIDIA), multicore CPUs, and even exotic things like FPGAs (although I haven't tried it myself).

OpenCL is much nicer as a code generation target.

cool thanks I'll look in the opencl direction.

iirc altera has the opencl to fpga flow

# ? Apr 28, 2016 02:48

the talent deficit: Dec 20, 2003; self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture

use up/down instead of master/slave imo, or top/bottom if you're at cambridge

# ? Apr 28, 2016 02:49

Brain Candy: May 18, 2006

rjmccall posted:

the reason dynamic memory allocation is forbidden in control systems isn't because it's slow

fair, i didn't explain

the fake control systems i use the same algorithms as the real systems except i don't have to worry about realtime guarantees. those kinds problems are the exactly kind of problems i, a moron, smash myself into. i am not as smart as Kahan and if i could get some hints as to what was going wrong where it'd certainly appreciate it

whether unums would do this in practice, i don't know, but the thought is appealing

# ? Apr 28, 2016 03:07

Brain Candy: May 18, 2006

VikingofRock posted:

Out of curiosity, what keeps stuff like MPFR (or a rational number implementation) from working in your use case?

i use java with no jni, the worst possible configuration for numeric computing :unsmigghh:

# ? Apr 28, 2016 03:11

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

Brain Candy posted:

fair, i didn't explain

the fake control systems i use the same algorithms as the real systems except i don't have to worry about realtime guarantees. those kinds problems are the exactly kind of problems i, a moron, smash myself into. i am not as smart as Kahan and if i could get some hints as to what was going wrong where it'd certainly appreciate it

whether unums would do this in practice, i don't know, but the thought is appealing

is it the same algorithm but somehow a completely different implementation? because if it's supposed to simulating the real system i assume it should also be simulating the fp problems that the real system would have, but maybe that's not the point

# ? Apr 28, 2016 03:15

Brain Candy: May 18, 2006

rjmccall posted:

is it the same algorithm but somehow a completely different implementation? because if it's supposed to simulating the real system i assume it should also be simulating the fp problems that the real system would have, but maybe that's not the point

yes, that level of detail isn't necessary. i wouldn't need the Kalman filter, but i might need a Kalman filter

# ? Apr 28, 2016 03:31

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

gotcha

yeah, i can see how having a simulator that warns about running into precision problems would be useful

you could probably just do that with an ordinary software float library, though

# ? Apr 28, 2016 03:38

Dylan16807: May 12, 2010

rjmccall posted:

i have no idea why he thinks that forcing the processor to do arbitrary-precision fp arithmetic will "save power" vs. an implementation with known operand sizes

he has some numbers he quotes, off a 32nm process

6 picojoules to do a register store/load

64 picojoules to do a 64 bit multiply+add

4200 picojoules per 64 bits read out of main memory

there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them

# ? Apr 28, 2016 04:06

sarehu: Apr 20, 2007; (call/cc call/cc)

the talent deficit posted:

use up/down instead of master/slave imo, or top/bottom if you're at cambridge

red/black

# ? Apr 28, 2016 04:10

VikingofRock: Aug 24, 2008

the talent deficit posted:

top/bottom if you're at cambridge

please don't kink shame

# ? Apr 28, 2016 04:44

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

Dylan16807 posted:

he has some numbers he quotes, off a 32nm process

6 picojoules to do a register store/load

64 picojoules to do a 64 bit multiply+add

4200 picojoules per 64 bits read out of main memory

there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them

this makes some sense, although i will note that fp conversion is not necessarily an efficient operation � it can be, but it's often not optimized because hardware designers reasonably assume that it's not a common operation in well-written fp code, and it's hard enough to optimize the well-written code without worrying about the lovely stuff

the main thing is that these power savings (compared to doing everything in full-precision double, i guess?) assume that the format actually shrinks the amount of memory touched. compressing a double to only occupy 3 bytes out of an 8-byte allocation makes very little difference for memory performance when the bus does almost everything in units of 64 bytes

now maybe you can shrink the format to use fewer "inline" bytes (4?) and overflow to a side allocation when necessary, and that way you could fit a lot more values per cache line. but even ignoring questions like "how is that side allocation managed at all", that still means using at least twice as much memory when the overflow occurs, with really poor locality. it wouldn't take many overflows for that to completely cancel any savings

(also as jawn says you have to ignore the fact that the individual operations on these values would take considerably more power. like i feel like i'm supposed to come away saying "man, what if we could do a dynamically smaller multiply" and ignore that verifying that costs something. theoretically an fp unit could already do smaller multiplies when fewer significand bits are set; checking that almost certainly wouldn't be worth it, and that's with the bits always in the same place, unlike this format)

# ? Apr 28, 2016 07:31

Wheany: Mar 17, 2006; Spinya^{ha^{ha^haha}ha}ha_{ha_{ha_haha}ha}ha!; Doctor Rope

the talent deficit posted:

use up/down instead of master/slave imo, or top/bottom if you're at cambridge

powerbottom/top

# ? Apr 28, 2016 07:46

JawnV6: Jul 4, 2004; So hot ...

Dylan16807 posted:

there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them

there's a lot of extra complexity just figuring out what's relevant. seems like a lot of extra branching especially if sizes are being adjusted often. is the construction of a power virus generally understood? the only implementations are in python and julia, which seem poorly suited to fine grained power analysis

the scratchpad area is a whole other level of complexity, seems tantamount to transactional memory in terms of how much the application would have to change

# ? Apr 28, 2016 08:36

pseudorandom name: May 6, 2007

I think you're all talking about unum 1.0, unum 2.0 is the latest version.

http://www.johngustafson.net/presentations/Unums2.0slides-withNotes.pdf
http://www.johngustafson.net/pubs/RadicalApproach.pdf

# ? Apr 28, 2016 09:08

JawnV6: Jul 4, 2004; So hot ...

hm, cook accounts linked that interview

i lost the paper somewhere on adding bits and it got to gigabytes of lookup tables somehow and the presentation is careening through the circle of fifths

# ? Apr 28, 2016 09:43

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

pseudorandom name posted:

I think you're all talking about unum 1.0, unum 2.0 is the latest version.

http://www.johngustafson.net/presentations/Unums2.0slides-withNotes.pdf
http://www.johngustafson.net/pubs/RadicalApproach.pdf

what did i just read

# ? Apr 28, 2016 16:55

JawnV6: Jul 4, 2004; So hot ...

ive printed out hard copies for train reading

there's this tiny lil nod to 'power saving' right at the start where he derives a 4x4 truth table and claims it's implementable in 88 "ROM transistors", then implementation is chucked out the window to talk about the format's accuracy

i believe and/or don't care that it's accurate, i just wanna see linpack do less loads

# ? Apr 28, 2016 19:05

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

the decimal parts are amazing

# ? Apr 28, 2016 19:10

fritz: Jul 26, 2003

Wheany posted:

powerbottom/top

ironmaiden/powerslave

# ? Apr 28, 2016 19:39

fritz: Jul 26, 2003

fritz posted:

i agree that it's a pretty dumb idea but im not getting a feeling of "crank" out of the guy behind it

obsessed, sure, crank, no

gonna back track on this a little bit after reading the new stuff

# ? Apr 28, 2016 19:54

HappyHippo: Nov 19, 2003; Do you have an Air Miles Card?

http://math.ucr.edu/home/baez/crackpot.html

# ? Apr 28, 2016 19:57

suffix: Jul 27, 2013; Wheeee!

JawnV6 posted:

anyway, this END OF NUMERIC ERROR article has been floating around. it seems to be IEEE754 what the Mill is to x86. probably genius, would've been neat if that's how things settled 40 years ago, but reality picked a dodgy implementation and cost/benefit is going to keep us there for a while

pseudorandom name posted:

I think you're all talking about unum 1.0, unum 2.0 is the latest version.

http://www.johngustafson.net/presentations/Unums2.0slides-withNotes.pdf
http://www.johngustafson.net/pubs/RadicalApproach.pdf

# ? Apr 28, 2016 20:26

VikingofRock: Aug 24, 2008

fritz posted:

ironmaiden/powerslave

# ? Apr 28, 2016 20:38

JawnV6: Jul 4, 2004; So hot ...

suffix posted:

the last 2 slides of the presentation give away the big hints

1) 32/64b implementations undefined, 'RLE' pitched as magic bullet, unknown if table lookup methods scale that far
2) the money quote:

quote:

I admit that I smuggle mathematical correctness into the computing business inside of a Trojan horse of energy and power savings.

which would behoove one to, you know, actually go about proving energy and power savings

# ? Apr 29, 2016 00:43

Soricidus: Oct 21, 2010; freedom-hating statist shill

mathematical correctness? on computers? it seems so obvious with hindsight, why did nobdoy think of this before?!

# ? Apr 29, 2016 00:48

JawnV6: Jul 4, 2004; So hot ...

i thiiink i can float the stone on unums now. i know everyone else is done, idgaf

it's not some general purpose thing like 754. you're expected to:
1) gaze deeply into the application
2) determine a useful u-lattice
3) compute tables for +-*/
4) reduce tables

there's scant guidance for 2, and error cases for bad ones include "every result comes back as 'between 4 billion and infinity'". there's one worked example of 3, and 'compute' includes determining if entries are exactly representable in the u-lattice, so i can't tell if im supposed to do that with 754 or symbolically or what. 4 is asserted to be possible, but the tradeoffs in moving lookups to control flow is presumed to compress 32GB down to megabytes without proof or example. runtime is assumed to have some tables in a small ROM or something that doesn't cost the same as DDR loads

the crankiest part is willful blindness to how much hardware implementations of 754 already do. a comparison of a table implementation with the division hints already present would make a fantastic case

# ? Apr 29, 2016 18:31

Captain Foo: May 11, 2004; we vibin'
we slidin'
we breathin'
we dyin'

fritz posted:

ironmaiden/powerslave

# ? Apr 29, 2016 20:29

pseudorandom name: May 6, 2007

no, the crankiest part is the anecdote about his defeat of Kahan in this slide deck: https://www.slideshare.net/mobile/insideHPC/unum-computing-an-energy-efficient-and-massively-parallel-approach-to-valid-numerics

# ? Apr 29, 2016 23:32

Sapozhnik: Jan 2, 2005; Nap Ghost

pseudorandom name posted:

no, the crankiest part is the anecdote about his defeat of Kahan in this slide deck: https://www.slideshare.net/mobile/insideHPC/unum-computing-an-energy-efficient-and-massively-parallel-approach-to-valid-numerics

KAHAAAAAAAN!!!!!!!!!!

# ? Apr 30, 2016 01:34

abraham linksys: Sep 6, 2010

huh, tc39 is thinking of adding TCO to JavaScript via a new manual syntax instead of automatic TCO like they'd originally planned, weird https://github.com/tc39/proposal-ptc-syntax/blob/master/README.md

there prior art on this in any other languages?

# ? May 4, 2016 23:33

the talent deficit: Dec 20, 2003; self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture

i hope none of the people who complained about missing stack frames use for loops ever

# ? May 4, 2016 23:50

Adbot: ADBOT LOVES YOU

# ? Jun 12, 2024 00:30

more like dICK: Feb 15, 2010; This is inevitable.

Clojure needs a recur call as the final form in a tail recursive function, since the JVM doesn't do TCO.

# ? May 4, 2016 23:50

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > YOSPOS > p-lang thread: (now (have you (problems two)))

«‹›1784 »