|
someone tell me a little bit about cuda vs opencl i think i will soon have a data problem where i think they may be applicable (doing idk what with a lot of time series data) and idk anything about them or differences or w/e
|
# ? Apr 27, 2016 20:51 |
|
|
# ? Jun 12, 2024 00:30 |
|
Bloody posted:someone tell me a little bit about cuda vs opencl CUDA is much nicer to use in hand-written code. If you can use something like PyOpenCL, then there is less difference. CUDA is limited to NVIDIA hardware, OpenCL can use many kinds of GPU (also NVIDIA), multicore CPUs, and even exotic things like FPGAs (although I haven't tried it myself). OpenCL is much nicer as a code generation target.
|
# ? Apr 27, 2016 23:17 |
|
rjmccall posted:for me it's the insistence that his idea solves every problem that ever was and ever will be proposed mitigation: invent new representation that lacks HW support and may return a range instead of a discrete value. requires several new handlers, new machinery for common operations, and as a bonus opens up new failure modes rjmccall posted:i have no idea why he thinks that forcing the processor to do arbitrary-precision fp arithmetic will "save power" vs. an implementation with known operand sizes
|
# ? Apr 27, 2016 23:27 |
|
rjmccall posted:reading between the lines, i think he is envisaging a scheme where the processor implicitly manages these and can allocate these to the heap, as if implicit dynamic allocation would actually be acceptable in any of the environments that keep getting brought up as gross failures of floating point i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result
|
# ? Apr 28, 2016 01:53 |
|
you can get the way more expensive video card for those purposes
|
# ? Apr 28, 2016 02:00 |
Brain Candy posted:i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result Out of curiosity, what keeps stuff like MPFR (or a rational number implementation) from working in your use case?
|
|
# ? Apr 28, 2016 02:23 |
|
Brain Candy posted:i work in simulation and i'd take a 100 fold slowdown in heartbeat if it increased my chances of a correct result the reason dynamic memory allocation is forbidden in control systems isn't because it's slow
|
# ? Apr 28, 2016 02:27 |
|
JawnV6 posted:observation: people don't write overflow handlers if it was millijoules then maybe (gigabits per second times millijoules per bit = rip u)
|
# ? Apr 28, 2016 02:47 |
|
Athas posted:CUDA is much nicer to use in hand-written code. If you can use something like PyOpenCL, then there is less difference. cool thanks I'll look in the opencl direction. iirc altera has the opencl to fpga flow
|
# ? Apr 28, 2016 02:48 |
|
use up/down instead of master/slave imo, or top/bottom if you're at cambridge
|
# ? Apr 28, 2016 02:49 |
|
rjmccall posted:the reason dynamic memory allocation is forbidden in control systems isn't because it's slow fair, i didn't explain the fake control systems i use the same algorithms as the real systems except i don't have to worry about realtime guarantees. those kinds problems are the exactly kind of problems i, a moron, smash myself into. i am not as smart as Kahan and if i could get some hints as to what was going wrong where it'd certainly appreciate it whether unums would do this in practice, i don't know, but the thought is appealing
|
# ? Apr 28, 2016 03:07 |
|
VikingofRock posted:Out of curiosity, what keeps stuff like MPFR (or a rational number implementation) from working in your use case? i use java with no jni, the worst possible configuration for numeric computing
|
# ? Apr 28, 2016 03:11 |
|
Brain Candy posted:fair, i didn't explain is it the same algorithm but somehow a completely different implementation? because if it's supposed to simulating the real system i assume it should also be simulating the fp problems that the real system would have, but maybe that's not the point
|
# ? Apr 28, 2016 03:15 |
|
rjmccall posted:is it the same algorithm but somehow a completely different implementation? because if it's supposed to simulating the real system i assume it should also be simulating the fp problems that the real system would have, but maybe that's not the point yes, that level of detail isn't necessary. i wouldn't need the Kalman filter, but i might need a Kalman filter
|
# ? Apr 28, 2016 03:31 |
|
gotcha yeah, i can see how having a simulator that warns about running into precision problems would be useful you could probably just do that with an ordinary software float library, though
|
# ? Apr 28, 2016 03:38 |
|
rjmccall posted:i have no idea why he thinks that forcing the processor to do arbitrary-precision fp arithmetic will "save power" vs. an implementation with known operand sizes he has some numbers he quotes, off a 32nm process 6 picojoules to do a register store/load 64 picojoules to do a 64 bit multiply+add 4200 picojoules per 64 bits read out of main memory there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them
|
# ? Apr 28, 2016 04:06 |
|
the talent deficit posted:use up/down instead of master/slave imo, or top/bottom if you're at cambridge red/black
|
# ? Apr 28, 2016 04:10 |
the talent deficit posted:top/bottom if you're at cambridge please don't kink shame
|
|
# ? Apr 28, 2016 04:44 |
|
Dylan16807 posted:he has some numbers he quotes, off a 32nm process this makes some sense, although i will note that fp conversion is not necessarily an efficient operation — it can be, but it's often not optimized because hardware designers reasonably assume that it's not a common operation in well-written fp code, and it's hard enough to optimize the well-written code without worrying about the lovely stuff the main thing is that these power savings (compared to doing everything in full-precision double, i guess?) assume that the format actually shrinks the amount of memory touched. compressing a double to only occupy 3 bytes out of an 8-byte allocation makes very little difference for memory performance when the bus does almost everything in units of 64 bytes now maybe you can shrink the format to use fewer "inline" bytes (4?) and overflow to a side allocation when necessary, and that way you could fit a lot more values per cache line. but even ignoring questions like "how is that side allocation managed at all", that still means using at least twice as much memory when the overflow occurs, with really poor locality. it wouldn't take many overflows for that to completely cancel any savings (also as jawn says you have to ignore the fact that the individual operations on these values would take considerably more power. like i feel like i'm supposed to come away saying "man, what if we could do a dynamically smaller multiply" and ignore that verifying that costs something. theoretically an fp unit could already do smaller multiplies when fewer significand bits are set; checking that almost certainly wouldn't be worth it, and that's with the bits always in the same place, unlike this format)
|
# ? Apr 28, 2016 07:31 |
|
the talent deficit posted:use up/down instead of master/slave imo, or top/bottom if you're at cambridge powerbottom/top
|
# ? Apr 28, 2016 07:46 |
|
Dylan16807 posted:there's a lot of room to save power if you can pack your numbers tighter. though that doesn't actually require you to do arbitrary-precision calculations. you could keep working values in a fixed-width format, while keeping track of precision, and then truncate your floats when storing them the scratchpad area is a whole other level of complexity, seems tantamount to transactional memory in terms of how much the application would have to change
|
# ? Apr 28, 2016 08:36 |
|
I think you're all talking about unum 1.0, unum 2.0 is the latest version. http://www.johngustafson.net/presentations/Unums2.0slides-withNotes.pdf http://www.johngustafson.net/pubs/RadicalApproach.pdf
|
# ? Apr 28, 2016 09:08 |
|
hm, cook accounts linked that interview i lost the paper somewhere on adding bits and it got to gigabytes of lookup tables somehow and the presentation is careening through the circle of fifths
|
# ? Apr 28, 2016 09:43 |
|
pseudorandom name posted:I think you're all talking about unum 1.0, unum 2.0 is the latest version. what did i just read
|
# ? Apr 28, 2016 16:55 |
|
ive printed out hard copies for train reading there's this tiny lil nod to 'power saving' right at the start where he derives a 4x4 truth table and claims it's implementable in 88 "ROM transistors", then implementation is chucked out the window to talk about the format's accuracy i believe and/or don't care that it's accurate, i just wanna see linpack do less loads
|
# ? Apr 28, 2016 19:05 |
|
the decimal parts are amazing
|
# ? Apr 28, 2016 19:10 |
|
Wheany posted:powerbottom/top ironmaiden/powerslave
|
# ? Apr 28, 2016 19:39 |
|
fritz posted:i agree that it's a pretty dumb idea but im not getting a feeling of "crank" out of the guy behind it gonna back track on this a little bit after reading the new stuff
|
# ? Apr 28, 2016 19:54 |
|
http://math.ucr.edu/home/baez/crackpot.html
|
# ? Apr 28, 2016 19:57 |
|
JawnV6 posted:anyway, this END OF NUMERIC ERROR article has been floating around. it seems to be IEEE754 what the Mill is to x86. probably genius, would've been neat if that's how things settled 40 years ago, but reality picked a dodgy implementation and cost/benefit is going to keep us there for a while pseudorandom name posted:I think you're all talking about unum 1.0, unum 2.0 is the latest version.
|
# ? Apr 28, 2016 20:26 |
fritz posted:ironmaiden/powerslave
|
|
# ? Apr 28, 2016 20:38 |
|
the last 2 slides of the presentation give away the big hints 1) 32/64b implementations undefined, 'RLE' pitched as magic bullet, unknown if table lookup methods scale that far 2) the money quote: quote:I admit that I smuggle mathematical correctness into the computing business inside of a Trojan horse of energy and power savings.
|
# ? Apr 29, 2016 00:43 |
|
mathematical correctness? on computers? it seems so obvious with hindsight, why did nobdoy think of this before?!
|
# ? Apr 29, 2016 00:48 |
|
i thiiink i can float the stone on unums now. i know everyone else is done, idgaf it's not some general purpose thing like 754. you're expected to: 1) gaze deeply into the application 2) determine a useful u-lattice 3) compute tables for +-*/ 4) reduce tables there's scant guidance for 2, and error cases for bad ones include "every result comes back as 'between 4 billion and infinity'". there's one worked example of 3, and 'compute' includes determining if entries are exactly representable in the u-lattice, so i can't tell if im supposed to do that with 754 or symbolically or what. 4 is asserted to be possible, but the tradeoffs in moving lookups to control flow is presumed to compress 32GB down to megabytes without proof or example. runtime is assumed to have some tables in a small ROM or something that doesn't cost the same as DDR loads the crankiest part is willful blindness to how much hardware implementations of 754 already do. a comparison of a table implementation with the division hints already present would make a fantastic case
|
# ? Apr 29, 2016 18:31 |
|
fritz posted:ironmaiden/powerslave
|
# ? Apr 29, 2016 20:29 |
|
no, the crankiest part is the anecdote about his defeat of Kahan in this slide deck: https://www.slideshare.net/mobile/insideHPC/unum-computing-an-energy-efficient-and-massively-parallel-approach-to-valid-numerics
|
# ? Apr 29, 2016 23:32 |
|
pseudorandom name posted:no, the crankiest part is the anecdote about his defeat of Kahan in this slide deck: https://www.slideshare.net/mobile/insideHPC/unum-computing-an-energy-efficient-and-massively-parallel-approach-to-valid-numerics KAHAAAAAAAN!!!!!!!!!!
|
# ? Apr 30, 2016 01:34 |
|
huh, tc39 is thinking of adding TCO to JavaScript via a new manual syntax instead of automatic TCO like they'd originally planned, weird https://github.com/tc39/proposal-ptc-syntax/blob/master/README.md there prior art on this in any other languages?
|
# ? May 4, 2016 23:33 |
|
i hope none of the people who complained about missing stack frames use for loops ever
|
# ? May 4, 2016 23:50 |
|
|
# ? Jun 12, 2024 00:30 |
|
Clojure needs a recur call as the final form in a tail recursive function, since the JVM doesn't do TCO.
|
# ? May 4, 2016 23:50 |