Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
What's the most comfortable pattern for doing I/O stream filtering type work like compression or deserializing on a separate thread? Ideally i'm looking for a way to do something like a Boost filtering_streambuf that you can super easily pass a boost::iostreams::gzip_compressor() to to compress or decompress a stream, but have that work scheduled on a different thread than what is writing to or reading from the I/O stream.

The sanest way that jumps out at me is using a concurrent queue, and having the reader or writer just read or write from the queue and the actual source or sink has the filtering_streambuf, but then I'm adding another layer of buffering and batching to worry about.

I'd also appreciate a lay of the land for concurrent queues, in Java land I default to a basic BlockingQueue if performance isn't critical and the LMAX disruptor if performance is, but in C++ I have no idea what reputation the boost::lockfree:queue or just doing std::queue with a basic mutex have.

Adbot
ADBOT LOVES YOU

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Is there any widely-used and ergonomic facility for zero-copy views into const strings? I feel like https://en.cppreference.com/w/cpp/string/basic_string_view is what I'm reaching for, but it looks like I'd need to touch raw pointers by hand to split a string. I have an application that's spending most of its time reading in input lines from a textual tabular format, but the actual useful work is pretty fast. It's not blocked on disk I/O, and right now it's using getline to read whole lines in, passing a pointer to that line through a queue to a pool of worker threads, each of which is wrapping a complete line in an istringstream and reading from it with >> in a loop, which is slow as balls.

The profiler shows that 90% of total CPU time is spent in basic_istream::sentry::sentry, meaning this approach is really expensive. As a first shot I'm just going to do boost::algorithm::split and split into a vector of strings, but there's got to be a better way to do this. A regex iterator would also copy, and I'm just reading from a const stream, there must be a zero-copy way to do this that doesn't involve falling back to manually walking it and saving a collection of pointers. Even if I give up and roll this C-style, I couldn't use strtok if I wanted to because there's between 8 and 64 threads doing this work at any given time, and strtok isn't thread safe.

Edit: This is basically what I'm wanting to do: https://techoverflow.net/2017/01/23/zero-copy-in-place-string-splitting-in-c/.

Edit again: ohhhh this looks right up my alley: https://www.boost.org/doc/libs/1_70_0/libs/tokenizer/doc/tokenizer.htm

I swear, I am unable to find solutions until after I've made a dumb post about it.

Twerk from Home fucked around with this message at 04:30 on Jul 17, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Xarn posted:

Just two notes on this:

small strings don't allocate memory, so depending on your data you might be fine either way.
If you need to send the substrings into API that needs zero termination, you are doomed to a copy anyway.

Thanks to everyone who threw in suggestions, it was extremely helpful. I'm still doing copying as I am parsing the lines into my own data structure, but swapping out istringstream with operator>> for boost::tokenizer resulted in runtime by wall clock being 1/10th what it was before, with CPU time being 1/20th. 20x performance increase! The worker threadpool now can be much smaller, or even just one thread, which is going to let me now accept other types of files more easily with a single thread filling a different, more complicated data structure that would be harder to share between threads.

I'm sure that the mmap'd file could make this thing really rip and get me another 4x performance, but if I mmaped the file then it'd be more complicated to read in compressed input, and I'd be having to copy out uncompressed lines somewhere anyway. Here's an article I stumbled on mentioning how optimized Perl can be for some plain file reading / string parsing situations and comparing it to mmap-ing: https://yongweiwu.wordpress.com/2017/09/14/c-cxx-performance-mmap-and-string_view/

Next issue on the plate for me is that this application and several others my group works with are and have been using unsynchronized bools as flags for communications between threads. I'm pretty sure that these need to be atomic<bool> , or protected with mutexes or something but it's been working like this for years. I assume we've just gotten lucky with undefined behavior so far? Also, I'm glad I had a reference book around, because my Java-brained rear end was about to slap "volatile" on there, but it turns out that volatile does not create memory fences in C++.

Twerk from Home fucked around with this message at 19:21 on Jul 18, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Beef posted:

The non-atomic bools might be working because they are write-once? It's not a bad thing if it is. That is, you are not implementing critical sections with it, because the lack of pipeline flushes and compiler barrier will gently caress that up.

They're being used to tell threads that work is done, in a really hacky way. Multiple threads are reading from a queue, and then when all work has been submitted to the queue then the queue empties, the publisher thread sets a bool* to true that all of the threads are seeing and using as a signal to stop polling the queue. The threads are polling in a loop, and then sleeping if the queue was empty and it didn't get work.

I'm guessing a better approach would be actually using condition_variable on the lock with wait_for, but then I'm still going to need to have it periodically checking the flag to see if it should be shut down so in the real world it's not much different, other than a thread getting notified when something new came through the queue, so it wakes up and starts sooner than the current sleep-based waiting.

Edit: Actually, I guess I could get rid of timers entirely by having all the threads wait() on the lock protecting the queue, and then notify_one when I put something into the queue, then when it's time to shut everything down set the flag then notify_all.

Twerk from Home fucked around with this message at 21:54 on Jul 19, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

nielsm posted:

If you need to notify of/wait for job completion, a std::promise might be the most appropriate way of expressing that.

Yeah, to translate from Java, I'm really missing having a ThreadPoolExecutor that manages its own queue of incoming tasks, can be .shutdown() into a state that allows it to complete all of the tasks currently in the queue, and returns promises. It looks like Folly has some pretty nice ThreadPoolExecutors available: https://github.com/facebook/folly/blob/main/folly/docs/Executors.md and I also see that Boost has a thread_pool: https://www.boost.org/doc/libs/1_79_0/doc/html/boost_asio/reference/thread_pool.html. That's certainly the way we'll go in the future, where appropriate. At a glance, boost_asio is header-only, which means it'd be a pretty easy dependency to introduce.

What this application has right now is a std::queue and worker threads polling it in a loop based on a timer, with a shared boolean controlling the loop, as I mentioned. I do want a way to make the queue have a fixed maximum size, to provide backpressure so that if worker threads are not keeping up with the input thread, it will slow adding to the queue. If I do a relatively straightforward bounded blocking queue type solution like this one, then I'd need some way to interrupt threads that are blocked waiting for an item to be added to the queue when it's time to shut down.

It looks like Boost has a solution for that as well, but Boost.Thread is new to me and we're just using std::thread right now. I'm also not 100% confident that I understand the docs fully. It looks to me like boost::thread::interrupt will interrupt threads that are wait-ing with a boost::thread_interrupted exception, but not threads that are currently doing work, which is pretty cool. That makes it sound like when all work had been submitted to the queue and the queue was empty, then I could call .interrupt() on all of the threads and the next time they hit the wait() in the loop, an exception would be thrown.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
What are everyone's preferred high-speed non-cryptographically secure PRNGs? I've got a hot loop in an application that is now spending >50% of its time in std::uniform_int_distribution<IntType>::operator() with a std::mt19937_64.

There's a ton of options, but I'd appreciate some nuance from the devs here.

https://github.com/lemire/testingRNG
https://lemire.me/blog/2019/03/19/the-fastest-conventional-random-number-generator-that-can-pass-big-crush/

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Xarn posted:

Is the cost mostly in the RNG, or in the distribution? Based on what range you are using, you might be able to bypass the distribution, or use a better implementation.


Anyway, I use PCG for my projects & if it was performance critical, I would replace the std:: one with a better one (Incidentally, Lemire has a blog post about that as well)

Aw jeez, Xarn is our winner. A bad misread of profiler results on my part, it's spending its time not in getting the next random, but in translating it to the desired int range. Also, it looks like Lemire got most of his super-fast one into the GNU libstdc++!

Thanks for the tips.

https://lemire.me/blog/2019/06/06/nearly-divisionless-random-integer-generation-on-various-systems/
https://lemire.me/blog/2019/09/28/doubling-the-speed-of-stduniform_int_distribution-in-the-gnu-c-library/

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Is there a convenient, portable C++ way to read in values that were written little endian? I'm aware of endian.h on Linux, but if I wanted something that worked on Macs too what would the simplest option be?

My specific need at the moment is just validating that the first 4 bytes of a file are a magic number, which I can do just by explicitly reading 4 bytes into an unsigned char array and comparing arrays so that I don't have to muck with endianness, but I'm realizing I don't have a lightweight general purpose solution for this handy.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Jabor posted:

Do you need this to be portable to systems with unusual values for CHAR_BIT?

Nope, absolutely not. I am assuming 8 bit bytes.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

pseudorandom name posted:

std::endian and (if you live far enough in the future) std::byteswap

Thanks all!

Bit shifts have always worked and will continue to work, but I bet there was some shiny whizz-bang addition to the stdlib!

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
I just watched this talk about strings at Facebook, and their successful / unsuccessful attempts at improving string performance in C++.

https://www.youtube.com/watch?v=kPR8h4-qZdk

Starting at about 21-22 minutes, there's a discussion about their failed attempt to eliminate the null terminator until calling c_str(), at which point it would be added. The core bug that prevented their optimization has to do with:



1. malloc for data across multiple pages A and B, and only write the bytes that are within page A. No bytes written to page B. In the snippet above, only the part of data that's in Page A is written to.
2. Another malloc / free happens in another part of Page B, Page B is conditionally returned to the kernel because the part of data that's in Page B has never been written to, and the other memory allocated in B was freed.
3. c_str() is called, and data[size()] is read from the part of data in Page B, which was conditionally returned to the kernel, and thus uninitialized. This is undefined behavior and the kernel returns a 0.
4. Another malloc happens in Page B, and when written to, all of Page B comes back in its previous state, including the uninitialized garbage that was actually at data[size()], so now data[size()] is not 0 even though nothing has written to it.
5. That's a c string that's not null terminated now!

The only discussion I found around it is https://stackoverflow.com/questions/46123029/reading-from-uninitialized-memory-returns-different-answers-every-time, where the answers completely miss the point and assume that Facebooks senior engineers don't know what they're doing and it's a stupid simpler bug.

My big question is that if you're malloc-ing 256 bytes, writing in 128 bytes, and then reading the 129th byte's value, wasn't that always undefined behavior in the first place because you're reading uninitialized memory? I would expect that this problem could have happened without the extra malloc/free/return page to kernel cycle because data[129] was never written to, so it's fine for the 1st read from it to return 0 and the 2nd read from it to return nonzero because if you malloc something and don't write data from it, reading it is always undefined, right?

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

pseudorandom name posted:

I don't know why they're doing if data[size()] != '\0' data[size()] = '\0', instead of just unconditionally writing the 0. I skipped the beginning of the video, but I assume they're trying to avoid invalidating cache lines on other CPU cores via unnecessary writes -- if there is already a 0 at that memory location than they don't need to write a 0 at that location again, which means the current CPU core doesn't need to take exclusive ownership of that cache line and cause every other CPU core to evict that line from their cache.

This is apparently exactly what was happening, the build group was calling c_str() on const strings shared by many threads, so the unconditional write was murdering performance.



quote:

The situation I think he's attempting to describe is as follows:

1) jemalloc allocates page B, the program writes data to it, then the program frees the data in page B and jemalloc marks it as DONTNEED because no allocations are using it.
2) the kernel wires up the zero page in at page B, but never gets around to actually freeing the original page B
3) the program allocates the std::string and just the right circumstances result in the string spanning pages A and B, but the string data is too short for it to write to page B
4) c_str() gets called, the std::string reads the first byte of page B and sees that it is 0, so doesn't bother to write the 0 byte as the aforementioned cache optimization
5) the char* gets passed around the program, assumed to be safely NUL terminated
6) the program mallocs more memory, jemalloc returns the unused portion of page B
7) the program writes to the newly-allocated memory in page B, a page fault happens, the kernel does its bookkeeping and sees that page B was marked DONTNEED but is still around, it puts the original page B back into that location
8) the original page B had a non-zero value at byte offset zero due to it being used in to store data in step 1, so that cache optimization back in step 4 was wrong
9) now that char* isn't NUL terminated

This looks like a flow that could actually cause it with the specific behaviors of jemalloc on Linux! The Stack overflow post I linked mentions that this bug couldn't happen if allocating the string is the first time page B is being used, because jemalloc returns 0-ed pages anyway right now, so to make the bug happen you'd need a page that was already being used and had some junk in it, and then the front of that page would have to be freed before being allocated as part of data.

quote:

Now, jemalloc's only mmap() call uses the flags MAP_PRIVATE | MAP_ANONYMOUS with the occasional inclusion of MAP_FIXED, and in particular it doesn't use MAP_UNINITIALIZED. This means that pages are always zero-initialized when allocated.

Additionally, even madvise() with MADV_DONTNEED will, for anonymous mappings, return "zero-fill-on-demand pages" for anonymous mappings, which I read as "zero-initialized pages."

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Does anyone have recommendations for a static analysis tool that can help identify when transitive #includes are being used within a file, and also mark unused #includes for elimination. Maybe clang-tidy does this? I'll give that a shot to start. I haven't run clang-tidy in a while and remember it being a bit finicky to set up, what's the cleanest way to run it?

I've got a bunch of hosed up scientific computing code that's been depending on transitive includes and specifically ordered includes for years, and I'd like a tool assisted way to rapidly unfuck that problem.

Also, does anyone have a textbook recommendation for cmake? I've just been fumbling my way around, and most of what's out there seems to use old, deprecated ways of doing things and I'd love to get up to date with modern cmake.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

rjmccall posted:

Google made a tool for this called “include what you use” which seems to still be maintained, probably because it was pushed by a very senior employee who didn’t seem to have much else going on. Very nice guy.

Thanks!! This is exactly up my alley. The first time I read your post, I thought you were sarcastically telling me that Google engineers were too smart to make this mistake, and a specific person would yell at people who did.

https://github.com/include-what-you-use/include-what-you-use

This is going to be a time saver once I get it set up. These kinds of problems exist in a ton of the software we use, and I've already found 3 projects so far that got blown up by this change that went into GCC 11 that means that including <algorithm> no longer includes <limits>, among other things. Bugs like this are what made conda-forge decide to stay on a CentOS 6 sysroot instead of risking the change to Centos 7.

I had been poking at things manually with g++ -H, which was already finding lots of places that have only been working due to the order of #includes.

Edit: Sweet, it's even packaged in Debian. Thanks LLVM team! https://packages.debian.org/bullseye/iwyu

Twerk from Home fucked around with this message at 04:33 on Aug 30, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Xarn posted:

IWYU is useful, but takes too much manual shepherding for me. It doesn't properly model stdlib, so you will have it tell you to include <iosfwd> to get size_t, include bits/foo (internal libstdc++ header) and so on.

Wow, you were not kidding about this. There's no way that this is how it's supposed to be working, I'm doing to read docs / config a bit in the hopes of being able to use its automated fixup tools on a whole lot of projects. There must be a way to make it understand libstdc++'s public API.

Its current insane suggestions are:
code:
#include <__algorithm/copy.h>            // for copy
#include <__algorithm/find.h>            // for find
#include <__algorithm/max.h>             // for max
#include <__algorithm/min.h>             // for min
#include <__functional/operations.h>     // for plus
#include <__memory/unique_ptr.h>         // for unique_ptr
#include <__utility/forward.h>           // for forward
#include <__utility/integer_sequence.h>  // for index_sequence, make_index_s...
#include <__string>                      // for char_traits<>::pos_type
#include <__tuple>                       // for tuple_size
It did find a whole bunch of missing includes as well though!

Maybe this is a mismatch of iwyu and my local clang environment? I'm on clang-13 at the moment, but using iwyu 0.18. I'll try fixing that first.

Twerk from Home fucked around with this message at 15:58 on Aug 30, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
A couple of solutions to this, in decreasing sanity order would be:

1. Finer grained headers. It sounds like you may be growing a gigantic header?
2. The module system introduced in C++20, specifically designed to fix this: https://en.cppreference.com/w/cpp/language/modules
3. Just locally forward declare things that you don't need to know the full signature of.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Does anyone have a microbenchmarking library that they'd recommend? A quick search found:

Google Benchmark: https://github.com/google/benchmark
Nanobench: https://github.com/martinus/nanobench
Picobench: https://github.com/iboB/picobench
Criterion: https://github.com/p-ranav/criterion

I've got one hot function that's driving 70% of my applications runtime and want to be able to quickly experiment with different micro-optimizations, because I've removed everything that I can and want to try some different stuff to get more cache hits.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Thanks everyone! I set up Google Benchmark and it's fine. I'm walking face-first into the pitfalls of microbenchmarking, because getting work out of my hot path means finding another place to do it, or copying a data structure into a more efficient form before running the inner loop, and then I end up having to measure how expensive that stuff is in context.

This application's code is being published when the paper is published anyway, so I'd appreciate a second set of eyes on any way that I can make this poo poo faster, because I'm staring down a choice between cache friendliness vs significantly more instructions. This still remains >60% of total runtime, even after massaging it. We're budgeting millions of hours of CPU time in December to run this, so even incremental improvements would help.

Phenotypes is a vector of int8_t that are usually 0 or 1. There's a handful of -1s, and extremely rarely something else. The size of phenotypes ranges from tens of thousands to almost a million, and is likely to grow to low millions in future usage. Pairs is commonly a single digit multiple of the size of phenotypes, extremely sparse compared to every possible combination of pairs. Because pairs is so much bigger, we're iterating over pairs sequentially, and randomly accessing phenotypes. Pairs is sorted in an attempt to improve locality of phenotype accesses, like [{0,3}, {0, 200}, {2,4}, {2, 19}] and so on.

The lowest hanging fruit that did make huge improvements were:

1. Using a smaller int type for phenotypes. Originally phenotypes was a vector<int32_t>. We only really need 4 values, and 2 of them are exceptional cases.
2. Doing += 0 or 1 where we originally had an if. The compiler wasn't smart enough to eliminate the branch, and we were getting slaughtered by branch misses.

C++ code:
double calculate(std::vector<int8_t> &phenotypes, std::vector<std::pair<size_t, size_t>> &pairs) {
    int64_t cscs = 0;
    int64_t cscn = 0;
    int64_t cncn = 0;


    for (auto &p : pairs) {
        auto &[p1, p2] = p;
        auto x = phenotypes[p1];
        auto y = phenotypes[p2];

        if (x < -1 || x > 1 || y < -1 || y > 1) {
            handle_exceptional_case(p, x, y);
        }

        cscs += ((x == 1) && (y == 1));
        cscn += ((x == 0) && (y == 1));
        cscn += ((x == 1) && (y == 0));
        cncn += ((x == 0) && (y == 0));
    }

    // The actual math done here is different, but it's derived from all three
    return cscs / cscn + cncn;
}
I've been poking at it in Godbolt too: https://godbolt.org/z/dKfcPdYj3

The further paths I can see to optimize are:

1. Make a single pass through phenotypes before this to check for values not 0 or 1. There will typically be none, and if we find any we can handle them outside and pass calculate a collection of pairs that only has 0 or 1 values.
2. That lets us eliminate checking from the hot path, and also instead of checking (x == 1) && (y == 1), we can just do x && y, x ^ y, !(x || y), which is faster.
3. Improve cache friendliness by storing the pair indexes as uint32_t instead of size_t and cover this in warnings that it only handles up to 2^32-1 phenotypes. This won't limit us anytime soon.

The stuff that I'm not sure about at all are:
1. Now that phenotypes can only be 0 or 1 here, I could use vector<bool> or some kind of bitset. Better for cache, but many more instructions to pull bools out of it. This is probably a win on big datasets, and a loss on small. I think we're willing to make that tradeoff, or I might have both available and do a heuristic based on dataset size.
2. It's obvious to me that I could add together sets of cscs ,cscn, cncn by reading many pairs into two bitsets, and then doing bitwise &, ^, and !(x_bitset | y_bitset) and then += by count(). However, I don't know a fast way to loop and add bools to a bitset that wouldn't involve some bit manipulation overhead. I was thinking that I would just unroll the loop by my bitset size, gathering 64 pairs into two bitsets and then doing the bitwise operations and adding them to the accumulators.
3. There must be some way to vectorize this whole thing, but I'm not seeing it. I need a fast gather operation for bytes. I could over-retrieve and then mask and pack, of course, but that manipulation is overhead that could be spent just loading and adding bytes.
4. Integer math is 1 cycle on Intel, right? So I shouldn't need to add more accumulators to avoid a loop-carried data dependency?

This is only targeting Skylake-server, so AVX-512 is on the table, and also the 32-bit gather instructions are faster than scalar loads: https://stackoverflow.com/questions/24756534/in-what-situation-would-the-avx2-gather-instructions-be-faster-than-individually. It's also likely to only run on Skylake-server for the next 5 years, so doing performance things that are not portable are on the table. L2 is 1MB!

Also, if this isn't welcome here I'll take it to stack overflow. I'm just out of my depth but I know there's some type of bitset shortcut here to sum things up much faster.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
You all had some incredible advice, and if you want to see the non-vectorized progression in quick-bench: https://quick-bench.com/q/X4b-hgBsM6HUUK8DBInJadWJ0lY. Quick-bench started running out of memory, so that doesn't even include the final, ~15% faster than best in there solution that I landed on. It's not news to anyone here, but vector<bool> is insanely slow and even for very large datasets that spill out of L2, vector<bool> was slower than vector<int8_t> with only 1s and 0s in it, consistently.

I couldn't get quick-bench to do AVX-512, but the AVX-512 implementation was able to do the counting and accumulating faster than scalar accesses alone! However on the real world complete application analysis, the down-clocking from AVX instructions was so significant that a specific scalar implementation ended up faster for the whole program, even though the hot loop itself was a solid 25% faster than the fastest scalar solution. This was a very surprising result to me!!

Xerophyte posted:

Caveat: I'm not an expert on vectorization, especially not avx512.

I think your current code can be reasonably vectorized by using _mm512_i32gather_epi32 and _mm512_cvtepi32_epi8 to gather chunks of left-pairs and right-pairs to vectorize the arithmetic over, assuming it is safe to read 3 bytes past the end of the phenotype arrays. You have the requisite address offset array already in pairs, albeit in AoS. Unsure if it will be worth using the downconverts to pack 64 int8s instead of just vectorizing over 16 int32s, both are options.

This was a great vectorized solution that was fastest on this loop by a considerable amount. Gather 16 int32s, pack into 16 int8s, AND and XOR the matching pairs, and then use POPCNT on them: https://stackoverflow.com/questions/17354971/fast-counting-the-number-of-set-bits-in-m128i-register. I was surprised that vectorized popcnt doesn't exist until Ice Lake, it's not there on Skylake-X so I couldn't use it.

cheetah7071 posted:

you could eliminate a few instructions per loop by not calculating cscn, since it will always be equal to pair.size()-cscs-cncn-(# of exceptional cases). The exceptional case function will have to track the number of times its been called but that should be fairly cheap if it's as rare as you say.

This was another big win that I had overlooked. With all of the exceptional cases handled before this loop, only 0/1 values meant that I only needed to accumulate 2 of the 3 sets. I went with cscs and cscn because they were & and ^, respectively.


Zopotantor posted:

How are you sorting? If it's lexicographically, maybe try z-order or tiling to try further improving locality.

Z order sorting helped!

Beef posted:

Most compilers have an option to spit out a vectorization report, an annotated version of your code. I had better results on HPC and genomics codes by making simple loops guided by that auto vectorization feedback. The Intel compiler used to be better for that, but I haven't touched it in a while.

I think that this is a great plan in general and usually happens with sequential data access, but these scatter/gather instructions are so finicky that I think it's unclear to the compiler if they'd help what you're doing. These instructions, and AVX-512 in general are getting much better in newer Intel chips, and I wouldn't be surprised if future compilers start using them more often as gather latencies get lower, and downclocking becomes less of a big deal. We have some Ice Lake machines coming in, and I'm going to test on one just to see.

This was the microbenchmark winner, by a mile (25%!!!)
C++ code:
    if  (phenotypes_.capacity() < phenotypes_.size() + 3) {
        phenotypes_.resize(phenotypes_.size() + 3);
    }

    for (; i + 15 < pairs.first.size(); i+= 16) {
        auto left_addresses = _mm512_loadu_si512(&pairs.first[i]);
        auto right_addresses = _mm512_loadu_si512(&pairs.second[i]);

        auto lefts = _mm512_i32gather_epi32(left_addresses, phenotypes_.data(), 1);
        auto rights = _mm512_i32gather_epi32(right_addresses, phenotypes_.data(), 1);
        
        auto left_packed = _mm512_cvtepi32_epi8(lefts);
        auto right_packed = _mm512_cvtepi32_epi8(rights);

        auto cscs_batch = _mm_and_si128(left_packed, right_packed);
        auto cscn_batch = _mm_xor_si128(left_packed, right_packed);

        cscs += popcnt128(cscs_batch);
        cscn += popcnt128(cscn_batch);
    }
But push come to shove, in the context of the whole application, higher clocks from scalar instructions ended up winning on Xeon 6130s. Winning by a hair, less than 5%, but also more portable:
C++ code:
        for (size_t i = 0; i + 1 < pairs.size(); i += 2) {

            const auto [p1_a, p1_b] = pairs[i];
            const auto x1 = phenotypes_[p1_a];
            const auto y1 = phenotypes_[p1_b];

            cscs += x1 & y1;
            cscn += x1 ^ y1;

            const auto [p2_a, p2_b] = pairs[i + 1];
            auto x2 = phenotypes_[p2_a];
            auto y2 = phenotypes_[p2_b];

            cscs2 += x2 & y2;
            cscn2 += x2 ^ y2;
        }
I was pretty surprised that unrolling the loop and adding a second pair of accumulators was more than a 15% win. CPUs are really wide! Also, side note because I couldn't resist trying this on my Macbook Air: Apple CPUs are the widest of all. With two accumulators like that, my Macbook is able to do all all of that counting and summing in identical time to data access alone, just pulling x, y and not doing anything with them. Also, clearly these all created some extra work, where I have separate loops to handle the leftover stuff that the unrolled loop doesn't hit.

Twerk from Home fucked around with this message at 18:25 on Nov 8, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
C++ ergonomics question: How can I return an instance of a derived class from a function that returns a type of a base class without using pointers, and without copying?

C++ code:
struct Base {
    virtual ~Base() = default;
    virtual void foo() = 0;
};

struct Derived : Base {
    ~Derived() = default;
    void foo() override;
};

Base buildABase() {
    return Derived();
}
This doesn't work, because Base is an abstract class, and I can't return an instance of an abstract class.

C++ code:
Base& buildABase() {
    return Derived();
}
Doesn't work because I'm returning a reference to a temporary, or worse I'm returning a reference to something on the stack if I make a non-temporary.

How I can hack around this with what I know:

C++ code:
std::unique_ptr<Base> buildABase() {
    return std::make_unique<Derived>();
}
Also, because I'm a novice at this, if I can get by with default destructors then I want a virtual default destructor on Base and Derived, right? I was expecting to have a pure virtual destructor on the Base type, because Base doesn't have any members of its own, but it looks like pure virtual destructors need a body, and my types can all get away with default destructors.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
I have so many Java habits to unlearn. Building on the above, some more things surprised me:

C++ code:
std::unique_ptr<Base> buildABase() {
    return std::make_unique<Derived>(Derived());
}

struct Consumer {
    Consumer(std::unique_ptr<Base> &&inputBase) : myMember{std::move(inputBase)} {};
    std::unique_ptr<Base> myMember;
};
I have a Consumer that wants a working instance of Base, which is actually a Derived class. I feel like I'm having to repeat myself in the initializer list, but even in the copy constructor, it doesn't compile if I just say myMember{inputBase}, I have to specifically tell it to move.

This makes me realize that there's absolutely some extraneous copies elsewhere in these applications, and I've been assuming the compiler is better at figuring out when things can be moved than it really is.

Twerk from Home fucked around with this message at 05:11 on Nov 16, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

cheetah7071 posted:

this is probably not doing what you want. It's calling the constructor for Derived which takes the output of Derived() as an argument: in other works, the copy constructor. If you want to construct in place with the default constructor, just do std::make_unique<Derived>()

The syntax for what you're trying to do in Consumer is a little weird because unique_ptr is specifically for cases where only one portion of the code ever needs access to the object. If the calling function constructs the unique_ptr and then passes it to Consumer's constructor, then two parts of the code have access to it simultaneously, which breaks the core guarantee of the class, so it doesn't compile. The two options are to use a shared_ptr instead (if you really want to allow both the calling function and the Consumer object to access it) or to have the constructor for Consumer instead take the arguments you need for it to construct its own unique_ptr

If you really need to do complicated manipulations on the object in the calling function, so passing an actual pointer to a Base object is really needed, and also you really want the funcitonality of unique_ptr, you do have an option, but it's kind of ugly. You can do something like:

C++ code:
struct Consumer {
	Consumer(Base* po_base) : myMember(po_base) {};
	std::unique_ptr<Base> myMember;
}
In this version, the constructor takes a raw pointer, builds a unique pointer around it, and then assumes responsibility for it. If you wanted to get really fancy you could do something like:

C++ code:
struct Consumer {
	Consumer(Base*& po_base) : myMember(po_base) {po_base = nullptr;}
	std::unique_ptr<Base> myMember;
}
where you take control of the pointer and delete the caller's copy of it to really emphasize that it's yours now

this is still kind of dangerous because if the pointer refers to a stack-allocated object this doesn't do at all what you want and in 99% of cases it'll just be easier to avoid wanting to do this than to try to make it work

Thank you! Don't these two do the same thing? I thought that by calling std::move on the unique_ptr passed in, this is basically what I was doing:

C++ code:
    Consumer(std::unique_ptr<Base> &&inputBase) : myMember{std::move(inputBase)} {};
    Consumer(Base*& po_base) : myMember(po_base) {po_base = nullptr;}
Also, I thought that compilers were smart enough to automatically use the move constructor if you pass a temporary into the constructor and a move constructor exists? I'll be more careful about make_unique for sure.

In my specific case, to get concrete, I have something that's processing streaming data, and I don't want to put the whole "is it compressed? We support multiple types of compression. Is it even coming from a file?" logic into the Consumer. Instead I want to pass in a ready-to-go std::istream, and I do want ownership of it to move to the consumer. This way we can have the logic to turn paths or sockets or whatever into working istreams in one place, and the different ways to actually consume that data in another.

cheetah7071 posted:

to maybe emphasize a bit more clearly why the code snippet you posted is something that's kind of awkward, you're envisioning a scenario where the Consumer object is responsible for deleting the Base object, but where the Base object already exists in the calling function. Like, you can make it work with careful coding, but it's awkward


Yes, I'm setting up istreams and passing them into things that will read through all of the istreams and then they can go away once the Consumer is done with it. The consumer internally is breaking the data into chunks and then passing a unique_ptr to each chunk onto a synchronized queue, that lots of worker threads are pulling from, so we're already doing something somewhat like this elsewhere in the application. Things get std::move-ed into the queue, and std::move-d out.

Twerk from Home fucked around with this message at 05:32 on Nov 16, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Xerophyte posted:

Edit: for the use case at hand

1: Moving ownership of istreams is ... fraught, and not generally possible. Their move constructor and move assign are protected for good reasons, and trying to work around that by making a unique_ptr to the stream is in general a footgun. What if you want to take piped input from cin, for example?

Options:
- Make consuming data a function call that takes a reference to the stream, rather than have a persistent Consumer object that takes ownership. Having a Consumer class can be an implementation detail, one that doesn't need ownership.
- Just read data to buffers, then pass buffers to Consumers.
- Manage the lifetimes elsewhere and just let the Consumer take a reference to the stream, if you really have to.

Depends on the particulars of your situation and data, I suppose.

2: Not sure how relevant this is but be aware that extending the streaming libraries is as I understand it quite tricky if that's a thing you think you might need. You need to create a class derived from std::basic_streambuf, then construct a std::istream from that class. It's not straight-forward.

I'm going to have a think about the factoring of this application and see how I can move less poo poo around. What if I want to take piped input from cin? cin is an istream, so I could either pass it in directly, or more likely add cin as a source for a boost filtering_stream, because we have compressed data flowing around. I'm guessing your concern is that it'd be easy to accidentally end up with multiple places reading from cin at the same time and blow my face off? I'm wanting the consumer to manage the stream because one part of the application is opening the multiple files that we're reading from, and the consumers get used in different places to read through those files.


Xarn posted:

Finally there is a discussion on whether you should accept unique_ptr (and other move-only types) by value or by &&. I am firmly in the side of value, but using && isn't actually wrong either. You just need to be careful about the difference between

C++ code:
template <typename T>
void sink1(std::unique_ptr<T>&& input);

template <typename T>
void sink2(T&& input);
because they are very different.

Whether you accept them by value or by &&, don't you have to call std::move where you are calling the function anyway because calling sink1(myFoo) with a std::unique_ptr<Foo> and a sink1 that accepts values will try to call the deleted copy constructor anyway? Hence a need to call sink1(std::move(myFoo)), which makes me feel like I should be accepting a && anyway.

giogadi posted:

Here’s a great treatment on the subject of polymorphic vectors:

https://johnysswlab.com/process-polymorphic-classes-in-lightning-speed/

But my main point is to wonder whether there are language features that could make this kind of thing easier - all of these solutions, while clever and effective, take a lot of work to implement

I'm a little surprised at just how clunky basic dependency injection feels. I get that most C++ work at this point is done in the context of older codebases with a ton of internal libraries available to solve most of these problems, but the language and stdlib itself could absolutely offer some more ergonomic ways to do this.

Ihmemies's example of unordered_map::contains not existing until C++ 20 is such a great example of the C++ stdlib being huge but still missing basics.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Xerophyte posted:

My concern is that you said "I want to pass in a ready-to-go std::istream, and I do want ownership of it to move to the consumer." istream has disabled moves specifically to ensure it is not possible to do that, because it is not in general a safe operation. An istream may be a system resource that cannot be locally owned. Moving is only enabled for classes derived from istream that represent something that can be owned and have its ownership transferred, notably ifstreams.

The discussion gave me the impression that you intended to get around this by passing around std::unique_ptr<std::istream>s, which technically will work as long as you only use them for streams that can be moved (or, well, constructed/deleted locally). It smells and I would try to not do that, though.

When I came into this, the proof of concept was plain std::ifstream only, no polymorphism. The whole world handles these files gzip compressed, and my lab uses a lot of zstd compression, so I used boost::iostreams::filtering_stream, but wanted to be able to leave parts of the application untouched and still using plain ifstreams. boost filtering_streams are istreams, but not ifstreams, and all that I need from any of them is the istream interface.

You explained pretty clearly why I can't assume every istream is move-able, and I guess I should just change all of these to boost filtering_streams instead. I just came in with Java brain and saw "We are only using these as istream? Let's just type the whole thing std::istream." They're all only std::ifstream or boost::iostreams::filtering_istreams anyway.

Side note: Is boost::iostreams still the right thing to be reaching for for basic compression / decompression of gzip and zstd? Bonus points for something that uses Intel isa-l instead of zlib, as it's night-and-day faster and we use igzip. I've been thinking about writing a boost filter that links against isa-l myself, but my group tends to just use zstd instead. I feel like other users that are using gzip would sure appreciate it, though.

Xarn posted:

Calling sink1 will not always zero-out the passed pointer, because std::unique_ptr<Data>&& is a reference and thus the caller's unique ptr instance will only be modified after the coinflip. This is different from sink2 where caller's pointer is always moved-out, because it has to be moved into the function before the coinflip happens.

This is why I prefer taking these types by value, because then I can be sure that when the function exits, no matter the way it does so, my input has been already taken care of and I can't see the old value. This makes it harder to get to inconsistent state.

Oh wow, I would never want a pointer to not be moved into the function when calling it with std::move, so you've convinced me that value semantics are safer for this situation. I can already think of a handful of exciting ways that && and not moving it could explode spectacularly.


cheetah7071 posted:

I'll put up with all the C++ nonsense just to have namespaces and std::vector

Tired: C-with-classes style code
Wired: C with vector

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Ihmemies posted:

So if the program quits they stay in memory? What kind of sense that makes?

No, if you close the process, it will go away.

However, you will leak memory while it is running, as in any cycles of shared_ptrs pointing at each other will keep the pointed objects alive as long as the process runs. You have to manually break cycles with weak_ptrs somewhere in the cycle, which will not keep the pointed-at object around.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Ihmemies posted:

So, I got feedback about a project and it said I need to initialize class variables. I was like ?????

Use initializer lists in your constructors, like this: https://en.cppreference.com/w/cpp/language/constructor

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Is there any easy way that I'm missing in C++ to quickly allocate memory for a vector of vectors? I'm doing a lot of relatively small allocations and it's slow.

I've got a genomics HPC application that I've been tuning to run better, that's what all the posting about vectorization was earlier. After making a pretty big win with massif to decrease peak memory usage, we're chasing better total throughput by launching many processes at the same time now that it uses less memory and we can fit more into a single worker node.

The application has three big phases: preparing a large, read-only data structure that's made up of a vector of 100k-1 million vectors of bytes, each 50k-1m bytes long, then spawning threads to do all its work as a single parser thread throws lines from the input onto a synchronized queue and all the worker threads pull from it, then write to a single output thread via a synchronized queue, and then at the end it sorts the output on a single thread using a fixed amount of memory.

We're running into Ahmdal's law where we have more wasted scheduled CPU time as we use more worker threads, and I've gotten the memory usage down to basically just the single large data structure + less than 100MB other stuff, so I'm trying to request less memory and thread and get a bunch of these packed onto a single worker node at once. The unexpected bottleneck is that the more processes we have in the allocation phase at once, the longer the allocation phase takes for all of them, in a pretty big way! I checked I/O metrics and we are not limited on disk I/O or apparently exhausting memory bandwidth (that I can see via a naiive check), so my question is:

Is there a system-wide memory allocation bottleneck? What could I do to alleviate this with a vector of vectors? Each vector is allocating via a complete copy from an existing vector, then mutating without ever adding more items, which should only allocate once. The set I'm testing this on is only allocating about 100k vectors of 60k bytes each, but having 20 processes sitting there on a single node is taking much longer than forecast on a less-loaded node.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

nielsm posted:

Which language version are you on? On the newer versions it should be relatively painless to make an arena allocator where you can allocate one huge slab of memory and tell all the inner vectors to take their memory via that allocator. Those arena allocators can then be per-thread or whatever.
And also pre-allocate with vector.reserve(), instead of growing dynamically, if you know how many items you need.

Thanks for this, an arena allocator is exactly the kind of solution here, where I have a big vector of vectors of known size (at runtime) and the same lifetime.



I'm calling reserve on the vector that holds the vectors, and each subsequent vector is copy-assigned from an existing vector of the final length, which I hope will only allocate once. It's just that doing 100k-1m separate allocations is slow, because I have 100k-1m separate vectors.

I've got a big templated function that will have some small internal differences in flow based on what the template type actually is. Am I going to hate my life for using if constexpr, like

C++ code:
if constexpr (std::is_same<T, int8_t>::value) {
    // Specific code for T being a signed char
}
or is that a relatively safe and sanity-preserving path to go down? The alternative I see is duplicating a bunch of code by having a separate template<int8_t> version of the function. What scares me is that I know that typeid(T) and std::is_same<T, whatever> behave differently, but typeid(T) is runtime only.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

cheetah7071 posted:

So, in the template solution, I'd have a template which essentially boils down to a boolean "is this being run in a real environment, or a test environment" on every class dependent on user parameters?

If I'm understanding correctly then your real code would use Foo<Singleton> and your test code would use Foo<Spoofer>, and then you just call T::getSingleton() (and update spoofer to have the same signature as your real singleton).

I had not thought about or ever seen templates for compile-time DI, but my code is going to have fewer virtual functions in the future thanks to this.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Just here to echo that the landmines around ever subtracting from a size_t are real and one of the handful of moments that actually made me swear out loud during my return to C++ over the last few months.

It should have been signed. I guess that we're still living with the decisions made when pointers were much smaller, and will be forever.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Ihmemies posted:

Does anyone use Qt creator? I want to limit the clang to specific C standard: https://clang.llvm.org/c_status.html

It says I can use something like -std=c90

Qt says it's an invalid option and discards it if I try to use it:



Any pro tips? Do I need to put it somewhere else?

What C standard are you wanting to use? C90 is not a standard, as the clang docs you linked mention you probably want c89 if you're trying to intentionally write c that builds with ancient or tiny compilers, but I'm curious what those are? C11 is really widely supported at this point.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
I'm dumb, ANSI C is apparently also called c90 as well as c89, my bad.

Yes, you want your compiler in that mode, and then just turn up the warnings.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
What are the downsides of statically linking libstdc++ or libgcc? I'd appreciate an authoritative source on static vs dynamic linking of these libraries in general, because I interact with tons of tools that have very questionable build processes, frequently including statically linking glibc, libgcc, and libstdc++ with -static, -static-libgcc, or -static-libstdc++. Does anyone have any textbooks and/or posts they'd recommend on this subject?

I know that statically linking glibc is a bad idea because it will still need to load the system's glibc anyway, and it's possible to get weird failures. I've had a decent track record of getting small improvement PRs accepted to these kinds of tools, and I'm wondering if I should focus my efforts only on dynamically linking glibc, or if it's worth trying to find a portable way to dynamically link libcc and libstdc++ too?

The best thoughts I've found on this are a decade old Stack Overflow discussion that links a blog post from 2005: https://stackoverflow.com/questions/13636513/linking-libstdc-statically-any-gotchas.

As much as I would like efforts to move towards shipping your dependencies as dynamic libraries and using LD_LIBRARY_PATH to use them, I feel like baby steps are the easiest place to start. Hell, I also have started seeing more and more Docker containers that just have statically linked binaries in them, but that's a slightly different thing.

Edit: Related, I've never really worked with RPATHs or been the one to set them or think about them, but that behavior mentioned where you can set RPATH to $ORIGIN to make a binary search for .sos next to it sounds like something that I'd want most of the time. I've always envied how Windows handles .dlls compared to having to use environment variables to manage finding libraries on Linux.

Twerk from Home fucked around with this message at 19:53 on Oct 27, 2023

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Are you submitting code and the professor chooses how to compile it and with what flags, or is there room for somebody to forget to pass flags to optimize and get a bad grade?

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Xarn posted:

So I have this in my code

C++ code:
#if defined(__cpp_lib_uncaught_exceptions)
#  define YEP_WE_CAN_USE_UNCAUGHT_EXCEPTIONS
#endif // __cpp_lib_uncaught_exceptions
and later,
C++ code:
bool uncaught_exceptions() {
#if defined(YEP_WE_CAN_USE_UNCAUGHT_EXCEPTIONS)
        return std::uncaught_exceptions() > 0;
#else
        return std::uncaught_exception();
#endif
}
(names changed slightly :v:)

Today I got a bug report from someone building for MacOS, targetting 10.9, that the build fails with

code:
error: 'uncaught_exceptions' is unavailable: introduced in macOS 10.12 - see https://conda-forge.org/docs/maintainer/knowledge_base.html#newer-c-features-with-old-sdk
        return std::uncaught_exceptions() > 0;
$BUILD_PREFIX/bin/../include/c++/v1/exception:185:63: note: 'uncaught_exceptions' has been explicitly marked unavailable here
_LIBCPP_FUNC_VIS _LIBCPP_AVAILABILITY_UNCAUGHT_EXCEPTIONS int uncaught_exceptions() _NOEXCEPT;
And I want to know, how is MacOS a real platform real people build for? Why the gently caress does the library define the feature macro, just to mark the function as "UNAVAILABLE, gently caress YOU FOR EVEN TRYING"?

Did you read the docs that were linked? That person isn't just building on a Mac, they're building inside of conda, which is its own incredibly hosed up environment. On Linux I've run into a whole host of problems because by default on Linux Conda ships and compiles against a CentOS 6 sysroot, which isn't actually C++11 compatible. They considered moving to a Centos 7 Sysroot last year in 2022, and decided that it was too soon and to stick with Centos 6.

The docs on that link say that the build failure only happened because

quote:

The libc++ library uses Clang availability annotations to mark certain symbols as unavailable when targeting versions of macOS that ship with a system libc++ that do not contain them. Clang always assumes that the system libc++ is used.

However, since conda-forge ships its own (modern) libcxx we can ignore these checks because these symbols are in fact available. To do so, add _LIBCPP_DISABLE_AVAILABILITY to the defines.

It looks like conda on mac as configured uses the system Clang but ships its own libcxx. System Clang pre-emptively decides to tell you that uncaught_exceptions is not available because they are not in the system libc++, but if you're building in conda you aren't using the system libc++. Conda has so many nasty edge cases, welcome to this one.

Also, Mac OS 10.9 launched in 2013, support ended in 2016, tell them good luck with their problems with their OS that has been out of support for 7 years.

Edit: All this to say, if you want this to work in conda on Mac OS, add -D_LIBCPP_DISABLE_AVAILABILITY to your CXXFLAGS. I've also had a lot of maintainers just tell me "conda isn't supported, gently caress off, use a real distro" because at its core, conda has turned into the stupidest possible rolling release distro of linux.

Second edit: The other way that I would tackle this is that you can also use the non-Apple vanilla clang compiler on Mac OS in Conda by adding "llvm-openmp" to your dependencies in the conda meta.yaml. Apple clang does a lot of weird stuff so people tend not to use the system clang in order to make things work more consistently across Mac OS / Linux.

Twerk from Home fucked around with this message at 16:16 on Dec 11, 2023

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Xarn posted:

I've read that specific section.

But as I understand it, the sequence of events goes
* Conda targets some ancient version of MacOS. Not great, but also not a crime.
* libc++ has the ability to tell you that such and such API is not available in the target version of MacOS, because it knows it is tightly bound to specific version. This is actually really nice.
* libc++ happily defines the macro for a feature it will not let you use when targetting older versions. <-- this is actually terrible and can gently caress right off.

I haven't run into this landmine personally, but my impression of what's happening is:


* Conda targets some ancient version of MacOS. Not great, but also not a crime.
* Apple clang has the ability to tell you that such and such API is not available in the target version of MacOS's system libc++, because it knows it is tightly bound to specific version. This is actually really nice.
* You aren't actually using system libc++! You are linking against a newer version of libc++ that conda-forge provided, that does have the features you want. The macro is defined because those symbols are there, they're usable. However, system clang doesn't know that you're using a different libc++, and system clang tells you that it's not available before actually trying to link and finding those symbols.

Edit: the suggested fix is just telling clang not to check, because when you link it will actually work.

Twerk from Home fucked around with this message at 17:00 on Dec 11, 2023

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

Plorkyeran posted:

This specific problem is Conda's fault. They're shipping their own copy of libc++ to enable backdeployment to older OS versions, including their own copy of the headers, but they haven't updated the headers to reflect that and instead the headers are configured for using the system deployment. Bundling libc++ isn't a particularly exotic thing and they're just doing it wrong; they're supposed to be defining _LIBCPP_HAS_NO_VENDOR_AVAILABILITY_ANNOTATIONS.

This would have to go into the header that conda-forge bundles with their newer libstdc++? This is probably an easy fix and I've gotten them to accept all sorts of other stuff too. The conda-forge project just happened, nobody is really making sure that everything works or is done the right way.


rjmccall posted:

If you’re building with the macOS SDK and using a macOS target triple but not actually targeting macOS, that seems like a You Problem.

Do you mean compiling stuff on Mac OS in the conda forge environment? They ship their own libcxx, headers, all libs, sometimes their own compilers but sometimes the system compiler. I am in past my depth (like everyone who touches conda, it seems) and don't know quite what you mean.

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
I've been having a fun time running some micro-benchmarks on my M1 Air with Apple Clang and marveling at how different Apple Clang on Mac OS on Apple Silicon behaves vs GCC on Linux on x86. I realize that the three layers of differences make it tough to isolate why something is radically faster, slower, or just plain different, but it's still been interesting and enjoyable to just to see.

Initial observations are that although the Accelerate math BLAS library is blazing fast, likely because of the Apple AMX coprocessor, but vector performance seems generally worse than you get with modern x86 systems. This could be some combination of auto-vectorization being worse for clang on aarch64-apple-darwin, or a less vectorized libstdc++, or 128-bit wide NEON just lacking both width and instructions compared to AVX2.

I've also seen an unexpected optimization happen repeatedly that surprised me: automatic caching of function call results. Can someone explain this to me? I'm pretty surprised that C++ compiler optimizers are this ambitious, it's definitely not just the branch predictor learning branches but seems to be optimizing out repeated calls to a function with the same args, which breaks the benchmark entirely.

Here's an example of both things I'm seeing, that is slow vectorization and apparent optimizing out of actually calling the function we're trying to benchmark. The full example, ready to compile, is here:
https://lemire.me/blog/2019/06/18/how-fast-is-getline-in-c/
https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/blob/master/2019/06/18/getline.cpp
If anybody wants to go deep, here it is in Godbolt with clang on aarch64

The relevant bits are here:
C++ code:
__attribute__ ((noinline))
size_t count_returns(char * data, size_t length) {
    size_t answer = 0;
    for(size_t i = 0; i < length; i++) {
      if (data[i] == '\n') answer++;
    }
    return answer;
}

void count_lines_benchmark(size_t size) {
    printf(" %s\n", __PRETTY_FUNCTION__);
    bool verbose = true;
    printf("generating and parsing input having %zu bytes.\n", size);
    int repeat = 10;
    std::vector<char> blob(size + 1, 'a');
    blob [size] = '\0';
    // average line length should be 80
    for(size_t br = 0; br < size / 80; br++) {
      blob[rand() % size] = '\n';
    }
    size_t c = 0;
    BEST_TIME_NS((c = count_returns(blob.data(), blob.size())), , repeat, size, verbose);
    printf("bogus = %zu \n", c);
}

int main(int argc, char** argv) {
  count_lines_benchmark(10000000);
  get_line_benchmark(10000000);
  get_line_predictable_benchmark(10000000);
  return EXIT_SUCCESS;
}
When I run this with repeat set to any number greater than 1, I get impossibly fast results:
pre:
 void count_lines_benchmark(size_t)
generating and parsing input having 10000000 bytes.
(c = count_returns(blob.data(), blob.size()))               :  41 ns total,  0.00 ns per byte  or 227151.85 GB/s 
bogus = 124230 
 void get_line_benchmark(size_t)
generating and parsing input having 10000000 bytes.
sum_line_lengths(ss)                                        :  28198375 ns total,  2.82 ns per byte  or 0.33 GB/s 
 void get_line_predictable_benchmark(size_t)
generating and parsing input having 10000000 bytes.
sum_line_lengths(ss)                                        :  28372542 ns total,  2.84 ns per byte  or 0.33 GB/s 
When I set repeat to 1 so that it will only run the function 1 time and not optimize out the call, I see saner numbers:
pre:
 void count_lines_benchmark(size_t)
generating and parsing input having 10000000 bytes.
(c = count_returns(blob.data(), blob.size()))               :  4142333 ns total,  0.41 ns per byte  or 2.25 GB/s 
bogus = 124230 
 void get_line_benchmark(size_t)
generating and parsing input having 10000000 bytes.
sum_line_lengths(ss)                                        :  28193042 ns total,  2.82 ns per byte  or 0.33 GB/s 
 void get_line_predictable_benchmark(size_t)
generating and parsing input having 10000000 bytes.
sum_line_lengths(ss)                                        :  28329542 ns total,  2.83 ns per byte  or 0.33 GB/s 
For reference, these are the numbers that I'm seeing on GCC 9.4.0 on a decade old Core i5-4250U, with 10 repeats:
pre:
 void count_lines_benchmark(size_t)
generating and parsing input having 10000000 bytes.
(c = count_returns(blob.data(), blob.size()))               :  3473586 ns total,  0.35 ns per byte  or 2.68 GB/s 
bogus = 124250 
 void get_line_benchmark(size_t)
generating and parsing input having 10000000 bytes.
sum_line_lengths(ss)                                        :  6283024 ns total,  0.63 ns per byte  or 1.48 GB/s 
 void get_line_predictable_benchmark(size_t)
generating and parsing input having 10000000 bytes.
sum_line_lengths(ss)                                        :  3708638 ns total,  0.37 ns per byte  or 2.51 GB/s 
Am I misunderstanding what's happening? Why and how is it optimizing out the call to the function under benchmark?

Edit: and I just saw the inline assembly in the benchmark code. That would do it. It's just hosed on ARM.

Twerk from Home fucked around with this message at 21:07 on Dec 26, 2023

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

seiken posted:

Assuming this isn't some kind of school assignment, I can recommend https://github.com/cameron314/concurrentqueue

This is crazy overkill for this usage case vs. a std::queue with std::mutex. I'd do something a lot more like this: https://morestina.net/blog/1400/minimalistic-blocking-bounded-queue-for-c.

I guess if this is for a school assignment, that might be looking at the answer key, though.

Adbot
ADBOT LOVES YOU

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.

giogadi posted:

I always find it sad when we have to give context like “it’s for an assignment” to justify doing something in a weird (or worse) way. Surely if something is worth learning about in school, then there must be some way to apply it that actually makes sense in practice too.

I think all the time about how I’d teach stuff if I were a teacher, but I wonder how much is out of the teachers’ control due to constraints on curriculum, etc

For what it's worth, I wouldn't reach for lockfree as a default option for a low-contention shared queue in the real world either. It seems insane to me that there isn't a simple blocking synchronized bounded queue in the stdlib or boost.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply