Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Sweeper
Nov 29, 2007
The Joe Buck of Posting
Dinosaur Gum

giogadi posted:

Sorry, I think I was being unclear. This is what I'm suggesting:

code:
// foo.cpp
int HowdyPardner() {
    return 5;
}
code:
// main.cpp
extern int HowdyPardner();

int main() {
    int x = HowdyPardner();
    return x;
}
This avoids the use of a header file for main.cpp to use the functions in foo.cpp. If I add more functions to foo.cpp, it won't require recompiles of main.cpp. Like I said, this of course has issues like requiring to keep the extern declarations in-sync with the definitions. My question is: are there any other downsides to this approach that I'm not aware of? Obviously it becomes more problematic with lots of users of the code in foo.cpp, because if a function header changes, I'd have to go through and change not only callsites, but also extern declarations.

We do this but we put the extern usage in the header of the dependency of the function. Cuts down on compile times and makes it clear what parts of the different modules we are using

Adbot
ADBOT LOVES YOU

Methanar
Sep 26, 2013

by the sex ghost

giogadi posted:

I do dev on an old laptop and my compilation times in c++ are just excruciating.

Is this the sort of problem that's worth throwing money at?

If you're doing dev work for any commercial reason whatsoever, a current 2000 dollar workstation is pennies and pays for itself in like a week.

giogadi
Oct 27, 2009

You’re 100% right that the simplest answer is to just get a faster machine, but in my naive heart it just feels wasteful to get a new computer when this 9 year old MacBook Pro is super fast still! Just compile times suck when making a change to a commonly included header.

Overall I tend to hold onto hardware until it’s actually unusable, and part of my reason for developing on this old thing is to ensure that other folks on old hardware will be able to run and enjoy my game

e: just to be even more annoying: I wonder all the time whether compilers, applications, games, etc would be better if dev shops just didn’t exclusively use the fastest machines available

giogadi fucked around with this message at 21:18 on Sep 2, 2022

leper khan
Dec 28, 2010
Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.

giogadi posted:

You’re 100% right that the simplest answer is to just get a faster machine, but in my naive heart it just feels wasteful to get a new computer when this 9 year old MacBook Pro is super fast still! Just compile times suck when making a change to a commonly included header.

Overall I tend to hold onto hardware until it’s actually unusable, and part of my reason for developing on this old thing is to ensure that other folks on old hardware will be able to run and enjoy my game

e: just to be even more annoying: I wonder all the time whether compilers, applications, games, etc would be better if dev shops just didn’t exclusively use the fastest machines available

If you can't play it on a PowerBook G4 it's not worth playing. :colbert:

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe
Tracking symbol-by-symbol dependencies well enough to figure out what changed and what needs to get recompiled is a ton of work, and it’s really easy to either miss a dependency (and thus miscompile a lot, training users to do clean rebuilds and defeating all the work you put into it) or register a ton of false dependencies (so that in practice you’re frequently recompiling almost as much as you would with a simple file-tracking model, but spending a lot of overhead to do it) or both. And then at the end of the day you might still be stuck generating whole object files instead of just replacing a single function.

I’ve definitely heard of compiler projects trying it and then years later ripping it out and saying it was a waste of their time.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

giogadi posted:

e: just to be even more annoying: I wonder all the time whether compilers, applications, games, etc would be better if dev shops just didn’t exclusively use the fastest machines available
There's some of that, but I think for the most part everything is bloated because people are jerks. Like back in the day you could write a basic web server from scratch (well, other than using standard TCP/IP system libraries) that would compile to about a 10K binary, would take a fraction of a second to compile, and worked fine. But now, because people are jerks, a functioning webserver has to build all of openssl, a functioning email service has to include all the crypto poo poo for DKIM signatures, etc. etc. even before you get into the whole mess of nailing databases to your web poo poo.

For games it's a bit different, but similar, in that now your game probably has to use at least OpenGL so as to be compatible with the various hardware, and you probably want to use one of the major things (Unity, Godot etc.) so you get cross-platform compatibility and relatively easy integration with Steam etc. So not driven by jerks, but driven by jerk-adjacent properties. Also now it has to still look nice in 4k so unless you're doing "retro" all your textures and poo poo have to be ten times larger than they used to be, which means taking ten times as long to load...

So even if you're not developing on high-end machines, making your thing 'visually compatible' with high-end machines ironically means either making it slower, or making it larger (by providing different versions of content for lower resolutions / lesser hardware).

more falafel please
Feb 26, 2005

forums poster

giogadi posted:

Sorry, I think I was being unclear. This is what I'm suggesting:

code:
// foo.cpp
int HowdyPardner() {
    return 5;
}
code:
// main.cpp
extern int HowdyPardner();

int main() {
    int x = HowdyPardner();
    return x;
}
This avoids the use of a header file for main.cpp to use the functions in foo.cpp. If I add more functions to foo.cpp, it won't require recompiles of main.cpp. Like I said, this of course has issues like requiring to keep the extern declarations in-sync with the definitions. My question is: are there any other downsides to this approach that I'm not aware of? Obviously it becomes more problematic with lots of users of the code in foo.cpp, because if a function header changes, I'd have to go through and change not only callsites, but also extern declarations.

With C++ name mangling you'd at least get a linker error if the function parameters don't match the external declarations, but in general I don't care for this approach. If you've got compile time issues, splitting up headers so you don't have any mega-headers that get included all over the place feels like a better option. Also, old standbys like precompiled headers, or Unity/Blob builds don't solve the problem of one header change triggering a big recompilation, but they do improve build times significantly.

I

pseudorandom name
May 6, 2007

None of the C++ name mangling methods are sufficient to detect e.g. struct member changes.

Languages like Rust include hashes of some kind of recursive normalized form of the types in their mangling schemes for this reason.

VikingofRock
Aug 24, 2008




pseudorandom name posted:

Languages like Rust include hashes of some kind of recursive normalized form of the types in their mangling schemes for this reason.

I wonder if there's a possibility for an attack here, where a hash collision results in a malicious recompilation.

pseudorandom name
May 6, 2007

If you're importing untrusted code you already have problems.

more falafel please
Feb 26, 2005

forums poster

pseudorandom name posted:

None of the C++ name mangling methods are sufficient to detect e.g. struct member changes.

Languages like Rust include hashes of some kind of recursive normalized form of the types in their mangling schemes for this reason.

I'm sure I'm missing something, but what's the scenario for that? When a struct is declared differently in different translation units?

I guess that would do it, but again I'll say, I loving hate that idea. Don't declare structs in multiple places, christ, you're asking for trouble. What if different translation units use different packing? You shouldn't do that either without a good reason and a strategy, but poo poo, someone's gonna mess it up, and it's gonna be an absolute pain in the rear end to find that bug.

pseudorandom name
May 6, 2007

You can't link objects with mismatched ABI because the names are mangled differently.

more falafel please
Feb 26, 2005

forums poster

pseudorandom name posted:

You can't link objects with mismatched ABI because the names are mangled differently.

Different calling conventions, but if one declaration of a struct is aligned to 8-byte boundaries and another is aligned to 16-byte boundaries (possibly because of different headers included in different TUs), I don't think that gets mangled differently. Similarly, I'm pretty sure structs as function parameters get mangled as their name, not their contents, so if the contents of a struct differ between TUs, it won't cause a linker error, it'll just gently caress you at runtime.

pseudorandom name
May 6, 2007

I was talking about languages other than C++ which recursively mangle the types themselves into the names.

repiv
Aug 13, 2009

Twerk from Home posted:

2. The module system introduced in C++20, specifically designed to fix this: https://en.cppreference.com/w/cpp/language/modules

Is anyone here using modules in anger yet? They sound nice on paper but whenever I check on their status it's just endless people complaining about compiler crashes, miscompilation and broken tooling

Xarn
Jun 26, 2015
Are you prepared to discover brand new compiler bugs for every project you port to modules? If not, don't use them.

The build perf improvements are real though.

Plorkyeran
Mar 22, 2007

To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed
Clang 16 is finally shipping module support but the documentation makes it pretty clear that you shouldn't actually use it yet.

Nalin
Sep 29, 2007

Hair Elf
I've been playing around with modules in one of my hobby projects. I've ICE'd the compiler on multiple versions so far and even logged an EDG (Intellisense) bug. Your first mistake you'll make is trying to import <stl>; You'll try over and over to make it work but eventually you'll just give up. And lol get hosed if you try to use boost.

If you want to try modules, it's probably best to just use MSVC 2022. It's probably the best so far out of all of them. AFAIK, cmake still hasn't figured out how they'll do modules yet.

cheetah7071
Oct 20, 2010

honk honk
College Slice
I just want to confirm that, in windows, there's no way for a program to be compiled in a way that straddles the boundary between a console program and a window program--i.e., a program that doesn't pop up a background console window if you just double-click the exe, but also has a functional cout if you do happen to call it from the command line. Is that right?

I might end up having to compile two versions of this program because I expect both the GUI and command-line versions to be useful to users

Foxfire_
Nov 8, 2010

cheetah7071 posted:

I just want to confirm that, in windows, there's no way for a program to be compiled in a way that straddles the boundary between a console program and a window program--i.e., a program that doesn't pop up a background console window if you just double-click the exe, but also has a functional cout if you do happen to call it from the command line. Is that right?
All telling Visual Studio /SUBSYSTEM:CONSOLE does is do some extra setup for you. You can manually call AllocConsole to create a console if the process doesn't already have one, or AttachConsole if you want to attach to the parent process's console (or any other random console you have permissions to)

cheetah7071
Oct 20, 2010

honk honk
College Slice
ah, AttachConsole is probably exactly what I wanted! The ideal behavior is that, if there is a parent console (i.e., it was called from the command line), it does the command line behavior, otherwise it does the gui behavior. Thanks for the help!

vote_no
Nov 22, 2005

The rush is on.

cheetah7071 posted:

I might end up having to compile two versions of this program because I expect both the GUI and command-line versions to be useful to users

Incidentally, if you can use Qt it handles this for you (with CONFIG += console)

Foxfire_
Nov 8, 2010

Is there a parsing equivalent to std::format() or is my best bet to just build something out of std::regex and std::stoi()? (I'm dealing with a preexisting linux kernel driver that returns hex text to userspace and want to get it back to numbers)

Zopotantor
Feb 24, 2013

...und ist er drin dann lassen wir ihn niemals wieder raus...

Foxfire_ posted:

Is there a parsing equivalent to std::format() or is my best bet to just build something out of std::regex and std::stoi()? (I'm dealing with a preexisting linux kernel driver that returns hex text to userspace and want to get it back to numbers)

Maybe just use iostreams?

qsvui
Aug 23, 2003
some crazy thing
Maybe this might be helpful? https://scnlib.readthedocs.io/en/v1.0/index.html

Foxfire_
Nov 8, 2010

scnlib was the kind of thing I was looking for, thanks. Now to decide if I care enough to replace the regex+stoi one I already have working :|

Ihmemies
Oct 6, 2012

FML. I have a map where keys are sorted with std::less<string> to alphabetical order.

NOW I realized this motherfucking exercise wants them in ASCII order.

So instead of:

code:
elastinen
Finlanders
Happoradio
That poo poo should be printed in

code:
Finlanders
Happoradio
elastinen
Order. :aaaaa:

Is there some automatic way to sort it or must I do some manual fuckery? :v:

Also why this stupid rear end project requires C++11 in compilation. I can't use contains() with map so my code is full of functions like

code:
bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}


bool Gigs::artists_contains(string key_to_find) {
    auto ret = artists.count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}

bool Gigs::stages_contains(string city, string stage, string key_to_find) {
    auto ret = stages[city][stage].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}
So.. great.................... :eng99:

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


string::compare uses ASCII order so you can just write your own comparison object based on that. It's not completely automatic but it's only a line or two.

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

Ihmemies posted:

code:
bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}
wtf, why not
code:
bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    return ret > 0;
}
or
code:
bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    return artists[artist]["Gigs"].find(key_to_find) != artists[artist]["Gigs"].end();
}

cheetah7071
Oct 20, 2010

honk honk
College Slice
you can also just use count in place of contains because non-zero numbers implicitly convert to true in if statements and zero implicitly converts to false

contains more provides readability than functionality

roomforthetuna
Mar 22, 2005

I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

cheetah7071 posted:

you can also just use count in place of contains because non-zero numbers implicitly convert to true in if statements and zero implicitly converts to false

contains more provides readability than functionality
There is a performance difference in some structures (if there are multiple things matching then count has to look at all of them, versus contains just has to see the first one and can stop). But yeah, in a container that can't have multiple of the same key, it's basically identical.

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Ihmemies posted:

FML. I have a map where keys are sorted with std::less<string> to alphabetical order.

NOW I realized this motherfucking exercise wants them in ASCII order.

So instead of:

code:
elastinen
Finlanders
Happoradio
That poo poo should be printed in

code:
Finlanders
Happoradio
elastinen
Order. :aaaaa:

Is there some automatic way to sort it or must I do some manual fuckery? :v:

Also why this stupid rear end project requires C++11 in compilation. I can't use contains() with map so my code is full of functions like

code:
bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}


bool Gigs::artists_contains(string key_to_find) {
    auto ret = artists.count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}

bool Gigs::stages_contains(string city, string stage, string key_to_find) {
    auto ret = stages[city][stage].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}
So.. great.................... :eng99:

Careful using the [] map operator. I would expect “contains” style methods to be const (and yours probably should be), but [] has “create if doesn’t exist” semantics.

If it were my code I’d probably just use .find anyway. That’s compatible, no?

code:
using ValByKey = std::map<std::string, int64_t>;
using ValByKeyByStage = std::map<std::string, ValByKey>;
using ValByKeyByStageByCity = std::map<std::string, ValByKeyByStage>;

class Gigs {
  public:
    Gigs() {
        map_["a"]["b"]["c"] = 1;
    };
    ~Gigs() = default;
     
    bool MapContains(std::string city, std::string stage, std::string key_to_find) const;

  private:
    ValByKeyByStageByCity map_;
};

bool Gigs::MapContains(std::string city, std::string stage, std::string key_to_find) const {
    // Look for the entry for `city`.
    auto by_key_by_stage_it = map_.find(city); 
    if(by_key_by_stage_it!= map_.end()) {
        // Found the entry for `city`, look in that entry for `stage`.
        const ValByKeyByStage& by_key_by_stage = by_key_by_stage_it->second;
        auto by_key_it = by_key_by_stage.find(stage); 
        if (by_key_it != by_key_by_stage.end()) {
            // Found the entry for `stage`, look in that entry for `key_to_find`.
            const ValByKey& by_key = by_key_it->second;
            auto it = by_key.find(key_to_find); 
            if (it != by_key.end()) {
                // It exists!
                return true;
            }
        }
    }
    return false;
}

int main()
{
    Gigs gigs;
    
    if (gigs.MapContains("a", "b", "c")) {
        cout << "it does";
    } else {
        cout << "it does not";
    }

    return 0;
}
You could probably shorten this quite a bit, but hopefully you get the idea.

You could also write your own templates to do Contains and/or FindOrNullptr style functionality.

code:

using ValByKey = std::map<std::string, int64_t>;
using ValByKeyByStage = std::map<std::string, ValByKey>;
using ValByKeyByStageByCity = std::map<std::string, ValByKeyByStage>;

// Returns true if the `key` exists in the map, else returns false.
template <typename mapType, typename keyType>
bool Contains(const mapType& map, keyType key) {
    auto it = map.find(key); 
    if (it != map.end()) return true;
    return false;
}

// Returns a pointer to the value at `key`, or nullptr if that key doesn't exist.
template <typename mapType, typename keyType>
const typename mapType::mapped_type* FindOrNullptr(const mapType& map, keyType key) {
    auto it = map.find(key); 
    if (it != map.end()) return &it->second;
    return nullptr;
}

class Gigs {
  public:
    Gigs() {
        map_["a"]["b"]["c"] = 1;
    };
    ~Gigs() = default;
     
    bool MapContains(std::string city, std::string stage, std::string key_to_find) const;
    
  private:
    ValByKeyByStageByCity map_;
};

bool Gigs::MapContains(std::string city, std::string stage, std::string key_to_find) const {
    // Look for the entry for `city`.
    const ValByKeyByStage* key_by_stage = FindOrNullptr(map_, city); 
    if(key_by_stage) {
        // Found the entry for `city`, look in that entry for `stage`.
        const ValByKey* by_key = FindOrNullptr(*key_by_stage, stage);
        if (by_key) {
            // Found the entry for `stage`, look in that entry for `key_to_find`.
            return Contains(*by_key, key_to_find);
        }
    }
    return false;
}

int main()
{
    Gigs gigs;
    
    if (gigs.MapContains("a", "b", "c")) {
        cout << "it does";
    } else {
        cout << "it does not";
    }

    return 0;
}

Ruggan
Feb 20, 2007
WHAT THAT SMELL LIKE?!


Oh also, you probably want to switch your parameters to const string& instead of string. Unless you enjoy making copies of strings every time you aren't using an rvalue or temporary.

Ihmemies
Oct 6, 2012

Thanks. This is my first c++ course and I really don't have that much of an idea what I'm actually doing :v: The previous one was in Python..

Even the data structure used was like "do whatever you want" so I just semi randomly figured out something which seems to work.

Ihmemies fucked around with this message at 07:25 on Oct 20, 2022

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Does anyone have a microbenchmarking library that they'd recommend? A quick search found:

Google Benchmark: https://github.com/google/benchmark
Nanobench: https://github.com/martinus/nanobench
Picobench: https://github.com/iboB/picobench
Criterion: https://github.com/p-ranav/criterion

I've got one hot function that's driving 70% of my applications runtime and want to be able to quickly experiment with different micro-optimizations, because I've removed everything that I can and want to try some different stuff to get more cache hits.

Xarn
Jun 26, 2015
You can't really go wrong with google bench, even if it does a bunch of stuff I don't like.

I usually use benchmarking support in Catch2, because I already use Catch2 for my tests, so adding new benchmarks is trivial. I also like that you can intersperse benchmarks and assertions, so you skip running benchmark if the test fails because the output is invalid.

Beef
Jul 26, 2004
Also give Intel Vtune a try, to analyse your performance benchmark. Limux's perf does something similar, but does not help you interpret the results.

qsvui
Aug 23, 2003
some crazy thing
Here's a web page for benchmarking if you're feeling particularly lazy: https://quick-bench.com/

Twerk from Home
Jan 17, 2009

This avatar brought to you by the 'save our dead gay forums' foundation.
Thanks everyone! I set up Google Benchmark and it's fine. I'm walking face-first into the pitfalls of microbenchmarking, because getting work out of my hot path means finding another place to do it, or copying a data structure into a more efficient form before running the inner loop, and then I end up having to measure how expensive that stuff is in context.

This application's code is being published when the paper is published anyway, so I'd appreciate a second set of eyes on any way that I can make this poo poo faster, because I'm staring down a choice between cache friendliness vs significantly more instructions. This still remains >60% of total runtime, even after massaging it. We're budgeting millions of hours of CPU time in December to run this, so even incremental improvements would help.

Phenotypes is a vector of int8_t that are usually 0 or 1. There's a handful of -1s, and extremely rarely something else. The size of phenotypes ranges from tens of thousands to almost a million, and is likely to grow to low millions in future usage. Pairs is commonly a single digit multiple of the size of phenotypes, extremely sparse compared to every possible combination of pairs. Because pairs is so much bigger, we're iterating over pairs sequentially, and randomly accessing phenotypes. Pairs is sorted in an attempt to improve locality of phenotype accesses, like [{0,3}, {0, 200}, {2,4}, {2, 19}] and so on.

The lowest hanging fruit that did make huge improvements were:

1. Using a smaller int type for phenotypes. Originally phenotypes was a vector<int32_t>. We only really need 4 values, and 2 of them are exceptional cases.
2. Doing += 0 or 1 where we originally had an if. The compiler wasn't smart enough to eliminate the branch, and we were getting slaughtered by branch misses.

C++ code:
double calculate(std::vector<int8_t> &phenotypes, std::vector<std::pair<size_t, size_t>> &pairs) {
    int64_t cscs = 0;
    int64_t cscn = 0;
    int64_t cncn = 0;


    for (auto &p : pairs) {
        auto &[p1, p2] = p;
        auto x = phenotypes[p1];
        auto y = phenotypes[p2];

        if (x < -1 || x > 1 || y < -1 || y > 1) {
            handle_exceptional_case(p, x, y);
        }

        cscs += ((x == 1) && (y == 1));
        cscn += ((x == 0) && (y == 1));
        cscn += ((x == 1) && (y == 0));
        cncn += ((x == 0) && (y == 0));
    }

    // The actual math done here is different, but it's derived from all three
    return cscs / cscn + cncn;
}
I've been poking at it in Godbolt too: https://godbolt.org/z/dKfcPdYj3

The further paths I can see to optimize are:

1. Make a single pass through phenotypes before this to check for values not 0 or 1. There will typically be none, and if we find any we can handle them outside and pass calculate a collection of pairs that only has 0 or 1 values.
2. That lets us eliminate checking from the hot path, and also instead of checking (x == 1) && (y == 1), we can just do x && y, x ^ y, !(x || y), which is faster.
3. Improve cache friendliness by storing the pair indexes as uint32_t instead of size_t and cover this in warnings that it only handles up to 2^32-1 phenotypes. This won't limit us anytime soon.

The stuff that I'm not sure about at all are:
1. Now that phenotypes can only be 0 or 1 here, I could use vector<bool> or some kind of bitset. Better for cache, but many more instructions to pull bools out of it. This is probably a win on big datasets, and a loss on small. I think we're willing to make that tradeoff, or I might have both available and do a heuristic based on dataset size.
2. It's obvious to me that I could add together sets of cscs ,cscn, cncn by reading many pairs into two bitsets, and then doing bitwise &, ^, and !(x_bitset | y_bitset) and then += by count(). However, I don't know a fast way to loop and add bools to a bitset that wouldn't involve some bit manipulation overhead. I was thinking that I would just unroll the loop by my bitset size, gathering 64 pairs into two bitsets and then doing the bitwise operations and adding them to the accumulators.
3. There must be some way to vectorize this whole thing, but I'm not seeing it. I need a fast gather operation for bytes. I could over-retrieve and then mask and pack, of course, but that manipulation is overhead that could be spent just loading and adding bytes.
4. Integer math is 1 cycle on Intel, right? So I shouldn't need to add more accumulators to avoid a loop-carried data dependency?

This is only targeting Skylake-server, so AVX-512 is on the table, and also the 32-bit gather instructions are faster than scalar loads: https://stackoverflow.com/questions/24756534/in-what-situation-would-the-avx2-gather-instructions-be-faster-than-individually. It's also likely to only run on Skylake-server for the next 5 years, so doing performance things that are not portable are on the table. L2 is 1MB!

Also, if this isn't welcome here I'll take it to stack overflow. I'm just out of my depth but I know there's some type of bitset shortcut here to sum things up much faster.

Adbot
ADBOT LOVES YOU

Sweeper
Nov 29, 2007
The Joe Buck of Posting
Dinosaur Gum

Twerk from Home posted:

Thanks everyone! I set up Google Benchmark and it's fine. I'm walking face-first into the pitfalls of microbenchmarking, because getting work out of my hot path means finding another place to do it, or copying a data structure into a more efficient form before running the inner loop, and then I end up having to measure how expensive that stuff is in context.

This application's code is being published when the paper is published anyway, so I'd appreciate a second set of eyes on any way that I can make this poo poo faster, because I'm staring down a choice between cache friendliness vs significantly more instructions. This still remains >60% of total runtime, even after massaging it. We're budgeting millions of hours of CPU time in December to run this, so even incremental improvements would help.

Phenotypes is a vector of int8_t that are usually 0 or 1. There's a handful of -1s, and extremely rarely something else. The size of phenotypes ranges from tens of thousands to almost a million, and is likely to grow to low millions in future usage. Pairs is commonly a single digit multiple of the size of phenotypes, extremely sparse compared to every possible combination of pairs. Because pairs is so much bigger, we're iterating over pairs sequentially, and randomly accessing phenotypes. Pairs is sorted in an attempt to improve locality of phenotype accesses, like [{0,3}, {0, 200}, {2,4}, {2, 19}] and so on.

The lowest hanging fruit that did make huge improvements were:

1. Using a smaller int type for phenotypes. Originally phenotypes was a vector<int32_t>. We only really need 4 values, and 2 of them are exceptional cases.
2. Doing += 0 or 1 where we originally had an if. The compiler wasn't smart enough to eliminate the branch, and we were getting slaughtered by branch misses.

C++ code:
double calculate(std::vector<int8_t> &phenotypes, std::vector<std::pair<size_t, size_t>> &pairs) {
    int64_t cscs = 0;
    int64_t cscn = 0;
    int64_t cncn = 0;


    for (auto &p : pairs) {
        auto &[p1, p2] = p;
        auto x = phenotypes[p1];
        auto y = phenotypes[p2];

        if (x < -1 || x > 1 || y < -1 || y > 1) {
            handle_exceptional_case(p, x, y);
        }

        cscs += ((x == 1) && (y == 1));
        cscn += ((x == 0) && (y == 1));
        cscn += ((x == 1) && (y == 0));
        cncn += ((x == 0) && (y == 0));
    }

    // The actual math done here is different, but it's derived from all three
    return cscs / cscn + cncn;
}
I've been poking at it in Godbolt too: https://godbolt.org/z/dKfcPdYj3

The further paths I can see to optimize are:

1. Make a single pass through phenotypes before this to check for values not 0 or 1. There will typically be none, and if we find any we can handle them outside and pass calculate a collection of pairs that only has 0 or 1 values.
2. That lets us eliminate checking from the hot path, and also instead of checking (x == 1) && (y == 1), we can just do x && y, x ^ y, !(x || y), which is faster.
3. Improve cache friendliness by storing the pair indexes as uint32_t instead of size_t and cover this in warnings that it only handles up to 2^32-1 phenotypes. This won't limit us anytime soon.

The stuff that I'm not sure about at all are:
1. Now that phenotypes can only be 0 or 1 here, I could use vector<bool> or some kind of bitset. Better for cache, but many more instructions to pull bools out of it. This is probably a win on big datasets, and a loss on small. I think we're willing to make that tradeoff, or I might have both available and do a heuristic based on dataset size.
2. It's obvious to me that I could add together sets of cscs ,cscn, cncn by reading many pairs into two bitsets, and then doing bitwise &, ^, and !(x_bitset | y_bitset) and then += by count(). However, I don't know a fast way to loop and add bools to a bitset that wouldn't involve some bit manipulation overhead. I was thinking that I would just unroll the loop by my bitset size, gathering 64 pairs into two bitsets and then doing the bitwise operations and adding them to the accumulators.
3. There must be some way to vectorize this whole thing, but I'm not seeing it. I need a fast gather operation for bytes. I could over-retrieve and then mask and pack, of course, but that manipulation is overhead that could be spent just loading and adding bytes.
4. Integer math is 1 cycle on Intel, right? So I shouldn't need to add more accumulators to avoid a loop-carried data dependency?

This is only targeting Skylake-server, so AVX-512 is on the table, and also the 32-bit gather instructions are faster than scalar loads: https://stackoverflow.com/questions/24756534/in-what-situation-would-the-avx2-gather-instructions-be-faster-than-individually. It's also likely to only run on Skylake-server for the next 5 years, so doing performance things that are not portable are on the table. L2 is 1MB!

Also, if this isn't welcome here I'll take it to stack overflow. I'm just out of my depth but I know there's some type of bitset shortcut here to sum things up much faster.

just off the top of my head, hard to mess around with this having no data: if you want to try avx you can use the two comparisons (<= 1, >= -1) [or just == 1? not sure what -1 is for] on the packed 8bit ints (64) to produce two 64bit masks which you can bitwise-and with avx (there is a mask & instruction, keep in the avx registers) for both x and y. Then you can create masks for 'x not 0 xor y not 0' and 'is x not 0 and y not 0' and then sum the bits in each (I'm sure someone has a fast bitwise sum in avx on the internet) and add those into cscs cscn and then take a total count to figure out cncn at the end? not sure if this works, but you should be able to do the entire thing within avx registers doing 64-wide computations. you can handle up the exceptional cases after or before, but you don't have to change the data it will get skipped via mask.

for something simpler you could change the booleans to count up cscn/cscs/cncn into simple addition like:
code:
    int64_t both = 0;
    int64_t x_only = 0;
    int64_t y_only = 0;
    int64_t exceptional_count = 0;

    for (auto &p : pairs) {
        auto p1 = p.a;
        auto p2 = p.b;
        auto x = phenotypes[p1];
        auto y = phenotypes[p2];

        bool exceptional = x > 1 || y > 1 || x < || y < 0;
        if (__builtin_expect(exceptional, 0)) {
            ++exceptional_count;
        }

        both += (x + y);
        x_only += x;
        y_only += y;
    }

    // We actually do some different math derived from all three counts
    // and other variables
    int64_t total = pairs.size() - exceptional_count;
    int64_t cscs = both / 2;
    int64_t cscn = x_only;
    int64_t cncn = total - (x_only + y_only);
    return cscs / cscn + cncn;
maybe? I didn't think about this too hard so who knows if it makes sense. how exception are -1 cases?

in general I've not found compilers to be that good about using avx, generally you have to do it yourself. it is important to keep it all in the avx registers for the entire computation. also be careful when it comes to avx instruction usage, some of them will downclock the cpu which can make things unexpectedly slower, although I think with the latest chips (newer than skylake) most of avx512 is running at full speed?

Sweeper fucked around with this message at 15:53 on Nov 7, 2022

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply