C/C++ Programming Questions Not Worth Their Own Thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > C/C++ Programming Questions Not Worth Their Own Thread

«‹›641 »

Sweeper: Nov 29, 2007; The Joe Buck of Posting; Dinosaur Gum

giogadi posted:

Sorry, I think I was being unclear. This is what I'm suggesting:
code:
// foo.cpp
int HowdyPardner() {
    return 5;
}
code:
// main.cpp
extern int HowdyPardner();

int main() {
    int x = HowdyPardner();
    return x;
}
This avoids the use of a header file for main.cpp to use the functions in foo.cpp. If I add more functions to foo.cpp, it won't require recompiles of main.cpp. Like I said, this of course has issues like requiring to keep the extern declarations in-sync with the definitions. My question is: are there any other downsides to this approach that I'm not aware of? Obviously it becomes more problematic with lots of users of the code in foo.cpp, because if a function header changes, I'd have to go through and change not only callsites, but also extern declarations.

We do this but we put the extern usage in the header of the dependency of the function. Cuts down on compile times and makes it clear what parts of the different modules we are using

# ? Sep 2, 2022 19:56

Adbot: ADBOT LOVES YOU

# ? Jun 7, 2024 06:24

Methanar: Sep 26, 2013; by the sex ghost

giogadi posted:

I do dev on an old laptop and my compilation times in c++ are just excruciating.

Is this the sort of problem that's worth throwing money at?

If you're doing dev work for any commercial reason whatsoever, a current 2000 dollar workstation is pennies and pays for itself in like a week.

# ? Sep 2, 2022 20:43

giogadi: Oct 27, 2009

You�re 100% right that the simplest answer is to just get a faster machine, but in my naive heart it just feels wasteful to get a new computer when this 9 year old MacBook Pro is super fast still! Just compile times suck when making a change to a commonly included header.

Overall I tend to hold onto hardware until it�s actually unusable, and part of my reason for developing on this old thing is to ensure that other folks on old hardware will be able to run and enjoy my game

e: just to be even more annoying: I wonder all the time whether compilers, applications, games, etc would be better if dev shops just didn�t exclusively use the fastest machines available

giogadi fucked around with this message at 21:18 on Sep 2, 2022

# ? Sep 2, 2022 21:11

leper khan: Dec 28, 2010; Honest to god thinks Half Life 2 is a bad game. But at least he likes Monster Hunter.

giogadi posted:

You’re 100% right that the simplest answer is to just get a faster machine, but in my naive heart it just feels wasteful to get a new computer when this 9 year old MacBook Pro is super fast still! Just compile times suck when making a change to a commonly included header.

Overall I tend to hold onto hardware until it’s actually unusable, and part of my reason for developing on this old thing is to ensure that other folks on old hardware will be able to run and enjoy my game

e: just to be even more annoying: I wonder all the time whether compilers, applications, games, etc would be better if dev shops just didn’t exclusively use the fastest machines available

If you can't play it on a PowerBook G4 it's not worth playing. :colbert:

# ? Sep 2, 2022 21:37

rjmccall: Sep 7, 2007; no worries friend; Fun Shoe

Tracking symbol-by-symbol dependencies well enough to figure out what changed and what needs to get recompiled is a ton of work, and it�s really easy to either miss a dependency (and thus miscompile a lot, training users to do clean rebuilds and defeating all the work you put into it) or register a ton of false dependencies (so that in practice you�re frequently recompiling almost as much as you would with a simple file-tracking model, but spending a lot of overhead to do it) or both. And then at the end of the day you might still be stuck generating whole object files instead of just replacing a single function.

I�ve definitely heard of compiler projects trying it and then years later ripping it out and saying it was a waste of their time.

# ? Sep 2, 2022 22:13

roomforthetuna: Mar 22, 2005; I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

giogadi posted:

e: just to be even more annoying: I wonder all the time whether compilers, applications, games, etc would be better if dev shops just didn�t exclusively use the fastest machines available

There's some of that, but I think for the most part everything is bloated because people are jerks. Like back in the day you could write a basic web server from scratch (well, other than using standard TCP/IP system libraries) that would compile to about a 10K binary, would take a fraction of a second to compile, and worked fine. But now, because people are jerks, a functioning webserver has to build all of openssl, a functioning email service has to include all the crypto poo poo for DKIM signatures, etc. etc. even before you get into the whole mess of nailing databases to your web poo poo.

For games it's a bit different, but similar, in that now your game probably has to use at least OpenGL so as to be compatible with the various hardware, and you probably want to use one of the major things (Unity, Godot etc.) so you get cross-platform compatibility and relatively easy integration with Steam etc. So not driven by jerks, but driven by jerk-adjacent properties. Also now it has to still look nice in 4k so unless you're doing "retro" all your textures and poo poo have to be ten times larger than they used to be, which means taking ten times as long to load...

So even if you're not developing on high-end machines, making your thing 'visually compatible' with high-end machines ironically means either making it slower, or making it larger (by providing different versions of content for lower resolutions / lesser hardware).

# ? Sep 2, 2022 22:37

more falafel please: Feb 26, 2005; forums poster

giogadi posted:

Sorry, I think I was being unclear. This is what I'm suggesting:
code:
// foo.cpp
int HowdyPardner() {
    return 5;
}
code:
// main.cpp
extern int HowdyPardner();

int main() {
    int x = HowdyPardner();
    return x;
}
This avoids the use of a header file for main.cpp to use the functions in foo.cpp. If I add more functions to foo.cpp, it won't require recompiles of main.cpp. Like I said, this of course has issues like requiring to keep the extern declarations in-sync with the definitions. My question is: are there any other downsides to this approach that I'm not aware of? Obviously it becomes more problematic with lots of users of the code in foo.cpp, because if a function header changes, I'd have to go through and change not only callsites, but also extern declarations.

With C++ name mangling you'd at least get a linker error if the function parameters don't match the external declarations, but in general I don't care for this approach. If you've got compile time issues, splitting up headers so you don't have any mega-headers that get included all over the place feels like a better option. Also, old standbys like precompiled headers, or Unity/Blob builds don't solve the problem of one header change triggering a big recompilation, but they do improve build times significantly.

I

# ? Sep 2, 2022 22:45

pseudorandom name: May 6, 2007

None of the C++ name mangling methods are sufficient to detect e.g. struct member changes.

Languages like Rust include hashes of some kind of recursive normalized form of the types in their mangling schemes for this reason.

# ? Sep 2, 2022 23:22

VikingofRock: Aug 24, 2008

pseudorandom name posted:

Languages like Rust include hashes of some kind of recursive normalized form of the types in their mangling schemes for this reason.

I wonder if there's a possibility for an attack here, where a hash collision results in a malicious recompilation.

# ? Sep 3, 2022 06:34

pseudorandom name: May 6, 2007

If you're importing untrusted code you already have problems.

# ? Sep 3, 2022 07:24

more falafel please: Feb 26, 2005; forums poster

pseudorandom name posted:

None of the C++ name mangling methods are sufficient to detect e.g. struct member changes.

Languages like Rust include hashes of some kind of recursive normalized form of the types in their mangling schemes for this reason.

I'm sure I'm missing something, but what's the scenario for that? When a struct is declared differently in different translation units?

I guess that would do it, but again I'll say, I loving hate that idea. Don't declare structs in multiple places, christ, you're asking for trouble. What if different translation units use different packing? You shouldn't do that either without a good reason and a strategy, but poo poo, someone's gonna mess it up, and it's gonna be an absolute pain in the rear end to find that bug.

# ? Sep 3, 2022 07:39

pseudorandom name: May 6, 2007

You can't link objects with mismatched ABI because the names are mangled differently.

# ? Sep 3, 2022 07:56

more falafel please: Feb 26, 2005; forums poster

pseudorandom name posted:

You can't link objects with mismatched ABI because the names are mangled differently.

Different calling conventions, but if one declaration of a struct is aligned to 8-byte boundaries and another is aligned to 16-byte boundaries (possibly because of different headers included in different TUs), I don't think that gets mangled differently. Similarly, I'm pretty sure structs as function parameters get mangled as their name, not their contents, so if the contents of a struct differ between TUs, it won't cause a linker error, it'll just gently caress you at runtime.

# ? Sep 3, 2022 08:23

pseudorandom name: May 6, 2007

I was talking about languages other than C++ which recursively mangle the types themselves into the names.

# ? Sep 3, 2022 08:41

repiv: Aug 13, 2009

Twerk from Home posted:

2. The module system introduced in C++20, specifically designed to fix this: https://en.cppreference.com/w/cpp/language/modules

Is anyone here using modules in anger yet? They sound nice on paper but whenever I check on their status it's just endless people complaining about compiler crashes, miscompilation and broken tooling

# ? Sep 8, 2022 16:35

Xarn: Jun 26, 2015

Are you prepared to discover brand new compiler bugs for every project you port to modules? If not, don't use them.

The build perf improvements are real though.

# ? Sep 8, 2022 17:53

Plorkyeran: Mar 22, 2007; To Escape The Shackles Of The Old Forums, We Must Reject The Tribal Negativity He Endorsed

Clang 16 is finally shipping module support but the documentation makes it pretty clear that you shouldn't actually use it yet.

# ? Sep 8, 2022 17:59

Nalin: Sep 29, 2007; Hair Elf

I've been playing around with modules in one of my hobby projects. I've ICE'd the compiler on multiple versions so far and even logged an EDG (Intellisense) bug. Your first mistake you'll make is trying to import <stl>; You'll try over and over to make it work but eventually you'll just give up. And lol get hosed if you try to use boost.

If you want to try modules, it's probably best to just use MSVC 2022. It's probably the best so far out of all of them. AFAIK, cmake still hasn't figured out how they'll do modules yet.

# ? Sep 9, 2022 04:53

cheetah7071: Oct 20, 2010; honk honk; College Slice

I just want to confirm that, in windows, there's no way for a program to be compiled in a way that straddles the boundary between a console program and a window program--i.e., a program that doesn't pop up a background console window if you just double-click the exe, but also has a functional cout if you do happen to call it from the command line. Is that right?

I might end up having to compile two versions of this program because I expect both the GUI and command-line versions to be useful to users

# ? Sep 27, 2022 01:31

Foxfire_: Nov 8, 2010

cheetah7071 posted:

I just want to confirm that, in windows, there's no way for a program to be compiled in a way that straddles the boundary between a console program and a window program--i.e., a program that doesn't pop up a background console window if you just double-click the exe, but also has a functional cout if you do happen to call it from the command line. Is that right?

All telling Visual Studio /SUBSYSTEM:CONSOLE does is do some extra setup for you. You can manually call AllocConsole to create a console if the process doesn't already have one, or AttachConsole if you want to attach to the parent process's console (or any other random console you have permissions to)

# ? Sep 27, 2022 01:44

cheetah7071: Oct 20, 2010; honk honk; College Slice

ah, AttachConsole is probably exactly what I wanted! The ideal behavior is that, if there is a parent console (i.e., it was called from the command line), it does the command line behavior, otherwise it does the gui behavior. Thanks for the help!

# ? Sep 27, 2022 01:46

vote_no: Nov 22, 2005; The rush is on.

cheetah7071 posted:

I might end up having to compile two versions of this program because I expect both the GUI and command-line versions to be useful to users

Incidentally, if you can use Qt it handles this for you (with CONFIG += console)

# ? Sep 27, 2022 03:20

Foxfire_: Nov 8, 2010

Is there a parsing equivalent to std::format() or is my best bet to just build something out of std::regex and std::stoi()? (I'm dealing with a preexisting linux kernel driver that returns hex text to userspace and want to get it back to numbers)

# ? Oct 11, 2022 20:22

Zopotantor: Feb 24, 2013; ...und ist er drin dann lassen wir ihn niemals wieder raus...

Foxfire_ posted:

Is there a parsing equivalent to std::format() or is my best bet to just build something out of std::regex and std::stoi()? (I'm dealing with a preexisting linux kernel driver that returns hex text to userspace and want to get it back to numbers)

Maybe just use iostreams?

# ? Oct 11, 2022 20:33

qsvui: Aug 23, 2003; some crazy thing

Maybe this might be helpful? https://scnlib.readthedocs.io/en/v1.0/index.html

# ? Oct 12, 2022 03:57

Foxfire_: Nov 8, 2010

scnlib was the kind of thing I was looking for, thanks. Now to decide if I care enough to replace the regex+stoi one I already have working :|

# ? Oct 13, 2022 04:38

Ihmemies: Oct 6, 2012

FML. I have a map where keys are sorted with std::less<string> to alphabetical order.

NOW I realized this motherfucking exercise wants them in ASCII order.

So instead of:

code:

elastinen
Finlanders
Happoradio

That poo poo should be printed in

code:

Finlanders
Happoradio
elastinen

Order.

Is there some automatic way to sort it or must I do some manual fuckery? :v:

Also why this stupid rear end project requires C++11 in compilation. I can't use contains() with map so my code is full of functions like

code:

bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}


bool Gigs::artists_contains(string key_to_find) {
    auto ret = artists.count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}

bool Gigs::stages_contains(string city, string stage, string key_to_find) {
    auto ret = stages[city][stage].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}

So.. great.................... :eng99:

# ? Oct 19, 2022 23:59

ultrafilter: Aug 23, 2007; It's okay if you have any questions.

string::compare uses ASCII order so you can just write your own comparison object based on that. It's not completely automatic but it's only a line or two.

# ? Oct 20, 2022 00:02

roomforthetuna: Mar 22, 2005; I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

Ihmemies posted:

code:

bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}

wtf, why not

code:

bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    return ret > 0;
}

code:

bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    return artists[artist]["Gigs"].find(key_to_find) != artists[artist]["Gigs"].end();
}

# ? Oct 20, 2022 01:42

cheetah7071: Oct 20, 2010; honk honk; College Slice

you can also just use count in place of contains because non-zero numbers implicitly convert to true in if statements and zero implicitly converts to false

contains more provides readability than functionality

# ? Oct 20, 2022 01:47

roomforthetuna: Mar 22, 2005; I don't need to know anything about virii! My CUSTOM PROGRAM keeps me protected! It's not like they'll try to come in through the Internet or something!

cheetah7071 posted:

you can also just use count in place of contains because non-zero numbers implicitly convert to true in if statements and zero implicitly converts to false

contains more provides readability than functionality

There is a performance difference in some structures (if there are multiple things matching then count has to look at all of them, versus contains just has to see the first one and can stop). But yeah, in a container that can't have multiple of the same key, it's basically identical.

# ? Oct 20, 2022 02:42

Ruggan: Feb 20, 2007; WHAT THAT SMELL LIKE?!

Ihmemies posted:

FML. I have a map where keys are sorted with std::less<string> to alphabetical order.

NOW I realized this motherfucking exercise wants them in ASCII order.

So instead of:
code:
elastinen
Finlanders
Happoradio
That poo poo should be printed in
code:
Finlanders
Happoradio
elastinen
Order.

Is there some automatic way to sort it or must I do some manual fuckery?

Also why this stupid rear end project requires C++11 in compilation. I can't use contains() with map so my code is full of functions like
code:
bool Gigs::artists_contains_gig(string artist, string key_to_find) {
    auto ret = artists[artist]["Gigs"].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}


bool Gigs::artists_contains(string key_to_find) {
    auto ret = artists.count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}

bool Gigs::stages_contains(string city, string stage, string key_to_find) {
    auto ret = stages[city][stage].count(key_to_find);
    while (ret) {
        return true;
        ret -= 1;
    }
    return false;
}
So.. great....................

Careful using the [] map operator. I would expect �contains� style methods to be const (and yours probably should be), but [] has �create if doesn�t exist� semantics.

If it were my code I�d probably just use .find anyway. That�s compatible, no?

code:

using ValByKey = std::map<std::string, int64_t>;
using ValByKeyByStage = std::map<std::string, ValByKey>;
using ValByKeyByStageByCity = std::map<std::string, ValByKeyByStage>;

class Gigs {
  public:
    Gigs() {
        map_["a"]["b"]["c"] = 1;
    };
    ~Gigs() = default;
     
    bool MapContains(std::string city, std::string stage, std::string key_to_find) const;

  private:
    ValByKeyByStageByCity map_;
};

bool Gigs::MapContains(std::string city, std::string stage, std::string key_to_find) const {
    // Look for the entry for `city`.
    auto by_key_by_stage_it = map_.find(city); 
    if(by_key_by_stage_it!= map_.end()) {
        // Found the entry for `city`, look in that entry for `stage`.
        const ValByKeyByStage& by_key_by_stage = by_key_by_stage_it->second;
        auto by_key_it = by_key_by_stage.find(stage); 
        if (by_key_it != by_key_by_stage.end()) {
            // Found the entry for `stage`, look in that entry for `key_to_find`.
            const ValByKey& by_key = by_key_it->second;
            auto it = by_key.find(key_to_find); 
            if (it != by_key.end()) {
                // It exists!
                return true;
            }
        }
    }
    return false;
}

int main()
{
    Gigs gigs;
    
    if (gigs.MapContains("a", "b", "c")) {
        cout << "it does";
    } else {
        cout << "it does not";
    }

    return 0;
}

You could probably shorten this quite a bit, but hopefully you get the idea.

You could also write your own templates to do Contains and/or FindOrNullptr style functionality.

code:


using ValByKey = std::map<std::string, int64_t>;
using ValByKeyByStage = std::map<std::string, ValByKey>;
using ValByKeyByStageByCity = std::map<std::string, ValByKeyByStage>;

// Returns true if the `key` exists in the map, else returns false.
template <typename mapType, typename keyType>
bool Contains(const mapType& map, keyType key) {
    auto it = map.find(key); 
    if (it != map.end()) return true;
    return false;
}

// Returns a pointer to the value at `key`, or nullptr if that key doesn't exist.
template <typename mapType, typename keyType>
const typename mapType::mapped_type* FindOrNullptr(const mapType& map, keyType key) {
    auto it = map.find(key); 
    if (it != map.end()) return &it->second;
    return nullptr;
}

class Gigs {
  public:
    Gigs() {
        map_["a"]["b"]["c"] = 1;
    };
    ~Gigs() = default;
     
    bool MapContains(std::string city, std::string stage, std::string key_to_find) const;
    
  private:
    ValByKeyByStageByCity map_;
};

bool Gigs::MapContains(std::string city, std::string stage, std::string key_to_find) const {
    // Look for the entry for `city`.
    const ValByKeyByStage* key_by_stage = FindOrNullptr(map_, city); 
    if(key_by_stage) {
        // Found the entry for `city`, look in that entry for `stage`.
        const ValByKey* by_key = FindOrNullptr(*key_by_stage, stage);
        if (by_key) {
            // Found the entry for `stage`, look in that entry for `key_to_find`.
            return Contains(*by_key, key_to_find);
        }
    }
    return false;
}

int main()
{
    Gigs gigs;
    
    if (gigs.MapContains("a", "b", "c")) {
        cout << "it does";
    } else {
        cout << "it does not";
    }

    return 0;
}

# ? Oct 20, 2022 04:48

Ruggan: Feb 20, 2007; WHAT THAT SMELL LIKE?!

Oh also, you probably want to switch your parameters to const string& instead of string. Unless you enjoy making copies of strings every time you aren't using an rvalue or temporary.

# ? Oct 20, 2022 04:51

Ihmemies: Oct 6, 2012

Thanks. This is my first c++ course and I really don't have that much of an idea what I'm actually doing :v:

The previous one was in Python..

Even the data structure used was like "do whatever you want" so I just semi randomly figured out something which seems to work.

Ihmemies fucked around with this message at 07:25 on Oct 20, 2022

# ? Oct 20, 2022 07:22

Twerk from Home: Jan 17, 2009; This avatar brought to you by the 'save our dead gay forums' foundation.

Does anyone have a microbenchmarking library that they'd recommend? A quick search found:

Google Benchmark: https://github.com/google/benchmark
Nanobench: https://github.com/martinus/nanobench
Picobench: https://github.com/iboB/picobench
Criterion: https://github.com/p-ranav/criterion

I've got one hot function that's driving 70% of my applications runtime and want to be able to quickly experiment with different micro-optimizations, because I've removed everything that I can and want to try some different stuff to get more cache hits.

# ? Nov 4, 2022 18:55

Xarn: Jun 26, 2015

You can't really go wrong with google bench, even if it does a bunch of stuff I don't like.

I usually use benchmarking support in Catch2, because I already use Catch2 for my tests, so adding new benchmarks is trivial. I also like that you can intersperse benchmarks and assertions, so you skip running benchmark if the test fails because the output is invalid.

# ? Nov 4, 2022 19:11

Beef: Jul 26, 2004

Also give Intel Vtune a try, to analyse your performance benchmark. Limux's perf does something similar, but does not help you interpret the results.

# ? Nov 4, 2022 19:16

qsvui: Aug 23, 2003; some crazy thing

Here's a web page for benchmarking if you're feeling particularly lazy: https://quick-bench.com/

# ? Nov 4, 2022 19:27

Twerk from Home: Jan 17, 2009; This avatar brought to you by the 'save our dead gay forums' foundation.

Thanks everyone! I set up Google Benchmark and it's fine. I'm walking face-first into the pitfalls of microbenchmarking, because getting work out of my hot path means finding another place to do it, or copying a data structure into a more efficient form before running the inner loop, and then I end up having to measure how expensive that stuff is in context.

This application's code is being published when the paper is published anyway, so I'd appreciate a second set of eyes on any way that I can make this poo poo faster, because I'm staring down a choice between cache friendliness vs significantly more instructions. This still remains >60% of total runtime, even after massaging it. We're budgeting millions of hours of CPU time in December to run this, so even incremental improvements would help.

Phenotypes is a vector of int8_t that are usually 0 or 1. There's a handful of -1s, and extremely rarely something else. The size of phenotypes ranges from tens of thousands to almost a million, and is likely to grow to low millions in future usage. Pairs is commonly a single digit multiple of the size of phenotypes, extremely sparse compared to every possible combination of pairs. Because pairs is so much bigger, we're iterating over pairs sequentially, and randomly accessing phenotypes. Pairs is sorted in an attempt to improve locality of phenotype accesses, like [{0,3}, {0, 200}, {2,4}, {2, 19}] and so on.

The lowest hanging fruit that did make huge improvements were:

1. Using a smaller int type for phenotypes. Originally phenotypes was a vector<int32_t>. We only really need 4 values, and 2 of them are exceptional cases.
2. Doing += 0 or 1 where we originally had an if. The compiler wasn't smart enough to eliminate the branch, and we were getting slaughtered by branch misses.

C++ code:

double calculate(std::vector<int8_t> &phenotypes, std::vector<std::pair<size_t, size_t>> &pairs) {
    int64_t cscs = 0;
    int64_t cscn = 0;
    int64_t cncn = 0;


    for (auto &p : pairs) {
        auto &[p1, p2] = p;
        auto x = phenotypes[p1];
        auto y = phenotypes[p2];

        if (x < -1 || x > 1 || y < -1 || y > 1) {
            handle_exceptional_case(p, x, y);
        }

        cscs += ((x == 1) && (y == 1));
        cscn += ((x == 0) && (y == 1));
        cscn += ((x == 1) && (y == 0));
        cncn += ((x == 0) && (y == 0));
    }

    // The actual math done here is different, but it's derived from all three
    return cscs / cscn + cncn;
}

I've been poking at it in Godbolt too: https://godbolt.org/z/dKfcPdYj3

The further paths I can see to optimize are:

1. Make a single pass through phenotypes before this to check for values not 0 or 1. There will typically be none, and if we find any we can handle them outside and pass calculate a collection of pairs that only has 0 or 1 values.
2. That lets us eliminate checking from the hot path, and also instead of checking (x == 1) && (y == 1), we can just do x && y, x ^ y, !(x || y), which is faster.
3. Improve cache friendliness by storing the pair indexes as uint32_t instead of size_t and cover this in warnings that it only handles up to 2^32-1 phenotypes. This won't limit us anytime soon.

The stuff that I'm not sure about at all are:
1. Now that phenotypes can only be 0 or 1 here, I could use vector<bool> or some kind of bitset. Better for cache, but many more instructions to pull bools out of it. This is probably a win on big datasets, and a loss on small. I think we're willing to make that tradeoff, or I might have both available and do a heuristic based on dataset size.
2. It's obvious to me that I could add together sets of cscs ,cscn, cncn by reading many pairs into two bitsets, and then doing bitwise &, ^, and !(x_bitset | y_bitset) and then += by count(). However, I don't know a fast way to loop and add bools to a bitset that wouldn't involve some bit manipulation overhead. I was thinking that I would just unroll the loop by my bitset size, gathering 64 pairs into two bitsets and then doing the bitwise operations and adding them to the accumulators.
3. There must be some way to vectorize this whole thing, but I'm not seeing it. I need a fast gather operation for bytes. I could over-retrieve and then mask and pack, of course, but that manipulation is overhead that could be spent just loading and adding bytes.
4. Integer math is 1 cycle on Intel, right? So I shouldn't need to add more accumulators to avoid a loop-carried data dependency?

This is only targeting Skylake-server, so AVX-512 is on the table, and also the 32-bit gather instructions are faster than scalar loads: https://stackoverflow.com/questions/24756534/in-what-situation-would-the-avx2-gather-instructions-be-faster-than-individually. It's also likely to only run on Skylake-server for the next 5 years, so doing performance things that are not portable are on the table. L2 is 1MB!

Also, if this isn't welcome here I'll take it to stack overflow. I'm just out of my depth but I know there's some type of bitset shortcut here to sum things up much faster.

# ? Nov 7, 2022 14:34

Adbot: ADBOT LOVES YOU

# ? Jun 7, 2024 06:24

Sweeper: Nov 29, 2007; The Joe Buck of Posting; Dinosaur Gum

Twerk from Home posted:

Thanks everyone! I set up Google Benchmark and it's fine. I'm walking face-first into the pitfalls of microbenchmarking, because getting work out of my hot path means finding another place to do it, or copying a data structure into a more efficient form before running the inner loop, and then I end up having to measure how expensive that stuff is in context.

This application's code is being published when the paper is published anyway, so I'd appreciate a second set of eyes on any way that I can make this poo poo faster, because I'm staring down a choice between cache friendliness vs significantly more instructions. This still remains >60% of total runtime, even after massaging it. We're budgeting millions of hours of CPU time in December to run this, so even incremental improvements would help.

Phenotypes is a vector of int8_t that are usually 0 or 1. There's a handful of -1s, and extremely rarely something else. The size of phenotypes ranges from tens of thousands to almost a million, and is likely to grow to low millions in future usage. Pairs is commonly a single digit multiple of the size of phenotypes, extremely sparse compared to every possible combination of pairs. Because pairs is so much bigger, we're iterating over pairs sequentially, and randomly accessing phenotypes. Pairs is sorted in an attempt to improve locality of phenotype accesses, like [{0,3}, {0, 200}, {2,4}, {2, 19}] and so on.

The lowest hanging fruit that did make huge improvements were:

1. Using a smaller int type for phenotypes. Originally phenotypes was a vector<int32_t>. We only really need 4 values, and 2 of them are exceptional cases.
2. Doing += 0 or 1 where we originally had an if. The compiler wasn't smart enough to eliminate the branch, and we were getting slaughtered by branch misses.
C++ code:
double calculate(std::vector<int8_t> &phenotypes, std::vector<std::pair<size_t, size_t>> &pairs) {
    int64_t cscs = 0;
    int64_t cscn = 0;
    int64_t cncn = 0;


    for (auto &p : pairs) {
        auto &[p1, p2] = p;
        auto x = phenotypes[p1];
        auto y = phenotypes[p2];

        if (x < -1 || x > 1 || y < -1 || y > 1) {
            handle_exceptional_case(p, x, y);
        }

        cscs += ((x == 1) && (y == 1));
        cscn += ((x == 0) && (y == 1));
        cscn += ((x == 1) && (y == 0));
        cncn += ((x == 0) && (y == 0));
    }

    // The actual math done here is different, but it's derived from all three
    return cscs / cscn + cncn;
}
I've been poking at it in Godbolt too: https://godbolt.org/z/dKfcPdYj3

The further paths I can see to optimize are:

1. Make a single pass through phenotypes before this to check for values not 0 or 1. There will typically be none, and if we find any we can handle them outside and pass calculate a collection of pairs that only has 0 or 1 values.
2. That lets us eliminate checking from the hot path, and also instead of checking (x == 1) && (y == 1), we can just do x && y, x ^ y, !(x || y), which is faster.
3. Improve cache friendliness by storing the pair indexes as uint32_t instead of size_t and cover this in warnings that it only handles up to 2^32-1 phenotypes. This won't limit us anytime soon.

The stuff that I'm not sure about at all are:
1. Now that phenotypes can only be 0 or 1 here, I could use vector<bool> or some kind of bitset. Better for cache, but many more instructions to pull bools out of it. This is probably a win on big datasets, and a loss on small. I think we're willing to make that tradeoff, or I might have both available and do a heuristic based on dataset size.
2. It's obvious to me that I could add together sets of cscs ,cscn, cncn by reading many pairs into two bitsets, and then doing bitwise &, ^, and !(x_bitset | y_bitset) and then += by count(). However, I don't know a fast way to loop and add bools to a bitset that wouldn't involve some bit manipulation overhead. I was thinking that I would just unroll the loop by my bitset size, gathering 64 pairs into two bitsets and then doing the bitwise operations and adding them to the accumulators.
3. There must be some way to vectorize this whole thing, but I'm not seeing it. I need a fast gather operation for bytes. I could over-retrieve and then mask and pack, of course, but that manipulation is overhead that could be spent just loading and adding bytes.
4. Integer math is 1 cycle on Intel, right? So I shouldn't need to add more accumulators to avoid a loop-carried data dependency?

This is only targeting Skylake-server, so AVX-512 is on the table, and also the 32-bit gather instructions are faster than scalar loads: https://stackoverflow.com/questions/24756534/in-what-situation-would-the-avx2-gather-instructions-be-faster-than-individually. It's also likely to only run on Skylake-server for the next 5 years, so doing performance things that are not portable are on the table. L2 is 1MB!

Also, if this isn't welcome here I'll take it to stack overflow. I'm just out of my depth but I know there's some type of bitset shortcut here to sum things up much faster.

just off the top of my head, hard to mess around with this having no data: if you want to try avx you can use the two comparisons (<= 1, >= -1) [or just == 1? not sure what -1 is for] on the packed 8bit ints (64) to produce two 64bit masks which you can bitwise-and with avx (there is a mask & instruction, keep in the avx registers) for both x and y. Then you can create masks for 'x not 0 xor y not 0' and 'is x not 0 and y not 0' and then sum the bits in each (I'm sure someone has a fast bitwise sum in avx on the internet) and add those into cscs cscn and then take a total count to figure out cncn at the end? not sure if this works, but you should be able to do the entire thing within avx registers doing 64-wide computations. you can handle up the exceptional cases after or before, but you don't have to change the data it will get skipped via mask.

for something simpler you could change the booleans to count up cscn/cscs/cncn into simple addition like:

code:

    int64_t both = 0;
    int64_t x_only = 0;
    int64_t y_only = 0;
    int64_t exceptional_count = 0;

    for (auto &p : pairs) {
        auto p1 = p.a;
        auto p2 = p.b;
        auto x = phenotypes[p1];
        auto y = phenotypes[p2];

        bool exceptional = x > 1 || y > 1 || x < || y < 0;
        if (__builtin_expect(exceptional, 0)) {
            ++exceptional_count;
        }

        both += (x + y);
        x_only += x;
        y_only += y;
    }

    // We actually do some different math derived from all three counts
    // and other variables
    int64_t total = pairs.size() - exceptional_count;
    int64_t cscs = both / 2;
    int64_t cscn = x_only;
    int64_t cncn = total - (x_only + y_only);
    return cscs / cscn + cncn;

maybe? I didn't think about this too hard so who knows if it makes sense. how exception are -1 cases?

in general I've not found compilers to be that good about using avx, generally you have to do it yourself. it is important to keep it all in the avx registers for the entire computation. also be careful when it comes to avx instruction usage, some of them will downclock the cpu which can make things unexpectedly slower, although I think with the latest chips (newer than skylake) most of avx512 is running at full speed?

Sweeper fucked around with this message at 15:53 on Nov 7, 2022

# ? Nov 7, 2022 15:43

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > C/C++ Programming Questions Not Worth Their Own Thread

«‹›641 »