Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
VikingofRock
Aug 24, 2008




So I'm trying to track down a bug in a multi-threaded program, which I believe comes from a third-party library I am using. After spending a significant amount of time in lldb, I realized that what is going on is that a vector of strings is being allocated over the same memory as an object in another thread, and then they mess with each other and eventually this causes a segfault. This is clear from looking at the addresses of the vector elements and the object, and from the fact that the the object's members clearly contain data from the vector (e.g. vector contains "poo", object has an int member which is 0x6f6f7003 at same address). The vector of strings is global static, and gets initialized at runtime with some code that looks like this:

C++ code:
#define N_ITEMS 13
static std::vector<std::string> s_items;

// ...

void initialize_items() {
    if (s_items.empty()) {
        s_items.resize(N_ITEMS);
        s_items[0] = "poo";
        s_items[1] = "fart";
        // ...
        s_items[N_ITEMS] = "butt";
    }
}
initialize_items() gets called from every object that wants to use s_items before said use (side note: this code is very bad). Now, obviously this is not thread safe, since you can get cases like the following:

  1. Thread 1 got to initialize_items() first, and has just re-sized s_items. So s_items is not empty, but the thread has not actually filled s_items yet.
  2. Thread 2 comes by, sees that s_items is not empty, and then happily chugs along and uses an invalid s_items.

But I'm not quite seeing how this ends up actually over-writing another thread's object. My best guess is that the call to resize causes the data associated with s_items to be re-allocated, and that the thread doing that re-allocation then allocates that memory over the object from the other thread. But, I thought that heap allocation was thread-aware, so it seems like that sort of thing should be impossible.

So my question is, can the fact that the s_items is global static mess with the heap allocator's thread awareness, thus failing to prevent a collision? Or is something else going on? This is all being compiled with clang++, if that matters.

Adbot
ADBOT LOVES YOU

OzyMandrill
Aug 12, 2013

Look upon my words
and despair

hackbunny posted:

correct! but the microsoft C runtime (until the "universal runtime" refactoring, which split the core C runtime from the compiler intrinsics library) was never meant to be used by other compilers, and they freely changed standard conformance and even the ABI from one version to the next. the "funniest" (in the "funniest home videos" sense of a football hitting someone in the crotch) one is probably msvcrt.dll, which while continuously upgraded with the latest bells and whistles, has to retain backwards compatibility with Visual Studio 6.

i was once working on a soccer game for pc with dx5 or 6, and we came across a bug where the anim system would animate all the players apart from their left foot, which would stay at 0,0,0 - the floor under the players center of mass. but only on some pcs. it was eventually tracked down to the version of MSVCRT.dll - if we used the latest version, all the left feet stayed under the player while they ran. if we used the previous dll, it worked fine.
i have no idea why or how it could affect what was essentially a for loop, but if theres a way to bugger up your code when you're not looking, vstudio 6 would do it.
we shipped with statically linked CRT instead.

gonadic io
Feb 16, 2011

>>=
having fun with the good old nullable boolean rn. hurray mysql.

FlapYoJacks
Feb 12, 2009

gonadic io posted:

having fun with the good old nullable boolean rn. hurray mysql.

Use MongoDB.


And by "use MongoDB", I mean PostgreSQL

cinci zoo sniper
Mar 15, 2013




ratbert90 posted:

Use MongoDB.


And by "use MongoDB", I mean PostgreSQL

hoo gently caress i was ready to start fighting :laffo:

quiggy
Aug 7, 2010

[in Russian] Oof.


VikingofRock posted:

C++ code:
#define N_ITEMS 13

jesus christ why are you doing this in c++ code, make that a const size_t (or maybe a static const size_t depending on context)

DONT THREAD ON ME
Oct 1, 2002

by Nyc_Tattoo
Floss Finder

quiggy posted:

jesus christ why are you doing this in c++ code, make that a const size_t (or maybe a static const size_t depending on context)

hi quiggy hope you're good

quiggy
Aug 7, 2010

[in Russian] Oof.


also the answer to your question is that thread support in c++ is janky at best and you shouldn't be invoking undefined behavior

quiggy
Aug 7, 2010

[in Russian] Oof.


MALE SHOEGAZE posted:

hi quiggy hope you're good

i am well, thank you friend

cinci zoo sniper
Mar 15, 2013




im starting to be really surprised with popularity of python in data analysis. more specifically, for some reason i assumed that pandas is much better than it actually is - im not sure its good for anything other than very small, low-dimensional datasets

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

VikingofRock posted:

So my question is, can the fact that the s_items is global static mess with the heap allocator's thread awareness, thus failing to prevent a collision? Or is something else going on? This is all being compiled with clang++, if that matters.

sometimes these races can result in memory being freed on one thread while another thread is still using it. that can screw up the allocator's data structures, but more likely in this case the memory is just being reallocated for a new purpose, allocators like to return memory that's just been freed

libc++ or libstdc++ would be the more important difference because of the difference in the layout of std::string. libc++ uses a small-string optimization that sometimes puts character data directly in the std::string object, libstdc++ doesn't (at least in older versions). when the string data is out-of-line, racy assignments can trigger malloc/free problems

VikingofRock
Aug 24, 2008




quiggy posted:

jesus christ why are you doing this in c++ code, make that a const size_t (or maybe a static const size_t depending on context)

quiggy posted:

also the answer to your question is that thread support in c++ is janky at best and you shouldn't be invoking undefined behavior

For the record, this is all in a third-party library that I am using (and now debugging). It's full of coding horrors, and the whole thing is written as "object oriented C" instead of C++. Shockingly the library actually seems to work when things are single-threaded, and it's only been recently when I've been scaling up my concurrency that I've started getting random segfaults coming from it.

There's a C library which does the same thing as this C++ library, which I will probably be switching to, since that library has been re-written in the past few years to take concurrency into account. That library is a horrifying maze of #ifdefs, but at least it seems a little more battle-tested than this one, since its one of the most widely-used libraries in astronomy. So at this point I am mostly just trying to figure this out so I can be a good astronomy citizen and submit a good bug report / patch to the C++ library people at NASA.


rjmccall posted:

sometimes these races can result in memory being freed on one thread while another thread is still using it. that can screw up the allocator's data structures, but more likely in this case the memory is just being reallocated for a new purpose, allocators like to return memory that's just been freed

libc++ or libstdc++ would be the more important difference because of the difference in the layout of std::string. libc++ uses a small-string optimization that sometimes puts character data directly in the std::string object, libstdc++ doesn't (at least in older versions). when the string data is out-of-line, racy assignments can trigger malloc/free problems

This is using libc++, and the small-string optimization seems to definitely be in effect: when I hexdump the vector data, I can see the contents of the strings. My guess is that this is related to the former problem that you mention. In any case, this is all good enough for a solid bug report and suggested fix at this point.

For when I got to submit a bug report / patch: Is this a good way to fix this in C++98? 99% of the C++ I've written has been C++11 or later, so I can never remember the idiomatic way to do things pre-C++11.

C++ code:
std::vector<std::string> make_s_items() {
    std::vector<std::string> items;
    items.reserve(N_ITEMS);
    items.push_back("poo");
    items.push_back("fart");
    // ...
    items.push_back("butt");
    return items;
}

static const std::vector<std::string> s_items = make_s_items();

// remove old calls to initialize_items() throughout code
Now that I think about it, the last thing that might be relevant here is that s_items is actually a member of a class. Not sure if that changes things (other than the syntax, slightly).

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe

cinci zoo sniper posted:

im starting to be really surprised with popularity of python in data analysis. more specifically, for some reason i assumed that pandas is much better than it actually is - im not sure its good for anything other than very small, low-dimensional datasets

well, a lot of data scientists will do work with a much smaller representative data set that a lot of popular tools work well with, but only work well with machine-size sets

then they bring in the systems people to scale all that up

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe
and then of course you have all the tooling for people that have the resources to run it like databricks and zeppelin and hue that let them screw around with larger datasets to begin with

cinci zoo sniper
Mar 15, 2013




lancemantis posted:

well, a lot of data scientists will do work with a much smaller representative data set that a lot of popular tools work well with, but only work well with machine-size sets

then they bring in the systems people to scale all that up

lancemantis posted:

and then of course you have all the tooling for people that have the resources to run it like databricks and zeppelin and hue that let them screw around with larger datasets to begin with

i mean sure, if you've got the resource you can work with it, but for how much everyone is touting data science i did imagine that ~the package~ defining not exclusively numerical analysis would be able to not be somewhat hamstrung to two-dimensional data and datasets of about total_ram*0.2

maybe i vastly overestimate how much data people use on average (since in my domain answer always is a fuckton (the representative dataset = the dataset) and you are mathematically wrong if you do anything else) or how dimensional it is or vastly underestimate the clouds/serverfarms at disposal of an average pandas user

cinci zoo sniper fucked around with this message at 21:16 on Sep 29, 2017

cinci zoo sniper
Mar 15, 2013




not to say i dislike it or think its poo poo, im just mildly disappointed that i can make a strong case for using r over python in my jerb without really trying or pulling arguments by their ears

quiggy
Aug 7, 2010

[in Russian] Oof.


VikingofRock posted:

For when I got to submit a bug report / patch: Is this a good way to fix this in C++98? 99% of the C++ I've written has been C++11 or later, so I can never remember the idiomatic way to do things pre-C++11.

if you're just trying to initialize your vector with the empty string, you should be able to do

C++ code:
std::vector<std::string> x(N_ITEMS, "");
and not have the nasty make_s_items() function at all. if you're trying to initialize the vector with set values that are different for each value, then yeah you'll need to do it like you just did. you can skip the call to std::vector::reserve() with the constructor like this

C++ code:
std::vector<std::string> x(5);
x[0] = "hello";
x[1] = "world";
x[2] = "I";
x[3] = "am";
x[4] = "gay";
return x;
(the difference between the first and second vector constructor is that the second one will create 5 uninitialized string values, while the first will initialize the strings to "")

sadly c++98/03 don't have the c++11 initializer list syntax so you have to do it this ugly way instead

quiggy
Aug 7, 2010

[in Russian] Oof.


also if you have any control over it please do not #define N_ITEMS, that's horrible bullshit and you shouldn't even be doing it remotely modern c let alone c++

JewKiller 3000
Nov 28, 2006

by Lowtax
c++ can std::suck my std::dick

Sapozhnik
Jan 2, 2005

Nap Ghost

JewKiller 3000 posted:

c++ can std::suck my std::dick

quiggy
Aug 7, 2010

[in Russian] Oof.


JewKiller 3000 posted:

c++ can std::suck my std::dick

i think these are from boost actually

Arcsech
Aug 5, 2008

JewKiller 3000 posted:

c++ can std::suck my std::dick

remember to always #include<protection> in your fun time, nobody wants an std::dick

rjmccall
Sep 7, 2007

no worries friend
Fun Shoe

VikingofRock posted:

For when I got to submit a bug report / patch: Is this a good way to fix this in C++98? 99% of the C++ I've written has been C++11 or later, so I can never remember the idiomatic way to do things pre-C++11.

moving to a global initializer is definitely better if they're fine with the initializer being executed eagerly during load. if they really want to make it lazy, they should move it into a static local variable and make sure they're compiling with thread-safe statics, which are the default on most compilers

VikingofRock posted:

Now that I think about it, the last thing that might be relevant here is that s_items is actually a member of a class. Not sure if that changes things (other than the syntax, slightly).

begin a static class member shouldn't make any difference vs. being a true global

Arcteryx Anarchist
Sep 15, 2007

Fun Shoe

cinci zoo sniper posted:

not to say i dislike it or think its poo poo, im just mildly disappointed that i can make a strong case for using r over python in my jerb without really trying or pulling arguments by their ears

Yeah, some of the cluster frameworks support R as well, so unless you drive them towards ones that are python/java/scala only they can always argue to continue on that path

Zemyla
Aug 6, 2008

I'll take her off your hands. Pleasure doing business with you!

rjmccall posted:

begin a static class member shouldn't make any difference vs. being a true global
Yeah, it's basically like poor man's namescaping. "Classes" don't exist at runtime, and it'd exist in the DATA or BSS segment like a real global.

Powerful Two-Hander
Mar 10, 2004

Mods please change my name to "Tooter Skeleton" TIA.


cinci zoo sniper posted:

im starting to be really surprised with popularity of python in data analysis. more specifically, for some reason i assumed that pandas is much better than it actually is - im not sure its good for anything other than very small, low-dimensional datasets

there's some pretty nice looking "live notepad" type deal that hooks into data sources and let's you just gently caress around with python queries against boxed datasets or something that I've seen people at work using but I haven't had time to try it.

i mean, it's just like having ssms and a proper data source instead of 8 billion csv files but I guess without using God's own language aka SQL

JawnV6
Jul 4, 2004

So hot ...
im a terrible (old) programmer and whenever i see BSS the first expansion that comes to mind is The Verve's

VikingofRock
Aug 24, 2008




rjmccall posted:

moving to a global initializer is definitely better if they're fine with the initializer being executed eagerly during load. if they really want to make it lazy, they should move it into a static local variable and make sure they're compiling with thread-safe statics, which are the default on most compilers


begin a static class member shouldn't make any difference vs. being a true global

That's exactly what I thought, but I wasn't sure. Cool cool. And now that I think about it, I think this vector only gets used in the .cxx file, so I think I can actually just remove it as a class member altogether.

Thanks for your help, everyone.

cinci zoo sniper
Mar 15, 2013




Powerful Two-Hander posted:

there's some pretty nice looking "live notepad" type deal that hooks into data sources and let's you just gently caress around with python queries against boxed datasets or something that I've seen people at work using but I haven't had time to try it.

i mean, it's just like having ssms and a proper data source instead of 8 billion csv files but I guess without using God's own language aka SQL

that's jupyter. r has something similar, and they both are equally worthless to do actual work. if you're 13.5x coding megaburrito or something that's your lazy coverup for presentation, but that's pretty much where it ends

as for tons of csv files, i can imagine with some nosql garbage. our mongo currently works like a flipcoin if you're querying more than 2 weeks of data (a few hundo megabytes) at once, whereas postges/mysql is what you would expect, you can poo poo out entire database into a single csv if you want

cinci zoo sniper
Mar 15, 2013




also not sure where i stand on the overall scale of sql usage in data analysis. all my coworkers like hundreds or thousands of line long sql scripts to do all in it, and i dont really see the point, and i just dont get the point do so if you aren't limited to sql and excel. i just pull the sql in a few lines and janitor up something actually maintainable in r instead, at a fraction of time or effort.

JewKiller 3000
Nov 28, 2006

by Lowtax
the problem is that r is garbage while sql is extremely cool and good

cinci zoo sniper
Mar 15, 2013




JewKiller 3000 posted:

the problem is that r is garbage while sql is extremely cool and good

for data my work deals in sql is poo poo too, you're limited to the most trivial of operations you must do and insistence to try to make an analytical tool out of sql will just lead to dumb poo poo like 80 kilobyte sql scripts my coworker writes for a single calculation that takes uhh, 100 lines i nr?

cinci zoo sniper
Mar 15, 2013




"what do you mean saying you dont want to use my script"

Powerful Two-Hander
Mar 10, 2004

Mods please change my name to "Tooter Skeleton" TIA.


my experience of R is some guy kept emailing our team edl demanding that we install R for him and we got so fed up telling him to gently caress off we just deleted the edl

but that was years ago an I think now you can integrate it into SQL server 2016 or something so who knows!

cinci zoo sniper
Mar 15, 2013




Powerful Two-Hander posted:

my experience of R is some guy kept emailing our team edl demanding that we install R for him and we got so fed up telling him to gently caress off we just deleted the edl

but that was years ago an I think now you can integrate it into SQL server 2016 or something so who knows!

microsoft bought out an r vendor and started strapping lots of poo poo for analytics together using the vendors stuff and their inhouse tooling/db stuff, but i wouldnt risk putting it all together. getting r to point where it is worthy of a non-local environment (or dedicated computing farm) involves disproportionate effort

there also are libraries for r that allow you to query db directly, but as you imagine, they are absolutely inferior to using somethng sqlworkbench/j or datagrip or what have you, any db tool with decent developer.

closest direct sql and r intersection that i can admit being legit useful is are the libraries that allow writing sql queries inside r environment, that can be useful if you are used to sql stuff. other than that, imo, a separation between r and rest of the world is due

big smart r people still seem to have troubles figuring out this whole "reproducible analytics" thing for a reason

JawnV6
Jul 4, 2004

So hot ...
is the new coding horror horror poster another how!! rereg?

fritz
Jul 26, 2003

cinci zoo sniper posted:

im starting to be really surprised with popularity of python in data analysis. more specifically, for some reason i assumed that pandas is much better than it actually is - im not sure its good for anything other than very small, low-dimensional datasets

ultimately i think the question is 'what else are you gonna use', python's got a long history in scientific computing and sure it could be better but it's not loving matlab

and now if you're gonna ask why python's got that history, again go back to 1997 and tell me 'what else are you gonna use', and remember you gotta make it palatable to scientists used to fortran and matlab, and consider the alternate reality in which the other major scripting language of the day won in more fields besides bioinformatics

fritz
Jul 26, 2003

JawnV6 posted:

is the new coding horror horror poster another how!! rereg?

i like how!!

Maluco Marinero
Jan 18, 2001

Damn that's a
fine elephant.
i remember how!!

Adbot
ADBOT LOVES YOU

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

JawnV6 posted:

is the new coding horror horror poster another how!! rereg?

I have so far resisted the urge to reply and yell a lot

I am proud of me

  • Locked thread