|
Thermopyle posted:Yep. This is an issue across a lot of fields in science and is part of the reason there's a crisis of sorts with regards to reproducibility. I'm in particular thinking about the more smooshy medical/health/social sciences. It's a general crisis? I thought I was the only one who cared about it. I'm a computer science student, and my eyes have only recently been opened to the utter atrocity that is the lack of reproducibility in computer science papers. How can this even be science? How can someone publish a benchmark result, be vague about the specifics of their algorithms and implementation, yet not provide any code? Not only can I not reproduce their results, I cannot even reasonably compare my own to theirs! Even when code is available, it is often environment-dependent (with said environment left undocumented) to such a degree that you have no chance of getting it to work. Reproducibility in computer science should be trivial, but instead it is non-existent. I really wish there was a journal, or just a paper database, that only accepted submissions that lived up to some standard of reproducibility.
|
# ? Apr 14, 2013 12:04 |
|
|
# ? May 16, 2024 15:42 |
|
Athas posted:It's a general crisis? I thought I was the only one who cared about it. I'm a computer science student, and my eyes have only recently been opened to the utter atrocity that is the lack of reproducibility in computer science papers. How can this even be science? How can someone publish a benchmark result, be vague about the specifics of their algorithms and implementation, yet not provide any code? Not only can I not reproduce their results, I cannot even reasonably compare my own to theirs! Even when code is available, it is often environment-dependent (with said environment left undocumented) to such a degree that you have no chance of getting it to work. Reproducibility in computer science should be trivial, but instead it is non-existent. I really wish there was a journal, or just a paper database, that only accepted submissions that lived up to some standard of reproducibility. In the specific field of software engineering and maintenance, my thesis work involves studying papers covering the last ~10 years and examining what techniques and procedures are commonly used in research experiments. From the results of the study, we created a software library containing (well-documented) implementations of these techniques. Our work is in collaboration with the people at TraceLab, which is a tool that lets you create experiments and share them, with the idea that all of the settings, data, implementations, etc. are packaged up and run exactly the same on any computer. TraceLab's grant alone is $2,000,000 from the NSF. If you add up the grants of all the collaborators it is even more. So yeah, reproducibility is a major issue in computer science.
|
# ? Apr 14, 2013 13:43 |
|
QuarkJets posted:It's exactly for the reasons that you say; the older crowd wants to keep using FORTRAN and IDL because that's what they've always used, whereas the younger crowd wants to use Python and C++ because it's what they learned in school (and a lot of other reasons) Long story short, we eventually got it all converted to C. It still wasn't pretty, but it was better. We offered the C version back to the group who provided the original Fortran source. They didn't want it.
|
# ? Apr 14, 2013 17:51 |
|
You have to realized that anyone still working in FORTRAN has been told by well-meaning-individuals who don't actually know what they are talking about that "C (or C++ or Java or Python or Boost) is better" for literally 40 years now.
|
# ? Apr 14, 2013 18:30 |
|
Oh, on the topic of FORTRAN people. I did some research (briefly, stuff came up in my life) for a professor in the department who used to use FORTRAN. She'd somehow been convinced to switch over to MATLAB. Her FORTRAN code was all good and worked wonderfully and quickly but her MATLAB was actually written largely by some compsci grad student and was strangely awful, took forever to run, and based on my brief time there I guess it had some poorly defined limits because it would give output that went outside of the range of the input data. Poor lady. I hope she got it all working because jesus.
|
# ? Apr 14, 2013 18:40 |
|
Athas posted:It's a general crisis? I thought I was the only one who cared about it. I'm a computer science student, and my eyes have only recently been opened to the utter atrocity that is the lack of reproducibility in computer science papers. How can this even be science? How can someone publish a benchmark result, be vague about the specifics of their algorithms and implementation, yet not provide any code? Not only can I not reproduce their results, I cannot even reasonably compare my own to theirs! Even when code is available, it is often environment-dependent (with said environment left undocumented) to such a degree that you have no chance of getting it to work. Reproducibility in computer science should be trivial, but instead it is non-existent. I really wish there was a journal, or just a paper database, that only accepted submissions that lived up to some standard of reproducibility. If you'd like to have your mind blown, read this.
|
# ? Apr 14, 2013 21:08 |
|
Shugojin posted:Oh, on the topic of FORTRAN people. I did some research (briefly, stuff came up in my life) for a professor in the department who used to use FORTRAN. That makes me really sad. They're not even in the same ball pit, really. FORTRAN is what you use when you need good computational code that runs incredibly fast and you don't need to write your own libraries. It's a computational programming language. MATLAB is more of a computational toolbox; it has a lot of useful algorithms that can make your job easy if you don't already have other implementations, but it's slow as gently caress, has awful syntax and even worse documentation. For an example of horrible code that makes me cry, in MATLAB all errors that occur during onCleanup are transformed into warnings. Hope you caught all possible failure points!
|
# ? Apr 15, 2013 05:20 |
|
One of our developers implemented a one to many relationship in an SQL database by storing multiple FK values as a comma separated string then when it came to pulling the data out, using a function to convert the string into XML by using REPLACE() to insert <t> </t> in place of commas then returning the individual values from the built xml using xpath queries. What the gently caress. His reasoning? "Well it was created as a one to one relationship but then I had to change it and I didn't have much time". Right.
|
# ? Apr 15, 2013 11:16 |
|
In our login.sql SET @LOGINSQL += ' WHERE (user.[Username] = ''' + @username + ''' OR user.[Email_Address] = ''' + @username + ''') AND u.[Password] = ''' + password + ''' AND ... AND ...' exec(@LOGINSQL)
|
# ? Apr 15, 2013 11:25 |
|
Thermopyle posted:If you'd like to have your mind blown, read this. Alright, now that is some well documented poo poo. I am still working through the first few paragraphs. Takes me back to my college days. Can't wait to finish it all...it sparked a lot of conversation within our group.
|
# ? Apr 15, 2013 16:58 |
|
QuarkJets posted:You are exactly right. Even today most graduate physics/astronomy students have maybe one computational physics course before entering grad school, and most grad programs may only offer one additional computational course. There's not even an introduction to programming in these programs, you're told about these tools that are necessary for solving certain problems but nobody explains how to actually produce good code or anything like that. My graduate level computational class was basically just a class on Mathematica and was completely useless Maybe I'm too used to reading code written by scientists for scientists, but generally speaking, I don't think I'd call it "terrible". Indeed, most of the code I deal with heavily depends on the user knowing what the gently caress they are doing, and there's no "customer" where we have to worry about every little thing. I don't think that makes it "terrible" but yeah, it's intended only for one purpose, often for only one person or group. There's often not too many fail safes if you enter the wrong inputs, but part of science is knowing to "sanity check" your own results. NEVER trust a black box. That all said, I would like to point out that you are correct that most physics and astrophysics people are NOT taught good coding practice. We have to learn ourselves, and even I only learned what I did because I follow threads here a lot and read a LOT of stack overflow when I'm looking for a function. I had only ONE computer coding class and that was a grad level computational physics class. We were thrown right into "here's the math" of how to do algorithms and expected to just know the rest (or learn it on our own). I do know groups of guys (usually the astrophysics guys who run hardcore star simulations) who know lots of great coding techniques but that's not the norm. In any case, yeah, physics should include more coding methods classes. I know nothing about memory management for example, and I have a rudimentary knowledge of how to OOP (because OOP is often unneeded for science stuff). So yeah, in the sense that it's designed to get a question answered and that's about it, I guess you'd call it "terrible", but I find it more readable than most of the well written code I come across.
|
# ? Apr 15, 2013 17:03 |
|
evensevenone posted:You have to realized that anyone still working in FORTRAN has been told by well-meaning-individuals who don't actually know what they are talking about that "C (or C++ or Java or Python or Boost) is better" for literally 40 years now.
|
# ? Apr 15, 2013 17:33 |
|
JetsGuy posted:(because OOP is often unneeded for science stuff). You could say OOP is "unneeded" for everything.
|
# ? Apr 15, 2013 17:33 |
|
Hitch posted:Alright, now that is some well documented poo poo. I am still working through the first few paragraphs. Takes me back to my college days. Can't wait to finish it all...it sparked a lot of conversation within our group. To be fair he's making a minor statistical blunder himself. It's expected that many of the publishable results will be false positives. Say you have 100 researchers looking for something that happens under 5% of conditions, and there's a 5% false positive rate. So 10% of researchers find it and get published, and half are false positives. But that's fine, because now those 10% are going to have their experiments repeated, and only the true positives are going to be repeatable. Now as you look for really small signals you are going to run into problems, but looking at the false-positive rate of front-line research is not really a good measure.
|
# ? Apr 15, 2013 17:37 |
|
hobbesmaster posted:You could say OOP is "unneeded" for everything. True. The larger point I was making was that in scientific programming the only concern is does it work, and beyond that there's not much concern about writing it in a certain "style". SOMETIMES there is option in "can it work faster". This is however more when you are dealing with heavy loads of data and often times when you're doing parallel processing to begin with (so you're already ahead of the game in terms of scientific programmers). In the end, scientists are happy when the code works the way it is intended to. No one but them and collaborators will likely ever use or see the code so there's no reason to do much beyond that. I find that coding with OOP as opposed to procedural code is much more readable (and of course, much easier to fix). So it is what I do now, but a lot of programs I run across from collaborators are written procedurally just because it does what it is required to do.
|
# ? Apr 15, 2013 18:02 |
|
JetsGuy posted:In the end, scientists are happy when the code works the way it is intended to. No one but them and collaborators will likely ever use or see the code so there's no reason to do much beyond that. ok but this is the sentiment of the software industry at large circa 1985 or so, there's a reason all these things programming practices were invented.
|
# ? Apr 15, 2013 18:18 |
|
JetsGuy posted:True. The larger point I was making was that in scientific programming the only concern is does it work, and beyond that there's not much concern about writing it in a certain "style". A very important (and often overlooked) corollary to "does it work" is "when it fails, does it do so in an obvious, interpretable way?" The author of the Abandon Matlab blog wrote a pretty thorough post about this. This is one of the reasons why I dislike the usage of Perl in bioinformatics analysis pipelines -- the language seems to encourage logic that plays fast and loose with data types and correctness of the results. If (for example) I'm adjusting base positions because of a difference in 0-based and 1-based indexing, I want to know right away if there are problems with the input data. If a file is corrupt and I try to add 1 to a string, Perl can happily turn "hello" into "hellp". Python requires you to convert strings to integers first, and you'll receive a very nice ValueError: invalid literal for int() with base 10: 'hello' when you try to do so.
|
# ? Apr 15, 2013 18:21 |
|
Bunny Cuddlin posted:ok but this is the sentiment of the software industry at large circa 1985 or so, there's a reason all these things programming practices were invented. but you see it works.
|
# ? Apr 15, 2013 18:23 |
|
JetsGuy posted:Maybe I'm too used to reading code written by scientists for scientists, but generally speaking, I don't think I'd call it "terrible". Indeed, most of the code I deal with heavily depends on the user knowing what the gently caress they are doing, and there's no "customer" where we have to worry about every little thing. I don't think that makes it "terrible" but yeah, it's intended only for one purpose, often for only one person or group. There's often not too many fail safes if you enter the wrong inputs, but part of science is knowing to "sanity check" your own results. NEVER trust a black box. Sure, I wouldn't say that all science code is terrible. You're just a lot more likely to find terrible code in science projects because so few scientists receive any formal computer science training. There are still plenty of scientific projects with well-written code for one reason or another, whether that's due to luck or due to some of the coders having exposure to real computer science training or even just exposure to examples of good code. I took a C course from the engineering department when I was a physics undergrad before I took computational physics, and it was a huge boon to the entire class to have someone who knew how to code. Everyone else was completely lost for that first week, having never coded before, and we spent some long hours together learning how to write simple C programs. It's insane that this was normal practice in physical science education less than a decade ago (and I guess still is). Fast forward to grad school, a few more years of computational programming under my belt, and it was the same story as undergrad: most of the grad students in our graduate-level computational course had either never touched code before or had only coded in their undergrad-level computational course and had barely struggled through it and hated the whole experience. JetsGuy posted:True. The larger point I was making was that in scientific programming the only concern is does it work, and beyond that there's not much concern about writing it in a certain "style". And that's part of the problem. By the time that I was well into grad school I had finally realized that OOP had a bunch of advantages even if the code itself wasn't going to be read by anyone else. In my third year I wrote a set of classes that trivialized all of the extra work that I was doing on a regular basis. For instance, in ROOT I would typically make stacked histogram+data plots in a very specific way, so I wrote a class in Python that would do all of this extra work for me. When I moved onto new projects that required very similar plots with minor tweaks, I used the same class. For my fourth year of grad school I spent about 40% of my waking hours drinking, versus only 10% for my third year, thanks to OOP.
|
# ? Apr 15, 2013 18:47 |
|
evensevenone posted:To be fair he's making a minor statistical blunder himself. It's expected that many of the publishable results will be false positives. Say you have 100 researchers looking for something that happens under 5% of conditions, and there's a 5% false positive rate. So 10% of researchers find it and get published, and half are false positives. I know the author so I forwarded your comment along. Here's his response: gwern branwen posted:> But that's fine, because now those 10% are going to have their experiments repeated, and only the true positives are going to be repeatable. This is getting tangentially off topic as poor quality code is only one of the reasons replication is not possible, so if anyone else is interested we should probably take it to a thread in SAL. Is anyone interested in such a thread?
|
# ? Apr 15, 2013 19:17 |
|
QuarkJets posted:Sure, I wouldn't say that all science code is terrible. You're just a lot more likely to find terrible code in science projects because so few scientists receive any formal computer science training. There are still plenty of scientific projects with well-written code for one reason or another, whether that's due to luck or due to some of the coders having exposure to real computer science training or even just exposure to examples of good code. I completely agree with your points here. I think some people are misinterpreting my "does it work" statement as a means of being dismissive of good coding practices. I'm not, I'm just trying to explain why scientific code is often... raw. There's a certain element of "why bother" that a lot of scientists have when it comes to making the code cleaner looking or at the very least, repeatable. Your example here is great by the way. I actually spent a lot of time the last few months finally writing an interactive program to do 95% of the plots. I got tired of c/ping in all the code I needed to make a plot over and over for each project. Now, I just have my codes output the data in a format that is standardized, which a special program I have now JUST for plotting is. It's not **quite** to where I want it for integrating into other programs, but right now it's a great standalone. More importantly, any time a colleague says "hay, plot X v Y", I can make a good looking easy to interpret graph in seconds now. So I agree that scientific coding requires a lot more of a culture shift, but I'm just trying to tell people why it is what it is. IT all really ties into this whole "it was good enough for me, it's good enough for you" that pervades all of academia. It's why we get people thrown into the fire, and things like departments demanding 100 hours a week out of their grad students, or they are lazy assholes who should leave. There is a huge cultural problem in the whole "all students should suffer" mentality that many established scientists tend to have. The argument, however, is if you love your field, it's not a burden to learn everything yourself.
|
# ? Apr 15, 2013 20:24 |
|
Another thing to remember, if other scientists can download your code and get the same results, that is not "repeatable" from a scientific perspective. For something to be repeatable, they should be able to take their own data, apply the analytic processes as described in your publication, and get results which agree. The publication is the matter of record, not the source code. Someone should never even have to see your code to get the same results.
|
# ? Apr 15, 2013 20:39 |
|
But if it doesn't check out, it could be helpful to see if it was an error in the code versus a statistical anomaly or malfeasance.
|
# ? Apr 15, 2013 20:53 |
|
What do you guys think about R? I've seen reasons about why MAPLE and MATLAB are horrors (and I've struggled with them myself) but R seems to be straightforward and easy to use.
|
# ? Apr 16, 2013 00:27 |
|
I've never used R, but rants about how awful Matlab is often end in comments about how much better R is so I assume it's decent.
|
# ? Apr 16, 2013 00:42 |
|
I don't use R, but http://www.talyarkoni.org/blog/2012/06/08/r-the-master-troll-of-statistical-languages/
|
# ? Apr 16, 2013 00:52 |
|
I've not done much in it but I can say that its inline help actually works and is helpful, which is way more than can be said for most things.
|
# ? Apr 16, 2013 00:55 |
|
I use R quite a bit, and have taught some short courses on it for genetics grad students at my university. It's a very neat language; you can get a lot done without much code and the language's syntax is quite flexible. I use it for virtually all of my statistical and numerical analysis; I recently got a lot of use out of the survival and rpart packages.
|
# ? Apr 16, 2013 01:55 |
|
abiogenesis posted:What do you guys think about R? I've seen reasons about why MAPLE and MATLAB are horrors (and I've struggled with them myself) but R seems to be straightforward and easy to use. R has its own set of issues but consider this: the alternative is usually SAS.
|
# ? Apr 16, 2013 02:00 |
|
R is also like 20 years newer than MATLAB and has about 1/20th of the functionality. There are a lot of horrors in MATLAB, but a lot of them just have to do with how much crap has been added over the years.
|
# ? Apr 16, 2013 02:08 |
|
Where does S+ fit in all of this? I had to use it in a stats class once and remember it being somehow related to R.
|
# ? Apr 16, 2013 02:15 |
|
evensevenone posted:R is also like 20 years newer than MATLAB and has about 1/20th of the functionality. There are a lot of horrors in MATLAB, but a lot of them just have to do with how much crap has been added over the years. There's a shitload of packages on cran tho. Main thing I can think of that matlab has going for it is simulink, but I've never had any use for it in my stuff. Most of the complaints on the 'Abandon Matlab' blog I remember running into back when I was matlabbing in the 90s.
|
# ? Apr 16, 2013 02:23 |
|
Simulink is what you use to program guidance systems for missiles so I'm pretty sure that's not an R package.
|
# ? Apr 16, 2013 03:30 |
|
Is there someplace that I can read about R, comparing it to MATLAB? I know nothing about R Generally I like to use NumPy and matplotlib, only falling back on our MATLAB site license if there's some sort of legacy code that I don't have the time to replicate. On topic, one of the engineers at the place where I work insists that you should only use 'new' and 'delete' for classes, never for primitives. If you want an array of ints, then you should create it only create it with malloc. Is this just crazy talk or is there a legitimate reason for this notion?
|
# ? Apr 16, 2013 05:59 |
|
R is apparently great for stats work but I absolutely despise it as a general programming language. My job is to do general "hey can you build us some software" programming for researchers, and whenever they say "well this is what I did in R" my reaction is what the gently caress.QuarkJets posted:On topic, one of the engineers at the place where I work insists that you should only use 'new' and 'delete' for classes, never for primitives. If you want an array of ints, then you should create it only create it with malloc. Is this just crazy talk or is there a legitimate reason for this notion? You'll get an authoritative answer in the C/C++ thread, but that sounds like horseshit, especially for modern compilers. (On the other hand, it's completely plausible that he had to use some terrible embedded compiler and this kind of thing was essentially for programming around some bug in it.) e: The rule of thumb is "don't use malloc and new in the same codebase without a drat good reason," so see if you can get one out of him.
|
# ? Apr 16, 2013 06:08 |
|
malloc() is pretty much verboten in embedded circles, to say nothing of new. I think my avr-gcc even turns free() into no-ops by default.
|
# ? Apr 16, 2013 06:50 |
|
evensevenone posted:malloc() is pretty much verboten in embedded circles, to say nothing of new. I think my avr-gcc even turns free() into no-ops by default. what....what? Is everything just required to be statically allocated or what?
|
# ? Apr 16, 2013 07:22 |
|
What didn't you know you needed?
|
# ? Apr 16, 2013 07:26 |
|
ultramiraculous posted:what....what? It's part of several embedded programming standards, including DO-178C, which are the requirements for aviation software. If you can guarantee that your program takes up some amount of memory, that eliminates a large amount of failures, and lessens the chance of an airplane falling out of the loving sky because of some weird memory fragmentation issue.
|
# ? Apr 16, 2013 07:30 |
|
|
# ? May 16, 2024 15:42 |
|
I think a lot of them also disallow recursive functions for the same reason.
|
# ? Apr 16, 2013 07:36 |