Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
pokeyman
Nov 26, 2006

That elephant ate my entire platoon.
Anytime someone says the format is "CSV", save yourself a headache and ask "which CSV?"

Adbot
ADBOT LOVES YOU

bob dobbs is dead
Oct 8, 2017

I love peeps
Nap Ghost
you don't think of it as "portable", you think of it as, "more portable than excel, lol"

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
honestly xlsx might actually be more portable, lmao

Carbon dioxide
Oct 9, 2012

pokeyman posted:

Anytime someone says the format is "CSV", save yourself a headache and ask "which CSV?"

Do you mean comma-separated comma-separated values, or semi-colon comma-separated values?

Ola
Jul 19, 2004

Carriage return separated values.

NihilCredo
Jun 6, 2011

iram omni possibili modo preme:
plus una illa te diffamabit, quam multæ virtutes commendabunt

I usually see pipes as a separator when the fields contain user-inputed text, which is most of the time

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
it';s also bizarre since ASCII already includes Field Separator and Record Separator characters... nobody wants to use them

necrotic
Aug 2, 2005
I owe my brother big time for this!
Harder to mash those in with notepad.txt

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

canis minor posted:

This is what I wanted to hear, thank you very much! It's a good thing that I'm leaving this place and won't be dealing with engineering this (as, tracking who/what/when edited the CSVs, or the state of how many CSV are there is an information that needs to be tracked in a DB :v: )

so let's be charitable and give your engineers the benefit of the doubt.

conceivably the way this would be done is that the original data would be in the database, and there would be an middle tier api to serialize the data as CSV when appropriate. i can imagine it would look like:

a) db with data model correctly in tables/etc
b) middle tier with api for getting data science results as csv (something like getResults()).
c) python client that does the data science stuff

The advantages with having an api are:
a) Client probably shouldn't be able to execute whatever queries against the database.
b) Client doesn''t have to care about serialization or deserialization.

A more conventional approach would be to just make a restful api that gives back the results as JSON/XML, but if the module you're plugging the data into is expecting CSV, it might not make sense to do that conversion on the client - since the data isn't in json internally, it would be wasteful to have the server serialize it as json, send it across the wire, have the client deserialize it, and convert it into json. For relatively small datasets, that wouldn't be a problem, but if we're talking 20+ mb of results, it could be seconds of wasted time.

Your way with a database tracking csv files is totally crazy and I can't imagine anyone stupid enough to try something like that.

QuarkJets
Sep 8, 2008

jesus at least use HDF5

canis minor
May 4, 2011


As far as I know there will be a CSV for every time it's being fed into machine learning, with the files being tracked (what/why needs to be tracked idk)

The way I'd do it is not have the CSV part at all - if the algorithm is expecting data, why do DB -> API -> CSV -> algo, if you can go straight DB -> algo. I guess algorithm might be expecting CSV, but refactoring the code to work with DB socket shouldn't be a problem. Why create that DB -> CSV endpoint altogether?

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


You can store the CSVs from a given run, which makes reproducing/debugging significantly easier. If you're going to talk to the database directly, you'll need some way to recreate the state of the database at the time a model was run to reproduce the run.

Kilson
Jan 16, 2003

I EAT LITTLE CHILDREN FOR BREAKFAST !!11!!1!!!!111!
https://gizmodo.com/a-google-engineer-discovered-a-vulnerability-letting-hi-1828787568

canis minor
May 4, 2011

ultrafilter posted:

You can store the CSVs from a given run, which makes reproducing/debugging significantly easier. If you're going to talk to the database directly, you'll need some way to recreate the state of the database at the time a model was run to reproduce the run.

Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs :v:. I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time).
The person that will be implementing this though was handwaving about performance, and using CSV as a standard in Python machine learning (I don't know anything about the subject), and how you should always be using numpy + pandas + CSV over anything else, even local sqlite - hence the "why" question.

Don't get me wrong, I do appreciate all the points made - I'd happily accept if I was told "I want to open this dataset in Excel, see if it makes sense, maybe run some function on it that I know how to write in Excel" or "I want to operate on CSVs because that's what I'm accustomed to and seeing text files makes me happy".

boo_radley
Dec 30, 2005

Politeness costs nothing
Enterprise: the language spec

https://github.com/joaomilho/Enterprise/blob/master/README.md

quote:

The line comment is useful when you want to restate what the next line does. Here's an idiomatic example:


code:

// adds one to counter counter++;;;


block comment


code:

/* this is a block comment */


The block comment is useful when a comment is long, like explaining some implementation:


code:

/* The International Enterprise™ Association only certifies code with a block comment that exceeds three lines, so this comment guarantees our future certification. */


On top of these, Enterprise™ adds:

copyright comment


code:

/© This code is property of ACME™ studios 2017. ©/


Every Enterprise™ program must begin with a copyright notice, else it will not compile and fail with an UnexpectedNonDisruptiveOpenSourceException error.

It's customary to cover any non trivial implementation in Enterprise™ with a copyright (and a comment). On top of that add an NDA comment (see below).

NDA comment


code:

/© This following code implements a "Web Dropdown Menu", copyright number 9283F3. ©/ 
/NDA The following code can only be read if you signed NDA 375-1. If you happen to read it by mistake, send a written letter to our legal department with two attached copies immediately. NDA/

boo_radley fucked around with this message at 00:29 on Sep 4, 2018

Absurd Alhazred
Mar 27, 2010

by Athanatos

:bisonyes:

QuarkJets
Sep 8, 2008

canis minor posted:

Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs :v:. I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time).
The person that will be implementing this though was handwaving about performance, and using CSV as a standard in Python machine learning (I don't know anything about the subject), and how you should always be using numpy + pandas + CSV over anything else, even local sqlite - hence the "why" question.

Don't get me wrong, I do appreciate all the points made - I'd happily accept if I was told "I want to open this dataset in Excel, see if it makes sense, maybe run some function on it that I know how to write in Excel" or "I want to operate on CSVs because that's what I'm accustomed to and seeing text files makes me happy".

my two cents, he's kind of wrong, CSV is common but HDF5 is common too and way superior, for all reasons except "can be opened with notepad"

Anyone who says "CSV should obviously be the standard" for what often amounts to GB or sometimes even TB of numerical data is a fool

QuarkJets fucked around with this message at 00:46 on Sep 4, 2018

Scikar
Nov 20, 2005

5? Seriously?

If someone said to me 'CSV should be the standard' I would hear it as 'I want to turn a small problem for me into a big problem for you'. Especially when people start complaining because they ran their analysis on a CSV file that turned out to be out of date and try to pin it on you for not synchronising them in a way that guarantees consistency.

MrMoo
Sep 14, 2000

xtal posted:

If you used a local database or SQLite they would probably be approximately equivalent in terms of speed, but much, much more powerful. You could even query the database for the results as CSV and pipe that in to your program instead and the slowdown would be minimal. Whether or not the entire dataset fits in memory is a big difference, but orthogonal to the question of CSV vs. RDBMS.

Not really, I would guess the main target is OLAP cube processing in ways that SQL is pretty awful at. Convention at Bank of America for example is to download 100's of MB of data on each query from KDB into a local cube and manipulate locally in Python.

QuarkJets
Sep 8, 2008

Also CSV is not a good format for portability, as your files are needlessly huge. Being able to open the file in Notepad and read through its contents is not a portability feature, it's an interpretability feature (and a pointless one at that)

MrMoo
Sep 14, 2000

Yet I always find historical tick data, i.e. quotes & trades, are always stored and transferred as CSVs, like gigabytes a day.

redleader
Aug 18, 2005

Engage according to operational parameters

QuarkJets posted:

Being able to open the file in Notepad and read through its contents is not a portability feature, it's an interpretability feature (and a pointless one at that)

Why do you think that's pointless? I've found it fantastically useful to be able to open and gently caress around with CSVs in any old editor.

pokeyman
Nov 26, 2006

That elephant ate my entire platoon.

QuarkJets posted:

Also CSV is not a good format for portability, as your files are needlessly huge.

This is a weird use of the term "portability" in this context. Also surely CSV gzips ok? Boom fixed.

Carbon dioxide
Oct 9, 2012


This link made my Linux Firefox tab crash.

boo_radley
Dec 30, 2005

Politeness costs nothing

Carbon dioxide posted:

This link made my Linux Firefox tab crash.

Please try using an Enterprise web browser.

QuarkJets
Sep 8, 2008

redleader posted:

Why do you think that's pointless? I've found it fantastically useful to be able to open and gently caress around with CSVs in any old editor.

In the context of data science, you're maybe going to want to directly view an infinitesimally small number of these files, to spot check them, or when things go wrong. When that happens, there's no material difference between opening them in emacs or opening them in whatever program supports your data format, and it's likely that your org actually has a tool for doing that kind of diagnosis (and if they don't have one then you should write one because it saves a lot of time when you're dealing with a lot of data volume).

There have been plenty of times where I've been loving around with a research project and just used CSVs because that was quicker to implement but those files have no business in an operational environment where performance matters.

streetlamp
May 7, 2007

Danny likes his party hat
He does not like his banana hat
uh why would someone write a little JS snippet like this

code:
document.write("<scr" + "ipt language='JavaScr" + "ipt' type='text\/javascr" + "ipt' src='https://notmyworkplace.com/cgi-bin/cgiwrap/getcaldetail.pl?ID="+pair[1]+"'><\/scr" + "ipt>");

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Old browsers were really bad and would parse <script>"<script>"</script> as a nested script tag because that's sort of weird. The text/javascript replacement either to get around some extremely sort of basic filter / block or a misunderstanding of what was going on with the "scr" + "ipt" thing. Note, that this hack hasn't been necessary since IE5.

streetlamp
May 7, 2007

Danny likes his party hat
He does not like his banana hat
i work in higher ed so yeah we lag behind about on that timeline of things

JawnV6
Jul 4, 2004

So hot ...

canis minor posted:

Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs :v:. I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time).
Or if for some wild reason the data in the DB was subject to change and the person wanted something reproducible. Like the previous page had someone with this issue:

canis minor posted:

So, it's the matter of transferring datasets, which I guess, yes, there's a point there. I guess, if it run on the same box, would there still be much difference? (let's assume that data is read-only for machine learning algorithms, but will indeed change at any point after the algorithms are run)

canis minor
May 4, 2011

QuarkJets posted:

my two cents, he's kind of wrong, CSV is common but HDF5 is common too and way superior, for all reasons except "can be opened with notepad"

Anyone who says "CSV should obviously be the standard" for what often amounts to GB or sometimes even TB of numerical data is a fool

Will keep that in mind - I'd think that if you want to look at the data, you'd want to perform some operations (aggregate, filter, etc.) which I don't see how CSV is suited for - you're essentially using Excel to do stuff to that dataset, so you end up with "SQL", so why not use SQL to begin with. At my place they bought water cooled PC to deal with Excel parsing through this CSV, which is just :stonklol: (but again, not something I deal with)

JawnV6 posted:

Or if for some wild reason the data in the DB was subject to change and the person wanted something reproducible. Like the previous page had someone with this issue:

I'm sorry but, I don't understand the point you're making - I'd assume that if you'd want something reproducible you'd create a dump, run algorithm. Let's say day passes, data changes - so you create a dump, run your algorithm again. If you want to get back to previous day you load the previous dump. You'd have reproducible snapshots and using a CSV in that scenario isn't better than using a DB - you can still do that using DBs.

I don't know if at any point you can do without the previous snapshots, but I'd hope at certain point your algorithm works and doesn't need to be fine tuned to provided data so you can do without previous snapshots (I'm sure that I'm wrong :D).

canis minor fucked around with this message at 17:50 on Sep 4, 2018

JawnV6
Jul 4, 2004

So hot ...

canis minor posted:

I'm sorry but, I don't understand the point you're making - I'd assume that if you'd want something reproducible you'd create a dump, run algorithm. Let's say day passes, data changes - so you create a dump, run your algorithm again. If you want to get back to previous day you load the previous dump. You'd have reproducible snapshots and using a CSV in that scenario isn't better than using a DB - you can still do that using DBs.

I don't know if at any point you can do without the previous snapshots, but I'd hope at certain point your algorithm works and doesn't need to be fine tuned to provided data so you can do without previous snapshots (I'm sure that I'm wrong :D).
Okay, did the folks you're actually working with bring this use case to your attention, or was it just my shitpost? Do you naturally set up a DB to provide point-in-time snapshots on queries like this? You didn't bother to distinguish between "the data I pulled has changed" and "the data I pulled has not changed, but there is additional data relevant to the query" so I'm not that confident you understand what you're doing over there.

"I ran my algo yesterday and got an amazing model! Then I ran my training again today and the model is crap. Can you help me debug this?" - This is the guy who's going to tell you CSV's would have saved him, who wasn't sophisticated enough to put things into DB-snapshot terms because frankly that seems like your job.

canis minor
May 4, 2011

I don't have anything to do with this and have no idea / no saying on how it will be setup - I've bits of information like: "we're going to use CSVs, because it's got better performace" and "we're going to track what is happening to the CSVs in the database" and I'm trying to asess how much sense does it have and if the people that will be doing this indeed do.

canis minor fucked around with this message at 18:42 on Sep 4, 2018

Bruegels Fuckbooks
Sep 14, 2004

Now, listen - I know the two of you are very different from each other in a lot of ways, but you have to understand that as far as Grandpa's concerned, you're both pieces of shit! Yeah. I can prove it mathematically.

JawnV6 posted:

Okay, did the folks you're actually working with bring this use case to your attention, or was it just my shitpost? Do you naturally set up a DB to provide point-in-time snapshots on queries like this? You didn't bother to distinguish between "the data I pulled has changed" and "the data I pulled has not changed, but there is additional data relevant to the query" so I'm not that confident you understand what you're doing over there.

"I ran my algo yesterday and got an amazing model! Then I ran my training again today and the model is crap. Can you help me debug this?" - This is the guy who's going to tell you CSV's would have saved him, who wasn't sophisticated enough to put things into DB-snapshot terms because frankly that seems like your job.

so here's an idea:

if the data can be represented as a csv, it shouldn't be a problem to model it in the database - you make one table that looks like:

run id, creation_date, table name
-----

you make an api "addRun"

your addRun method
1. Makes a new table (possible named run_table_{guid}) with column names from csv {if present}
2. Updates the table with the content of all the rows in the csv.
3. Adds an entry to the run table, indicating that the run can be found in x table.

Voila - now you can keep all the results of every run, forever, without jamming the whole loving thing into the database as a string, or storing a billion flat file csvs, and you can retain all results - and you can even do poo poo like run SQL statements on the individual rows too. woo magic. (of course if all the tables have the same columns, well, then actually model the data rather than doing it this way)

Bruegels Fuckbooks fucked around with this message at 18:40 on Sep 4, 2018

JawnV6
Jul 4, 2004

So hot ...
Again, I'm glad DB folks have words for this, maybe you could speak to the domain expert to tease out these details instead of relying on SA shitposts?

Just a thought! I just sorta start with "what if the domain experts aren't total wankers who just want to jerk me around and actually have requirements even if they aren't eloquently stated in my domain's jargon" and go from there.

QuarkJets
Sep 8, 2008

canis minor posted:

Will keep that in mind - I'd think that if you want to look at the data, you'd want to perform some operations (aggregate, filter, etc.) which I don't see how CSV is suited for - you're essentially using Excel to do stuff to that dataset, so you end up with "SQL", so why not use SQL to begin with. At my place they bought water cooled PC to deal with Excel parsing through this CSV, which is just :stonklol: (but again, not something I deal with)

lol

return0
Apr 11, 2007

canis minor posted:

Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs :v:. I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time).
The person that will be implementing this though was handwaving about performance, and using CSV as a standard in Python machine learning (I don't know anything about the subject), and how you should always be using numpy + pandas + CSV over anything else, even local sqlite - hence the "why" question.

Don't get me wrong, I do appreciate all the points made - I'd happily accept if I was told "I want to open this dataset in Excel, see if it makes sense, maybe run some function on it that I know how to write in Excel" or "I want to operate on CSVs because that's what I'm accustomed to and seeing text files makes me happy".

It’s totally reasonable to abstract the model training implementation from the DB or API by using an exported immutable snapshot as its input. I’d personally use JSON lines or any other splittable format, but whatever.

Connecting directly to the DB is bad. Keeping snapshot archives as a DB dump is bad. Having splittable input is good.

You’ve characterised their response to your objection as hand waving. Perhaps that’s because this stuff is fairly standard good practice, and they didn’t understand where you were coming from.

ryde
Sep 9, 2011

God I love young girls

canis minor posted:

Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs :v:. I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time).

You are aways tweaking it. We want to "re-train" ML algorithms periodically on new data, in part because we're detecting behavior and it changes over time.

Likewise you may have to re-train a previous version of your model. A lot of ML tools, such as scikit, expect you to just serialize the whole object graph using stuff like Java serialization/Dill and then deserialize it for prod. This can obviously break with library versions or code change. So you might have to rebuild your serialized object graph at some point, and you want the original data around.

If you're lucky you have a somewhat implementation independent format for describing the models and retraining isn't necessarily a thing, but you also have a bunch of transformations before you even feed data into the model and you may have to retrain when *that* changes too.

Edit: to put it another way, its not handwaving. Versioning ML algorithms involves building a version over the input data, transforms on the data, the training code, and library support code. Hard to version data that you don't snapshot.

Edit2: There's also no reason it has to be CSV, except for tooling. We would prefer something like Parquet or Avro, but tooling can be limited so end up with JSON more often than not.

ryde fucked around with this message at 21:52 on Sep 4, 2018

canis minor
May 4, 2011

return0 posted:

It’s totally reasonable to abstract the model training implementation from the DB or API by using an exported immutable snapshot as its input. I’d personally use JSON lines or any other splittable format, but whatever.

Connecting directly to the DB is bad. Keeping snapshot archives as a DB dump is bad. Having splittable input is good.

You’ve characterised their response to your objection as hand waving. Perhaps that’s because this stuff is fairly standard good practice, and they didn’t understand where you were coming from.

I agree with you that I don't know about it, to recognize what is and what isn't best practice, hence the questions. I can have a read about it, if you've got resources on why connecting directly to DB is bad, even if it's local. There were already valid reasons that were brought up - people wanting to look at the data in question and not knowing SQL, or delay if transfering data over the wire (not applicable here, but still valid in general applications). I'm genuinely curious as to why it is a standard.

edit:

ryde posted:

You are aways tweaking it. We want to "re-train" ML algorithms periodically on new data, in part because we're detecting behavior and it changes over time.

Likewise you may have to re-train a previous version of your model. A lot of ML tools, such as scikit, expect you to just serialize the whole object graph using stuff like Java serialization/Dill and then deserialize it for prod. This can obviously break with library versions or code change. So you might have to rebuild your serialized object graph at some point, and you want the original data around.

If you're lucky you have a somewhat implementation independent format for describing the models and retraining isn't necessarily a thing, but you also have a bunch of transformations before you even feed data into the model and you may have to retrain when *that* changes too.

Edit: to put it another way, its not handwaving. Versioning ML algorithms involves building a version over the input data, transforms on the data, the training code, and library support code. Hard to version data that you don't snapshot.

Edit2: There's also no reason it has to be CSV, except for tooling. We would prefer something like Parquet or Avro, but tooling can be limited so end up with JSON more often than not.

I see - thank you; this definitelly explains the need for maintaining the snapshots, but doesn't explain why it can't be DB snapshots. CSV - ok, it's readable, but where does the performance bit kick in? I can see the point in JSON and Parquet / Avro you've mentioned to address the issues that were mentioned

canis minor fucked around with this message at 22:44 on Sep 4, 2018

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

The performance hit for using CSV is that the files are really big, because they're ASCII text. That's okay if you don't have much data or are training on one computer, but a lot of training is done with supercomputers, and all of the nodes you're using need access to the data, so your performance is dependent on how much data needs to be moved around. Distributed file systems like gpfs and hdfs try to make that process as straightforward as possible for the user but there are still performance ramifications under the hood

The performance hit for using SQL is similar but different, depending on implementation. If the nodes all need to query a central database for data, then that is a bottleneck. If you just use the db to create a CSV (or even a more compact format) and then use that, then that's a bottleneck. If you're just using the db to log changes to CSV files then that seems pretty innocuous to me, but I wonder if something like sticking the CSVs into a git repo might be better

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply