|
Anytime someone says the format is "CSV", save yourself a headache and ask "which CSV?"
|
# ? Sep 3, 2018 17:04 |
|
|
# ? Jun 8, 2024 07:46 |
|
you don't think of it as "portable", you think of it as, "more portable than excel, lol"
|
# ? Sep 3, 2018 17:15 |
|
honestly xlsx might actually be more portable, lmao
|
# ? Sep 3, 2018 17:33 |
|
pokeyman posted:Anytime someone says the format is "CSV", save yourself a headache and ask "which CSV?" Do you mean comma-separated comma-separated values, or semi-colon comma-separated values?
|
# ? Sep 3, 2018 17:59 |
|
Carriage return separated values.
|
# ? Sep 3, 2018 18:08 |
|
I usually see pipes as a separator when the fields contain user-inputed text, which is most of the time
|
# ? Sep 3, 2018 19:38 |
|
it';s also bizarre since ASCII already includes Field Separator and Record Separator characters... nobody wants to use them
|
# ? Sep 3, 2018 20:03 |
|
Harder to mash those in with notepad.txt
|
# ? Sep 3, 2018 20:27 |
|
canis minor posted:This is what I wanted to hear, thank you very much! It's a good thing that I'm leaving this place and won't be dealing with engineering this (as, tracking who/what/when edited the CSVs, or the state of how many CSV are there is an information that needs to be tracked in a DB ) so let's be charitable and give your engineers the benefit of the doubt. conceivably the way this would be done is that the original data would be in the database, and there would be an middle tier api to serialize the data as CSV when appropriate. i can imagine it would look like: a) db with data model correctly in tables/etc b) middle tier with api for getting data science results as csv (something like getResults()). c) python client that does the data science stuff The advantages with having an api are: a) Client probably shouldn't be able to execute whatever queries against the database. b) Client doesn''t have to care about serialization or deserialization. A more conventional approach would be to just make a restful api that gives back the results as JSON/XML, but if the module you're plugging the data into is expecting CSV, it might not make sense to do that conversion on the client - since the data isn't in json internally, it would be wasteful to have the server serialize it as json, send it across the wire, have the client deserialize it, and convert it into json. For relatively small datasets, that wouldn't be a problem, but if we're talking 20+ mb of results, it could be seconds of wasted time. Your way with a database tracking csv files is totally crazy and I can't imagine anyone stupid enough to try something like that.
|
# ? Sep 3, 2018 20:29 |
|
jesus at least use HDF5
|
# ? Sep 3, 2018 21:50 |
|
As far as I know there will be a CSV for every time it's being fed into machine learning, with the files being tracked (what/why needs to be tracked idk) The way I'd do it is not have the CSV part at all - if the algorithm is expecting data, why do DB -> API -> CSV -> algo, if you can go straight DB -> algo. I guess algorithm might be expecting CSV, but refactoring the code to work with DB socket shouldn't be a problem. Why create that DB -> CSV endpoint altogether?
|
# ? Sep 3, 2018 22:17 |
|
You can store the CSVs from a given run, which makes reproducing/debugging significantly easier. If you're going to talk to the database directly, you'll need some way to recreate the state of the database at the time a model was run to reproduce the run.
|
# ? Sep 3, 2018 22:21 |
|
https://gizmodo.com/a-google-engineer-discovered-a-vulnerability-letting-hi-1828787568
|
# ? Sep 3, 2018 22:21 |
|
ultrafilter posted:You can store the CSVs from a given run, which makes reproducing/debugging significantly easier. If you're going to talk to the database directly, you'll need some way to recreate the state of the database at the time a model was run to reproduce the run. Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs . I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time). The person that will be implementing this though was handwaving about performance, and using CSV as a standard in Python machine learning (I don't know anything about the subject), and how you should always be using numpy + pandas + CSV over anything else, even local sqlite - hence the "why" question. Don't get me wrong, I do appreciate all the points made - I'd happily accept if I was told "I want to open this dataset in Excel, see if it makes sense, maybe run some function on it that I know how to write in Excel" or "I want to operate on CSVs because that's what I'm accustomed to and seeing text files makes me happy".
|
# ? Sep 3, 2018 23:06 |
|
Enterprise: the language spec https://github.com/joaomilho/Enterprise/blob/master/README.md quote:The line comment is useful when you want to restate what the next line does. Here's an idiomatic example: boo_radley fucked around with this message at 00:29 on Sep 4, 2018 |
# ? Sep 3, 2018 23:54 |
|
boo_radley posted:Enterprise: the language spec
|
# ? Sep 4, 2018 00:16 |
|
canis minor posted:Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs . I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time). my two cents, he's kind of wrong, CSV is common but HDF5 is common too and way superior, for all reasons except "can be opened with notepad" Anyone who says "CSV should obviously be the standard" for what often amounts to GB or sometimes even TB of numerical data is a fool QuarkJets fucked around with this message at 00:46 on Sep 4, 2018 |
# ? Sep 4, 2018 00:42 |
|
If someone said to me 'CSV should be the standard' I would hear it as 'I want to turn a small problem for me into a big problem for you'. Especially when people start complaining because they ran their analysis on a CSV file that turned out to be out of date and try to pin it on you for not synchronising them in a way that guarantees consistency.
|
# ? Sep 4, 2018 01:17 |
|
xtal posted:If you used a local database or SQLite they would probably be approximately equivalent in terms of speed, but much, much more powerful. You could even query the database for the results as CSV and pipe that in to your program instead and the slowdown would be minimal. Whether or not the entire dataset fits in memory is a big difference, but orthogonal to the question of CSV vs. RDBMS. Not really, I would guess the main target is OLAP cube processing in ways that SQL is pretty awful at. Convention at Bank of America for example is to download 100's of MB of data on each query from KDB into a local cube and manipulate locally in Python.
|
# ? Sep 4, 2018 01:34 |
|
Also CSV is not a good format for portability, as your files are needlessly huge. Being able to open the file in Notepad and read through its contents is not a portability feature, it's an interpretability feature (and a pointless one at that)
|
# ? Sep 4, 2018 01:56 |
|
Yet I always find historical tick data, i.e. quotes & trades, are always stored and transferred as CSVs, like gigabytes a day.
|
# ? Sep 4, 2018 03:05 |
|
QuarkJets posted:Being able to open the file in Notepad and read through its contents is not a portability feature, it's an interpretability feature (and a pointless one at that) Why do you think that's pointless? I've found it fantastically useful to be able to open and gently caress around with CSVs in any old editor.
|
# ? Sep 4, 2018 03:37 |
|
QuarkJets posted:Also CSV is not a good format for portability, as your files are needlessly huge. This is a weird use of the term "portability" in this context. Also surely CSV gzips ok? Boom fixed.
|
# ? Sep 4, 2018 04:02 |
|
boo_radley posted:Enterprise: the language spec This link made my Linux Firefox tab crash.
|
# ? Sep 4, 2018 06:57 |
|
Carbon dioxide posted:This link made my Linux Firefox tab crash. Please try using an Enterprise web browser.
|
# ? Sep 4, 2018 07:45 |
|
redleader posted:Why do you think that's pointless? I've found it fantastically useful to be able to open and gently caress around with CSVs in any old editor. In the context of data science, you're maybe going to want to directly view an infinitesimally small number of these files, to spot check them, or when things go wrong. When that happens, there's no material difference between opening them in emacs or opening them in whatever program supports your data format, and it's likely that your org actually has a tool for doing that kind of diagnosis (and if they don't have one then you should write one because it saves a lot of time when you're dealing with a lot of data volume). There have been plenty of times where I've been loving around with a research project and just used CSVs because that was quicker to implement but those files have no business in an operational environment where performance matters.
|
# ? Sep 4, 2018 09:08 |
|
uh why would someone write a little JS snippet like this code:
|
# ? Sep 4, 2018 16:06 |
|
Old browsers were really bad and would parse <script>"<script>"</script> as a nested script tag because that's sort of weird. The text/javascript replacement either to get around some extremely sort of basic filter / block or a misunderstanding of what was going on with the "scr" + "ipt" thing. Note, that this hack hasn't been necessary since IE5.
|
# ? Sep 4, 2018 16:08 |
|
i work in higher ed so yeah we lag behind about on that timeline of things
|
# ? Sep 4, 2018 16:22 |
|
canis minor posted:Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs . I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time). canis minor posted:So, it's the matter of transferring datasets, which I guess, yes, there's a point there. I guess, if it run on the same box, would there still be much difference? (let's assume that data is read-only for machine learning algorithms, but will indeed change at any point after the algorithms are run)
|
# ? Sep 4, 2018 16:46 |
|
QuarkJets posted:my two cents, he's kind of wrong, CSV is common but HDF5 is common too and way superior, for all reasons except "can be opened with notepad" Will keep that in mind - I'd think that if you want to look at the data, you'd want to perform some operations (aggregate, filter, etc.) which I don't see how CSV is suited for - you're essentially using Excel to do stuff to that dataset, so you end up with "SQL", so why not use SQL to begin with. At my place they bought water cooled PC to deal with Excel parsing through this CSV, which is just (but again, not something I deal with) JawnV6 posted:Or if for some wild reason the data in the DB was subject to change and the person wanted something reproducible. Like the previous page had someone with this issue: I'm sorry but, I don't understand the point you're making - I'd assume that if you'd want something reproducible you'd create a dump, run algorithm. Let's say day passes, data changes - so you create a dump, run your algorithm again. If you want to get back to previous day you load the previous dump. You'd have reproducible snapshots and using a CSV in that scenario isn't better than using a DB - you can still do that using DBs. I don't know if at any point you can do without the previous snapshots, but I'd hope at certain point your algorithm works and doesn't need to be fine tuned to provided data so you can do without previous snapshots (I'm sure that I'm wrong ). canis minor fucked around with this message at 17:50 on Sep 4, 2018 |
# ? Sep 4, 2018 17:37 |
|
canis minor posted:I'm sorry but, I don't understand the point you're making - I'd assume that if you'd want something reproducible you'd create a dump, run algorithm. Let's say day passes, data changes - so you create a dump, run your algorithm again. If you want to get back to previous day you load the previous dump. You'd have reproducible snapshots and using a CSV in that scenario isn't better than using a DB - you can still do that using DBs. "I ran my algo yesterday and got an amazing model! Then I ran my training again today and the model is crap. Can you help me debug this?" - This is the guy who's going to tell you CSV's would have saved him, who wasn't sophisticated enough to put things into DB-snapshot terms because frankly that seems like your job.
|
# ? Sep 4, 2018 18:17 |
|
I don't have anything to do with this and have no idea / no saying on how it will be setup - I've bits of information like: "we're going to use CSVs, because it's got better performace" and "we're going to track what is happening to the CSVs in the database" and I'm trying to asess how much sense does it have and if the people that will be doing this indeed do.
canis minor fucked around with this message at 18:42 on Sep 4, 2018 |
# ? Sep 4, 2018 18:36 |
|
JawnV6 posted:Okay, did the folks you're actually working with bring this use case to your attention, or was it just my shitpost? Do you naturally set up a DB to provide point-in-time snapshots on queries like this? You didn't bother to distinguish between "the data I pulled has changed" and "the data I pulled has not changed, but there is additional data relevant to the query" so I'm not that confident you understand what you're doing over there. so here's an idea: if the data can be represented as a csv, it shouldn't be a problem to model it in the database - you make one table that looks like: run id, creation_date, table name ----- you make an api "addRun" your addRun method 1. Makes a new table (possible named run_table_{guid}) with column names from csv {if present} 2. Updates the table with the content of all the rows in the csv. 3. Adds an entry to the run table, indicating that the run can be found in x table. Voila - now you can keep all the results of every run, forever, without jamming the whole loving thing into the database as a string, or storing a billion flat file csvs, and you can retain all results - and you can even do poo poo like run SQL statements on the individual rows too. woo magic. (of course if all the tables have the same columns, well, then actually model the data rather than doing it this way) Bruegels Fuckbooks fucked around with this message at 18:40 on Sep 4, 2018 |
# ? Sep 4, 2018 18:37 |
|
Again, I'm glad DB folks have words for this, maybe you could speak to the domain expert to tease out these details instead of relying on SA shitposts? Just a thought! I just sorta start with "what if the domain experts aren't total wankers who just want to jerk me around and actually have requirements even if they aren't eloquently stated in my domain's jargon" and go from there.
|
# ? Sep 4, 2018 18:43 |
|
canis minor posted:Will keep that in mind - I'd think that if you want to look at the data, you'd want to perform some operations (aggregate, filter, etc.) which I don't see how CSV is suited for - you're essentially using Excel to do stuff to that dataset, so you end up with "SQL", so why not use SQL to begin with. At my place they bought water cooled PC to deal with Excel parsing through this CSV, which is just (but again, not something I deal with) lol
|
# ? Sep 4, 2018 20:14 |
|
canis minor posted:Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs . I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time). It’s totally reasonable to abstract the model training implementation from the DB or API by using an exported immutable snapshot as its input. I’d personally use JSON lines or any other splittable format, but whatever. Connecting directly to the DB is bad. Keeping snapshot archives as a DB dump is bad. Having splittable input is good. You’ve characterised their response to your objection as hand waving. Perhaps that’s because this stuff is fairly standard good practice, and they didn’t understand where you were coming from.
|
# ? Sep 4, 2018 21:36 |
|
canis minor posted:Ok, yes, though you could achieve the same thing as having previous states as snapshots of the DB, just like copying the CSVs . I imagine as well, that you'd only want such behaviour if you're tweaking your algorithm (unless you're tweaking it all the time). You are aways tweaking it. We want to "re-train" ML algorithms periodically on new data, in part because we're detecting behavior and it changes over time. Likewise you may have to re-train a previous version of your model. A lot of ML tools, such as scikit, expect you to just serialize the whole object graph using stuff like Java serialization/Dill and then deserialize it for prod. This can obviously break with library versions or code change. So you might have to rebuild your serialized object graph at some point, and you want the original data around. If you're lucky you have a somewhat implementation independent format for describing the models and retraining isn't necessarily a thing, but you also have a bunch of transformations before you even feed data into the model and you may have to retrain when *that* changes too. Edit: to put it another way, its not handwaving. Versioning ML algorithms involves building a version over the input data, transforms on the data, the training code, and library support code. Hard to version data that you don't snapshot. Edit2: There's also no reason it has to be CSV, except for tooling. We would prefer something like Parquet or Avro, but tooling can be limited so end up with JSON more often than not. ryde fucked around with this message at 21:52 on Sep 4, 2018 |
# ? Sep 4, 2018 21:49 |
|
return0 posted:It’s totally reasonable to abstract the model training implementation from the DB or API by using an exported immutable snapshot as its input. I’d personally use JSON lines or any other splittable format, but whatever. I agree with you that I don't know about it, to recognize what is and what isn't best practice, hence the questions. I can have a read about it, if you've got resources on why connecting directly to DB is bad, even if it's local. There were already valid reasons that were brought up - people wanting to look at the data in question and not knowing SQL, or delay if transfering data over the wire (not applicable here, but still valid in general applications). I'm genuinely curious as to why it is a standard. edit: ryde posted:You are aways tweaking it. We want to "re-train" ML algorithms periodically on new data, in part because we're detecting behavior and it changes over time. I see - thank you; this definitelly explains the need for maintaining the snapshots, but doesn't explain why it can't be DB snapshots. CSV - ok, it's readable, but where does the performance bit kick in? I can see the point in JSON and Parquet / Avro you've mentioned to address the issues that were mentioned canis minor fucked around with this message at 22:44 on Sep 4, 2018 |
# ? Sep 4, 2018 22:11 |
|
|
# ? Jun 8, 2024 07:46 |
|
The performance hit for using CSV is that the files are really big, because they're ASCII text. That's okay if you don't have much data or are training on one computer, but a lot of training is done with supercomputers, and all of the nodes you're using need access to the data, so your performance is dependent on how much data needs to be moved around. Distributed file systems like gpfs and hdfs try to make that process as straightforward as possible for the user but there are still performance ramifications under the hood The performance hit for using SQL is similar but different, depending on implementation. If the nodes all need to query a central database for data, then that is a bottleneck. If you just use the db to create a CSV (or even a more compact format) and then use that, then that's a bottleneck. If you're just using the db to log changes to CSV files then that seems pretty innocuous to me, but I wonder if something like sticking the CSVs into a git repo might be better
|
# ? Sep 4, 2018 23:08 |