|
I have a quasi-religious (i.e. style) question, since I briefly paused today when writing something to consider what would be the most natural way for other people. Background is how to write/compose a set of transformations, e.g. funneling a dict through a series of functions which each take a dict, operate on it, and return a dict.code:
The reason why I paused and wondered was because in e.g. Clojure, I would probably write something like this: code:
Hollow Talk fucked around with this message at 18:35 on Apr 28, 2020 |
# ? Apr 28, 2020 18:32 |
|
|
# ? May 23, 2024 14:57 |
|
Hollow Talk posted:Opinions? Flip off Guido and import hy?
|
# ? Apr 28, 2020 18:44 |
|
It kinda depends pretty heavily on what your actual use case is. In your trivial example I'd almost prefer Python code:
|
# ? Apr 28, 2020 18:44 |
|
Phobeste posted:It kinda depends pretty heavily on what your actual use case is. In your trivial example I'd almost prefer If it was very simple, this. If it wasn't extremely obvious, the first or 2nd depending on the day of the week.
|
# ? Apr 28, 2020 19:03 |
|
In the it doesn't matter camp, but if row were a class, something like this might be easier to read:Python code:
|
# ? Apr 28, 2020 19:10 |
|
1. With no other facts at hand, the advice is to do what is most idiomatic in Python and that's list comprehensions. 2. If you're working in an existing code base or in an organization you should match what is idiomatic in that code base or what the organization's style guides indicate. 3. If no one else is looking at the code except for you, it's tempting to do whatever you want. I'd caution that it's bad to build up habits and preferences at odds with the larger Python universe. 4. Sometimes it's OK to break idioms, conventions, and guides if it makes a big difference in readability for a specific case.
|
# ? Apr 28, 2020 19:24 |
|
Thermopyle posted:do what is most idiomatic in Python and that's list comprehensions.
|
# ? Apr 28, 2020 19:26 |
|
Not saying it's the way to go, but I can't resist...Python code:
There are other functional programming packages in Python but they all seem too "magic" for me. On the other hand, toolz is elegantly simple. The only exception might be curried functions. SurgicalOntologist fucked around with this message at 19:49 on Apr 28, 2020 |
# ? Apr 28, 2020 19:46 |
|
Thanks for all the replies so far. For the record, I tend to use list or generator comprehensions for these things as well, but something in my brain had a "what if"-moment earlier. The toolz syntax is nice for data pipelining, but it's another dependency.
|
# ? Apr 28, 2020 20:02 |
|
Personally I'd prefer a list comprehension in the most abstract sense where we're just talking about the example code but if this is a series of transforms that is being done repeatedly in the same order I'd prefer a function and if you have multiple lists like this that need some set of transforms applied to them in different orders I'd prefer a class of some kind. Removing all of the context of what the list is and what the transformations are kind of makes the question meaningless—I'm happy to do something a little less idiomatic if it is significantly more comprehensible to people who have to deal with it in the future (myself included). Or, you know, what Thermopyle said but less concise.
|
# ? Apr 28, 2020 21:58 |
|
Count me in for a single list comprehension, like the one Phobeste posted
|
# ? Apr 28, 2020 22:07 |
|
I have what is essentially a ~100gb Python dictionary, 200 million 4kb float32 arrays each with a string key. I need to provide this to some people who will be using Python to randomly access these. This is a better plan than having them generate them on-demand because they are costly to generate (~half a second to generate a single array). I'm looking for suggestions on the best way to distribute these. My options seem to be to create a database with a traditional database server and have to host it (either a SQL or more likely a dedicated key-value store) or some sort of file-based setup (sqlite, hdf5, bdb). I would really prefer the latter from a cost perspective. My goal is to minimize the effort and code required to access the data. Something that let you just create a Python object by handing it a file/folder name and accessing it just like a dictionary would be the ideal solution. sqlitedict is available and gdb/bdb seems to be in the standard library and works this way, but may not be portable. Has anyone done something like this and have good/bad experience to share?
|
# ? Apr 30, 2020 20:44 |
|
EDIT NVM dumb answer
|
# ? Apr 30, 2020 20:57 |
|
OnceIWasAnOstrich posted:I have what is essentially a ~100gb Python dictionary, 200 million 4kb float32 arrays each with a string key. I need to provide this to some people who will be using Python to randomly access these. This is a better plan than having them generate them on-demand because they are costly to generate (~half a second to generate a single array). Would pickle or shelve work for you? I haven't used them for that much data... Both are in the standard library. Of course, the users of your data have to trust you since pickle (and shelve since it uses pickle under the hood) can run arbitrary code when unpickling.
|
# ? Apr 30, 2020 20:58 |
|
Yeah I would go with the latter option, if there is a python de-facto modern k-v store format use that. Otherwise pick your favorite that is well supported, I think sqlite would be at least. If you can give an example python program and say "do pip install py-sqlite first" it should be ok. You don't want to host a db server, and they don't want to use your db server either.
|
# ? Apr 30, 2020 21:09 |
|
Thermopyle posted:Would pickle or shelve work for you? I haven't used them for that much data... Don't you have to load an entire pickle into memory? Since they don't know whats going to be loaded ahead of time, he likely cant chunk it thus leaving him to try to load 100GB into memory.
|
# ? Apr 30, 2020 21:15 |
|
Thermopyle posted:Would pickle or shelve work for you? I haven't used them for that much data... Pickle alone doesn't solve my problem, but shelve might. It seems to sometimes be backed by a Python dict, but has options to basically act as a wrapper around dbm which is definitely on-disk. I'll need to check if the underlying files are sufficiently-portable, but that does solve the not needing to distribute any extra code or install libraries issue. Since everything is effectively already bytes or can be interpreted as bytes I suppose I could use the dbm module directly and avoid pickle security issues. Those won't be a big deal in this instance because they are already trusting me for more than that. I vaguely remember having technical or portability issues with that module years ago but that was back in the 2.7 days.
|
# ? Apr 30, 2020 22:20 |
|
I am not a fan of using shelve for data quantities approaching "massive" amounts, by that point I just use HDF5. Store the arrays as datasets using their dict key as the dataset name. There, you're done. This should take less than 10 lines using a for loop. If you want to split the data across multiple files that's just a flag passed to h5py's File constructor Users fetch the list of dataset names from the file, which is equivalent to fetching a list of keys from the dictionary. Then they load the dataset with square bracket syntax, just like a dict. You can implement gzip compression on the datasets with literally 1 line, if that's desirable, and users don't have to do anything extra to read the data; HDF5 performs the decompression automatically.
|
# ? May 1, 2020 00:00 |
|
QuarkJets posted:I am not a fan of using shelve for data quantities approaching "massive" amounts, by that point I just use HDF5. I was curious about this when I was looking into HDF5. The "dataset" terminology made me worry it would choke when I made hundreds of millions of datasets. edit: It seems like there is a practical limit somewhere below a million datasets: https://forum.hdfgroup.org/t/limit-on-the-number-of-datasets-in-one-group/5892 I could split up my items/datasets into groups, maybe by prefix, and write a little wrapper script to hide that without too much effort. I'm also a little skeptical of shelve being able to handle that many keys without choking, unless it happens to use a really scalable search tree setup. OnceIWasAnOstrich fucked around with this message at 00:25 on May 1, 2020 |
# ? May 1, 2020 00:16 |
|
Drat, I did not realize that HDF5 had trouble with millions of datasets in a group A simple workaround might be to split the datasets across arbitrarily-named groups (say 10k datasets per group) and then since you know your own hierarchy you can create a pair of datasets that provide the mapping from keys to dataset paths Sort of like this: Data/ # Group Data/DataGroup1/ # Group Data/Datagroup1/key_1 # Dataset Data/Datagroup1/key_2 # Dataset ... Data/DataGroupM # Group Data/DataGroupM/key_N # Dataset etc. replacing key_N with the actual key names And then you'd have a pair of datasets that look like this keys key_1 key_2 ... key_N paths Data/DataGroup1/key_1 Data/DataGroup1/key_2 ... Data/DataGroupM/key_N Now you fetch the list of keys from the keys dataset Python code:
Python code:
QuarkJets fucked around with this message at 01:45 on May 1, 2020 |
# ? May 1, 2020 01:42 |
|
OnceIWasAnOstrich posted:I have what is essentially a ~100gb Python dictionary, 200 million 4kb float32 arrays each with a string key. I need to provide this to some people who will be using Python to randomly access these. This is a better plan than having them generate them on-demand because they are costly to generate (~half a second to generate a single array). use a filesystem, one array per file; if there's only one level of hierarchy let the kernel deal with it/use directories -- store an index somewhere if u must or hdf5 pickling is bad and dumb, don't use it unless you have to. 4kb float arrays, densely packed, is not such a case Malcolm XML fucked around with this message at 02:27 on May 1, 2020 |
# ? May 1, 2020 02:15 |
|
pickling is fine if you're using it correctly. I don't think I'd use it for this just because it's so big.
|
# ? May 1, 2020 02:41 |
|
Yeah pickling is fine. I wind up using HDF5 even for small things just because I'm so familiar with it, it's becoming pretty widespread in astronomy circles. Plenty of greybeards still using FITS though, which was probably a pretty huge breakthrough back in the 1980s
|
# ? May 1, 2020 05:08 |
|
Thermopyle posted:pickling is fine if you're using it correctly. the valid cases for pickling that don't lead to tears is vanishingly small unless you need to serialize _arbitrary_ python objects, there are usually better, faster, smaller ways that don't restrict you to only python
|
# ? May 1, 2020 06:28 |
|
Hollow Talk posted:I have a quasi-religious (i.e. style) question, since I briefly paused today when writing something to consider what would be the most natural way for other people. Background is how to write/compose a set of transformations, e.g. funneling a dict through a series of functions which each take a dict, operate on it, and return a dict.
|
# ? May 1, 2020 12:36 |
|
Malcolm XML posted:the valid cases for pickling that don't lead to tears is vanishingly small This is not a vanishingly small use case it’s an extremely common one, especially in data science. Eg: -using your trained ML model in production -Having a DataFrame containing python objects that you want to export -very quickly loading/writing a few GB mixed dtype table The last one can be accomplished using other formats though somewhat slower CarForumPoster fucked around with this message at 13:32 on May 1, 2020 |
# ? May 1, 2020 13:27 |
|
Malcolm XML posted:the valid cases for pickling that don't lead to tears is vanishingly small Really there's technically better ways to do almost everything that pickle does. It's surprisingly often that the betterness of these other ways is trumped by the fact that pickle is in the standard library and takes almost zero time to implement. You should default to the easy, low-cost way and re-evaluate when it becomes apparent you need something more. You've lost almost nothing and quite possibly saved yourself a lot. Of course, this requires knowing what pickle does and it's shortcomings. There's a very wide gray area where it just might work fine.
|
# ? May 1, 2020 15:55 |
|
Every one of those cases has awful gotchas if pushed beyond one offs and after fixing many cases of pickling gone wrong if you can't spare 10 seconds to find a better domain specific solution that's about the only case to use it Lmao at deploying pickles that's another case where it'll explode in tears (and is bad engineering too since it can easily break when python or your own or library code changes ). It's dog slow, breaks on common object structure, and is a giant security hole if you load untrusted pickles since the pickle vm does ace by default Arrow / parquet is far better for data frames. There's a lot of crap in the python stdlib.
|
# ? May 1, 2020 17:05 |
|
Malcolm XML posted:Every one of those cases has awful gotchas if pushed beyond one offs and after fixing many cases of pickling gone wrong if you can't spare 10 seconds to find a better domain specific solution that's about the only case to use it
|
# ? May 1, 2020 17:17 |
|
One offs are surprisingly common. I certainly wouldn't advocate for using it for anything other than that. One offs like distributing a data structure to a handful of known systems is a perfect use case of pickle. Claiming that pickling is bad and dumb isn't really the right way to say it if only because it's obviously wrong depending on the use case. Claiming that pickling is bad and dumb for X, Y, and Z is a much more defensible position. The standard library is an awful mess. However, adding dependencies is quite often worse. A big problem with pickle is that it's so easy to use that 9 times out of 10 that it gets used it's the inappropriate tool. This colors everyone's perception of the thing. Thermopyle fucked around with this message at 18:48 on May 1, 2020 |
# ? May 1, 2020 18:39 |
|
Thermopyle posted:
This is exactly why it's bad and dumb, e.g., if it has to have a giant red box in the docs calling out insecurity it probably shouldn't be shipped in the stdlib. It's a giant footgun. Sure there's a very tiny use case for it but 99% of the time use something else If 99% of the time it's bad and dumb, I think it's completely worth calling it bad and dumb in general. If you are advanced enough to know how the pickle VM and protocol works and where it doesn't you can make up your own mind. Most people do not. Good use case: "I want to rehydrate some complex object state that isn't accidentally circular for something exploratory and that I don't rely on to work in exactly the same environment as when I saved it" It's exactly the same as Java serialization. More problems than benefits. quote:One offs like distributing a data structure to a handful of known systems is a perfect use case of pickle. This is exactly when you run into some of dumb fuckin' issues of compatibility and have to use a dependency like dill to debug it. Anyway for this guy's use of it in production: numpy will wrap arrays into pickle using save, but just use parquet and a filesystem arrangement. You're then not locked into using Python if you need to transfer data elsewhere, and don't have discover __reduce_ex__. Also you get statistics and various computations for free + compression Malcolm XML fucked around with this message at 21:47 on May 1, 2020 |
# ? May 1, 2020 21:44 |
|
Malcolm XML posted:This is exactly why it's bad and dumb, e.g., if it has to have a giant red box in the docs calling out insecurity it probably shouldn't be shipped in the stdlib. I think that stupid red box is a consequence of the rest of the pickle docs being bad and not a consequence of something bad being included in the stdlib. Anyway... Malcolm XML posted:I think it's completely worth calling it bad and dumb in general. A more accurate description would be "don't use it unless you're very sure of the consequences". I think we'll just have to disagree on this one.
|
# ? May 1, 2020 23:35 |
|
What should and shouldn't be in the std lib is a subjective can of worms. Python (initially at least) positioned itself as a batteries-included lang, and has a deliberately broad std lib. (Which is now in varying states of quality, maintenance, and API styles) This happens to be one of the sharpest tools in the lib.
|
# ? May 1, 2020 23:38 |
|
Malcolm XML posted:Anyway for this guy's use of it in production: numpy will wrap arrays into pickle using save, but just use parquet and a filesystem arrangement. You're then not locked into using Python if you need to transfer data elsewhere, and don't have discover __reduce_ex__. Also you get statistics and various computations for free + compression
|
# ? May 2, 2020 00:00 |
|
Thermopyle posted:I think that stupid red box is a consequence of the rest of the pickle docs being bad and not a consequence of something bad being included in the stdlib. pickle is inherently designed insecurely but it's not at all obvious that you are running an unrestricted bytecode interpreter to reconstruct the object Dominoes posted:What should and shouldn't be in the std lib is a subjective can of worms. Python (initially at least) positioned itself as a batteries-included lang, and has a deliberately broad std lib. (Which is now in varying states of quality, maintenance, and API styles) This happens to be one of the sharpest tools in the lib. I think we can all agree a tool that's virtually impossible to use safely and correctly is not a great choice to be in the stdlib, giving it a seal of quality it doesn't deserve. It should've been removed in python3 to become an optional lib, in which case you have to opt in very explicitly. https://www.python.org/dev/peps/pep-0594/ cannot get here quick enough though. In this case the battery is leaking acid all over the place. yeah parquet is a beast but it's the common format for easy analytics data transfer -- (other than hdf5 or .mat lol) it's also ~big data~ compliant so you throw spark/presto/hadoop/etc if you need to without any real work
|
# ? May 2, 2020 00:03 |
|
Malcolm XML posted:It should've been removed in python3 to become an optional lib, in which case you have to opt in very explicitly.
|
# ? May 2, 2020 00:09 |
|
I've played around a bit with using whatever filesystem someone happens to be running to handle this and storing individual binary files with raw bytes. Putting all 200m in a single directory is a no-go, it causes everything I tried (ext4, xfs, btrfs) to blow up. Apparently even the B-trees can't handle that. It works reasonably well if I create a directory hierarchy per-character since I have a max of 36 possibilities [A-Z0-9] for each character and a max of 10 characters so the hierarchy doesn't end up too deep or with too many files in a single folder, just an insane amount of folders but well within the capabilities of most filesystems. This makes copying this an exercise in "you better loving have an SSD". It's probably not so bad once I manage to create a tar and the writing shouldn't be too random. I guess if I just made it its own FS and copied it at a block level and distributed an image that would solve the random I/O problem but add in requiring people to deal with a filesystem image. This is basically the same thing I did with HDF5 (and a recursive version of what QuarkJets suggested) but with the folders replaced by HDF5 groups. HDF5 has the benefit of being a solid file object I can read/write with sequential I/O. I guess this is kind of the same thing as a filesystem image except that the code to deal with it is in the HDF5 library instead of the kernel. For practical purposes, this is going to be used by a dozen or less of my students for research purposes for now, so I can make them do whatever I want as long as I teach them how to do it. This is all making putting this all in a real database (and maybe creating a database dump) and a docker image more and more appealing. OnceIWasAnOstrich fucked around with this message at 01:35 on May 2, 2020 |
# ? May 2, 2020 01:31 |
|
Anyone know how the subprocess library works? I’m doing some latency measurement stuff across two computers running a shell command on my receiver through python’s subprocess library, and I’ve noticed that if I run my code in the actual terminal the latency is a more or less stable oscillation like I’m expecting, but running it in python as a subprocess causes my measured latency to slowly increase to infinity. I suspect it has something to do with subprocess receiving fewer CPU resources but I am not certain.
|
# ? May 5, 2020 19:47 |
|
It should have the same resources available to it, it's just another process. Does the script generate a lot of output to either stderr or stdout? Is that perhaps getting buffered in the python process?
|
# ? May 5, 2020 20:54 |
|
|
# ? May 23, 2024 14:57 |
|
Yeah what's Python being used for here? Just launching a subprocess and nothing else? Also, there are many ways to create a subprocess, which are you using and with what options?
|
# ? May 6, 2020 00:46 |