|
I am completely baffled by this thing I have just discovered. I am loving around with making NNTP proxies (yeah yeah, don't start) and I'm modifying an old package called Papercut which came with a buggy mostly-formed proxy hidden in there as a plugin. The basic idea is that this simply uses nntplib to re-create the requests the server receives and then reformat them back into the raw NNTP packets and send them back. I ran into an issue where articles would become slightly corrupted while doing this. I was making nntplib.NNTP.article() requests. This function returns a 4-tuple, where the 4th item is a list containing the lines in the article. I started comparing this list of bytestrings between raw nntplib requests to the underlying NNTP server vs my proxied server, and found a few lines in a few articles that were subtly different. In particular, if the line bytestring began with the two bytes 2e2e (hex format, 2e==".") both the proxy and the raw NNTP query would receive the line properly, but after being re-sent through the proxy you would lose one of those bytes, making the total string one byte shorter with no other changes. After testing this on a large number of articles I have not found any other corruption at all. Why on earth is the SocketServer.StreamRequestHandler or the nntplib slightly mangling my data when there is a newline followed by two periods? I threw in a loop to add an additional period to any strings starting with two periods...and it worked, no more corruption. Code to pull data and reformat into one long bytestring Python code:
Python code:
RFC977 posted:Text is sent only after a numeric status response line has been sent I guess this answers why I have to do this. The nntplib client has stripped extra periods from the lines, and I need to regenerate them. I guess I don't need to post this, but .
|
# ¿ Apr 8, 2017 01:02 |
|
|
# ¿ May 2, 2024 08:42 |
|
Ghost of Reagan Past posted:What's a good library for drawing an array of pixels? They would change over time (I'm not animating figures, though), and I plan on eventually feeding into some LEDs, but for now I'd rather prototype without wiring up a bunch of LEDs. I've thought of just doing it with Flask + Javascript but I'm wondering if there's some other good solution. numpy for the array and matplotlib imshow() to render?
|
# ¿ Apr 8, 2017 01:44 |
|
Data Graham posted:
Yes Longer answer: Yes, if you want to do something like multithreading and it uses any significant amount of CPU time in the interpreter itself, you simply cannot do it without more processes. If you are doing something that involves a lot of IO, theoretically you can use threads but it is still a huge pain in the rear end to actually work with the GIL and you may be better off using greenlets or eventlets or something where someone did the hard work of making sure the IO could actually run concurrently and the patterns can sometimes be easier to make sure you don't dump something blocking into the coroutine. If you are rate-limited by your parsing and not the IO itself none of those will help though. Maybe some of the Python3 async stuff might be simpler for this purpose but I haven't gotten a chance to mess with that enough. OnceIWasAnOstrich fucked around with this message at 13:37 on Aug 1, 2017 |
# ¿ Aug 1, 2017 13:28 |
|
Ugh asyncio. I felt the need to build a nice fast socket based server app and decided to do it "correctly". Started using asyncio and man in this a mess. Why are both generator coroutines and async functions a thing? Why are there multiple keywords that all do something very slightly different with seemingly arbitrary restrictions on which types of functions they work on? I do actually know the answer to these questions. I'm just annoyed because the mix of info and examples I find even in the 3.6 documentation is baffling to use as an intro because it is not at all obvious which type of coroutines different bits of the asyncio library work with, at least at first. Trying to figure out whether/how you can schedule? run? new async def functions inside an eventloop which is already running a run_until_complete streaming server is weirdly difficult when examples are using two different APIs and two different forms of coroutines and older info uses newer terminology differently.
|
# ¿ Oct 10, 2017 15:42 |
|
Jose Cuervo posted:Question: Is there a standard / simple way in the Notebook to ensure that everyone uses the same version of Python, and the same version of the packages being imported? Run a JupyterHub server and set everyone to use one consistent environment?
|
# ¿ Sep 27, 2018 00:13 |
|
A triangular distribution doesn't sound like it would effectively approximate what really sounds like some sort of bimodal distribution. Maybe model using two or more distributions (one for high usage one for low usage) and a latent variable describing which state you are in. You could then either sequentially draw X numbers from your current state where X is Poisson or binomial or something, and then another random number of days for the other and keep switching. Alternatively you draw one day at a time and also draw a Bernoulli at some low probability that tells you whether to switch states.
|
# ¿ Oct 29, 2018 19:54 |
|
There are some situations where it will be very handy, and a whole bunch of examples that make me angry to look at.
|
# ¿ Mar 27, 2019 02:13 |
|
the yeti posted:Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?) I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented. Is there a reason you can't call out to a binary for it?
|
# ¿ Jul 25, 2019 02:11 |
|
Python code:
Edit: Doing this also exposes the weird Python float magic. Python code:
OnceIWasAnOstrich fucked around with this message at 16:58 on Sep 24, 2019 |
# ¿ Sep 24, 2019 16:37 |
|
pmchem posted:can confirm this is often how science is done Had a collaboration where we had a pipeline that generated a ton of numerical data tables and were trying to standardize storage onto some Apache Arrow based format, either Feather or Parquet. This was vetoed by some PIs because they wanted "something standard that we can open up in Excel when you all leave". Massive CSV files it is.
|
# ¿ Jan 17, 2020 17:05 |
|
Is it possible to use Jupyter on a machine where the home folder is a NFS/SMB mounted network share? It seems to extensively use sqlite databases for random things which causes issues doing something as simple as running a ipython/jupyter console when the home folder is network mapped.
|
# ¿ Feb 13, 2020 21:56 |
|
I have what is essentially a ~100gb Python dictionary, 200 million 4kb float32 arrays each with a string key. I need to provide this to some people who will be using Python to randomly access these. This is a better plan than having them generate them on-demand because they are costly to generate (~half a second to generate a single array). I'm looking for suggestions on the best way to distribute these. My options seem to be to create a database with a traditional database server and have to host it (either a SQL or more likely a dedicated key-value store) or some sort of file-based setup (sqlite, hdf5, bdb). I would really prefer the latter from a cost perspective. My goal is to minimize the effort and code required to access the data. Something that let you just create a Python object by handing it a file/folder name and accessing it just like a dictionary would be the ideal solution. sqlitedict is available and gdb/bdb seems to be in the standard library and works this way, but may not be portable. Has anyone done something like this and have good/bad experience to share?
|
# ¿ Apr 30, 2020 20:44 |
|
Thermopyle posted:Would pickle or shelve work for you? I haven't used them for that much data... Pickle alone doesn't solve my problem, but shelve might. It seems to sometimes be backed by a Python dict, but has options to basically act as a wrapper around dbm which is definitely on-disk. I'll need to check if the underlying files are sufficiently-portable, but that does solve the not needing to distribute any extra code or install libraries issue. Since everything is effectively already bytes or can be interpreted as bytes I suppose I could use the dbm module directly and avoid pickle security issues. Those won't be a big deal in this instance because they are already trusting me for more than that. I vaguely remember having technical or portability issues with that module years ago but that was back in the 2.7 days.
|
# ¿ Apr 30, 2020 22:20 |
|
QuarkJets posted:I am not a fan of using shelve for data quantities approaching "massive" amounts, by that point I just use HDF5. I was curious about this when I was looking into HDF5. The "dataset" terminology made me worry it would choke when I made hundreds of millions of datasets. edit: It seems like there is a practical limit somewhere below a million datasets: https://forum.hdfgroup.org/t/limit-on-the-number-of-datasets-in-one-group/5892 I could split up my items/datasets into groups, maybe by prefix, and write a little wrapper script to hide that without too much effort. I'm also a little skeptical of shelve being able to handle that many keys without choking, unless it happens to use a really scalable search tree setup. OnceIWasAnOstrich fucked around with this message at 00:25 on May 1, 2020 |
# ¿ May 1, 2020 00:16 |
|
I've played around a bit with using whatever filesystem someone happens to be running to handle this and storing individual binary files with raw bytes. Putting all 200m in a single directory is a no-go, it causes everything I tried (ext4, xfs, btrfs) to blow up. Apparently even the B-trees can't handle that. It works reasonably well if I create a directory hierarchy per-character since I have a max of 36 possibilities [A-Z0-9] for each character and a max of 10 characters so the hierarchy doesn't end up too deep or with too many files in a single folder, just an insane amount of folders but well within the capabilities of most filesystems. This makes copying this an exercise in "you better loving have an SSD". It's probably not so bad once I manage to create a tar and the writing shouldn't be too random. I guess if I just made it its own FS and copied it at a block level and distributed an image that would solve the random I/O problem but add in requiring people to deal with a filesystem image. This is basically the same thing I did with HDF5 (and a recursive version of what QuarkJets suggested) but with the folders replaced by HDF5 groups. HDF5 has the benefit of being a solid file object I can read/write with sequential I/O. I guess this is kind of the same thing as a filesystem image except that the code to deal with it is in the HDF5 library instead of the kernel. For practical purposes, this is going to be used by a dozen or less of my students for research purposes for now, so I can make them do whatever I want as long as I teach them how to do it. This is all making putting this all in a real database (and maybe creating a database dump) and a docker image more and more appealing. OnceIWasAnOstrich fucked around with this message at 01:35 on May 2, 2020 |
# ¿ May 2, 2020 01:31 |
|
Bad Munki posted:e: Yeah, it's the darnedest thing. If I sit here with my return key mashed, the whole thing functions exactly as intended, albeit with a few blank lines from the server console because I'm constantly entering blank commands, with the occasional actual command output interspersed as my proxy sends the command but is unable to execute it. The fix here is, obviously, to leave a paperweight on my return key indefinitely. I think you need to set bufsize=1 and universal_newlines=True. edit: Although maybe not? I would have thought a manual flush() would also have done the trick.
|
# ¿ Jun 15, 2020 18:27 |
|
abelwingnut posted:i have, effectively: Whether or not this is the best way to do this, you can do: Python code:
|
# ¿ Sep 23, 2020 23:35 |
|
dragon enthusiast posted:Wouldn't using *args require that foo and bar be named and assigned to outside of the function? You can just iterate over the results args list and access the property from the list item without knowing its original variable name, assuming I'm understanding what you are very non-specifically trying to do. Python code:
OnceIWasAnOstrich fucked around with this message at 18:28 on Oct 2, 2020 |
# ¿ Oct 2, 2020 18:08 |
|
Dominoes posted:One minor, tangental* step: I'm going to move the python process manager I wrote towards only supporting the latest python minor version. I've been neglecting this project for want of spare time, but this change may simplify the codebase and API. Python 3 is now old enough scientific libraries have started accumulating dumb poo poo that only works with 3.5 but not newer because of dependency hell. I'm thinking of the difficulty of getting PyQT4 working with 3.6+ to use with visualization libraries that only work with PyQT4 but not PyQT5. With the exception of legacy C++ codebases like that there isn't much reason that I've run into.
|
# ¿ Oct 10, 2020 05:31 |
|
Mirconium posted:Hello thread, my old friend Consider taking a look at FAISS https://github.com/facebookresearch/faiss. It can do a lot of things but the core of it is extremely fast kNN searches on extremely large datasets with a variety of strategies to speeding things up on hideously big datasets. 20k is honestly pretty small (exhaustive search gets infeasible around 1M vectors with that code) so you can just use the flat index for exact results and not rely on the index tricks for approximate results. Since you refer to scipy's distance function I'm assuming you are using something normal like Euclidean distance so it probably includes a C-implementation of what you are using already. edit: That said, 20k is not a lot of vectors at all and I've used the sklearn BallTree with no problems for both Euclidean or Cosine distance metrics on similarly sized datasets. The scipy pdist() function only takes about 10 seconds on my machine to compute the full distance matrix for 20k vectors of 100 dimensions each. Don't give it a callable metric, that will ruin things with Python overhead. If you can't use any of the included metrics you'll need to construct your method with not-Python in some way. OnceIWasAnOstrich fucked around with this message at 15:45 on Oct 15, 2020 |
# ¿ Oct 15, 2020 15:27 |
|
Mirconium posted:For multiprocessing pools, should I be spawning one any time I want to async-map over an iterator, or can I just create a single pool for the arena that I am working with and then summon it by reference any time I need to go over an iterator? I'm not planning to actually asynchronously run anything, I just want to parallelize some iterative computations. You have to close the pool one way or another, it won't happen if it is an attribute of an object that gets garbage collected. The easiest way is to open pools with context managers and the with keyword. This means you can re-use a pool like you mention to save spinning up new processes but you need to keep track of it and ensure you close it eventually. If your parallel processes take long enough just re-make the pool, but if spinning up a new pool takes enough time it is worth it to keep it around, especially if you use a length pool intitializer function. You could create the pool in an outside context with a context manager and have the entire lifecycle of your object happen inside of that I suppose.
|
# ¿ Nov 10, 2020 21:45 |
|
Mirconium posted:So what about the destructor strategy? Like if I just add pool.terminate() to __del__ are there potential issues with that? (I guess potentially if __del__ doesn't get called, which I assume can arise from crashes or Exceptions or something?) Destructor would be fine if the object gets garbage collected properly. If you have a crash at the wrong time your pool will hang around afterward regardless of any of this (you will just be slightly more likely for this to happen if it is alive for the entirely script lifetime instead of just during computation). I don't remember clearly what guarantees CPython has wrt to garbage collection and exceptions but you would still probably want to wrap your function with a context manager and use __exit__ instead, or use a try/finally block. OnceIWasAnOstrich fucked around with this message at 15:31 on Nov 11, 2020 |
# ¿ Nov 11, 2020 15:27 |
|
Mirconium posted:Also python visualizations are all awful, especially for making actual nice looking plots that do unusual formatting, as presumably would be needed in data journalism, DOUBLE especially for making them HTML-friendly. I personally have had good luck with auto-generating javascript and html5. It's an added layer of learning curve, but when you get good at python, remember that as an option. I can't really agree with this, although default matplotlib and some of its wrappers can be awful especially if you want non-raster renderers. There are plenty of nice HTML-friendly ways to do very nice visualizations in Python including but not limited to Plotly and Bokeh. Rolling your own Javascript and HTML generation seems like an amazing amount of work for something that is probably going to be uglier and way harder to use than the Python plotly.js interface. Dash/Plotly is a great resource for data journalism-type stuff where you want fancy/nice/interactable/web-friendly visualizations.
|
# ¿ Nov 12, 2020 23:42 |
|
Zoracle Zed posted:I'd recommend the grouper iterator but god drat it's annoying itertools has a "recipes" section in the documentation instead of just putting code in the library where it'd be useful The number of times I have had to Google and copy-paste the exact same function off of either Stackoverflow or the itertools doc depending on which ones shows up first is just so frustrating. Put it in the drat library already. I don't need the best way to write that that memorized taking up space in my brain. Who do we need to bother to make this happen? I know I can install more-itertools or whatever but I don't want a whole extra dependency when that is an incredibly common and simple need that would fit well in itertools.
|
# ¿ Nov 13, 2020 19:53 |
|
duck monster posted:I am currently overbrimming with the desire to beat Homebrew's python maintainers to death (with angry words). As much as I used it for years and got my whole PhD doing work on a Mac Pro primarily with Homebrew...it is a trash package manager. You can pin a formula but the Homebrew rule that you can't build against outdated formulas really works against Python here. I don't think there is a good way to maintain Python on Homebrew as it exists when by rule you can't NOT upgrade dependencies. I believe if you pin a particular formula it will at least ask for confirmation before upgrading it. I believe you could also alias --confirm onto your brew commands to get it to behave more like a sane package manager by default. Use anything else to manage your Python environment on macOS, whether it is pyenv or conda or whatever else.
|
# ¿ Nov 17, 2020 15:38 |
|
Epsilon Plus posted:ooooh, I didn't even think that would be a serious issue Yep, creating and working with massive Python strings is definitely going to add a lot of overhead compared to simple integer or even bignum math. Depending on what you are curious about, using log10 gets you the same info without that particular performance hit although with exponentially-increasing bignums you'll hit some performance problems soon enough anyway.
|
# ¿ Dec 9, 2020 20:25 |
|
CarForumPoster posted:When I share results I output the notebook to PDF. The few times I’ve needed to do this it’s always with nontechnical people. Yeah the promise of notebooks being useful to share with non-technical people was a little oversold for me. Anyone non-technical isn't going to 1) install Python and associated libraries 2) be able to run one of my notebooks 3) care about the code. I can make a PDF but it doesn't really get me much when with just a little more work I can make a Shiny/Dash app and get actual interaction. I know there are services to kinda-sorta host them but . I've experimented a bit with using them to quickly prototype a bit of an overall analysis pipeline and then integrate it into a bigger Snakemake pipeline and that worked well enough. Nothing groundbreaking but it probably did save me some time and if I need to dig in more I can pretty easily. I use them extensively for loving around and experimenting since having visualizations or big chunks of data pop out semi-formatted in a browser cell is (usually) more useful than having to write extra code for it in a REPL via SSH or having to use a separate program or X server to let matplotlib poo poo appear over SSH. These days this usually goes hand-in-hand with using it to test out changes to an actual library that I'm writing in a real IDE. Problems with multiprocessing or async issues do pop up in certain contexts that cause my to have to go entirely out of the notebook environment though. I've also found myself extensively using Colab for teaching students since neither they nor I have to set anything up and they can't possibly break anything that can't be fixed by hitting "Factory Reset Runtime". If I can get them to not overwrite the code cells that worked it helps a little bit early on because previous work doesn't "disappear" in their brains the way it seems to on the CLI or REPL.
|
# ¿ Dec 19, 2020 18:55 |
|
You can also getattr the class name out of the module that the class is defined in. If you need to pull specific classes based on something like a text config file this works well. Assuming all of your models are present in the models module: Python code:
|
# ¿ Dec 28, 2020 19:40 |
|
Mursupitsku posted:After running some tests it seems that the classifier is the only thing taking a significant amount of time to run. I haven't yet tested if dropping features would increase the performance. In any case there is only 17 features total and my gut tells I cant really drop any of them. Part of the "eXremeness" of XGBoost is that it does scale pretty well with more hardware and threads. XGBoost is pretty well optimized already, but perhaps you are using an excessively-large model. 17 is a decent number of features and it could be worth actually doing some feature selection testing. You may also be using default hyperparameters that make a more-complex-than-necessary model and could reduce the tree depth, maximum number of trees, or number of boosting rounds.
|
# ¿ Jan 10, 2021 17:21 |
|
CarForumPoster posted:If you wanna be lazy AND cheap, deploy your Flask app with Zappa to AWS lambda. gently caress servers. Because a have a ton of free GCP credits I do this with Cloud Run and its been working great for a quickie Flask app this past week I didn't want to bother a server with. It feels a bit more like a natural fit for Flask containers than Lambda with the configurable per-container concurrency, although Lambda can probably do that too by now. OnceIWasAnOstrich fucked around with this message at 16:19 on Jan 28, 2021 |
# ¿ Jan 28, 2021 16:16 |
|
Bad Munki posted:I don't think I'm overloading anything? I've got a function that'll take a list of ints/ranges, and as a convenience, if your list is only one item long, you can just give it the item alone. What's so weird about that? I'm sure it's 99% likely I hosed up the hint above, so that's what I'm trying to get right. It is the heterogeneous list that is messing your type checking up as far as I can tell, not being either a int or a list and I think your type hints are about as good as it will get. I'm guessing (without checking) that PyCharm isn't handling covariant typing on your Iterable and is assuming invariant type and checking the first item and assuming the entire list is ints. What happens if your first list element is a string? If you really care about that and wanted a cleaner function you could accept kwargs for lists with one for int-lists and one for range-lists since presumably you have code to separate the items out anyway and handle them differently in your function already. edit: I'm having a hard time figuring out whether the PEP/mypy actually allow for covariant type lists. I think maybe the type system doesn't allow that for mutable types? I think maybe PyCharm sees list, the type system specifies lists can only be invariant, so it sees a List[int], converts to Iterable[int] and calls it a day? OnceIWasAnOstrich fucked around with this message at 23:19 on Feb 22, 2021 |
# ¿ Feb 22, 2021 23:16 |
|
I've recently done quite a bit with ML at various levels in Python. The Scikit-Learn version is very simple, easy to use, but there is a lot buried in the many, many arguments to some methods and the defaults are very frequently going to be unhelpful to you. You might not have the same choice in optimization methods you might use otherwise and the API isn't really designed for internal-to-the-model modularity like that so it is up to whoever wrote that model to give you all the options or you need to customize the model yourself (lots of code). PyTorch is very much for creating models and methods, not really for use as a simple API for using established/implemented methods like scikit. That said, the type/size of models you want PyTorch for tend to have a lot of extra complication in setting up massive parallel training that isn't really conducive to something super-simple like the Scikit API although stuff like pytorch-lightning get close. I've done a lot with sequence learning using models that use the HuggingFace-style API for Transformers models implemented in either PyTorch or Tensorflow. That API is a lot closer to what you might expect for direct use although it is clearly evolving rapidly as more and more methods get churned out gradually expanding the range of stuff the coordinated API might need to do.
|
# ¿ Mar 9, 2021 22:51 |
|
Rocko Bonaparte posted:I tend to look at neuron activation as a threshold instead of an intercept or bias so I wonder if I'm interpreting the intercept_ attribute incorrectly. The coefs_ fields really do look like regular old weights across different layers; it adjusts based on hidden_layer_sizes and the data I fit. I run the fit() method to get the initial topology and then blow it over. A multilayer perceptron works differently than the example you have. In your example, neurons output a binary 0/1 value directly. In the MLP scheme that Scikit-Learn uses, you have a non-linear activation function that maps to a specific range for each neuron. To implement the classifier, the MLPClassifer takes the output of the final layer which I believe will be of shape [num_samples, num_classes], and uses the softmax function to normalize that output to a probability distribution over your classes for each sample. The classifier will then output the class identify with the highest probability. Dominoes posted:You can classify a decision tree as ML if you want, or not. It's an easy-to-grasp, but powerful tool for creating complex behavior. I'm curious, would you maybe draw a line at a non-ML decision tree being human-interpretable? I definitely agree that there is obviously a ton of hype and my personal hand-wavy boundary is that ML models are models where there is no attempt to make the model structure reflect how the modeled phenomenon actually works, and the point isn't to understand the phenomenon through the model, just to make a good [insert goal here]. Clearly decision trees in certain incarnations are ML, especially ensemble tree methods. One of the more powerful and "successful" big machine learning models isn't an ANN but is instead a complicated method of creating ensembles of decision trees. OnceIWasAnOstrich fucked around with this message at 00:09 on Mar 12, 2021 |
# ¿ Mar 11, 2021 23:59 |
|
Dominoes posted:I don't have a reason to draw a line; categorization is a tool you can apply to a problem. Maybe you have a reason to draw a line for DTs as ML or not. Sorry, I didn't mean to make you draw a line. My point was that, personally, I would never say a type of model is or isn't ML. From my perspective ML is more of the approach or philosophy to problem solving. To me, a linear regression could be (and is) used as machine learning, but can just as easily not be. Also, Rocko, your issue with applying the MLP classifier to the XOR problem can be illustrated this way. All of the methods used for optimization of model weights are based on gradients (SGD/Adam more so than LBFGS). You can easily end up in a situation where your optimizer gets stuck in a particular region of parameter space and needs to propose much larger changes to the parameter to improve the gradient than it is capable of making. If you use SGD or Adam as your optimizer, you can visualize this: Python code:
You can see that about a quarter of these manage to converge on the correct solution, a loss value near zero. MLPClassifier uses log-loss/cross-entropy. If you leave the learning_rate_init default at 0.001, you should get a lot of warnings telling you it didn't converge with 200 iterations and a plot like this: With the default learning rate, it can't make changes to the parameters fast enough to reach the correct solution in the default maximum number of iterations. I am not sure why L-BFGS does as poorly as it does here, I get 28% properly fit, most of the models converge on similar incorrect solutions. There aren't quite enough knobs to tweak with that particular optimizer here. This is one of those issues I was referring to when I say that the defaults for the Scikit models are not always sane and definitely not always suitable. Making these stochastic optimizers converge properly and quickly by tweaking both model and optimizer hyperparameters is where much of the "art" comes into it.
|
# ¿ Mar 12, 2021 01:09 |
|
Cyril Sneer posted:I have a situation where I'm extracting two sets of byte arrays from a larger byte array buffer as follows: I am missing something. Why can't you just frombuffer() as a np.int16 dtype? Is the original data two 8-bit signed ints sequentially or something? If its something like that, just cast to 16-bit, multiple your more-significant part by the appropriate factor (or bit-shift), then add them together. OnceIWasAnOstrich fucked around with this message at 18:25 on Apr 30, 2021 |
# ¿ Apr 30, 2021 18:22 |
|
If anyone has created a CLI tool with potentially dozens of configurable options before, I have some questions about how you like to structure this. I have inherited a tool with a CLI interface with a moderately complex argparse setup. We have a base parser and 3-4 specialized parsers for different invocations. Right now this gets processed by a main.py with functions run for the different CLI invocations. These functions do the basic argparse prepartion and parsing, and simply invoke a specific function in another module with all of the parsed arguments given to the function as named keyword arguments. This means that the function signatures for those actual core processing functions are massive. Hypothetically someone might use them without the CLI interface from Python, so they have the full set of keyword arguments with the same default values as are set in argparse, although that isn't enforced in any way by code and just feels bad. How would you go about structuring this?
Is there some sort of intended solution that exists for this?
|
# ¿ May 2, 2021 17:17 |
|
QuarkJets posted:A common way to reduce the length of a function signature is to define a dataclass that contains the configurable parameters, then the function just accepts an instance of that class. If multiple functions can use the same dataclass then that's great, or you can define multiple dataclasses too - it's often better to have tailored instances than trying to create one big class that does everything Yesss. Dataclasses were what I was thinking of but couldn't remember exactly because I have (regretfully) only used them once or twice a while back. It ends up basically identical to using the argparse-returned Namespace directly as far as the function is concerned, but it will require the same level of argument definition duplication unless I do something like generate the argparse parser from the dataclass. It does solve my problem of having it extremely difficult to use the functions outside of the CLI though. Click seems nice enough but mostly as a better argparse replacement, and I'm not struggling with that aspect right now, the actual interface is pretty nice and simple at the moment, outside of the huge number of optional arguments. Maybe it has a feature that would help me avoid the huge function signature, but it seems like it would just make me have a huge signature in addition to a huge number of decorators all stacked in the same place. OnceIWasAnOstrich fucked around with this message at 01:14 on May 3, 2021 |
# ¿ May 3, 2021 01:10 |
|
QuarkJets posted:I would be hesitant to do that for any corporate codebase that's not open source I wouldn't put anything on SomethingAwful that I wouldn't put on Pastebin. At least with Pastebin it isn't automatically indexed if you set it to unlisted.
|
# ¿ Jun 9, 2021 18:09 |
|
mr_package posted:Is there a way to use subprocess for internal functions? Or, a way to set timeout value to a function that's fairly simple (in the standard library) maybe with multiprocessing module? I'm running into a fun issue where DNS lookups for .local are ignoring socket.timeout value on Mac (probably due to bonjour integration / special handling), and the default timeout is 5 seconds. This is an automated test that hits failure case if I forgot to connect to VPN so I just want it to use a very short timeout instead of sitting there for 5 seconds and then failing. You can create a multiprocessing.Process with the target= argument as the callable you want to run, ideally using a Pipe or Queue to feed your data back to the main process. You then call .start() and then .join(timeout) with whatever timeout you want before checking the Pipe/Queue and calling it a day. If your underlying callable releases GIL you can even do this with threading.Thread instead.
|
# ¿ Jun 9, 2021 22:56 |
|
|
# ¿ May 2, 2024 08:42 |
|
Deadite posted:I'm having an issue trying to add a matplotlib graph to a tkinter GUI where the legend to the graph is getting cut off if I move it to below the chart. I've never had this problem in the context of another GUI but I've definitely run into similar issues with bits of a matplotlib figure getting rendered outside the bounds of an image. It is usually something that a call to tight_layout() or other layout-modifying functions can address.
|
# ¿ Jun 17, 2021 14:57 |