Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
OnceIWasAnOstrich
Jul 22, 2006

I am completely baffled by this thing I have just discovered. I am loving around with making NNTP proxies (yeah yeah, don't start) and I'm modifying an old package called Papercut which came with a buggy mostly-formed proxy hidden in there as a plugin.

The basic idea is that this simply uses nntplib to re-create the requests the server receives and then reformat them back into the raw NNTP packets and send them back. I ran into an issue where articles would become slightly corrupted while doing this. I was making nntplib.NNTP.article() requests. This function returns a 4-tuple, where the 4th item is a list containing the lines in the article. I started comparing this list of bytestrings between raw nntplib requests to the underlying NNTP server vs my proxied server, and found a few lines in a few articles that were subtly different. In particular, if the line bytestring began with the two bytes 2e2e (hex format, 2e==".") both the proxy and the raw NNTP query would receive the line properly, but after being re-sent through the proxy you would lose one of those bytes, making the total string one byte shorter with no other changes. After testing this on a large number of articles I have not found any other corruption at all.

Why on earth is the SocketServer.StreamRequestHandler or the nntplib slightly mangling my data when there is a newline followed by two periods? I threw in a loop to add an additional period to any strings starting with two periods...and it worked, no more corruption.

Code to pull data and reformat into one long bytestring
Python code:
    def get_ARTICLE(self, group_name, id):
        #WHYYY?
        resp, nr, id, headerlines = self.nntp.head(id)
        resp, nr, id, articlelines = self.nntp.article(id)
        dobreak = 0
        #Get rid of the headers so we can jam them back on later?
        while 1:
            if articlelines[0] == "":
                dobreak = 1
            del articlelines[0]
            if dobreak:
                break
        #What the gently caress am I doing here? This poo poo is bananas.
        for i in xrange(len(articlelines)):
            if articlelines[i][0]=='.':
                articlelines[i]='.'+articlelines[i]

        return ("\r\n".join(headerlines), "\r\n".join(articlelines))
Code that handles the server response. This is a method in a SocketServer.StreamRequestHandler object. send_response() uses self.wfile which is the object provided by the StreamHandlerRequest for outputting data. I don't see any way that this gets modified.
Python code:
    def do_ARTICLE(self):

        if len(self.tokens) == 2 and self.tokens[1].find('<') != -1:
            # Message ID specified
            for b in backends.values():
                self.tokens[1] = self.get_number_from_msg_id(self.tokens[1], b)
                result = b.get_ARTICLE(self.selected_group, self.tokens[1])
                if result:
                    backend = b
                    # article_info = b.get_article_number(self.tokens[1])
                    article_info = (self.selected_group, self.tokens[1])
                    break

        if result is None:
            self.send_response(ERR_NOSUCHARTICLENUM)
        else:
            response = STATUS_ARTICLE % (article_info[0], article_info[1])
            self.send_response("%s\r\n%s\r\n\r\n%s\r\n." % (response, result[0], result[1]))

    def send_response(self, message):
        self.wfile.write(message + "\r\n")
        self.wfile.flush()

While writing this up I decided to investigate, and found this:

RFC977 posted:

Text is sent only after a numeric status response line has been sent
that indicates that text will follow. Text is sent as a series of
successive lines of textual matter, each terminated with CR-LF pair.
A single line containing only a period (.) is sent to indicate the
end of the text (i.e., the server will send a CR-LF pair at the end
of the last line of text, a period, and another CR-LF pair).

If the text contained a period as the first character of the text
line in the original, that first period is doubled. Therefore, the
client must examine the first character of each line received, and
for those beginning with a period, determine either that this is the
end of the text or whether to collapse the doubled period to a single
one.

I guess this answers why I have to do this. The nntplib client has stripped extra periods from the lines, and I need to regenerate them. I guess I don't need to post this, but :justpost:.

Adbot
ADBOT LOVES YOU

OnceIWasAnOstrich
Jul 22, 2006

Ghost of Reagan Past posted:

What's a good library for drawing an array of pixels? They would change over time (I'm not animating figures, though), and I plan on eventually feeding into some LEDs, but for now I'd rather prototype without wiring up a bunch of LEDs. I've thought of just doing it with Flask + Javascript but I'm wondering if there's some other good solution.

I'm semi-willing to use some other language, but Python's the language I use daily so I'd rather just stick with that unless I can't find an adequate solution.

numpy for the array and matplotlib imshow() to render?

OnceIWasAnOstrich
Jul 22, 2006

Data Graham posted:


So um. Does this mean the GIL is really that much of a bastard and I should just have shot myself rather than try multithreading in Python in the first place?

Yes

Longer answer: Yes, if you want to do something like multithreading and it uses any significant amount of CPU time in the interpreter itself, you simply cannot do it without more processes. If you are doing something that involves a lot of IO, theoretically you can use threads but it is still a huge pain in the rear end to actually work with the GIL and you may be better off using greenlets or eventlets or something where someone did the hard work of making sure the IO could actually run concurrently and the patterns can sometimes be easier to make sure you don't dump something blocking into the coroutine. If you are rate-limited by your parsing and not the IO itself none of those will help though.

Maybe some of the Python3 async stuff might be simpler for this purpose but I haven't gotten a chance to mess with that enough.

OnceIWasAnOstrich fucked around with this message at 13:37 on Aug 1, 2017

OnceIWasAnOstrich
Jul 22, 2006

Ugh asyncio. I felt the need to build a nice fast socket based server app and decided to do it "correctly". Started using asyncio and man in this a mess. Why are both generator coroutines and async functions a thing? Why are there multiple keywords that all do something very slightly different with seemingly arbitrary restrictions on which types of functions they work on?

I do actually know the answer to these questions. I'm just annoyed because the mix of info and examples I find even in the 3.6 documentation is baffling to use as an intro because it is not at all obvious which type of coroutines different bits of the asyncio library work with, at least at first. Trying to figure out whether/how you can schedule? run? new async def functions inside an eventloop which is already running a run_until_complete streaming server is weirdly difficult when examples are using two different APIs and two different forms of coroutines and older info uses newer terminology differently.

OnceIWasAnOstrich
Jul 22, 2006

Jose Cuervo posted:

Question: Is there a standard / simple way in the Notebook to ensure that everyone uses the same version of Python, and the same version of the packages being imported?

Run a JupyterHub server and set everyone to use one consistent environment?

OnceIWasAnOstrich
Jul 22, 2006

A triangular distribution doesn't sound like it would effectively approximate what really sounds like some sort of bimodal distribution. Maybe model using two or more distributions (one for high usage one for low usage) and a latent variable describing which state you are in. You could then either sequentially draw X numbers from your current state where X is Poisson or binomial or something, and then another random number of days for the other and keep switching. Alternatively you draw one day at a time and also draw a Bernoulli at some low probability that tells you whether to switch states.

OnceIWasAnOstrich
Jul 22, 2006

There are some situations where it will be very handy, and a whole bunch of examples that make me angry to look at.

OnceIWasAnOstrich
Jul 22, 2006

the yeti posted:

Are there any libs that can actually create encrypted zips? Seems like zipfile and others are explicitly decrypt only (due to...licensing? Technical hurdles?)

I think libarchive can do that and there is a library with ctypes bindings to libarchive, though I don't know that those function bindings are implemented.

Is there a reason you can't call out to a binary for it?

OnceIWasAnOstrich
Jul 22, 2006

Python code:
from decimal import Decimal, ROUND_UP

x = Decimal(1.23)
x.quantize(Decimal('0.1'), rounding=ROUND_UP)
Not exactly convenient.

Edit:

Doing this also exposes the weird Python float magic.

Python code:
x = 1.23
str(x)
#'1.23'
Decimal(x)
#Decimal('1.229999999999999982236431605997495353221893310546875')

OnceIWasAnOstrich fucked around with this message at 16:58 on Sep 24, 2019

OnceIWasAnOstrich
Jul 22, 2006

pmchem posted:

can confirm this is often how science is done

Had a collaboration where we had a pipeline that generated a ton of numerical data tables and were trying to standardize storage onto some Apache Arrow based format, either Feather or Parquet. This was vetoed by some PIs because they wanted "something standard that we can open up in Excel when you all leave". Massive CSV files it is.

OnceIWasAnOstrich
Jul 22, 2006

Is it possible to use Jupyter on a machine where the home folder is a NFS/SMB mounted network share? It seems to extensively use sqlite databases for random things which causes issues doing something as simple as running a ipython/jupyter console when the home folder is network mapped.

OnceIWasAnOstrich
Jul 22, 2006

I have what is essentially a ~100gb Python dictionary, 200 million 4kb float32 arrays each with a string key. I need to provide this to some people who will be using Python to randomly access these. This is a better plan than having them generate them on-demand because they are costly to generate (~half a second to generate a single array).

I'm looking for suggestions on the best way to distribute these. My options seem to be to create a database with a traditional database server and have to host it (either a SQL or more likely a dedicated key-value store) or some sort of file-based setup (sqlite, hdf5, bdb). I would really prefer the latter from a cost perspective.

My goal is to minimize the effort and code required to access the data. Something that let you just create a Python object by handing it a file/folder name and accessing it just like a dictionary would be the ideal solution. sqlitedict is available and gdb/bdb seems to be in the standard library and works this way, but may not be portable. Has anyone done something like this and have good/bad experience to share?

OnceIWasAnOstrich
Jul 22, 2006

Thermopyle posted:

Would pickle or shelve work for you? I haven't used them for that much data...

Both are in the standard library.

Of course, the users of your data have to trust you since pickle (and shelve since it uses pickle under the hood) can run arbitrary code when unpickling.

Pickle alone doesn't solve my problem, but shelve might. It seems to sometimes be backed by a Python dict, but has options to basically act as a wrapper around dbm which is definitely on-disk. I'll need to check if the underlying files are sufficiently-portable, but that does solve the not needing to distribute any extra code or install libraries issue.

Since everything is effectively already bytes or can be interpreted as bytes I suppose I could use the dbm module directly and avoid pickle security issues. Those won't be a big deal in this instance because they are already trusting me for more than that. I vaguely remember having technical or portability issues with that module years ago but that was back in the 2.7 days.

OnceIWasAnOstrich
Jul 22, 2006

QuarkJets posted:

I am not a fan of using shelve for data quantities approaching "massive" amounts, by that point I just use HDF5.

Store the arrays as datasets using their dict key as the dataset name. There, you're done. This should take less than 10 lines using a for loop. If you want to split the data across multiple files that's just a flag passed to h5py's File constructor

Users fetch the list of dataset names from the file, which is equivalent to fetching a list of keys from the dictionary. Then they load the dataset with square bracket syntax, just like a dict.

You can implement gzip compression on the datasets with literally 1 line, if that's desirable, and users don't have to do anything extra to read the data; HDF5 performs the decompression automatically.

I was curious about this when I was looking into HDF5. The "dataset" terminology made me worry it would choke when I made hundreds of millions of datasets.

edit: It seems like there is a practical limit somewhere below a million datasets: https://forum.hdfgroup.org/t/limit-on-the-number-of-datasets-in-one-group/5892 I could split up my items/datasets into groups, maybe by prefix, and write a little wrapper script to hide that without too much effort. I'm also a little skeptical of shelve being able to handle that many keys without choking, unless it happens to use a really scalable search tree setup.

OnceIWasAnOstrich fucked around with this message at 00:25 on May 1, 2020

OnceIWasAnOstrich
Jul 22, 2006

I've played around a bit with using whatever filesystem someone happens to be running to handle this and storing individual binary files with raw bytes. Putting all 200m in a single directory is a no-go, it causes everything I tried (ext4, xfs, btrfs) to blow up. Apparently even the B-trees can't handle that. It works reasonably well if I create a directory hierarchy per-character since I have a max of 36 possibilities [A-Z0-9] for each character and a max of 10 characters so the hierarchy doesn't end up too deep or with too many files in a single folder, just an insane amount of folders but well within the capabilities of most filesystems. This makes copying this an exercise in "you better loving have an SSD". It's probably not so bad once I manage to create a tar and the writing shouldn't be too random.

I guess if I just made it its own FS and copied it at a block level and distributed an image that would solve the random I/O problem but add in requiring people to deal with a filesystem image. This is basically the same thing I did with HDF5 (and a recursive version of what QuarkJets suggested) but with the folders replaced by HDF5 groups. HDF5 has the benefit of being a solid file object I can read/write with sequential I/O. I guess this is kind of the same thing as a filesystem image except that the code to deal with it is in the HDF5 library instead of the kernel.



For practical purposes, this is going to be used by a dozen or less of my students for research purposes for now, so I can make them do whatever I want as long as I teach them how to do it. This is all making putting this all in a real database (and maybe creating a database dump) and a docker image more and more appealing.

OnceIWasAnOstrich fucked around with this message at 01:35 on May 2, 2020

OnceIWasAnOstrich
Jul 22, 2006

Bad Munki posted:

e: Yeah, it's the darnedest thing. If I sit here with my return key mashed, the whole thing functions exactly as intended, albeit with a few blank lines from the server console because I'm constantly entering blank commands, with the occasional actual command output interspersed as my proxy sends the command but is unable to execute it. The fix here is, obviously, to leave a paperweight on my return key indefinitely.

I think you need to set bufsize=1 and universal_newlines=True.

edit: Although maybe not? I would have thought a manual flush() would also have done the trick.

OnceIWasAnOstrich
Jul 22, 2006

abelwingnut posted:

i have, effectively:

listLocation = [ x.append(y) for x in list1 for y in list2]

Whether or not this is the best way to do this, you can do:

Python code:
listLocation = [ x + [y] for x, y in zip(list1, list2) ]

OnceIWasAnOstrich
Jul 22, 2006

dragon enthusiast posted:

Wouldn't using *args require that foo and bar be named and assigned to outside of the function?

I guess in my specific use case is an object that could contain properties obj.foo and obj.bar. I'm trying to figure out if there's any syntactic sugar to express something like obj.action(foo,bar,piss) as concisely as possible. R supports that kind of syntax for example.

You can just iterate over the results args list and access the property from the list item without knowing its original variable name, assuming I'm understanding what you are very non-specifically trying to do.

Python code:
def fn(*args):
  for arg in args:
    print(arg.foo)
edit: Oh you want the name of the variable that you pass in as a non-keyword argument to be available in the function's scope. That is definitely not the way anyone would expect a Python function to work so you should probably use the suggestion to pass in a dictionary which is intended for that sort of name-value relationship.

OnceIWasAnOstrich fucked around with this message at 18:28 on Oct 2, 2020

OnceIWasAnOstrich
Jul 22, 2006

Dominoes posted:

One minor, tangental* step: I'm going to move the python process manager I wrote towards only supporting the latest python minor version. I've been neglecting this project for want of spare time, but this change may simplify the codebase and API.

With that in mind: If you're using 3.6, 3.7 etc, why? Would switching to 3.8 break anything for you?

* Relevant in that it reduces a system state degree-of-freedom.

Python 3 is now old enough scientific libraries have started accumulating dumb poo poo that only works with 3.5 but not newer because of dependency hell. I'm thinking of the difficulty of getting PyQT4 working with 3.6+ to use with visualization libraries that only work with PyQT4 but not PyQT5. With the exception of legacy C++ codebases like that there isn't much reason that I've run into.

OnceIWasAnOstrich
Jul 22, 2006

Mirconium posted:

Hello thread, my old friend

I have a need to compute k nearest neighbors on mid-to-large datasets (20k samples and above)

Default implementations of exact kNN via ball tree and the like in sklearn stall out if you actually attempt to do this because they rely on computing the entire pairwise distance matrix for the whole dataset using scipy's single-threaded sadgasm of a distance function.

Wishing to avoid approximate nearest neighbors, I think it may be worth it just to roll my own fast knn with a lazily evaluated distance metric. This way we can split up the dataset into small chunks, and by computing the one vs all distance of a single sample in the chunk, we can use the triangle inequality to guarantee no sample outside the chunk is closer than some distance. After that we can quickly solve a small knn problem.

My problem now is how best to go about writing a performant, lazily evaluated, cached distance function in python? Ideally it would be something dictionary or hashmap-esque, where the key is a pair of indices representing two samples and the value is evaluated once when called, then stored for future reference. Unfortunately python dictionaries get sort of slow if they get really huge. Does anyone have suggestions?

(The temptation to do this in rust and then try to get it to be callable from python is overwhelming, but I know that that way lies only madness)

Consider taking a look at FAISS https://github.com/facebookresearch/faiss. It can do a lot of things but the core of it is extremely fast kNN searches on extremely large datasets with a variety of strategies to speeding things up on hideously big datasets. 20k is honestly pretty small (exhaustive search gets infeasible around 1M vectors with that code) so you can just use the flat index for exact results and not rely on the index tricks for approximate results. Since you refer to scipy's distance function I'm assuming you are using something normal like Euclidean distance so it probably includes a C-implementation of what you are using already.

edit: That said, 20k is not a lot of vectors at all and I've used the sklearn BallTree with no problems for both Euclidean or Cosine distance metrics on similarly sized datasets. The scipy pdist() function only takes about 10 seconds on my machine to compute the full distance matrix for 20k vectors of 100 dimensions each. Don't give it a callable metric, that will ruin things with Python overhead. If you can't use any of the included metrics you'll need to construct your method with not-Python in some way.

OnceIWasAnOstrich fucked around with this message at 15:45 on Oct 15, 2020

OnceIWasAnOstrich
Jul 22, 2006

Mirconium posted:

For multiprocessing pools, should I be spawning one any time I want to async-map over an iterator, or can I just create a single pool for the arena that I am working with and then summon it by reference any time I need to go over an iterator? I'm not planning to actually asynchronously run anything, I just want to parallelize some iterative computations.

Also the documentation makes OMINOUS ALLUSIONS to the fact that you can't rely on garbage collection for pools and you have to close them manually. So if I store a pool, do I like... have to actually write a destructor for whatever object encloses it telling it to terminate the pool?

You have to close the pool one way or another, it won't happen if it is an attribute of an object that gets garbage collected. The easiest way is to open pools with context managers and the with keyword.

This means you can re-use a pool like you mention to save spinning up new processes but you need to keep track of it and ensure you close it eventually. If your parallel processes take long enough just re-make the pool, but if spinning up a new pool takes enough time it is worth it to keep it around, especially if you use a length pool intitializer function. You could create the pool in an outside context with a context manager and have the entire lifecycle of your object happen inside of that I suppose.

OnceIWasAnOstrich
Jul 22, 2006

Mirconium posted:

So what about the destructor strategy? Like if I just add pool.terminate() to __del__ are there potential issues with that? (I guess potentially if __del__ doesn't get called, which I assume can arise from crashes or Exceptions or something?)

Destructor would be fine if the object gets garbage collected properly. If you have a crash at the wrong time your pool will hang around afterward regardless of any of this (you will just be slightly more likely for this to happen if it is alive for the entirely script lifetime instead of just during computation). I don't remember clearly what guarantees CPython has wrt to garbage collection and exceptions but you would still probably want to wrap your function with a context manager and use __exit__ instead, or use a try/finally block.

OnceIWasAnOstrich fucked around with this message at 15:31 on Nov 11, 2020

OnceIWasAnOstrich
Jul 22, 2006

Mirconium posted:

Also python visualizations are all awful, especially for making actual nice looking plots that do unusual formatting, as presumably would be needed in data journalism, DOUBLE especially for making them HTML-friendly. I personally have had good luck with auto-generating javascript and html5. It's an added layer of learning curve, but when you get good at python, remember that as an option.

I can't really agree with this, although default matplotlib and some of its wrappers can be awful especially if you want non-raster renderers. There are plenty of nice HTML-friendly ways to do very nice visualizations in Python including but not limited to Plotly and Bokeh. Rolling your own Javascript and HTML generation seems like an amazing amount of work for something that is probably going to be uglier and way harder to use than the Python plotly.js interface.

Dash/Plotly is a great resource for data journalism-type stuff where you want fancy/nice/interactable/web-friendly visualizations.

OnceIWasAnOstrich
Jul 22, 2006

Zoracle Zed posted:

I'd recommend the grouper iterator but god drat it's annoying itertools has a "recipes" section in the documentation instead of just putting code in the library where it'd be useful

The number of times I have had to Google and copy-paste the exact same function off of either Stackoverflow or the itertools doc depending on which ones shows up first is just so frustrating. Put it in the drat library already. I don't need the best way to write that that memorized taking up space in my brain. Who do we need to bother to make this happen?

I know I can install more-itertools or whatever but I don't want a whole extra dependency when that is an incredibly common and simple need that would fit well in itertools.

OnceIWasAnOstrich
Jul 22, 2006

duck monster posted:

I am currently overbrimming with the desire to beat Homebrew's python maintainers to death (with angry words).

YET AGAIN, a random non python install just randomly upgrades python breaking the gently caress out of every goddamn virtualenv on my machine. People have been banging on about this being unnaceptable on variour homebrew related git issues for years, and ..... nothing gets done. The irritating thing is python has *multiple* methods of letting multiple pythons co-exist. Theres no need to uninstall the old versions.... Gah...

At least a warning that "Hey this package will destroy your dev environment, just so you know" and an option to pull the eject lever.....

< / rant >

As much as I used it for years and got my whole PhD doing work on a Mac Pro primarily with Homebrew...it is a trash package manager. You can pin a formula but the Homebrew rule that you can't build against outdated formulas really works against Python here. I don't think there is a good way to maintain Python on Homebrew as it exists when by rule you can't NOT upgrade dependencies. I believe if you pin a particular formula it will at least ask for confirmation before upgrading it. I believe you could also alias --confirm onto your brew commands to get it to behave more like a sane package manager by default.

Use anything else to manage your Python environment on macOS, whether it is pyenv or conda or whatever else.

OnceIWasAnOstrich
Jul 22, 2006

Epsilon Plus posted:

ooooh, I didn't even think that would be a serious issue

Yep, creating and working with massive Python strings is definitely going to add a lot of overhead compared to simple integer or even bignum math. Depending on what you are curious about, using log10 gets you the same info without that particular performance hit although with exponentially-increasing bignums you'll hit some performance problems soon enough anyway.

OnceIWasAnOstrich
Jul 22, 2006

CarForumPoster posted:

When I share results I output the notebook to PDF. The few times I’ve needed to do this it’s always with nontechnical people.

Yeah the promise of notebooks being useful to share with non-technical people was a little oversold for me. Anyone non-technical isn't going to 1) install Python and associated libraries 2) be able to run one of my notebooks 3) care about the code. I can make a PDF but it doesn't really get me much when with just a little more work I can make a Shiny/Dash app and get actual interaction. I know there are services to kinda-sorta host them but :effort:.

I've experimented a bit with using them to quickly prototype a bit of an overall analysis pipeline and then integrate it into a bigger Snakemake pipeline and that worked well enough. Nothing groundbreaking but it probably did save me some time and if I need to dig in more I can pretty easily.

I use them extensively for loving around and experimenting since having visualizations or big chunks of data pop out semi-formatted in a browser cell is (usually) more useful than having to write extra code for it in a REPL via SSH or having to use a separate program or X server to let matplotlib poo poo appear over SSH. These days this usually goes hand-in-hand with using it to test out changes to an actual library that I'm writing in a real IDE. Problems with multiprocessing or async issues do pop up in certain contexts that cause my to have to go entirely out of the notebook environment though.

I've also found myself extensively using Colab for teaching students since neither they nor I have to set anything up and they can't possibly break anything that can't be fixed by hitting "Factory Reset Runtime". If I can get them to not overwrite the code cells that worked it helps a little bit early on because previous work doesn't "disappear" in their brains the way it seems to on the CLI or REPL.

OnceIWasAnOstrich
Jul 22, 2006

You can also getattr the class name out of the module that the class is defined in. If you need to pull specific classes based on something like a text config file this works well.

Assuming all of your models are present in the models module:

Python code:
from mypackage import models
predictor = 'modelA'
model_cls = getattr(models, predictor)
model = model_cls()

OnceIWasAnOstrich
Jul 22, 2006

Mursupitsku posted:

After running some tests it seems that the classifier is the only thing taking a significant amount of time to run. I haven't yet tested if dropping features would increase the performance. In any case there is only 17 features total and my gut tells I cant really drop any of them.

Would just throwing more computing power at it work? Atm I'm running it on an semi old laptop.

What are my options to optimize the classifier itself?

Part of the "eXremeness" of XGBoost is that it does scale pretty well with more hardware and threads. XGBoost is pretty well optimized already, but perhaps you are using an excessively-large model. 17 is a decent number of features and it could be worth actually doing some feature selection testing. You may also be using default hyperparameters that make a more-complex-than-necessary model and could reduce the tree depth, maximum number of trees, or number of boosting rounds.

OnceIWasAnOstrich
Jul 22, 2006

CarForumPoster posted:

If you wanna be lazy AND cheap, deploy your Flask app with Zappa to AWS lambda. gently caress servers.

Because a have a ton of free GCP credits I do this with Cloud Run and its been working great for a quickie Flask app this past week I didn't want to bother a server with. It feels a bit more like a natural fit for Flask containers than Lambda with the configurable per-container concurrency, although Lambda can probably do that too by now.

OnceIWasAnOstrich fucked around with this message at 16:19 on Jan 28, 2021

OnceIWasAnOstrich
Jul 22, 2006

Bad Munki posted:

I don't think I'm overloading anything? I've got a function that'll take a list of ints/ranges, and as a convenience, if your list is only one item long, you can just give it the item alone. What's so weird about that? I'm sure it's 99% likely I hosed up the hint above, so that's what I'm trying to get right.

In this case, a real example would be that the user is searching for frames of data from a satellite, and they often have a list of frames they want, which may include, say, frames 100, 150, 200, and everything from 300-400. Another user just wants to search for a single frame, 100. Forcing them to provide a list of all 104 values for the sake of purity of data type seems silly, as does providing multi_frame_search() and single_frame_search() variants.

If it were SUPER offensive as-is and the cops are already on their way, I would consider forcing it to always be a list. But it's still gonna be a list of ints and/or ranges. That part's a requirement. Making the list-ness of the input optional just seems polite.

It is the heterogeneous list that is messing your type checking up as far as I can tell, not being either a int or a list and I think your type hints are about as good as it will get. I'm guessing (without checking) that PyCharm isn't handling covariant typing on your Iterable and is assuming invariant type and checking the first item and assuming the entire list is ints. What happens if your first list element is a string?

If you really care about that and wanted a cleaner function you could accept kwargs for lists with one for int-lists and one for range-lists since presumably you have code to separate the items out anyway and handle them differently in your function already.

edit: I'm having a hard time figuring out whether the PEP/mypy actually allow for covariant type lists. I think maybe the type system doesn't allow that for mutable types? I think maybe PyCharm sees list, the type system specifies lists can only be invariant, so it sees a List[int], converts to Iterable[int] and calls it a day?

OnceIWasAnOstrich fucked around with this message at 23:19 on Feb 22, 2021

OnceIWasAnOstrich
Jul 22, 2006

I've recently done quite a bit with ML at various levels in Python. The Scikit-Learn version is very simple, easy to use, but there is a lot buried in the many, many arguments to some methods and the defaults are very frequently going to be unhelpful to you. You might not have the same choice in optimization methods you might use otherwise and the API isn't really designed for internal-to-the-model modularity like that so it is up to whoever wrote that model to give you all the options or you need to customize the model yourself (lots of code).

PyTorch is very much for creating models and methods, not really for use as a simple API for using established/implemented methods like scikit. That said, the type/size of models you want PyTorch for tend to have a lot of extra complication in setting up massive parallel training that isn't really conducive to something super-simple like the Scikit API although stuff like pytorch-lightning get close. I've done a lot with sequence learning using models that use the HuggingFace-style API for Transformers models implemented in either PyTorch or Tensorflow. That API is a lot closer to what you might expect for direct use although it is clearly evolving rapidly as more and more methods get churned out gradually expanding the range of stuff the coordinated API might need to do.

OnceIWasAnOstrich
Jul 22, 2006

Rocko Bonaparte posted:

I tend to look at neuron activation as a threshold instead of an intercept or bias so I wonder if I'm interpreting the intercept_ attribute incorrectly. The coefs_ fields really do look like regular old weights across different layers; it adjusts based on hidden_layer_sizes and the data I fit. I run the fit() method to get the initial topology and then blow it over.

A multilayer perceptron works differently than the example you have. In your example, neurons output a binary 0/1 value directly. In the MLP scheme that Scikit-Learn uses, you have a non-linear activation function that maps to a specific range for each neuron. To implement the classifier, the MLPClassifer takes the output of the final layer which I believe will be of shape [num_samples, num_classes], and uses the softmax function to normalize that output to a probability distribution over your classes for each sample. The classifier will then output the class identify with the highest probability.

Dominoes posted:

You can classify a decision tree as ML if you want, or not. It's an easy-to-grasp, but powerful tool for creating complex behavior.

I'm curious, would you maybe draw a line at a non-ML decision tree being human-interpretable? I definitely agree that there is obviously a ton of hype and my personal hand-wavy boundary is that ML models are models where there is no attempt to make the model structure reflect how the modeled phenomenon actually works, and the point isn't to understand the phenomenon through the model, just to make a good [insert goal here]. Clearly decision trees in certain incarnations are ML, especially ensemble tree methods. One of the more powerful and "successful" big machine learning models isn't an ANN but is instead a complicated method of creating ensembles of decision trees.

OnceIWasAnOstrich fucked around with this message at 00:09 on Mar 12, 2021

OnceIWasAnOstrich
Jul 22, 2006

Dominoes posted:

I don't have a reason to draw a line; categorization is a tool you can apply to a problem. Maybe you have a reason to draw a line for DTs as ML or not.

In the same sense, choose a tool suitable for the problem you're working with. Maybe it's something categorized as ML. I reject xhoosing ML when it's the wrong tool.

Sorry, I didn't mean to make you draw a line. My point was that, personally, I would never say a type of model is or isn't ML. From my perspective ML is more of the approach or philosophy to problem solving. To me, a linear regression could be (and is) used as machine learning, but can just as easily not be.

Also, Rocko, your issue with applying the MLP classifier to the XOR problem can be illustrated this way. All of the methods used for optimization of model weights are based on gradients (SGD/Adam more so than LBFGS). You can easily end up in a situation where your optimizer gets stuck in a particular region of parameter space and needs to propose much larger changes to the parameter to improve the gradient than it is capable of making.

If you use SGD or Adam as your optimizer, you can visualize this:

Python code:
correct = 0
for x in range(100):
    classifier = MLPClassifier(hidden_layer_sizes=(2,),
                           learning_rate_init=0.1,
                           random_state=x)
    model = classifier.fit(X, y)
    if all(model.predict(X) == y):
        correct += 1
    seaborn.lineplot(x=range(len(model.loss_curve_)), y=model.loss_curve_)
    plt.xlabel('iterations')
    plt.ylabel('loss')
print(f'Correct: {correct}%')


You can see that about a quarter of these manage to converge on the correct solution, a loss value near zero. MLPClassifier uses log-loss/cross-entropy. If you leave the learning_rate_init default at 0.001, you should get a lot of warnings telling you it didn't converge with 200 iterations and a plot like this:



With the default learning rate, it can't make changes to the parameters fast enough to reach the correct solution in the default maximum number of iterations. I am not sure why L-BFGS does as poorly as it does here, I get 28% properly fit, most of the models converge on similar incorrect solutions. There aren't quite enough knobs to tweak with that particular optimizer here.

This is one of those issues I was referring to when I say that the defaults for the Scikit models are not always sane and definitely not always suitable. Making these stochastic optimizers converge properly and quickly by tweaking both model and optimizer hyperparameters is where much of the "art" comes into it.

OnceIWasAnOstrich
Jul 22, 2006

Cyril Sneer posted:

I have a situation where I'm extracting two sets of byte arrays from a larger byte array buffer as follows:

code:
msb = np.frombuffer( data[0::2] , dtype=np.int8, count=64)
lsb = np.frombuffer( data[1::2] , dtype=np.int8, count=64)
At this point, msb and lsb are two bytearray objects of length 64. What I want to do is somehow "match" the two bytes together to form a 16 bit integer. That is, by first 16 bit integer would be formed by merging msb[0] and lsb[0]. The second 16 bit integer would be formed by merging msb[1] and lsb[1], etc. Any clever ways to do this?

Edit:

Even better, can I directly interpret my original 128-byte byte array at 64 16-bit integers, rather than my split-and-combine approach? Looking through the struck package, there are some unpack functions that seem to be what I'm looking for?

I am missing something. Why can't you just frombuffer() as a np.int16 dtype? Is the original data two 8-bit signed ints sequentially or something? If its something like that, just cast to 16-bit, multiple your more-significant part by the appropriate factor (or bit-shift), then add them together.

OnceIWasAnOstrich fucked around with this message at 18:25 on Apr 30, 2021

OnceIWasAnOstrich
Jul 22, 2006

If anyone has created a CLI tool with potentially dozens of configurable options before, I have some questions about how you like to structure this. I have inherited a tool with a CLI interface with a moderately complex argparse setup. We have a base parser and 3-4 specialized parsers for different invocations. Right now this gets processed by a main.py with functions run for the different CLI invocations. These functions do the basic argparse prepartion and parsing, and simply invoke a specific function in another module with all of the parsed arguments given to the function as named keyword arguments.

This means that the function signatures for those actual core processing functions are massive. Hypothetically someone might use them without the CLI interface from Python, so they have the full set of keyword arguments with the same default values as are set in argparse, although that isn't enforced in any way by code and just feels bad. How would you go about structuring this?

  • I could hand off the arguments to the function instead and require anyone using it via Python to manually construction a Namespace with the appropriate values, which makes that basically a non-starter because it would have to be done manually and doesn't benefit from any defaults. I would end up pulling specific values out of the namespace in that function but anyone looking at it would have to go backward and consult the argparse code to figure out what is available and what the defaults are. This is easy enough, just replace all the arguments with a single Namespace arg, and args. in front of every usage of one. It does break a lot of my nice IDE features because it can't know what is and isn't going to be in the Namespace.
  • I could create some sort of config class that gets instantiated with all of the arguments and passed around, although that just results in my duplicated argument definitions moving from a function to a dedicated class, which would still probably be an improvement. Would make it easier to check for errors because then its easy to check if something should actually exist in the class.
  • I could automatically create a config class with some code that constructs something from the argparse interface so that there is no duplication but then people who aren't insane will have no idea what my weird meta-code is that is constructing that class and I'm not sure how to get it documented properly. This always makes me feel smart but is terrible. Probably still won't help with pre-runtime inference of what items are actually in that class.
  • I could hypothetically have the main.py function split the arguments into more-logical groups and merge them into dicts or config classes based on what they end up being used for as some sort of hybrid so I don't need a monolithic config class. I'm not sure this solves any problems though, just moves some code around, maybe in a helpful way.

Is there some sort of intended solution that exists for this?

OnceIWasAnOstrich
Jul 22, 2006

QuarkJets posted:

A common way to reduce the length of a function signature is to define a dataclass that contains the configurable parameters, then the function just accepts an instance of that class. If multiple functions can use the same dataclass then that's great, or you can define multiple dataclasses too - it's often better to have tailored instances than trying to create one big class that does everything

Yesss. Dataclasses were what I was thinking of but couldn't remember exactly because I have (regretfully) only used them once or twice a while back. It ends up basically identical to using the argparse-returned Namespace directly as far as the function is concerned, but it will require the same level of argument definition duplication unless I do something like generate the argparse parser from the dataclass. It does solve my problem of having it extremely difficult to use the functions outside of the CLI though. Click seems nice enough but mostly as a better argparse replacement, and I'm not struggling with that aspect right now, the actual interface is pretty nice and simple at the moment, outside of the huge number of optional arguments. Maybe it has a feature that would help me avoid the huge function signature, but it seems like it would just make me have a huge signature in addition to a huge number of decorators all stacked in the same place.

OnceIWasAnOstrich fucked around with this message at 01:14 on May 3, 2021

OnceIWasAnOstrich
Jul 22, 2006

QuarkJets posted:

I would be hesitant to do that for any corporate codebase that's not open source

I wouldn't put anything on SomethingAwful that I wouldn't put on Pastebin. At least with Pastebin it isn't automatically indexed if you set it to unlisted.

OnceIWasAnOstrich
Jul 22, 2006

mr_package posted:

Is there a way to use subprocess for internal functions? Or, a way to set timeout value to a function that's fairly simple (in the standard library) maybe with multiprocessing module? I'm running into a fun issue where DNS lookups for .local are ignoring socket.timeout value on Mac (probably due to bonjour integration / special handling), and the default timeout is 5 seconds. This is an automated test that hits failure case if I forgot to connect to VPN so I just want it to use a very short timeout instead of sitting there for 5 seconds and then failing.

As a workaround I've put a function that returns the value of socket.gethostbyname() into a separate file and the very first test is now a fixture that calls subprocess.check_output() with timeout=1 parameter on that file. Works perfectly, saving me four seconds. Amazing. Just wondering if there's a way to make subprocess/standard library do this with local functions as opposed to calling separate python shell command.

You can create a multiprocessing.Process with the target= argument as the callable you want to run, ideally using a Pipe or Queue to feed your data back to the main process. You then call .start() and then .join(timeout) with whatever timeout you want before checking the Pipe/Queue and calling it a day. If your underlying callable releases GIL you can even do this with threading.Thread instead.

Adbot
ADBOT LOVES YOU

OnceIWasAnOstrich
Jul 22, 2006

Deadite posted:

I'm having an issue trying to add a matplotlib graph to a tkinter GUI where the legend to the graph is getting cut off if I move it to below the chart.

Does anyone know a way to display the legend when it is outside of the chart? FigureCanvasTkAgg doesn't have a height argument so I can't just stretch the viewable area. It looks like it is some kind of automatic resizing problem that I can't figure out a way around.

I've never had this problem in the context of another GUI but I've definitely run into similar issues with bits of a matplotlib figure getting rendered outside the bounds of an image. It is usually something that a call to tight_layout() or other layout-modifying functions can address.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply