Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
SurgicalOntologist
Jun 17, 2004

Only way to tell is to test. In this case I would guess it's faster. read_csv will use a C implementation, so if that part is slow there should be an improvement over the builtin csv module. The indexing itself should be fairly well optimized as well, and if speed is an issue sorting the frame or the lookup keys might help.

Generally the "pandas gives you a slowdown" is in comparison to numpy, not pure python.

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

Yeah, the indexing speed considerations would depend on the type of keys. If they are consecutive ints this would be doable (and probably fastest) in numpy, but with arbitrary keys it's either pandas or a dict, and pandas might be faster in some cases and shouldn't be noticeable slower in others.

SurgicalOntologist
Jun 17, 2004

unpacked robinhood posted:

Obfuscated is innacurate. I'd say it's not meant to be parsable at least.

I'd like to read toll prices from an online calculator. The values I'm interested are displayed but don't appear in the source.
Using the Inspector I've found them in a json that's part of an answer to a request with a bunch of authkey and other values in parameter.

Copy pasting the request url around to a "fresh" browser (no cookies etc) seems to be enough to get an answer, however removing the authkey value gives an "Access denied" message, as well as trying to get the file in curl for example.

I haven't tried doing a request with a made up User agent, but I'm concerned the authkey value may expire.

If selenium could load the page, fill the form, catch the url for the request I'm interested in and load the json somewhere clean it would be nice.

In these scenarios I often find it's not that hard to mimic the underlying API calls rather than use Selenium. Often you can find additional internal data as well.

You seem to have started in that direction, you just need to figure out how to get the authkey. To do that just figure out what API calls are made when you log in. Then use a requests.Session to maintain your cookie/headers. Likely you don't even have to find the authkey manually, but it will be automatically stored in the headers. For example:
Python code:
with requests.Session() as session:
    session.headers.update({'User-Agent': '...'})  # may or may not be necessary

    response = session.post(login_url, data=dict(username='...', password='...'))  # use dev tools to find out what you need to post
    response.raise_for_status()

    response = session.get(data_url)
    response.raise_for_status()
    data = response.json()
If that doesn't work, the next thing to try is to GET the login page before POSTing to the login API. That's sometimes necessary in case the website expects your session to begin that way. In any case you usually don't need much more than the above lines and it is usually much more reliable and faster than selenium. I maintain a lot of scrapers and I've switched them all over to this method.

SurgicalOntologist fucked around with this message at 17:46 on Jan 10, 2018

SurgicalOntologist
Jun 17, 2004

If you're looping and plotting I don't think it's going to matter that much how you loop. The plotting is likely many orders of magnitude slower than the looping.

In any case it sounds like the issue isn't speed but just understanding of looping.

SurgicalOntologist fucked around with this message at 18:25 on Jan 10, 2018

SurgicalOntologist
Jun 17, 2004

Munkeymon posted:

There are roughly three possibilities here:

They might just have that one hard-coded auth key, in which case you just keep using it and it's fine until they change something that'd probably break your scraper anyway.

The key might be loaded into some DOM node and read by a script, so you can probably do that same given that static page and a little legwork. Open the Chrome dev tools and search all with Ctrl[Option] + Shift[Command] + F and search for the key to see if it's just getting squirted out into a script element or something (this is likely).

It's derived from something the server sends. You're probably better off using Selenium in this case because they're being clever and paranoid.

In my experience it is almost always the last one, and it just happens because they have authentication on their backend--nothing especially clever or paranoid, any number of frameworks/packages would give that behavior by default out-of-the-box. All you have to do is mimic the login API calls just as you're mimicking the actual data fetching API calls. Usually opening a requests.Session and posting credentials to the login URL will put the authentication token in your session automatically. You typically don't even have to find it--just post your credentials (inside a session so you have persistent headers) and from then on you're authenticated.

Sometimes for the authentication you will find a situation like your second one, where you have to fetch the login page, find some token, then include it when you post your credentials.

Edit: actually I've been assuming that this is a website that requires a login. If it doesn't then I agree, you're probably in the second scenario there.

SurgicalOntologist fucked around with this message at 20:22 on Jan 11, 2018

SurgicalOntologist
Jun 17, 2004

You're getting that error in particular, because [df['Price']] is a list containing one item, the pandas column. You want just df['Price'].

However, that won't actually get the behavior you want, because just as you can't compare a list and an int, you can't compare a dataframe column and an int (well technically you can, but you won't get a True or False, you'll get a columns of True/False, which wont' work in that context). Notice that you're only setting this cl variable once, whereas you want it to take a different value for every row.

Your options are either:
1. You could set the color inside the loop by putting all that if/elif stuff inside the loop, therefore checking for a new color every time. You will need to change a couple other things which I leave as an excercise.
2. You could set all the colors ahead of time with something like pd.cut(df['Price'], [0, 1000, 2000, 3000], labels=['green', 'yellow', 'orange', 'red']).

SurgicalOntologist
Jun 17, 2004

The loop is the part that starts with for and ends when the indentation returns to the previous level.

SurgicalOntologist
Jun 17, 2004

Hadlock posted:

Is there a package/library similar to Go's "Cobra"? Command line flagging coffee organizational thing.

I personally like docopt, although many prefer click. For docopt, you write the help message first (as a docstring), and docopt parses it. Tigren's example would look something like (modifying it to show positional args and flagged options).
Python code:
"""hello.py

Simple program that greets <name> for a total of N times.

Usage:
  hello.py [OPTIONS] <name>


Options:
  --help -h      Show this message.
  --count=N       Number of greetings [default: 1].

"""
from docopt import docopt


def hello(name, count=1):
    for x in range(count):
        print(f'Hello {name}!')


def main():
    args = docopt(__doc__)
    hello(args['-<name>], count=int(args['--count']))


if __name__ == '__main__':
    main()
As you can see, you have to do some parsing yourself, but I find its ease to make up for it.

SurgicalOntologist
Jun 17, 2004

Hadlock posted:

Has anyone ever mixed and matched flask with click? I have a bunch of baby services I'll need to write for a project and my coworkers are still stuck doing cli stuff and haven't graduated fully to api. I thought maybe I could setup the app as a flask api service, but could also be run from the command line if well documented to give my coworkers a chance to come up to speed.

Click is built into flask already: http://flask.pocoo.org/docs/0.12/cli/ (same author I think)

SurgicalOntologist
Jun 17, 2004

ArcticZombie posted:

I'm doing something wrong/misunderstanding how to use asyncio. I want to read from stdin, start a task based on that input and be able to receive more input and start more tasks while waiting for that task to complete. But the first task seems to be blocking and I don't understand why. Here is an distilled example (with load simulating a task taking some time):

Python code:
import asyncio
import sys


async def load():
    await asyncio.sleep(3)
    return "done"


async def listen(loop):
    while True:
        recv = await loop.run_in_executor(None, sys.stdin.readline)
        print(f"Received: {recv}", end="")
        res = await load()
        print(f"Result {recv.strip()}: {res}")

loop = asyncio.get_event_loop()
loop.run_until_complete(listen(loop))
loop.close()

A coroutine blocks every time you use await. Although they allow other coroutines to run at those times, they nevertheless themselves run top-to-bottom. So, your while loop never returns to the top before the second await returns.

For your desired behavior, the call to readline must be the only await within the loop. You need to schedule the execution of a coroutine without awaiting it. This can be done with futures. Something like:
Python code:
import asyncio
from functools import partial
import sys


async def load():
    await asyncio.sleep(3)
    return "done"


def print_result(input_, future):
    print(f"Result {input_.strip()}: {future.result()}")


async def listen(loop):
    while True:
        recv = await loop.run_in_executor(None, sys.stdin.readline)
        print(f"Received: {recv}", end="")
        future = asyncio.ensure_future(load(), loop=loop)
        future.add_done_callback(partial(print_result, recv))

loop = asyncio.get_event_loop()
loop.run_until_complete(listen(loop))
loop.close()
Note that if your slow operation is CPU-bound rather than IO-bound there may be a better solution than making it a coroutine, but this is how you get your desired behavior at least.

SurgicalOntologist
Jun 17, 2004

Only things I would add to that are:

- you can usually just use subprocess.run as a convenience function (everything QuarkJets said still holds)
- if you are dealing with pipes with text instead of bytes you may need text=True. Older guides/posts may suggest unversal_newlines=True; these are the same.

I would also emphasize QuarkJet's point A. I would say you basically never need shell=True, but avoiding it just requires some understanding of what the shell does. Anything the shell is doing you can do more explicitly and securely in Python. For example, expanding wildcards you can use Path.glob or glob.glob; os.environ for environment variables, etc.

SurgicalOntologist
Jun 17, 2004

After you activate, use "pip" instead of "pip3". Activating conda environments points pip to the right executable, but doesn't bother with pip3, which stays connected to your system Python.

SurgicalOntologist
Jun 17, 2004

Also, although it appears that python3 points to the right place, I would just use the plain python executable. Once you've activated an env, further disambiguation is not necessary. Executables like python2, python3, pip3, etc. are awkward workarounds for working with multiple pythons without environments. With environments you don't need to worry about that and should use the canonical version of everything, IMO.

SurgicalOntologist
Jun 17, 2004

Check the orient argument of to_json; pandas offers four different ways to organize the JSON. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_json.html

SurgicalOntologist
Jun 17, 2004

You can also press Alt+Enter as you're writing code (referring to an object that you haven't imported yet) and PyCharm will give you a selection of possible imports and automatically add it to the top of the file.

SurgicalOntologist
Jun 17, 2004

I've got a SQL database on a NAS that I'm trying to query through SQLAlchemy. A NAS is obviously not ideal for running the queries but I'm just trying to do something simple. Unfortunately the process runs out of memory (1GB plus 4GB swap) just running this:
Python code:
user = session.Query(User).one()
I just want to grab one user instance for experimentation but this takes a couple hours then crashes. Meanwhile I can do simple groupby-count queries no problem (although they are also slow). Any ideas for how to make this work?

SurgicalOntologist
Jun 17, 2004

:doh:

Thanks!

SurgicalOntologist
Jun 17, 2004

If you really want something different, check out hypothesis. I use it whenever I can (with pytest as a test runner). Given a domain of inputs, it tries to find values that break your tests. The stateful testing module is especially neat, when I used it it was able to find some super obscure bugs in my code that only occurred after a very specific sequence of events.

SurgicalOntologist
Jun 17, 2004

From what I understand, yes, it allows you to run stuff in PyCharm, for example via test runners. I find it useful when I'm starting a project and it hasn't matured to the point where I'm bothering to put together a setup.py and install it (at which point running pytest from the command line will work just as well). I'm not sure if the resources root comes into play outside of specific contexts in web apps. I can't imagine PyCharm would manipulate all paths in Python code to be relative to a resource root, because that would create more problems than it solves. It's more for paths in HTML templates, for example (at least I think that's the case).

SurgicalOntologist
Jun 17, 2004

Dominoes posted:

Any thoughts on why setup.py's install_requires can't be removed in leu of Pipfile?

install_requires is for abstract requirements, pipfile is for concrete requirements. The former is best practice for library code, the latter is recommended for applications.

SurgicalOntologist
Jun 17, 2004

I think what's often missing in these discussions is the difference between applications and libraries. I didn't read the whole reddit thread but KR does mention that pipenv is only for applications, I'm just not sure it got much attention.

This reminded me that I've never used pipenv and though I usually write library code I have one or two applications that I could port. I currently use a conda requirements file for those. Should I switch?

SurgicalOntologist fucked around with this message at 02:48 on May 15, 2018

SurgicalOntologist
Jun 17, 2004

I see. I guess the reason I've never bothered with something like that is that I never navigate to the project directory to run a CLI application. That doesn't make sense to me, the source code is one thing but I'm usually using the application to interact with files somewhere else (e.g. a data store) or not interacting with files at all and I just won't leave my home directory. So with conda my environments are set up by name rather than by path, which fits my workflow. Am I weird for not wanting to navigate to my source code to run it?

All that said, it does look simple, and you're right Thermopyle it's low cost low risk, and there's no harm in one more tool in my belt, so I'll at least give it a shot.

While we're on the topic I'll mention a couple things I like about conda (both of which clearly have pros and cons). It can handle non-Python dependencies, which has come in handy several times. And if you install the same package version in multiple environments, it hard-links them if your file system supports it, saving space.

SurgicalOntologist
Jun 17, 2004

Master_Odin posted:

Except what about for developing libraries? I think it's more that including/excluding setup.py should be determined on if you're writing an application or a library (exclude on former, include on latter), but you should probably always have a Pipfile and Pipfile.lock for ease of development? KR has certainly included them in requests at the very least.

Genuinely asking, what are the benefits for ease of development? Once I have setup.py, if I want to develop a new machine I just run pip install -e. and I'm ready to go.

Also, exclude setup.py on apps? My approach is the opposite, I always include setup.py but I also include a file that specificies concrete dependencies for apps. Setup.py, for example, will create entry points for me and allow me to import app functions for development.

SurgicalOntologist fucked around with this message at 04:33 on May 15, 2018

SurgicalOntologist
Jun 17, 2004

Lpzie posted:

Curious if anyone has an idea how to approach something I want to code.

I have data that look like this:



Ultimately, I want to give the user the ability to select a region on the map and calculate the RMS. For now, I need to figure out how to draw the circle (or rect) and retain the coordinates. I've been looking at the matplotlib event handling stuff but I'm not quite following it.

Check out holoviews and/or Bokeh.

http://holoviews.org/reference/apps/bokeh/selection_stream.html#bokeh-gallery-selection-stream
https://demo.bokehplots.com/apps/selection_histogram

SurgicalOntologist
Jun 17, 2004

YMMV but I've always found it easier to reverse engineer the backend API by checking out the AJAX calls in Chrome Developer Tools. Extra bonus is that you sometimes uncover extra internal data that they don't bother to expose in the front-end.

SurgicalOntologist
Jun 17, 2004

There are ways to reload, but the easiest is just to restart the kernel. Kernel > Restart / Restart and run all.

SurgicalOntologist
Jun 17, 2004

I didn't realize you meant in the console... that's a fantastic idea, if it existed I would use it all the time. That's worth writing a %magic command for or at least a feature proposal to the developers.

SurgicalOntologist
Jun 17, 2004

I've enjoyed developing Bokeh apps, and there's also HoloViews which (sort of) provides a higher-level interface over Bokeh.

I believe there's a smaller project that's more directly comparable to Shiny but it's escaping me atm.

And there's also the option of just using widgets in Jupyter Notebook--depending on the use case this may or may not be an option.

SurgicalOntologist
Jun 17, 2004

I mean, it's not perfect and it's still in active development, although I think 1.0 is coming soon. I've always been impressed with the improvements every time I come back to it, but it's not surprising one would choose another direction. And yes you may have had to some JS depending on your needs.

SurgicalOntologist
Jun 17, 2004

I think you can make that a linear programming problem. With Python the best option (last time I looked into it) was PuLP, which has a terrible interface but it does work. You could make a matrix of variables where each slot is one person in one program taking the value of 0 or 1. Add a constraint for each program (e.g. row in the matrix) that it sums to 2 (2 volunteers in each slot), that the columns (volunteers) sum to 1 for the day rows and 1 for the night rows. "Don't pair up the same volunteers twice" is going to be hard to formulate linearly but may be doable combinatorically (something like "the sum of these four cells must be < 4; the sum of these four cells must be < 4, etc.). That might be too many constraints... in which case maybe there's a more clever way to do it or you could make the pairing concept part of the objective function (and I think you'd have to look into a solver that can handle a nonlinear objective function.. but still something more direct than a GA should work).

When I solved a problem like this I put the LPVariable objects (representing the 0 or 1 values) in an actual numpy array and used dot products to build the constraints. Another approach would be to store the LPVariables in a dict with keys of (volunteer, program) tuples and build all the constraints by looping.

SurgicalOntologist
Jun 17, 2004

Proteus Jones posted:

I’m not sure a 0 or 1 will work with the program selection, since it’s implied that it’s more or less a gradient on Most Preferred to Least Preferred. And the goal is to get everyone as close to Most Preferred as you can.
The 0 or 1 are the variables that the program is solving for; the objective function is the dot product of those variables and a vector of preference strength.

Edit: at least, that's the base of the objective function. Could add other considerations as long as they can be linearized (or quadtratic with the right solver, etc.).

SurgicalOntologist fucked around with this message at 03:47 on Jun 2, 2018

SurgicalOntologist
Jun 17, 2004

It could be that something you're doing is trying to use an extra worker, which is not provisioned.

I had a similar problem with a Flask app once, and it ended up being a utility that bakes CSS into email was trying to use a second process. It would hang with only one worker (the default for the flask dev server) but worked fine with 2 or more.

SurgicalOntologist
Jun 17, 2004

You could do it with apply too, just need to set up the function so it takes a row as input.

SurgicalOntologist
Jun 17, 2004

code:

import pandas as pd
import qrgen as qr


df = pd.read_excel("cases.xlsx")

def qr_from_row(row):
    return qr.genqr(row.url, row.filename)

df.apply(qr_from_row, axis='index')

SurgicalOntologist
Jun 17, 2004

Honestly, in this case I'm not sure there's an efficiency difference, since the function isn't vectorized. The loop is moved to a pandas function but not a C extension (at least I don't think so). Still, I find apply more readable than iterrows, especially when it's chained with other operations. It also encourages you to encapsulate the function which should make it testable.

I don't mind resorting to iterrows if another formulation doesn't come to me, but it's my last resort. It could be that a decade of scientific computing and avoiding loops like the plague has broken my brain, though.

SurgicalOntologist
Jun 17, 2004

huhu posted:

Not trying to start some argument, genuinely curious... The constructs I use exclusively are for item in array, for item in enumerate (array), and for item in range. Are there any good uses of the basic for loop with i=0, i++?

No. In python you should never have to manage your iteration index. Maybe in a while loop or some other place where in some sense you're managing the loop logic yourself. But 99% of the time when I do that I realize there's a better way before I finish coding the loop.

SurgicalOntologist fucked around with this message at 17:01 on Jun 10, 2018

SurgicalOntologist
Jun 17, 2004

cinci zoo sniper posted:

Apply() is slightly more complicated than that. The method itself will evaluate what is happening inside the function, and correspondingly execute it in "normal Python environment" or in "Cython environment". The former will be a standard for loop with a function call overhead, the latter will be a slightly slower than a directly vectorised operation - it all is fairly dependant on your lambda function and the axis choice. The only hard rule here is that iterrows() is bad, since on to of every other overhead you'll also incur Pandas casting each row into a Series.

Yeah, good point about apply. I think it evaluates the function on the first element more than once, in order to figure out the fastest way to execute it. As a result it might not be the best choice with a function with side effects, like the qr code example. It shouldn't be an issue on most filesystems, but nevertheless that might be a reason to go with iterrows in this case.

Edit: or itertuples

SurgicalOntologist fucked around with this message at 05:08 on Jun 11, 2018

SurgicalOntologist
Jun 17, 2004

Is there a construct or idiom in SQLAlchemy for "fetch an instance with these attributes; if one doesn't exist, create it"? I find myself needing this a lot. I'll write one for my own uses but maybe it already exists.

Edit: https://bitbucket.org/zzzeek/sqlalchemy/issues/3942/add-a-way-to-fetch-a-model-or-create-one
So let me revise my question: anyone know of a third-party utility library that provides this?

SurgicalOntologist fucked around with this message at 02:58 on Jun 12, 2018

SurgicalOntologist
Jun 17, 2004

Boris Galerkin posted:

I have a numpy.ndarray of coordinate points, i.e.,

code:
x = np.ndarray[[1.2, 0.2, 0], [1.3, 0.3, 0], [1.3, 0.0, 1], ...]
And what I want to do is to find the n closest indices where y == 0 but also where z == 0. In the example above, the 2 closest points for this would be at index 0 and 1, because x[2] does not meet the requirement z == 0.

Do you mean the closest points to y == 0? Because in your example there are no points with y == 0...

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

Boris Galerkin posted:

vvv Yeah that's what I meant. vvv

Do you want the closest indices to y == 0 where z == 0? Or do you want the closest indices to (y, z) == (0, 0)?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply