Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Hollow Talk
Feb 2, 2014
Not having read the problem description or the replies (much), I'm going to :justpost: and say that SQLAlchemy is the bee's knees. Basically, use SQLAlchemy Core, ignore the ORM exists, and mix and match textual SQL, reflection, transactions etc. to your heart's delight. This probably won't solve your problems (especially not if you actually have 200k columns and not rows -- for comparison, the maximum number of columns for a PostgreSQL table is ~1600), but it's nice software.

This is also a great snype...

Adbot
ADBOT LOVES YOU

Spaced God
Feb 8, 2014

All torment, trouble, wonder and amazement
Inhabits here: some heavenly power guide us
Out of this fearful country!



DoctorTristan posted:

I can only give vague advice on this kind of vague info, but it sounds like you are trying to do something quite difficult and have a moderate-to-severe case of running before you can walk.

A few things stand out:

When you say that your data is 60*200000, did you get that the wrong way round or do you really have 60 rows and 200,000 columns?

What does each row in your data frame represent? Is it an event where someone is called out, with date/time/location?

Can you do more basic queries, such as ‘how many call outs were there on June 19 2019?’ Or ‘how many call outs were there in total on region X during 2018?’ Can you do visualisations of these?

It sounds like your extraction, data cleaning and analysis steps are getting quite mixed up with each other - this is invariably a bad idea.

Yeah this is definitely being popped out of the womb and being told to fly a spaceship. A lot of my time has been spent on stackexchange or deep in the confines of panda reference pages.

To answer your question, 6AM me got it backwards, it's 200,000 long with 60 columns. Each row represents one call, and columns provide lat/long of the address they were going to, time the phone was picked up, time the unit was sent, time the unit got there, time the unit left, if the unit was canceled, in what priority order they arrived, and a bunch of other things. The dataframe represents the entirety of one year, as they want an analysis done on a yearly basis.

The output ideally should be "during this year, here are the locations that had the most of these saturated incidents" and would most likely be tabular because my main gig is geospatial analysis so I can super easily just plunk that into ArcGIS and run with it.

Thankfully, my internship folks prefaced giving me this project with "we definitely can't do it, and you don't have access to the VPN during quarantine, so maybe you might be able to try?" and they're optimistic in a final product but understand if I don't get too much done, what with not really knowing Python beyond the basics

KICK BAMA KICK
Mar 2, 2009

Anybody using rq -- I've got a few jobs that can get a few types of errors from an API they use and each might need to be handled differently (some jobs should just be requeued if the error looks like a transient "servers down" thing; another is scheduled to repeat anyway and can just wait till next time; and then there's errors that might be like bad credentials that should put everything on hold until I check the situation out.)

I was going to attach a dict with instructions about how to handle each case to the metadata you can define on the job decorator, and evaluate that in the function you can designate as the job failure handler; just checking if there's a better strategy I'm overlooking cause this seems like a common situation somebody here might already have solved.

Macichne Leainig
Jul 26, 2012

by VG

CarForumPoster posted:

You'll likely enjoy this part of the fast.ai deep learning course as they cover image localization/segmentation:

https://www.youtube.com/watch?v=MpZxV6DVsmM

Bonus that you can compare it to segmentation_models for output, ease of use, etc. Its PyTorch based rather than TF and is also pretty easy to use.

Well the sound chip in my laptop took a poo poo, I guess, so I had to switch to the aux out on my DisplayPort monitor.

Once I got that figured out yeah, it is a pretty good video and project. My boss has a bit of a hardon for TensorFlow unfortunately and I’m trying to convince him just to use PyTorch even for experimental purposes.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

KICK BAMA KICK posted:

Anybody using rq -- I've got a few jobs that can get a few types of errors from an API they use and each might need to be handled differently (some jobs should just be requeued if the error looks like a transient "servers down" thing; another is scheduled to repeat anyway and can just wait till next time; and then there's errors that might be like bad credentials that should put everything on hold until I check the situation out.)

I was going to attach a dict with instructions about how to handle each case to the metadata you can define on the job decorator, and evaluate that in the function you can designate as the job failure handler; just checking if there's a better strategy I'm overlooking cause this seems like a common situation somebody here might already have solved.

https://python-rq.org/docs/exceptions/

Depending exactly on what your jobs look like, I'd consider using exception handlers for your workers.

KICK BAMA KICK
Mar 2, 2009

Thermopyle posted:

https://python-rq.org/docs/exceptions/

Depending exactly on what your jobs look like, I'd consider using exception handlers for your workers.
Yeah, I mean that exception handler would use the metadata on the job to determine whether it should be requeued or just fail or what, like:
Python code:
def the_exception_handler_registered_with_rq(job, exc_type, exc_value, traceback):
    if isinstance(exc_value, SomeAPIsException):
        error_kind: EnumClassifyingThoseErrorsMoreUsefullyThanTheAPIDoes = classify_exception(exc_value)
        instructions = job.meta['on_error'].get(error_kind)
        if instructions == 'retry':
            # Maybe count retries up to a maximum value, or sleep for a specified interval before requeing
            job.requeue()
        elif instructions == 'ignore':
            return True  # Stop handling here and just put it on the FailedJobRegistry
        elif instructions == 'stop_everything':
            shut_down_all_those_jobs()
        else:
            return False  # Pass to the next handler on the list, if there is one
And then on each relevant job I would have like:
Python code:
error_instructions = {
    'on_error':
        {
            SomeErrorKind: 'retry',
            AnotherErrorKind: 'ignore'
        }
}

@django_rq.job(meta=error_instructions)
def some_job():
    ...
and the instructions might vary by job.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

That seems fine.

Another option is using different queues for the different types of jobs.

Then you start a worker for a each type of queue and the worker has its own exception handling routine for the types of jobs it runs. This way you're not doing a bunch of branching and isinstance-ing in your handlers.

Sorry if I'm not clear or not understanding your question very well. I'm in a funk today!

KICK BAMA KICK
Mar 2, 2009

Thermopyle posted:

Sorry if I'm not clear or not understanding your question very well. I'm in a funk today!
No worries, you're on point, splitting the queues for this kind of job and the other kinds was on my to-do list for a few reasons, and that's another good one. Thanks!

hhhmmm
Jan 1, 2006
...?

Spaced God posted:

Can someone help point me in the direction of some good resources on using SQL in Python, or alternatively give some advice?

My internship's big project for me involves a ~60x200,000 excel spreadsheet that requires very specific, weird analysis that no one has really wrapped their head around how to do, specifically involving simultaneous spatial and temporal analysis between different entries, so they tossed it to me. Originally I tried loving with it in Pandas but I'm running out of ways to break it over my knee in a way that gets what I want out of it so I thought about turning to SQL or some other relational DB. My background is in GIS and I've more or less been learning python and pandas on the fly

I would fire up a docker image with the database of my choice, then connect to that using some python client.

I've had really bad experiences with sqlalchemy and prefer db-native clients, but maybe thats just me.

Hollow Talk
Feb 2, 2014

hhhmmm posted:

I would fire up a docker image with the database of my choice, then connect to that using some python client.

I've had really bad experiences with sqlalchemy and prefer db-native clients, but maybe thats just me.

What were the experiences? I'm generally (and genuinely) curious about this.

SQLAlchemy Core (go die in a ditch, ORM) can be really unobtrusive, and it can provide raw connections to the underlying DBAPI things (cursors etc.) if needed, e.g. for COPY etc.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

ORM's are good though.

Hollow Talk
Feb 2, 2014

Thermopyle posted:

ORM's are good though.

For OLTP-type work, sure.

NinpoEspiritoSanto
Oct 22, 2013




ORMs are poo poo and sqlalchemy core is fuckin amazing.

Two hills I've got headstones on.

The Fool
Oct 16, 2003


Last two projects I’ve been able to get away with using pandas as a orm and it’s been great.

Going to be super annoyed the next time I need to do something more complicated.

Macichne Leainig
Jul 26, 2012

by VG
The last ORM I used was admittedly EF for .NET and all I remember is it being hell trying to get the EntityState set correctly for items that were modified. Though I think it was largely self-inflicted as we had some kind of object state of our own to manage as well.

I don't see why they wouldn't be fine for smaller projects if you just have a few tables you need to grab data out of, but if I was starting a brand new, large scale app I'd probably go a different route. I've never heard of SQLAlchemy (Core or ORM) before and both look pretty dang solid though.

As I transition into more Python based projects I should probably familiarize myself with things like that, using pymysql with cursors for everything is going to bite me in the rear end eventually.

Macichne Leainig fucked around with this message at 22:36 on Apr 8, 2020

Hollow Talk
Feb 2, 2014
SQLAlchemy Core + psycopg2 all day, everyday!

(Works for Postgres and Redshift :getin:)

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!
Is there a way with type checking to enforce the signature on overridden implementations in subclasses?

Wallet
Jun 19, 2006

Protocol7 posted:

As I transition into more Python based projects I should probably familiarize myself with things like that, using pymysql with cursors for everything is going to bite me in the rear end eventually.

Been working on a Flask (+Flask-RESTful) project using Postgres with SQLAlchemy and Marshmallow recently and it's been pretty painless.

Macichne Leainig
Jul 26, 2012

by VG

Wallet posted:

Been working on a Flask (+Flask-RESTful) project using Postgres with SQLAlchemy and Marshmallow recently and it's been pretty painless.

I have the same flask setup but with Pymysql. It’s only like three endpoints so it’s nothing for now but I know it’s going to get larger.

I’ll definitely check that out, thanks.

punished milkman
Dec 5, 2018

would have won
suddenly got the urge to make some lovely algorithmic music. what is the absolute easiest python API out there for turning code into music? Presumably it would be some sort of MIDI controller thing, but I have no idea what to expect

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Every time I've pinned down the exact reason someone thinks ORMs are poo poo (which usually takes days of back and forth) it is because they're using the wrong tool for the job or just an ORM that is bad.

mr_package
Jun 13, 2000
Hmmmm I just use pyodbc and make SQL queries directly, what am I missing out on?

mr_package
Jun 13, 2000

punished milkman posted:

suddenly got the urge to make some lovely algorithmic music. what is the absolute easiest python API out there for turning code into music? Presumably it would be some sort of MIDI controller thing, but I have no idea what to expect

SuperCollider? Never tried it but heard good things. There was a recent Hacker News thread, I did a quick google and it seemed to have good Python bindings/modules/whatever.
edit: thread was about CSound, but SuperCollider was mentioned in comments https://news.ycombinator.com/item?id=22787566

Another possibility that I think is under utilized is UVI Falcon supports Lua (well technically a subset of Lua they call 'uviscript'). I have to wonder what interesting things could be coded up there. It's designed to program instruments more than algorithmic music but still, gotta be some interesting potential there. Ditto KSP for Kontakt I suppose. Perhaps I'm wrong though, if it was worth doing people would be doing it probably.

mr_package fucked around with this message at 06:09 on Apr 9, 2020

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Spaced God posted:

Yeah this is definitely being popped out of the womb and being told to fly a spaceship. A lot of my time has been spent on stackexchange or deep in the confines of panda reference pages.

To answer your question, 6AM me got it backwards, it's 200,000 long with 60 columns. Each row represents one call, and columns provide lat/long of the address they were going to, time the phone was picked up, time the unit was sent, time the unit got there, time the unit left, if the unit was canceled, in what priority order they arrived, and a bunch of other things. The dataframe represents the entirety of one year, as they want an analysis done on a yearly basis.

The output ideally should be "during this year, here are the locations that had the most of these saturated incidents" and would most likely be tabular because my main gig is geospatial analysis so I can super easily just plunk that into ArcGIS and run with it.

Thankfully, my internship folks prefaced giving me this project with "we definitely can't do it, and you don't have access to the VPN during quarantine, so maybe you might be able to try?" and they're optimistic in a final product but understand if I don't get too much done, what with not really knowing Python beyond the basics

It sounds like if you can identify (from lat/long co-ordinates) when a call was to out of area then you get a decent chunk of the way there? You could add an extra column ‘is_out_of_area’ to the dataframe then reuse that in later queries?

One additional bit of advice I’d give is: don’t try to do too much in a single step, and don’t spend too long scouring the pandas docs hoping to find the one function that does exactly what you want. There’s no shame in just iterating over the dataframe row by row to do calculations, particularly if you’re new to both pandas and Python.

NinpoEspiritoSanto
Oct 22, 2013




Thermopyle posted:

Every time I've pinned down the exact reason someone thinks ORMs are poo poo (which usually takes days of back and forth) it is because they're using the wrong tool for the job or just an ORM that is bad.

The relational model is not the object model, regardless of how well you manage to fit the bits that look similar together with an ORM. I'm not arguing they are useless per se but the usefulness is limited particularly if you're working with a relational design that didn't grow in lockstep with whatever code you're using an ORM with now.

Like using Google translate to have a conversation in French it will get you so far but for everything you need to do eventually you need to learn to speak some French. Best middleground I've found for this bit is sqlalchemy core as you still get useful tools without having to write queries and data structures by hand all the time (a problem ORMs tackled).

SurgicalOntologist
Jun 17, 2004

Rocko Bonaparte posted:

Is there a way with type checking to enforce the signature on overridden implementations in subclasses?

mypy checks this: https://mypy.readthedocs.io/en/stable/class_basics.html#overriding-statically-typed-methods

hhhmmm
Jan 1, 2006
...?

Hollow Talk posted:

What were the experiences? I'm generally (and genuinely) curious about this.

SQLAlchemy Core (go die in a ditch, ORM) can be really unobtrusive, and it can provide raw connections to the underlying DBAPI things (cursors etc.) if needed, e.g. for COPY etc.

For postgres, very slow compared to using psycopg2 directly.

Hollow Talk
Feb 2, 2014

hhhmmm posted:

For postgres, very slow compared to using psycopg2 directly.

If you haven't seen this already, try to specify additional options when you create your engine for "executemany": https://docs.sqlalchemy.org/en/13/dialects/postgresql.html#psycopg2-connect-arguments

e.g.:
code:
engine = create_engine(config.db_url, executemany_mode='values', executemany_values_page_size=10000)
That should use psycopg2-optimisations for inserting/executing multiple rows at once (because, by default, it inserts each row separately, which is indeed rubbish). It's not COPY, but it's something. :D

Rocko Bonaparte
Mar 12, 2002

Every day is Friday!

Ahh I thought I automatically had all this in PyCharm but it turns out I don't.

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

You shouldn't use ORM's when the friction encountered because of the relational-OOP impedance mismatch is too high. This doesn't mean ORM's are bad, it means you, someone above you, or someone before you made a bad decision.

Note that "too high" doesn't just mean "I had to think more about how to do this thing".

You "just" need to choose the right tool for the job. Sometimes you may not know what the right tool will be so you might use sqlalchemy core. Sometimes you understand the problem space well enough that you can decide up front that some ORM will or will not meet your needs.

In the general case, if you just default to Method X because you think it's a good middleground, then you're never using the best tool for the job. It's like a carpenter just taking a hammer to work because it can drive nails, pry boards, beat a hole in a wall, and scoop their potato salad for lunch.

Of course, some people work on a narrow range of project types and they might never, or very rarely, benefit from an ORM or from not-an-ORM and it's just easier for them to default to whatever.

IME, much of the time people say ORMs are bad they've just internalized how ORMs apply to the type of work they were doing when they formed their ORM opinion or they're parroting what their DB instructor told them in school or something like that. Some people then take this "lesson" they think they've learned about ORMs and build projects in a way or select projects that benefit from whatever they think the lesson taught them and this just reinforces what they think they know. Other people just work in a field that benefits from an ORM and then don't see the how people could not be using an ORM...or the opposite.

People are very often not using an ORM when they should, or are are using an ORM when they shouldn't and everyone has Very Strong Opinions about the subject. This is a shame. Sometimes when you're using the not-ideal tool and have strong opinions about it, it prevents you from learning about, thinking about, and making the best of that tool.

FWIW, with no scientific basis, no survey of my past projects, going completely by my gut impression of my career: 40% of projects I've been significantly involved in have been best served by an ORM.

Rocko Bonaparte posted:

Ahh I thought I automatically had all this in PyCharm but it turns out I don't.

I usually run mypy in addition to using PyCharm's type checker. Neither mypy or PyCharm seem to be a subset of the other.

Thermopyle fucked around with this message at 17:49 on Apr 9, 2020

Macichne Leainig
Jul 26, 2012

by VG
I decided to switch my project over from pymysql to SQLAlchemy ORM + marshmallow and holy poo poo that's a dream.

The DB schema is just a few objects without much relation so I felt like an ORM would be appropriate, and I love it so far.

Hollow Talk
Feb 2, 2014
edit: nvm

Hollow Talk fucked around with this message at 20:40 on Apr 9, 2020

Jose Cuervo
Aug 25, 2004
I am trying to parallelize some code I have. I have a dictionary with 5500 entries (but this number will grow to about 20000 entries later on), where each entry is a pandas series that is 288 rows in length. I have to run a function Y on each combination of entries in the dictionary, where Y computes the euclidean distance between the two series and is a fairly computationally inexpensive task. However, even with 5500 entries this comes out to approximately 15 million calls to function Y and I would like to parallelize it. I am looking into using joblib to make use of all 6 cores on my computer and I am wondering if

1. The best way to do this is to split the list of combinations into 6 approximately equal sublists, and then use joblib to run the computation for each list on a single core, and
2. Is it possible to have a single copy of the dictionary that can be accessed by each process, rather than having six copies of this large dictionary?

QuarkJets
Sep 8, 2008

Don't try to optimize the scheduling yourself, eg by creating N sublists. Libraries like multiprocessing and joblib can handle that for you; you provide one iterable of data that needs processing, and let the library deal with giving the data to the workers as they become available

I'm not familiar with joblib, but it looks like it's built on top of multiprocessing, and multiprocessing has a Manager class that can share things like dicts between processes.

susan b buffering
Nov 14, 2016

Splitting up the work that way should be fairly ideal in terms of speed-up.

Number 2 might be possible but I wouldn’t count on it being simple if it is. You would have a much better time if those series were stored in an array like a pandas dataframe or just a plain numpy 2d array, even if just for the computation you’re doing.

I’d also recommend looking into using Numba for what you’re doing instead of joblib. It’s a JIT compiler that also handles threading and works particularly well with numpy arrays.

Wallet
Jun 19, 2006

Unimportant question time:

Now and then I have a file with a lot of things that aren't broken into clearly defined classes, functions, etc—stuff like config values or dictionaries of error definitions. I try to keep those values organized in a sane way and I usually put little signpost comments in there:
Python code:
errors_dict = {
    # Global errors
    'UNKNOWN_ERROR': {
        'message': 'Pretend this is a real error.',
        'status': 500
    },
    'BACK_INCOMPATIBLE': {
        'message': 'This one also.',
        'status': 400
    },

    # Auth errors
    'INVALID_SESSION': {
        'message': 'Nothing to see here.',
        'status': 403
    },
    'AS_FORBIDDEN': {
        'message': 'Or here.',
        'status': 403
    }
}
If one of those files gets big enough I always end up feeling like those signposts need emphasis because they become really easy to miss when scrolling through a file looking for something (ctrl+f only works if you know the name of the thing you are looking for).

I've seen people do this, but it every time I see it I feel like I've stepped into a readme from 1995:
Python code:
    ####################################################################################################################
    # Auth errors
    ####################################################################################################################
I've also seen people do this, which is hacky and doesn't get highlighted as a comment because it isn't one:
Python code:
    """
    Auth errors
    """
Do I just add a bunch of spaces and call it a day?
Python code:
    # A U T H    E R R O R S
Am I the only human being who cares about this? Is there some pythonic method of handling this buried in a PEP article somewhere?

a foolish pianist
May 6, 2007

(bi)cyclic mutation

Split it into different dicts, I think -

code:

global_errors = {...}

auth_errors = {...}

Then merge them into a single errors_dict.

Dominoes
Sep 20, 2007

Python code:
@dataclass
class ErrorData:
    message: string,
    status: int


class GlobalError(Enum): 
    UNKNOWN_ERROR = auto()
    BACK_INCOMPATIBLE = auto()

    def data(self) -> ErrorData:
        if self == ErrorType.UNKNOWN_ERROR:
            return GlobalError('Pretend this is a real error.', 500)
        elif self == ErrorType.BACK_INCOMPATIBLE:
            return GlobalError('This one also.', 400)


class AuthError(Enum): 
    INVALID_SESSION = auto()
    AS_FORBIDDEN = auto()

    def data(self) -> ErrorData:
        #...


an_error = GlobalError.UNKNOWN_ERROR.data()
This is more robust, and doesn't repeat the key names: You avoid the problem of typos in the dict keys, wrong types if setting this up programmatically and you pass the wrong var, pass the error into the wrong func and vice/versa etc, or pass the wrong category of error.

Dominoes fucked around with this message at 20:33 on Apr 10, 2020

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

Dominoes posted:

This is a case where I wish Python would allow enums to hold data.

Python code:
@dataclass
class GlobalError:
    message: string,
    status: int


class ErrorType(Enum): 
    UNKNOWN_ERROR = auto()
    BACK_INCOMPATIBLE = auto()
    INVALID_SESSION = auto()
    AS_FORBIDDEN = auto()
    # Sematically clear, but currently not supported i-wish example, where each variant wraps a `GlobalError`:
    # UNKNOWN_ERROR(GlobalError) = auto()

    # Instead, we'll use a method
    def data(self) -> GlobalError:
        if self == ErrorType.UNKNOWN_ERROR:
            return GlobalError('Pretend this is a real error.', 500)
        if self == ErrorType.BACK_INCOMPATIBLE:
            # ... etc


an_error = ErrorType.UNKNOWN_ERROR.data()
This is more robust, and doesn't repeat the key names: You avoid the problem of typos in the dict keys, wrong types if setting this up programmatically and you pass the wrong var, pass the error into the wrong func and vice/versa etc.

I mean they can with mixin classes


E: oh real Union types ... nope :(

Adbot
ADBOT LOVES YOU

Dominoes
Sep 20, 2007

Ninja-edited since I missed the point.

Dominoes fucked around with this message at 20:33 on Apr 10, 2020

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply