Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.
Are there any modern libraries that are sort of in the same vein as Pydantic, but not harmful and bad from trying to smush every data serialization/deserialization/validation/normalization concern together in the same bathroom sink? Plain dataclasses are bad. Marshmallow is bad. Mashumaro looks promising, but I haven't gone too deep yet.

Vulture Culture fucked around with this message at 22:26 on Oct 10, 2023

Adbot
ADBOT LOVES YOU

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Falcon2001 posted:

It might help if you explained more about the problem you have with those solutions.
I'm mostly looking for inspo on how people are solving the problem of taking in data, based on some kind of protocol or standard, and shuttling it around the different intermediate domain representations in their programs/services. Some of the key problems that bug me day to day are:

  • What the standard says the data should look like and the normalized version of the data you want to work with often aren't the same thing
  • Schema validation and semantic validation are totally different things and should be handled at different layers
  • Serialization/deserialization shouldn't be tightly coupled to the transformations between those data representations
  • Going between those data representations should be consistent and rule-based instead of based on arbitrary, implicit, and often undocumented conversions
  • The above should still be DRY and easy as much as possible; it shouldn't need either re-declaration of those rules for every single field, nor ad-hoc reflection-driven handling of an entire model
  • The hardest part of managing data models is usually handling discriminated unions

A good example of where Pydantic falls down is on both implicit and explicit conversions. When you're interpreting JSON input with a defined representation, usually the standard will define things like a specific datetime format or a particular method of encoding binary data. It handles these kinds of use cases so badly that I'd rather not even try

Vulture Culture fucked around with this message at 17:26 on Oct 12, 2023

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

His Divine Shadow posted:

Apparently the python library oauth2 is actually an oauth v1 library. Well that was a frustrating hour wasted. Only found out due to a stack overflow post.

Just felt like writing something out.
OAuth2 and OIDC are surprisingly simple to implement, and JWT-handling libraries like python-jose make it a lot harder to do unsafe things with JWTs. I recently did async client and server implementations of the auth code and device authorization flows as a toy project. I found py_simple_openid_connect to be a useful reference for my own implementation, which took about a day.

pyoidc is a solid implementation if you're looking for a library rather than a full batteries-included framework for Django, FastAPI, etc.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

His Divine Shadow posted:

The real PITA is dealing with the backend stuff on microsofts crappy azure environment. But I'm not feeling kindly disposed towards oauth in general at the moment, feels like total overkill for my purposes.
If you need to run your own IdP for a webapp, and you're not federating data access with external apps over some kind of API, then OAuth and JWTs are probably overkill vs. session cookies and a user database. OAuth2 only really removes complexity if you get to avoid handling logins at all, either because you've outsourced everything to one external IdP or you have a library like authlib that makes it easy to integrate multiple external IdPs. If for some reason I had to keep my own user store, but also integrate with external IdPs, that's the point where I'd try to implement my local login flow as an OIDC provider, because it at least keeps everything along the same flow.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Qtamo posted:

Knew I forgot something :doh: The original db tables stay in our internal environment, but the transformed data is a part of a larger dataset that gets sent to another party, and we can't guarantee that the data is safe there (well, can't really guarantee it in our own environment either but you get my point). Of course there's contract stipulations etc., but we'd rather keep things as safe as reasonably possible, since this other party doesn't need the original identifiers (but needs to know which have identical identifiers).
If anonymization is a problem you actually have, then every developer and developer client you have is also a risk. You should probably consider Baffle or Immuta or something and just export your normal tokenized fields when sending data to third parties

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

StumblyWumbly posted:

Is there any real difference or preference between the threading.Lock vs asyncio.Lock? At a low level, they must be doing the same thing, but I could imagine differences in what happens when you need to wait for the lock.
Async locks are for when you have multiple waiters in the same asyncio event loop. Trying to access any async resource (lock, task result, etc.) from another event loop will typically generate an exception.

StumblyWumbly posted:

I feel like I'm clear on the differences between threads and async.
Honestly, no. They couldn't be more different. asyncio is a model for running multiple tasks in a single thread, to ensure the thread always has something to do even when a task is waiting for a result. This keeps the runtime, not the operating system, in charge of when the event loop may be yielded from one task to another one. This is the reason locking primitives are so lightweight in the async model: it's literally impossible for two tasks in the same event loop to concurrently access the same memory, because they cannot ever run at the same time.

Vulture Culture fucked around with this message at 17:34 on Dec 4, 2023

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

StumblyWumbly posted:

Thanks for the help, that clarifies things. I think I expected async to be more just because it's new.

Is it true that if I have thread lock protecting a resource that ends up getting used in async, things are protected? Or is there a potential issue because an async lock would automatically yield or await but the thread lock will not? This question is purely academic
You need to be really careful addressing synchronization in programs that are simultaneously using asyncio and worker threads. Trying to acquire a locked lock will block the thread, right? And as a general rule, you don't want to put blocking operations into the event loop. If you spin up worker threads for every synchronous task that could possibly hit a blocking operation, including working with thread-locked resources, you're fine.

So consider what Lock and RLock each do. Lock will block the thread until another thread releases the lock, but all the other tasks on your event loop are running in the same thread, so now as soon as one of them tries to wait on the lock, no other task will ever run again and you're deadlocked. RLock is re-entrant, meaning the same thread can call it multiple times and safely re-acquire the lock, which puts you in the opposite situation: attempts to acquire the lock from other coroutines in the loop, which all run in the same thread as the one you locked from, will always succeed and your lock will never, ever wait.

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Jose Cuervo posted:

SQLAchemy question here. I have two models:

...

I now want to write a query which returns the most recent visit for each subject in the database.
This is the query I have so far, but it does not work (results in a DetachedInstanceError error message):
Python code:
    q = select(Visit) \
                .join(Subject.visits) \
                .order_by(Visit.date.desc()).limit(1)
Any thoughts on what I am doing wrong?
DetachedInstanceError usually results from trying to query without a session, or using a session that's already been closed. What's your full error text, and what's the code that actually executes this statement?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

The March Hare posted:

Did Poetry ever fix their resolver being insanely slow? I used it on a project that didn't get super big and it was a really significant portion of our build time. I was dying looking around for alternatives when I left. We had some other really niche edge-case issues with how Poetry was handling editable installs and maybe a couple of other things that were bothering me.

I felt bad since I was the one who initially made the switch over to Poetry because I had some (positive up to that point) experience with it, but it really broke down in actual business context for me. Bummed me out.

Was looking into PDM when I left and haven't really messed around in that world since then.
The entire mechanism of Python package management would need to be rewritten from scratch to make any backtracking dependency solver performant, because dependency resolution is dynamic, system/installation-dependent, and requires downloading each package to even generate its dependency list. You can use good, robust, slow solvers or naive, sometimes nondeterministic, fast solvers. This is the burden we bear

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

jaete posted:

Are there even any systems which would have fully deterministic dependency resolution for packages? I mean, how exactly would that work? When someone says "foo install some-package==1.23", the system would... download the static set of that package's dependencies... which can never change? So every package's maintainer would have to first specify the exact version of every dependency? I'm not sure what we're talking about here
If you provide the same set of package dependencies, and you have the same set of packages available, you get the same resulting set of installed packages

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

QuarkJets posted:

The problem is that package dependencies aren't always specific, usually because they don't have to be. If I have a package that requires "pandas" without specifying a version number then dependency resolution will give me whatever the latest version of pandas is, which changes over time. This is deterministic but the answer does change over time.

If you use a requirements.txt produced by `pip freeze` then you'll get the same set of installed packages each time, but pip is no longer really performing dependency resolution at that point
I get that. The issue is that the old pip resolver, the one prior to the backtracker, did not do this. On upgrades (and in rare cases, installs), it would come up with completely different depth-first resolution orders for the same underlying dependency specification. This resulted in divergent version preferences, due to the widespread reliance on data structures with unstable ordering (sometimes in packages' setup.py themselves). But it did it really fast! And this totally broken implementation is what most folks are comparing the modern resolvers (Poetry, pipenv, conda, recent pip) to.

Vulture Culture fucked around with this message at 20:49 on Jan 31, 2024

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

QuarkJets posted:

imo the actual solution is outright bad code, these things need to go or be changed:
- Argument name "str"
- String concatenation in a for loop
- Bullshit function name
- No docstring
Other than the input variable shadowing a builtin, which has the potential to create gnarly bugs later on, these aren't the kinds of things new devs, or arguably any devs, really need to spend cognitive energy on. (IMO, all internal utility function names are bullshit no matter how much you pretend they aren't. Beyond a small handful of developers, none of them will ever see actual reuse.)

Beyond the function name, these are all things that would get caught by a linter like flake8 or pylint, and it's a good idea to get in the habit of running code through these sorts of tools to see where they deviate from accepted style or hit things that, at scale, would become obvious performance torpedoes.

Vulture Culture fucked around with this message at 13:43 on Feb 16, 2024

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

QuarkJets posted:

Doesn't saying "flake8 will see some these and then spit in your face" dispute the notion that new developers don't need to think about them?
Someone's job with new devs, whether an engineering manager or a computer science teacher or an effective mentor, is to get their brains into something resembling a flow state, not to avoid triggering senior engineers' pet peeves. Clean it up later (code review, fast follow, maybe never) if it matters. There's a reason these contrived problems have so many sub-optimal solutions being published, and it's not that everyone involved is dumb.

QuarkJets posted:

The missing docstring and bad function name are probably the two most important items on that list, new developers should definitely be thinking about these things when writing a new function. Every time.
There's value in building high-quality libraries, but I also seriously question the time management of anyone who overinvests in these things for irrelevant little internal functions that only exist to reduce LOC in some actually-important function or another. If you have a real problem with comments to the extent that you need to measure percentages of functions with docstrings and ensure number only ever go up, sure, do what you've got to do to make that happen. Pretending everyone has this problem doesn't make a piss function missing a docstring "outright bad code".

Vulture Culture fucked around with this message at 13:40 on Feb 20, 2024

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Falcon2001 posted:

I agree. IMO trying to do this in Pandas is the wrong approach, part of your ETL pipeline is sanitizing and standardizing your data before loading it, so you should just handle that in code first before feeding it over to pandas. Especially if this is an adhoc spreadsheet. Here's an extremely naive example:

Python code:
from csv import DictReader

with open('file_path', 'r') as csvfile:
    reader = DictReader(csvfile)
    data = [row for row in reader]

out_data = []
current_category = ""
for row in data:
    if row['category'] and not (row['names'] or row['dates']):
        current_category = row['category']
    else:
        row['category'] = current_category
        out_data.append(row)
# Assumptions made:
# the category identifier row always shows up before the rows in that category
I agree that it should usually be a separate pipeline step, but you should feel fine doing whatever is idiomatically familiar to the people who need to maintain it while being reasonably performant

The nice thing about your data cleaning approach is that you can easily farm out bugs in the data cleaning to people who are basically complete strangers to data engineering, but if this isn't a problem you have, there's nothing wrong with vectorizing if it makes sense. The Pandas approach will probably, ironically, make less sense to people who are very familiar with Python, but feel extremely natural to people looking at Python data engineering from an R background

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Oysters Autobio posted:

Notebooks entirely live in a different use of coding.

There's using code to make a tool, package or something reusable. It's creating a system, package etc

Then there's using code to analyze data or explore / test or share lessons and learning. It's to create a product. Something to be consumed once by any number of people.

But after recently working in a data platform designed by DS' who were inspired by Netflix circa 2015, I think it's hell. Every ETL job is mashed into a notebook. Parameterized notebooks that generate other notebooks. Absolutely zero tests.

If I'm writing an analytical report it's fantastic because I can weave in visualizations and text and images.

Or often if I'm testing and building a number of functions, it's nice to call in the REPL on demand and debug things that way. But once that's finished, it goes into a python file. VS Codes has a great extension that can convert .ipynb to a .py file.

But for straight code it's a mess and frankly I find it slows you down. With a plain .py file I just navigate up and down, comment as I need, etc.

Finally once you've ever tried a feature IDE like vscode, you'll never want to go back to jupyterlab. The tab completion is way snappier, you've got great variable viewers and can leverage a big ecosystem of extensions

I'm a complete amateur at python and am only a data analyst, but I'm so glad I moved to VS Code.
Nothing wrong with using both. Set up %autoreload 2 as the first step in your notebook, then modularize any part of your analysis that you actually care about the correctness of as you go

I usually end up with the data-transformation parts of my code living as Python modules in VSCode, and the interactive visualization parts in Jupyter where the ecosystem is just a lot better

Adbot
ADBOT LOVES YOU

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Oysters Autobio posted:

Echoing this and expanding it even further: don't touch anything related to XML as a beginner.

While I'm absolutely sure it has some great features about it, I'm finding it's very much a "big boy" format and not beginner friendly. Maybe not the format necessarily but the ecosystem of tools in Python for it are really dense.

Hell, just look at the name "lxml" as a package. Gonna throw out a dumb hot take that I literally put no thought into: Acronyms should be banned from package naming.
Yeah, the big issue is that XML was built as a markup language, not a language for representing data structures or configuration, the two things developers between 1995 and 2005 really like to pretend it was ever good at

If you can make guarantees about the documents you're loading like "text will never contain other elements" then it gets a lot easier to work with and enables much more straightforward APIs like Pydantic

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply