Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
LightRailTycoon
Mar 24, 2017
Deptry will tell you which files import a module

Adbot
ADBOT LOVES YOU

The March Hare
Oct 15, 2006

Je rêve d'un
Wayne's World 3
Buglord
Did Poetry ever fix their resolver being insanely slow? I used it on a project that didn't get super big and it was a really significant portion of our build time. I was dying looking around for alternatives when I left. We had some other really niche edge-case issues with how Poetry was handling editable installs and maybe a couple of other things that were bothering me.

I felt bad since I was the one who initially made the switch over to Poetry because I had some (positive up to that point) experience with it, but it really broke down in actual business context for me. Bummed me out.

Was looking into PDM when I left and haven't really messed around in that world since then.

OnceIWasAnOstrich
Jul 22, 2006

The March Hare posted:

Did Poetry ever fix their resolver being insanely slow?

Hmm, not sure. It's not a problem I've run into in a way that bothered me. It seems roughly equivalent to pip in speed? We have a few projects with a couple dozen dependencies each and while solving isn't instant or anything, it isn't an issue I face much. We don't use open-ended version specifications for dependencies so this isn't really a problem except when we upgrade them which doesn't need to be fast, but if you do use open ended (or worse no version restrictions) keeping them restrictice and recent goes a long way to keeping things brisk.

If that is a big blocker though, PDM is the way to go and is another good option in general, though I still prefer Poetry.

QuarkJets
Sep 8, 2008

Pip has undergone a significant downgrade in performance over the last year or two, especially for packages with a lot of unpinned or loosely pinned dependencies. If I was concerned about poetey's performance, I would look at Pip first

Hed
Mar 31, 2004

Fun Shoe

QuarkJets posted:

Pip has undergone a significant downgrade in performance over the last year or two, especially for packages with a lot of unpinned or loosely pinned dependencies. If I was concerned about poetey's performance, I would look at Pip first

Am I misunderstanding or did you mean it’s gotten faster?

QuarkJets
Sep 8, 2008

Hed posted:

Am I misunderstanding or did you mean it’s gotten faster?

I mean that it's gotten much slower. Doesn't poetry ultimately use pip, like, under the hood? I thought it was basically a pip manager

WHERE MY HAT IS AT
Jan 7, 2011
Poetry has its own resolver and doesn’t rely on pip, pipenv does just punt to pip under the hood and is even slower. The real problem is just that setup.py is a non deterministic way of specifying dependencies so you can’t have a truly “offline” package resolver, you have to actually install stuff to do lock file generation.

Generic Monk
Oct 31, 2011

I've got an interview in about a week and they've asked me to do a couple of things beforehand. One is to write some ETL stuff in python to download a dataset as a CSV from the web, filter it and join it to something else, and load into a database. I've written data processing stuff with Pandas before so i'm confident I can do this but I'm not familiar with ETL conventions in Python really.

Anyone know what kind of thing they would be expecting? I'm leaning toward just something simple with pandas since that's what I'm most familiar with and would be the most likely to be understood rather than some specialist ETL tool. I'm relatively good at structuring my scripts well but I don't know what they would expect so I don't want to show something super weird. I normally split stuff into logical functions but it seems if it's just a linear ETL for one thing maybe just a really simple script is better? Help me seem like I know what I'm talking about :''')

The Fool
Oct 16, 2003


just use pandas

Tayter Swift
Nov 18, 2002

Pillbug
I use Pandas for my ETL all the time :shrug: Although I'm switching over to Polars now.

They might want to know if you know how to do proper method chaining and convert datetimes. And they might be sick of seeing the Titanic dataset if that was your plan.

Generic Monk
Oct 31, 2011

Tayter Swift posted:

I use Pandas for my ETL all the time :shrug: Although I'm switching over to Polars now.

They might want to know if you know how to do proper method chaining and convert datetimes. And they might be sick of seeing the Titanic dataset if that was your plan.

It's a specific dataset they wanted that's related to what the organisation does so mercifully that choice is out of my hands. The method chaining and datetime conversion and similar I'm very familiar with. I just mainly use SQL for my ETL work so I don't want to turn out something wildly off base in terms of structure I guess. Seems like for something this simple I can just make a bare script and then maybe turn it into a jupyter notebook if I'm feeling fancy idk

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I have a weird question. I'm working with a system that holds formulas in text and I'd like to interpret those myself.

So something like:
code:
( Date1-CPU / Date2-CPU ) * OtherFactor / (Factor1 / Factor2)
If I can extract the 'words' it's not so hard, which conveniently isn't too bad , but the actual formula could be different from place to place. Is eval the right option here? I've only heard footgun noises about that.

12 rats tied together
Sep 7, 2006

I would probably build a lookup table of symbol to function and then use some combination of operator.methodcaller / operator.attrgetter to call them explicitly after parsing the text.

This seems like a bad idea, but I'm sure you know that already. :cheers:

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

12 rats tied together posted:

I would probably build a lookup table of symbol to function and then use some combination of operator.methodcaller / operator.attrgetter to call them explicitly after parsing the text.

This seems like a bad idea, but I'm sure you know that already. :cheers:

The upside is that this is a script that calls an expected API, so we're not taking in like...unvalidated user input from the internet or anything, so the security implications are less dire.

hey mom its 420
May 12, 2007

This isn't really a python specific question, but I didn't know where else to post it. we have a django app connected to a rabbitMQ queue. There's a separate service that gets that message, generates a PDF with react-pdf, uploads it to S3 and then creates a success message (containing the path to the uploaded file). then there's a manage command that runs on a separate pod that reads those success messages and updates DB records with the S3 path.

all well and good. But now, I'd like to be able to say generate a PDF and then when it's done, send an email from the django app. What do you think would be the best way to achieve this? I don't know how these things are usually handled? My intuition is that the best way to do this is to encode the data of what we want to happen with the generated PDF inside the message itself? So that when we send the message, I can put in `["send_pdf_email_from_msg", "update_db_record_from_msg"]` and then the manage command can read those and know what to do?

Seems like it should work but it's kind of hard to follow from the perspective of someone just looking at the code, you'd have to know specifically the whole setup to know what what happens.

lazerwolf
Dec 22, 2009

Orange and Black

hey mom its 420 posted:

This isn't really a python specific question, but I didn't know where else to post it. we have a django app connected to a rabbitMQ queue. There's a separate service that gets that message, generates a PDF with react-pdf, uploads it to S3 and then creates a success message (containing the path to the uploaded file). then there's a manage command that runs on a separate pod that reads those success messages and updates DB records with the S3 path.

all well and good. But now, I'd like to be able to say generate a PDF and then when it's done, send an email from the django app. What do you think would be the best way to achieve this? I don't know how these things are usually handled? My intuition is that the best way to do this is to encode the data of what we want to happen with the generated PDF inside the message itself? So that when we send the message, I can put in `["send_pdf_email_from_msg", "update_db_record_from_msg"]` and then the manage command can read those and know what to do?

Seems like it should work but it's kind of hard to follow from the perspective of someone just looking at the code, you'd have to know specifically the whole setup to know what what happens.

Have you looked into using signals? Using a post_save signal after the s3 path is saved to trigger a send email function.

Hed
Mar 31, 2004

Fun Shoe

lazerwolf posted:

Have you looked into using signals? Using a post_save signal after the s3 path is saved to trigger a send email function.

This will work but I think overriding the save() method is typically viewed as better than attaching signals since that inverts control / is less obviously happening.

hey mom its 420
May 12, 2007

I don't like using signals, our team used to use them a lot and for some old pieces of code you just can't follow what's happening because the signals keep ping-ponging up and down.

Anyway, a signal would be OK if I always wanted to send an email after generating a pdf. But sometimes we just want to generate the pdf, sometimes we want to do that plus some other stuff. So I'm left wondering what kind of control flow to use for that. Another option that I see would maybe writing some sort of rule in redis before sending the signal that says OK, if this is the ID of the generated document, then also do this and that. And then when we get the success message in the queue check that I guess?

lazerwolf
Dec 22, 2009

Orange and Black

Hed posted:

This will work but I think overriding the save() method is typically viewed as better than attaching signals since that inverts control / is less obviously happening.

I’m not a big fan of overriding save methods to do side effects like sending emails or calling other functions. If you want to create a custom label post save, yeah overriding save makes sense. That’s just my personal opinion.

hey mom its 420 posted:

I don't like using signals, our team used to use them a lot and for some old pieces of code you just can't follow what's happening because the signals keep ping-ponging up and down.

Anyway, a signal would be OK if I always wanted to send an email after generating a pdf. But sometimes we just want to generate the pdf, sometimes we want to do that plus some other stuff. So I'm left wondering what kind of control flow to use for that. Another option that I see would maybe writing some sort of rule in redis before sending the signal that says OK, if this is the ID of the generated document, then also do this and that. And then when we get the success message in the queue check that I guess?

Is it unreasonable to build a REST api to call these various downstream functions?

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

The March Hare posted:

Did Poetry ever fix their resolver being insanely slow? I used it on a project that didn't get super big and it was a really significant portion of our build time. I was dying looking around for alternatives when I left. We had some other really niche edge-case issues with how Poetry was handling editable installs and maybe a couple of other things that were bothering me.

I felt bad since I was the one who initially made the switch over to Poetry because I had some (positive up to that point) experience with it, but it really broke down in actual business context for me. Bummed me out.

Was looking into PDM when I left and haven't really messed around in that world since then.
The entire mechanism of Python package management would need to be rewritten from scratch to make any backtracking dependency solver performant, because dependency resolution is dynamic, system/installation-dependent, and requires downloading each package to even generate its dependency list. You can use good, robust, slow solvers or naive, sometimes nondeterministic, fast solvers. This is the burden we bear

Oysters Autobio
Mar 13, 2017
Question for experienced folks here. I'm not a dev, just a data analyst who started picking up python for pandas and pyspark.

Past few months i feel like beyond initial basic tutorials and now using Python basically for interactive work in a jupyter notebook, I'm sort of stuck again in tutorial hell when it comes to wanting to build better skills at making python code and packages for *other* people to also use.

Sort of a mix between analysis paralysis (of too many options and freedom) and good hefty dose of imposter syndrome / anxiety over how other "real" developers and data scientists at work will judge my code.

I know it's all irrational, but one practical part I'm having trouble with is the framing and design part of things. Having ADHD doesn't help either because it's easy to accidentally fall into a rabbit hole and suddenly you're reading about pub/sub and Kafka when all you were trying to lookup was how to split a column by delimiter in pandas, lol. Outside of Python I turn heavily to structure, to do lists, framing / templates and the like to stay on track, but I'm having trouble applying any of that for Python work.

For example, I love design patterns, but all of them seem to be oriented towards "non data engineering" related problems or I can't figure out how to apply them with my problems. Like, I love the idea of TDD and red/green but have no clue how I would build testing functions for data analysis or ETL.

I can't seem to find more opinionated and detailed and opinionated step by step models for creating data pipelines, ETL or creating EDA packages to generate reports. A lot of stuff feels like the "step one draw two circles, step two draw the rest of the loving owl".

A lot of just venting here so please ignore if it's obnoxious or vague, but any advice or thoughts on this would be great.

ComradePyro
Oct 6, 2009
I've been writing code professionally for over a year now. I still don't know what I'm doing, but I've learned enough to think I was a total moron a year ago. give it time

jaete
Jun 21, 2009


Nap Ghost

Vulture Culture posted:

The entire mechanism of Python package management would need to be rewritten from scratch to make any backtracking dependency solver performant, because dependency resolution is dynamic, system/installation-dependent, and requires downloading each package to even generate its dependency list. You can use good, robust, slow solvers or naive, sometimes nondeterministic, fast solvers. This is the burden we bear

Are there even any systems which would have fully deterministic dependency resolution for packages? I mean, how exactly would that work? When someone says "foo install some-package==1.23", the system would... download the static set of that package's dependencies... which can never change? So every package's maintainer would have to first specify the exact version of every dependency? I'm not sure what we're talking about here

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

jaete posted:

Are there even any systems which would have fully deterministic dependency resolution for packages? I mean, how exactly would that work? When someone says "foo install some-package==1.23", the system would... download the static set of that package's dependencies... which can never change? So every package's maintainer would have to first specify the exact version of every dependency? I'm not sure what we're talking about here
If you provide the same set of package dependencies, and you have the same set of packages available, you get the same resulting set of installed packages

The March Hare
Oct 15, 2006

Je rêve d'un
Wayne's World 3
Buglord

Vulture Culture posted:

The entire mechanism of Python package management would need to be rewritten from scratch to make any backtracking dependency solver performant, because dependency resolution is dynamic, system/installation-dependent, and requires downloading each package to even generate its dependency list. You can use good, robust, slow solvers or naive, sometimes nondeterministic, fast solvers. This is the burden we bear

For sure, processes take time but sometimes they take more time than they need to. An important lesson to keep in mind, lest you fall into the trap of the Poetry dev team where you handwave performance concerns away for years by saying "oh but dep solving just takes time!" without recognizing that your particular implementation could use some improving.

Poetry (used to!?) be particularly bad at handling certain (common, judging by the # of github issues about them) conditions. It seems they've been accepting some significant performance patches of late though.

This one from May of last year, for example: https://github.com/python-poetry/poetry/pull/7950

"On my machine with fully populated caches, the performance of poetry update urllib3 (in a repo where this causes conflicts with boto3/botocore) goes from ~58 minutes to ~3 minutes with these changes."

They've also accepted a few other patches recently to cut down on unnecessary deepcopying during resolution, and this one which changes the order in which deps are resolved: https://github.com/python-poetry/poetry/pull/8256

"Thank you for this! Last week I updated transient dependencies for a project using poetry update <package>, and the command took over 3 minutes each time. With Poetry 1.6.0 it instead takes 8 seconds 😄"

"Awesome work! I was about to drop poetry since it took ~10 minutes for every dependency change. Updating it to check how much it improves 🚀"

The March Hare fucked around with this message at 19:01 on Jan 31, 2024

QuarkJets
Sep 8, 2008

Vulture Culture posted:

If you provide the same set of package dependencies, and you have the same set of packages available, you get the same resulting set of installed packages

The problem is that package dependencies aren't always specific, usually because they don't have to be. If I have a package that requires "pandas" without specifying a version number then dependency resolution will give me whatever the latest version of pandas is, which changes over time. This is deterministic but the answer does change over time.

If you use a requirements.txt produced by `pip freeze` then you'll get the same set of installed packages each time, but pip is no longer really performing dependency resolution at that point

Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

QuarkJets posted:

The problem is that package dependencies aren't always specific, usually because they don't have to be. If I have a package that requires "pandas" without specifying a version number then dependency resolution will give me whatever the latest version of pandas is, which changes over time. This is deterministic but the answer does change over time.

If you use a requirements.txt produced by `pip freeze` then you'll get the same set of installed packages each time, but pip is no longer really performing dependency resolution at that point
I get that. The issue is that the old pip resolver, the one prior to the backtracker, did not do this. On upgrades (and in rare cases, installs), it would come up with completely different depth-first resolution orders for the same underlying dependency specification. This resulted in divergent version preferences, due to the widespread reliance on data structures with unstable ordering (sometimes in packages' setup.py themselves). But it did it really fast! And this totally broken implementation is what most folks are comparing the modern resolvers (Poetry, pipenv, conda, recent pip) to.

Vulture Culture fucked around with this message at 20:49 on Jan 31, 2024

fatelvis
Mar 21, 2010

Hughmoris posted:

For those of you who fiddle with Python at home, what does your dev environment look like? Do you have Python installed directly to your desktop? Utilizing containers, or a remote environment?

I just use pipenv. Haven't had any issues with it really.

Oysters Autobio
Mar 13, 2017
Sorry for the crosspost but I realized this question might be better suited here.

I'm trying to setup my local VS Code (launched from Anaconda on Windows) to use the notebook kernel and terminal from a JupyterHub instance we use at work. Reason being that this Hub has a default conda env and kernel that's setup to abstract working with our spark and hadoop setup.

I followed some online tutorials which had me generate a token from my JupyterHub account, and then simply append it into a Jupyter extension setting which lets you set a remote jupyter server to use as a notebook kernel. Great, it works.

Problem is that this seems to only connect the notebook to the server.

The vscode terminal is still just my local machine. So how do I connect that to the remote JupyterHub?

Goal here is to just fully replicate my work out of the jupyterhub instance on vs code to leverage all the IDE tools it has. Notebook works fine, but I don't want to have to work between both vscode and jupyterlab on chrome to use all the functionality.

Side note, I use the "Show Contextual Help" feature in Jupyterlab a lot since it shows docstrings and info for any given object in the IDE. What's the equivalent for this in vs code?

OnceIWasAnOstrich
Jul 22, 2006

Oysters Autobio posted:

The vscode terminal is still just my local machine. So how do I connect that to the remote JupyterHub?

Hmm, this is kind of a big ask. Your setup works by connecting to the remote Jupyter kernel over HTTP when you are using a Notebook in VS Code via the JupyterHub extension. You of course can get a remote terminal via Jupyter, Jupyter lab does it, but it does it with (I think) XTerm.js or something similar running on the host communicating over websockets. To make this work over a Jupyter kernel link, you would have to rewrite (or re-implement in the form of an extension) the VS Code terminal to use the Jupyter kernel protocol and possibly add software to your kernel environments to run the terminal session and transfer data over websockets or some other protocol. The extension just isn't set up to do that, as far as I know.

You can, of course also do remote terminals (and editors, and extensions, and everything else), but those are implemented over SSH. To make that work you would not really be using JupyterHub and would instead connect to the host running the Jupyter kernels over SSH and then access both the terminal and notebook kernels over that tunnel.

The latter is possible, but maybe not with your JupyterHub if you can't make SSH connections to the underlying machines. The former is possible but I don't think it is currently implemented, at least I don't know of any implementations.

Oysters Autobio posted:

Side note, I use the "Show Contextual Help" feature in Jupyterlab a lot since it shows docstrings and info for any given object in the IDE. What's the equivalent for this in vs code?

Individual extensions for various languages provide this via language servers. This works in the normal editor interface, but with notebooks I imagine it would be the responsibility of the notebook extension. I don't use that, perhaps there is a feature in settings you could enable? That feature might just not be available in notebooks.

--

Another thing I know is possible is to run a web version of VS Code (or code-server) remotely and access it via the Jupyter proxies, something like https://github.com/betatim/vscode-binder or https://github.com/victor-moreno/jupyterhub-deploy-docker-VM/tree/master/singleuser/srv/jupyter_codeserver_proxy. You've still got a separate interface, but at least they're both running on the same host in the same browser.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

Question for experienced folks here. I'm not a dev, just a data analyst who started picking up python for pandas and pyspark.

Past few months i feel like beyond initial basic tutorials and now using Python basically for interactive work in a jupyter notebook, I'm sort of stuck again in tutorial hell when it comes to wanting to build better skills at making python code and packages for *other* people to also use.

Sort of a mix between analysis paralysis (of too many options and freedom) and good hefty dose of imposter syndrome / anxiety over how other "real" developers and data scientists at work will judge my code.

I know it's all irrational, but one practical part I'm having trouble with is the framing and design part of things. Having ADHD doesn't help either because it's easy to accidentally fall into a rabbit hole and suddenly you're reading about pub/sub and Kafka when all you were trying to lookup was how to split a column by delimiter in pandas, lol. Outside of Python I turn heavily to structure, to do lists, framing / templates and the like to stay on track, but I'm having trouble applying any of that for Python work.

For example, I love design patterns, but all of them seem to be oriented towards "non data engineering" related problems or I can't figure out how to apply them with my problems. Like, I love the idea of TDD and red/green but have no clue how I would build testing functions for data analysis or ETL.

I can't seem to find more opinionated and detailed and opinionated step by step models for creating data pipelines, ETL or creating EDA packages to generate reports. A lot of stuff feels like the "step one draw two circles, step two draw the rest of the loving owl".

A lot of just venting here so please ignore if it's obnoxious or vague, but any advice or thoughts on this would be great.


Wanted to jump back to this question, as I was in a somewhat similar place, and definitely have the same ADHD coping mechanisms and problems. Here's some notes:

  • Software development is, at it's core, all about combining building blocks of code together to do things. Often times when you run into that problem of 'Step 1: Circle, Step 2: Rest of the loving owl' it ends up being that you're missing lower context, as a good tutorial doesn't cover all the messy middle bit most of the time. Once you have more stuff in your personal toolkit, these jumps will get less and less frustrating and these docs/articles/etc will be more useful as they cover the context and goals over just handholding you through the entirety of the process.
  • To address that last point, I would recommend working on foundational software dev stuff. For me, Code Katas such as Codwars or Leetcode were the real clicking point for my adhd brain, so I'd recommend you try them out - that or something like Advent of Code, just anything that gives you a reasonably small problem space to work on.
  • Design patterns in particular are not a non-data engineering problem, although many of them just aren't applicable. However, that's true of any software design challenge - the patterns from Gang of Four aren't universal to every application, and also aren't universal truths - they're more like 'there's a million ways to cook a hamburger, but there's four really solid ones that you might serve at this restaurant'. Don't treat them as dogma, treat them as tools that may or may not be relevant.
  • If you're like me, you won't understand some of these things (my example is dependency injection) until you actually make some mistakes (for me, using singleton pattern everywhere and then trying to figure out how to test my mess). That's okay and good, and part of the learning experience, and you'll make fewer and less impactful mistakes as you learn.

I managed a piece of software that was similar to an ETL pattern, so I can give some general advice, but there's also been some talk about it in this thread if you go back a few pages.
  • For pipelines like this, I recommend trying to follow functional programming ideas more than OOP ones. What this means in general is 'avoid using classes for stuff other than data structure and storage as much as possible'. That's not a commandment btw, just a recommendation - I would say that in my service most of the classes with methods were just for holding shared behaviors to apply (different sorting methods for example) rather than the classic OOP 'load up an object with data and methods and then mutate the state around a bunch'. The biggest reason is that pipelines, by their structure, hold up very well to classic FP approaches because you expect that for a given input you'll always get the same output; there's no user input/etc to have to sidestep or integrate along the way.
  • Side note: if you're just doing the 'well I have a set of data I want to keep in a class instead of a dictionary' that's a great idea, and dataclasses are your friend (or one of the fancier versions like Pydantic/etc) - the point is that these are essentially just a way to hold data in a structured way for access, and you shouldn't have a lot of class methods that mutate or do actions on that data in that class. Again, not an absolute approach, but a recommendation, I often have little helper properties in classes like that for deterministic stuff (for example, if you're holding information about an order, you might have properties to calculate things like 'total cost' if that isn't already part of your dataset - the trick is that this should be deterministic (ie: total cost = obj.tax_price + obj.line_item_price or something like that).
  • I'll dig up my post from a bit ago with more details on this. Edit: https://forums.somethingawful.com/showthread.php?threadid=3812541&userid=65908&perpage=40&pagenumber=6#post536720636

Falcon2001 posted:

So, I did something vaguely related before, and we didn't use any off the shelf stuff. (Probably because as mentioned this wasn't precisely the same thing.) We were combining multiple types of data sources to form a unified 'source list' for use in an auditing application, basically - so more ET and very little L. So here's the useful things I learned from that project, and maybe it's useful to know about. It's also entirely possible I'm full of poo poo so if others have more context and I'm wrong, speak up.

  • Most important: work backwards from the requirements to figure out how the data should look at the end. If you don't have a set of requirements, talk to whoever's going to use the data. If you don't have a clear list there, sit down yourself before you start and write up a list of requirements. This will help you avoid some of the more insane data spaghetti I've seen before.
  • Make sure to normalize your data. This is, from what I've seen, a foundational part of ETL systems in the transform step and also a general programming practice, but it's extremely important with dealing with dissimilar data sources. For example, if you're combining data from System A and System B, you should end up with an object that allows you to directly compare the two. This means that you might have to basically make a brand new schema for the objects that allows you to transform things together.
  • Another way of putting this is that I wouldn't recommend treating Source A as a subclass of TargetData and Source B as another subclass, you should have just multiple instances of TargetData, each with an attribute that identifies the sources.
  • In contrast, it may be useful to have metadata/other source data available. For example, we had a source_data attribute that contained a flexibly structured JSON blob of the raw source data. This is likely to be the sort of thing that's highly domain dependent - you can argue that ideally you shouldn't need to preserve any of this, but I suspect that's also a Perfect World argument and doesn't really pan out. I would recommend making this consistent to find, so you're always looking in a certain place, and from there you can query it when required. Ideally, you can use this to identify places you're having to do these edge cases and work them back into your basic schema instead.
  • ETL is a domain that I think highly benefits from a more functional programming approach instead of a classic OO style - build your transform pipelines and other actions from atomic functions that get chained together and try and avoid having to worry about state as much as possible. If necessary, I'd try and keep it high level like a state machine or something of that nature instead of more flexible mutable objects.
  • Coincident with the previous suggestion, for god's sake please use type hints as much as possible. If your data requires moving complex stuff around, use objects such as dataclasses instead of complex dictionaries. Functional approaches move a LOT of data around from function to function quickly, and one of my biggest nightmares was this massive complex nested dictionary the previous dev had implemented that we just passed around everywhere and had to throw print statements at to figure out what the hell was in it.
  • Come up with a way to test this early on and make sure it stays up to date. Unit testing is great, but you also need a way to with an attached debugger run your workflow using sample source data. This will absolutely save you a shitton of time later on.

Oysters Autobio
Mar 13, 2017

Falcon2001 posted:

Wanted to jump back to this question, as I was in a somewhat similar place, and definitely have the same ADHD coping mechanisms and problems. Here's some notes...

This is really fantastic info, thanks so much for sharing. Your point about functional programming makes a lot of sense, and I think is probably key to some of my problems finding applicable ETL and data engineering related patterns. You've already provided way more than enough but if you or anyone has some good resources for functional programming for data engineering in python that'd be great.

Also your specific advice all makes a lot of sense, and I've always just been really bad at breaking things down into chunks for anything so this probably seems like the best place to start.


OnceIWasAnOstrich posted:

Hmm, this is kind of a big ask. Your setup works by connecting to the remote Jupyter kernel over HTTP when you are using a Notebook in VS Code via the JupyterHub extension. You of course can get a remote terminal via Jupyter, Jupyter lab does it, but it does it with (I think) XTerm.js or something similar running on the host communicating over websockets. To make this work over a Jupyter kernel link, you would have to rewrite (or re-implement in the form of an extension) the VS Code terminal to use the Jupyter kernel protocol and possibly add software to your kernel environments to run the terminal session and transfer data over websockets or some other protocol. The extension just isn't set up to do that, as far as I know.

Thanks for explainin, yeah reworking the extension how you describe it sounds far too onerous. Though it does sort of leave me scratching my head that no one else has tackled this, because only being able to connect a remote notebook but not a terminal/console seems like you may as well not even connect to a remote Jupyter server through vscode to begin with.

quote:

You can, of course also do remote terminals (and editors, and extensions, and everything else), but those are implemented over SSH. To make that work you would not really be using JupyterHub and would instead connect to the host running the Jupyter kernels over SSH and then access both the terminal and notebook kernels over that tunnel.

The latter is possible, but maybe not with your JupyterHub if you can't make SSH connections to the underlying machines. The former is possible but I don't think it is currently implemented, at least I don't know of any implementations.


So this kind of setup was what I was thinking was more doable, but I don't know what accesses I have in terms of SSH so I'll see what I can do. It doesn't help that my vscode is on a Windows machine because our workstations aren't connected to the Windows App store for me to install a WSL Linux distro. Adding PuTTY to the mix isn't appealing.

quote:

Individual extensions for various languages provide this via language servers. This works in the normal editor interface, but with notebooks I imagine it would be the responsibility of the notebook extension. I don't use that, perhaps there is a feature in settings you could enable? That feature might just not be available in notebooks.


Thanks that makes sense, ill dig into the notebook extension settings and see if I can find anything about a language server

--

quote:

Another thing I know is possible is to run a web version of VS Code (or code-server) remotely and access it via the Jupyter proxies, something like https://github.com/betatim/vscode-binder or https://github.com/victor-moreno/jupyterhub-deploy-docker-VM/tree/master/singleuser/srv/jupyter_codeserver_proxy. You've still got a separate interface, but at least they're both running on the same host in the same browser.
Will also definitely check this out too, thanks

OnceIWasAnOstrich
Jul 22, 2006

Oysters Autobio posted:

So this kind of setup was what I was thinking was more doable, but I don't know what accesses I have in terms of SSH so I'll see what I can do. It doesn't help that my vscode is on a Windows machine because our workstations aren't connected to the Windows App store for me to install a WSL Linux distro. Adding PuTTY to the mix isn't appealing.

On this particular point, Windows doesn't matter. Although I do use WSL-2 some, I do most of my work at work all on a Windows laptop I don't have admin or store access on. Most of that development work is done in VS Code with remote on some remote Linux machine that I may or may not have root on. I only have putty installed so I can test things for putty users. I otherwise do everything with the OpenSSH that is built into Win10 systems.

This does require SSH access to the remote systems, something admins may be trying to avoid by providing JupyterHub instead. In general, if you are using remote compute, even non-admin windows isn't going to be the dealbreaker or even necessarily an inconvenience.

I hesitate to suggest this, because it could be an end-run around restrictions and you should check policies, but you could absolutely run the VS Code or code-server CLI from within a Jupyter notebook and set up a tunnel for your local VS Code client that way. It is kind of a manual version of one of the links I provided.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

This is really fantastic info, thanks so much for sharing. Your point about functional programming makes a lot of sense, and I think is probably key to some of my problems finding applicable ETL and data engineering related patterns. You've already provided way more than enough but if you or anyone has some good resources for functional programming for data engineering in python that'd be great.

So unfortunately I don't have any particularly good single sources of advice, because if you search 'functional programming' you're going to run into the intense and opaque vocabulary. The basic idea of functional programming is that you're just connecting functions together to form your program, with no main 'driver code' or anything like that, with absolutely no state management.

Realistically this...is really hard to wrap your head around. You can see more about this in the video below:

https://www.youtube.com/watch?v=nuML9SmdbJ4

The for loop example is a great example of why pure FP is kind of batshit crazy to most seasoned software devs. Sure, we can probably understand it but why the gently caress would you do it that way?

The important point is to take away the good parts of functional programming - specifically that building your code out of atomic functions is clean and effective - and leave the really insane poo poo behind. Python lets you treat functions as first class objects that can be passed around/etc, so it's a very good language for this approach - you can also just dump functions into a file and access them without even the sort of 'static classmethod' approach that Java/C# uses. Use enough OOP approaches to maintain some sane level of state that makes your code readable and quickly understandable, but avoid complex and hefty objects that mutate over the course of your codebase as much as possible.

Oysters Autobio
Mar 13, 2017
Thanks for that. I watched a YT series which went over standard FP libraries like filter, map and reduce and it actually did really click for me how this could be really useful in ETL or data work.

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Falcon2001 posted:

you can also just dump functions into a file and access them without even the sort of 'static classmethod' approach that Java/C# uses.

Protip, purely for readability, my last position used a singleton rather than static methods. We had a ton of bespoke code that was constantly being fixed/updated and a lot of hands in the cookie jar reading unfamiliar code. So instead of just seeing methods and needing to check "and where is this guy coming from" in your IDE it's much nicer to be able to read it. You can get the same effect from imports I guess, but this enforced it for us.

code:
from module import singleton as feature
feature.methodname()

Fender fucked around with this message at 22:29 on Feb 3, 2024

QuarkJets
Sep 8, 2008

Fender posted:

Protip, purely for readability, my last position used a singleton rather than static methods. We had a ton of bespoke code that was constantly being fixed/updated and a lot of hands in the cookie jar reading unfamiliar code. So instead of just seeing methods and needing to check "and where is this guy coming from" in your IDE it's much nicer to be able to read it. You can get the same effect from imports I guess, but this enforced it for us.

code:
from module import singleton as feature
feature.methodname()

Do you mean module imports? E.g.

code:
import os
x = os.path.join("foo", "bar")

If you truly mean a singleton then I don't understand

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

QuarkJets posted:

Do you mean module imports? E.g.

code:
import os
x = os.path.join("foo", "bar")
If you truly mean a singleton then I don't understand

You know, as I was typing that out I was thinking about how pointlessly opinionated it sounded. I was mostly just piping up because it's how we blended our functional programming desires with our older OOP code. And for some reason it was universally loved.

Just put all your methods in a class and instantiate it at the end of the module and then import that. Not a perfect singleton, but it was essentially the same thing for this. You can make it purely functional or it's versatile and you can refactor older code into one of these and it can keep doing whatever OOP nonsense it might still be required to do, or maybe you have to pass it around somewhere. Whatever. But if you stick to the pattern you can do whatever wild crap you have to do in these supporting functional modules and it always looks the same and acts the same. Nothing earth shattering, but mildly topical and I (like everyone else there) really liked it so I preach.

Fender fucked around with this message at 01:13 on Feb 4, 2024

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Fender posted:

You know, as I was typing that out I was thinking about how pointlessly opinionated it sounded. I was mostly just piping up because it's how we blended our functional programming desires with our older OOP code. And for some reason it was universally loved.

Just put all your methods in a class and instantiate it at the end of the module and then import that. Not a perfect singleton, but it was essentially the same thing for this. You can make it purely functional or it's versatile and you can refactor older code into one of these and it can keep doing whatever OOP nonsense it might still be required to do, or maybe you have to pass it around somewhere. Whatever. But if you stick to the pattern you can do whatever wild crap you have to do in these supporting functional modules and it always looks the same and acts the same. Nothing earth shattering, but mildly topical and I (like everyone else there) really liked it so I preach.

Yeah, I mean you're essentially doing a from module import * sort of deal, except you're tucking it all under a specific class. But for what I'm talking about a singleton is actually totally unnecessary, because you shouldn't have any shared state - so there's no need to actually create a singleton or anything like that, you could instantiate new versions all day long and it wouldn't change anything.

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

Fender posted:

You know, as I was typing that out I was thinking about how pointlessly opinionated it sounded. I was mostly just piping up because it's how we blended our functional programming desires with our older OOP code. And for some reason it was universally loved.

Just put all your methods in a class and instantiate it at the end of the module and then import that. Not a perfect singleton, but it was essentially the same thing for this. You can make it purely functional or it's versatile and you can refactor older code into one of these and it can keep doing whatever OOP nonsense it might still be required to do, or maybe you have to pass it around somewhere. Whatever. But if you stick to the pattern you can do whatever wild crap you have to do in these supporting functional modules and it always looks the same and acts the same. Nothing earth shattering, but mildly topical and I (like everyone else there) really liked it so I preach.

I'm sorry I still don't understand.

What you're describing it sounds like this?
Python code:
# myfile.py
# imports go here

class AllTheThings:
    @staticmethod
    def some_great_method():
        print('butts')

    # And then some other methods that do other things
    ...

great_methods = AllTheThings()
And in this example if you wrote some code that needed to use "some_great_method" then you'd import "from myfile import great_methods"? And you'd call "great_methods.some_great_method()"? Did I misread?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply