Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

spiritual bypass posted:

Data ingestion is a common problem and CSV is always the solution

Speaking of which, and this isn't really a Python question, but is there an open source CSV editor that has some of the functionality of excel without all the...overhead? It'd be nice to find something I can use to just muck around quickly with tabular data without having to constantly be like 'no, don't accidentally make it an xlsx, stop formatting it' etc - not that it's a super big problem or anything.

Adbot
ADBOT LOVES YOU

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

DoctorTristan posted:

VSCode and PyCharm both have plugins that make csv editing less painful

Maybe this is the best answer, I'm not working with massive csvs or anything right now.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

BUUNNI posted:

I've been copying and running all the code you guys have provided and it's interesting to see how different every answer is.

It's worth noting this phenomenon, because this is a very important lesson for two reasons:

1. There's no such thing as 'the only way to do something', even in opinionated languages. Software design is a bit of an art form, and so you can fulfill the same requirement through a bunch of different ways. That being said...
2. Writing code is not the hard part of professional software development, requirements are the hard part. It can be jarring if you've only ever worked on personal projects or school projects that are highly structured, but in business, you'll often be dealing with people that deliver very vague requirements, and they might simply not have the context to understand what's missing. This is an incredibly important part of software development, and is one of the major skills to pick up as you progress in your career. You can tell two devs that you want X, and they might deliver two ENTIRELY DIFFERENT SOLUTIONS because they both are going to fill in the blanks on what you asked for based on their own judgment.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Anyone have any experience with one of the Python-only frontend frameworks like Anvil or JustPy? I have some internal tooling I'd like to build with not a lot of time, and all things being equal, it'd be nice to not have to learn JS and a modern frontend library. The alternative would just be a bunch of CLI stuff using Click/etc, which isn't awful (or Textual if I decide to get really fancy) but it'd be nice to be able to just link someone to a URL.

Like sure, at some point I'll come back to it but I'd like to be able to slap together some basic UIs in front of some automation/diagnostic stuff. My only real requirements would be something that lets you build the whole UX in Python (so not just Django/Flask), and ideally something with a decent default style setup so stuff looks reasonably 'built this century'.

Falcon2001 fucked around with this message at 08:11 on Nov 19, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I asked over in the CI/CD thread, but asking here as well: Anyone have experience with implementing Jupyter Notebooks as operational runbooks? Netflix apparently uses this as well as some other teams internally, but there's enough odd edges around it that I figure I should keep asking for advice.

Some minor notes:
- Our infra is mostly internal and proprietary (FAANG company) so we can't use most out of the box integration stuff I see in other 'ops runbook' software/services. We have python clients for all of it, which is where Jupyter started looking pretty positive.
- I'm reasonably confident I can tie it to organizational goals, we have a bunch of 'improve runbooks' and 'better impact tracking' stuff that all are pretty reasonable.
- The sorts of runbooks I'm looking to implement would mostly be incident response stuff at first, focusing on automating the 'gather information' section first and then allowing incremental moves towards more advanced runbooks/etc.
- We would be hosting some sort of centralized JupyterHub or other solution for this, rather than individual dev envs.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

Using git with jupyter notebooks is a huge clusterfuck, that's enough to keep me away from trying to use notebooks for ops

Ugh yeah, I saw this. I heard there's a plugin or something to store them in a more sane format but haven't dug into it yet.

Edit: Really, the problem here is that I need about half of the stuff Jupyter provides, so maybe it's better to just say that I'd like a platform that does the following:
- Web UI (so you can link to a specific runbook/etc from tickets or other documentation, but also so we have a centralized controlled environment for this, no risking individual devs not having their pipelines updated/etc)
- Must allow for some form of authentication, I'll have to figure out how to tie in our internal auth systems no matter what, so I assume I can probably hack in most stuff as long as it's not hardcoded to things like Github/etc.
- Can execute Python code inline with some form of documentation, Markdown preferred.
- Has the ability to set cells to read-only, while others can be writable (Doesn't matter if you can overcome the read-only via override steps, I just want to ensure there's a basic guardrail against accidental changes.)
- Must be OSS / Self-hostable
- Readable diffs and source control

It's the whole arbitrary code execution portion that is the trickiest, because obviously most documentation platforms are going to be extremely wary of getting anywhere near that part, for obvious reasons.

I looked at a couple various platforms and while there are OSS options, a lot of them assume an agent setup/etc for many supported infrastructures, and we basically already have solutions for most of that; monitoring/etc are all solved problems that are not really a good idea to try and replace internally. Fiberplane and Runme.dev both basically fall into this.

I know some internal teams are already doing this, so going to get some time with the people running their instances and get some feedback on the pros and cons they've run into and how they solve it.

Falcon2001 fucked around with this message at 22:37 on Nov 26, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

sugar free jazz posted:

I'm trying to automate some analyses in UCINET6 on Windows to streamline a data analysis process and have no idea what to do at this point since this kind of automation isn't something i'm really experienced with. I've fiddled with powershell scripts, python, and AHK, but I can't get it to actually do anything after opening the application and I'm not sure how to even go about figuring that out outside of just emailing the guys who wrote it since it's a relatively small, technical piece of software and I'm not actually good at this type of programming. Any suggestions on what kind of avenues to go down to start figuring this out in Python?

If you're looking for a really hacky solution, then you could use something like https://pyautogui.readthedocs.io/en/latest/ but frankly that's not a particularly effective way of approaching the problem unless this is a personal workflow you can hand-manage.

I would look at if there's any way to run this from a commandline/etc or otherwise invoke the program in such a way that you can tell it to do something consistently. I don't see a documented setup for that from a quick google search, so reaching out to the support DL might not be a bad idea.

I'd approach this from the perspective of 'how can I make this program do its analysis with no human input, given appropriate files/etc in the right places'. Once you figure that out, Python is an excellent language for setting up the files/etc in the right places, invoking UCINET, and then analysing/manipulating the output.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

saintonan posted:

Is there a tutorial somewhere on automating web scraping where I need the automaton to log in to an account, navigate to a page, then manipulate a GUI to produce a series of tables that the automaton scrapes and saves as a csv? I did this years ago using iMacros, but figured there'd be a relatively straightforward Python/Chromium solution.

Look into Selenium, it's the go-to for this sort of thing.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

sugar free jazz posted:

this is in fact a personal workflow i can hand manage! pyautogui also feels very familiar since i've done some screen scraping with selenium, so this will hopefully work. wish it could be done quietly but don't let the perfect be the enemy of the good etc. appreciate it!

FWIW there's nothing wrong with hacky/scrappy solutions, as long as you go in open eyed and aware of the limitations. I'm a big fan of weird lovely scripts, because the alternative is often 'nothing gets even halfway automated'.

The trick is knowing when to say 'okay cool this has gotten too big, this needs to be refactored properly', which often means not letting management know about your scrappy bullshit, because they often don't know the difference.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Here's a sort of minimal example:

Python code:
from pathlib import Path
filepath = Path("C:/Code/Butts/buttfile.txt")

def load_file_to_str(filename: Path) -> str:
    # This will load a text file (assuming default decoding) and return it as a big string
    with open(filename, 'r') as txtfile:
        return txtfile.read()

print(load_file_to_str(filepath))
This would return a string with line breaks for the whole file and then print it out, which should basically match what's in your text file, there are other methods too such as .readlines() etc that try and split it on line breaks.

I. M. Gei posted:

y'all I am very bad at programming, and my brain is the kind that's not good at filling in blanks when it's told to do something.

https://automatetheboringstuff.com/#toc Is what I would recommend if you're coming at this from a non-programmer perspective. This book is free online and is basically written to say 'Hey, you're not a developer, but you'd like to automate simple boring things' and I found it extremely helpful to me when I was getting started.

Falcon2001 fucked around with this message at 23:22 on Nov 29, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

The Python front-end world seems pretty experimental to me at this point. Would absolutely love to be proven wrong because as someone who loves Python but also loves UI/UX and data viz, I'm sort of just delaying learning any JS frameworks.

Are Jupyter notebooks at all in your POC here? Cause there's some half decent looking Vuetify extension for ipywidgets.

Otherwise the only other one I'm familiar with is Plotly Dash which was made for data science dashboards rather than any UI.

Seems all the cool kids who want to avoid the JavaScript ecosystem are using HTMX + Flask/Django. Dunno how well that would work for an internal tool.

Jupyter is on my list but more as a different project I posted about in here before for automating runbooks. I'll take a look at HTMX again and see about using that.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I had to use a third party regex library that allows for overlapping results; I had the exact same issue.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Anyone familiar with Click know if there's a way to implement argument parsing into a shared dataclass or something of that nature?

Basically I have code that looks like this - it's a CLI that's mostly invoked via runbook / change management process and has a lot of required arguments and the signatures are starting to get ridiculous; I think we're up to like 10 common options, but individual commands need unique ones as well for the given workflow.

Python code:
cli.command(name='example_cli')
@click.option('common_one')
@click.option('common_two')
@click.option('common_three')
@click.option('unique_one')
def launch_command(
    common_one,
    common_two,
    common_three,
    unique_one
):
You can group several options into a decorator, so I was already working to do the below:
Python code:
cli.command(name='example_cli')
@common_options
@click.option('unique_one')
def launch_command(
    common_one,
    common_two,
    common_three,
    unique_one
):
    
However, what I'd really like to do would be to build the common options into a shared dataclass or something, so that the individual functions are all a lot cleaner looking and don't have a ton of duplicated code in the signatures/decorators.

Python code:
cli.command(name='example_cli')
@common_options
@click.option('unique_one')
def launch_command(
    common_options: CommonOptions
    unique_one
):
I can't seem to figure out a way to handle this, without overwriting a ton of the core click logic. Anyone familiar with this have any suggestions?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Zoracle Zed posted:

have you read the building a git clone documentation? I've that used that approach a lot, where the base command has the @pass_context decorator and stashes a custom dataclass in ctx.obj, and then the subcommands have the @pass_obj decorator to recieve the dataclass object directly.

I'll take another look. I thought this method involved having group options which necessitated breaking your cli args up in a weird way ex: (command group --opt-one subcommand --opt-two) but I might have misread it.

I'd like to keep all the actual command parsing and opts at the lowest level so the help etc doesn't get janky, but I'll reread that in depth

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Zoracle Zed posted:

yeah you end up with an api like

code:
butt --fart loud --n-turds 3 poo poo --no-flush
where the base 'butt' command doesn't really do anything except collect the common arguments and the subcommand 'poo poo' does the actual work and accepts the specialized args. I think git is the only real-world cli I use that works like but i find it pretty natural, ymmv

Yeah I mean maybe that is fine. I might just avoid this whole thing for now, the feature I'm working on has enough other stuff going on I shouldn't try and fix this.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Jose Cuervo posted:

Is there a way to have a Python script run automatically when a file (or really a set of files) is placed in a folder? I think I have to set up some sort of process which listens(?) for files being placed in the folder, and then once that is detected I can trigger the script, but I don't know what keywords to look for to start researching the solution.

Fuller story:
I have a script that cleans / formats 4 or sometimes 5 different CSV files. This process occurs 4 times during the first four days of the month. Right now I am the only one who runs the script, and the two people who rely on the output are not able to run the script. I want to avoid running into an issue where I am sick (or hit by a bus) and they are unable to access the data they need. I want to be able to designate a folder to them, and have them copy-paste the raw CSV files into the folder. Once they have done that, ideally something(?) would start the script running to clean / format the CSV files, and output the cleaned / formatted files into a different folder.

Easiest option is to have something running constantly in the background with a sleep timer of some duration, then it wakes up and checks for files in the folder/etc and takes action if appropriate. there is file watching stuff but it's finicky from what I've read previously.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

The Iron Rose posted:

My team of devops/platform/infra engineers are building and distributing our first internal Python library, which is going to provide helper functions and custom span processors for opentelemetry. We’re using semantic versioning and are publishing to an internal shared pypi repo.

Unfortunately, we don’t have a lot of SWEs on the team! We’ve got one SWE who comes from a Java background, and I’m a comfortable programmer but come from an ops background. The rest of my team are largely ops people who write scripts here and there.

What are some best practices we should consider as we transition from a team that maintains lovely little lambda functions to a team providing a library our production services will depend on?

We’ve got semantic versioning, conventional commits, and some basic unit tests so far. I’m pushing for choosing a shared linter and enforcing that in our gitlab CI pipelines and precommit hooks (which incidentally I’ll happily take recommendations on). We’ve got a mix of pycharm and vs code users.

What are some other best practices we should consider? Both for library distribution to the organization, but also policies you feel are helpful for improving code quality.

A shared Linter and enforcement is good, I highly recommend it. Our system works by using black as our linter (run as a command in the environment, not in your IDE) and then the enforcement is 'did you run Black and get no changes? Then you can proceed.'

My best practices recommendations would be to think about it from your customer's point of view. Some things I'd consider:
  • Semantic Versioning is good, but you should specifically consider your API as needing to be backwards compatible outside of certain changes; most software uses major versions to denote times to make breaking changes. So 2.5 -> 2.6 could add features, but you would only deprecate something going from 2.5 > 3.0 for example.
  • This goes double if your company's build system would auto-update software/etc. You need a way to ensure that you can make breaking changes without accidentally breaking builds/production/etc.
  • On that topic, if your security team doesn't handle this you might want to consider a way to identify and notify teams if an older version has some sort of vulnerability and they're still using it.

Generally just think about 'If I was depending on a library, what would piss me off the most?' and that's a good greatest hits list of things to solve for. Otherwise the rest all sounds good and you're thinking through the major code quality stuff already.

Ninja Edit: You may want to consider including MyPy in your package and the linter/enforcement checks, or at least evaluate if it's appropriate. MyPy raises the bar significantly on your code quality by basically faking type enforcement, but because of the intricacies of Python and the libraries you use, this is sometimes a MASSIVE amount of work.

If you don't use it, my recommendation for any professional Python dev is to approach typing like you're working in Java or C# or another statically typed language and use type hints/etc everywhere possible. I think this is the single biggest easily-avoided Python footgun: You can churn out lovely, unreadable code incredibly quickly; with type hints you can at least help discovery on otherwise lovely code. Forcing yourself to do things like type hints and docstrings/etc are all going to mean that your code is easier to read later, which is the most important single thing for high quality code. That's not to say it's the only thing - but from an effort perspective it's such a huge return on investment.

Falcon2001 fucked around with this message at 19:42 on Dec 14, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

Any good resources on managing ETL with Python?

So, I did something vaguely related before, and we didn't use any off the shelf stuff. (Probably because as mentioned this wasn't precisely the same thing.) We were combining multiple types of data sources to form a unified 'source list' for use in an auditing application, basically - so more ET and very little L. So here's the useful things I learned from that project, and maybe it's useful to know about. It's also entirely possible I'm full of poo poo so if others have more context and I'm wrong, speak up.

  • Most important: work backwards from the requirements to figure out how the data should look at the end. If you don't have a set of requirements, talk to whoever's going to use the data. If you don't have a clear list there, sit down yourself before you start and write up a list of requirements. This will help you avoid some of the more insane data spaghetti I've seen before.
  • Make sure to normalize your data. This is, from what I've seen, a foundational part of ETL systems in the transform step and also a general programming practice, but it's extremely important with dealing with dissimilar data sources. For example, if you're combining data from System A and System B, you should end up with an object that allows you to directly compare the two. This means that you might have to basically make a brand new schema for the objects that allows you to transform things together.
  • Another way of putting this is that I wouldn't recommend treating Source A as a subclass of TargetData and Source B as another subclass, you should have just multiple instances of TargetData, each with an attribute that identifies the sources.
  • In contrast, it may be useful to have metadata/other source data available. For example, we had a source_data attribute that contained a flexibly structured JSON blob of the raw source data. This is likely to be the sort of thing that's highly domain dependent - you can argue that ideally you shouldn't need to preserve any of this, but I suspect that's also a Perfect World argument and doesn't really pan out. I would recommend making this consistent to find, so you're always looking in a certain place, and from there you can query it when required. Ideally, you can use this to identify places you're having to do these edge cases and work them back into your basic schema instead.
  • ETL is a domain that I think highly benefits from a more functional programming approach instead of a classic OO style - build your transform pipelines and other actions from atomic functions that get chained together and try and avoid having to worry about state as much as possible. If necessary, I'd try and keep it high level like a state machine or something of that nature instead of more flexible mutable objects.
  • Coincident with the previous suggestion, for god's sake please use type hints as much as possible. If your data requires moving complex stuff around, use objects such as dataclasses instead of complex dictionaries. Functional approaches move a LOT of data around from function to function quickly, and one of my biggest nightmares was this massive complex nested dictionary the previous dev had implemented that we just passed around everywhere and had to throw print statements at to figure out what the hell was in it.
  • Come up with a way to test this early on and make sure it stays up to date. Unit testing is great, but you also need a way to with an attached debugger run your workflow using sample source data. This will absolutely save you a shitton of time later on.

Falcon2001 fucked around with this message at 21:39 on Dec 23, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

I am a bit stuck in tutorial-hell on this project at the moment and struggling to start writing actual code because I don't really know what the end state should "look" like in terms of structure.

(Again, not a super professional at this, but here's some general advice for high level stuff.)

The end state you should again, figure out what your requirements are - what will it be used for, where does it go, etc. Get feedback on this phase from the people who will use the data. From there, sketch out a basic structure on how you'd like the data to look that fits with wherever it's being stored. I wouldn't over-worry about perfection, just get something to start with as a target goal. You can iterate as you work through the design and implementation process.

From there, you have a starting point (the data comes in as X) and an end point (I want the data to all look like Y); put those down in a flowchart like Draw.io /etc and then start adding steps between. At this phase you're literally just trying to make sure you cover the bases and then validate that the transitions work/etc. Get advice from others at this stage for sure; this flowchart would mostly be domain experts so other devs, but also anyone that would catch that certain things aren't happening.

From there you have a general design. Designs aren't set in stone, they're designs, but the point is to try and figure out as much as possible before you start writing big chunks of code you might have to throw away.

After all that, you can start coding. (You can do some of your design in pseudocode or code if it makes you more comfortable but don't start writing functions/etc yet). Hopefully that helps get you through this phase.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I have a weird question. I'm working with a system that holds formulas in text and I'd like to interpret those myself.

So something like:
code:
( Date1-CPU / Date2-CPU ) * OtherFactor / (Factor1 / Factor2)
If I can extract the 'words' it's not so hard, which conveniently isn't too bad , but the actual formula could be different from place to place. Is eval the right option here? I've only heard footgun noises about that.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

12 rats tied together posted:

I would probably build a lookup table of symbol to function and then use some combination of operator.methodcaller / operator.attrgetter to call them explicitly after parsing the text.

This seems like a bad idea, but I'm sure you know that already. :cheers:

The upside is that this is a script that calls an expected API, so we're not taking in like...unvalidated user input from the internet or anything, so the security implications are less dire.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

Question for experienced folks here. I'm not a dev, just a data analyst who started picking up python for pandas and pyspark.

Past few months i feel like beyond initial basic tutorials and now using Python basically for interactive work in a jupyter notebook, I'm sort of stuck again in tutorial hell when it comes to wanting to build better skills at making python code and packages for *other* people to also use.

Sort of a mix between analysis paralysis (of too many options and freedom) and good hefty dose of imposter syndrome / anxiety over how other "real" developers and data scientists at work will judge my code.

I know it's all irrational, but one practical part I'm having trouble with is the framing and design part of things. Having ADHD doesn't help either because it's easy to accidentally fall into a rabbit hole and suddenly you're reading about pub/sub and Kafka when all you were trying to lookup was how to split a column by delimiter in pandas, lol. Outside of Python I turn heavily to structure, to do lists, framing / templates and the like to stay on track, but I'm having trouble applying any of that for Python work.

For example, I love design patterns, but all of them seem to be oriented towards "non data engineering" related problems or I can't figure out how to apply them with my problems. Like, I love the idea of TDD and red/green but have no clue how I would build testing functions for data analysis or ETL.

I can't seem to find more opinionated and detailed and opinionated step by step models for creating data pipelines, ETL or creating EDA packages to generate reports. A lot of stuff feels like the "step one draw two circles, step two draw the rest of the loving owl".

A lot of just venting here so please ignore if it's obnoxious or vague, but any advice or thoughts on this would be great.


Wanted to jump back to this question, as I was in a somewhat similar place, and definitely have the same ADHD coping mechanisms and problems. Here's some notes:

  • Software development is, at it's core, all about combining building blocks of code together to do things. Often times when you run into that problem of 'Step 1: Circle, Step 2: Rest of the loving owl' it ends up being that you're missing lower context, as a good tutorial doesn't cover all the messy middle bit most of the time. Once you have more stuff in your personal toolkit, these jumps will get less and less frustrating and these docs/articles/etc will be more useful as they cover the context and goals over just handholding you through the entirety of the process.
  • To address that last point, I would recommend working on foundational software dev stuff. For me, Code Katas such as Codwars or Leetcode were the real clicking point for my adhd brain, so I'd recommend you try them out - that or something like Advent of Code, just anything that gives you a reasonably small problem space to work on.
  • Design patterns in particular are not a non-data engineering problem, although many of them just aren't applicable. However, that's true of any software design challenge - the patterns from Gang of Four aren't universal to every application, and also aren't universal truths - they're more like 'there's a million ways to cook a hamburger, but there's four really solid ones that you might serve at this restaurant'. Don't treat them as dogma, treat them as tools that may or may not be relevant.
  • If you're like me, you won't understand some of these things (my example is dependency injection) until you actually make some mistakes (for me, using singleton pattern everywhere and then trying to figure out how to test my mess). That's okay and good, and part of the learning experience, and you'll make fewer and less impactful mistakes as you learn.

I managed a piece of software that was similar to an ETL pattern, so I can give some general advice, but there's also been some talk about it in this thread if you go back a few pages.
  • For pipelines like this, I recommend trying to follow functional programming ideas more than OOP ones. What this means in general is 'avoid using classes for stuff other than data structure and storage as much as possible'. That's not a commandment btw, just a recommendation - I would say that in my service most of the classes with methods were just for holding shared behaviors to apply (different sorting methods for example) rather than the classic OOP 'load up an object with data and methods and then mutate the state around a bunch'. The biggest reason is that pipelines, by their structure, hold up very well to classic FP approaches because you expect that for a given input you'll always get the same output; there's no user input/etc to have to sidestep or integrate along the way.
  • Side note: if you're just doing the 'well I have a set of data I want to keep in a class instead of a dictionary' that's a great idea, and dataclasses are your friend (or one of the fancier versions like Pydantic/etc) - the point is that these are essentially just a way to hold data in a structured way for access, and you shouldn't have a lot of class methods that mutate or do actions on that data in that class. Again, not an absolute approach, but a recommendation, I often have little helper properties in classes like that for deterministic stuff (for example, if you're holding information about an order, you might have properties to calculate things like 'total cost' if that isn't already part of your dataset - the trick is that this should be deterministic (ie: total cost = obj.tax_price + obj.line_item_price or something like that).
  • I'll dig up my post from a bit ago with more details on this. Edit: https://forums.somethingawful.com/showthread.php?threadid=3812541&userid=65908&perpage=40&pagenumber=6#post536720636

Falcon2001 posted:

So, I did something vaguely related before, and we didn't use any off the shelf stuff. (Probably because as mentioned this wasn't precisely the same thing.) We were combining multiple types of data sources to form a unified 'source list' for use in an auditing application, basically - so more ET and very little L. So here's the useful things I learned from that project, and maybe it's useful to know about. It's also entirely possible I'm full of poo poo so if others have more context and I'm wrong, speak up.

  • Most important: work backwards from the requirements to figure out how the data should look at the end. If you don't have a set of requirements, talk to whoever's going to use the data. If you don't have a clear list there, sit down yourself before you start and write up a list of requirements. This will help you avoid some of the more insane data spaghetti I've seen before.
  • Make sure to normalize your data. This is, from what I've seen, a foundational part of ETL systems in the transform step and also a general programming practice, but it's extremely important with dealing with dissimilar data sources. For example, if you're combining data from System A and System B, you should end up with an object that allows you to directly compare the two. This means that you might have to basically make a brand new schema for the objects that allows you to transform things together.
  • Another way of putting this is that I wouldn't recommend treating Source A as a subclass of TargetData and Source B as another subclass, you should have just multiple instances of TargetData, each with an attribute that identifies the sources.
  • In contrast, it may be useful to have metadata/other source data available. For example, we had a source_data attribute that contained a flexibly structured JSON blob of the raw source data. This is likely to be the sort of thing that's highly domain dependent - you can argue that ideally you shouldn't need to preserve any of this, but I suspect that's also a Perfect World argument and doesn't really pan out. I would recommend making this consistent to find, so you're always looking in a certain place, and from there you can query it when required. Ideally, you can use this to identify places you're having to do these edge cases and work them back into your basic schema instead.
  • ETL is a domain that I think highly benefits from a more functional programming approach instead of a classic OO style - build your transform pipelines and other actions from atomic functions that get chained together and try and avoid having to worry about state as much as possible. If necessary, I'd try and keep it high level like a state machine or something of that nature instead of more flexible mutable objects.
  • Coincident with the previous suggestion, for god's sake please use type hints as much as possible. If your data requires moving complex stuff around, use objects such as dataclasses instead of complex dictionaries. Functional approaches move a LOT of data around from function to function quickly, and one of my biggest nightmares was this massive complex nested dictionary the previous dev had implemented that we just passed around everywhere and had to throw print statements at to figure out what the hell was in it.
  • Come up with a way to test this early on and make sure it stays up to date. Unit testing is great, but you also need a way to with an attached debugger run your workflow using sample source data. This will absolutely save you a shitton of time later on.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Oysters Autobio posted:

This is really fantastic info, thanks so much for sharing. Your point about functional programming makes a lot of sense, and I think is probably key to some of my problems finding applicable ETL and data engineering related patterns. You've already provided way more than enough but if you or anyone has some good resources for functional programming for data engineering in python that'd be great.

So unfortunately I don't have any particularly good single sources of advice, because if you search 'functional programming' you're going to run into the intense and opaque vocabulary. The basic idea of functional programming is that you're just connecting functions together to form your program, with no main 'driver code' or anything like that, with absolutely no state management.

Realistically this...is really hard to wrap your head around. You can see more about this in the video below:

https://www.youtube.com/watch?v=nuML9SmdbJ4

The for loop example is a great example of why pure FP is kind of batshit crazy to most seasoned software devs. Sure, we can probably understand it but why the gently caress would you do it that way?

The important point is to take away the good parts of functional programming - specifically that building your code out of atomic functions is clean and effective - and leave the really insane poo poo behind. Python lets you treat functions as first class objects that can be passed around/etc, so it's a very good language for this approach - you can also just dump functions into a file and access them without even the sort of 'static classmethod' approach that Java/C# uses. Use enough OOP approaches to maintain some sane level of state that makes your code readable and quickly understandable, but avoid complex and hefty objects that mutate over the course of your codebase as much as possible.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Fender posted:

You know, as I was typing that out I was thinking about how pointlessly opinionated it sounded. I was mostly just piping up because it's how we blended our functional programming desires with our older OOP code. And for some reason it was universally loved.

Just put all your methods in a class and instantiate it at the end of the module and then import that. Not a perfect singleton, but it was essentially the same thing for this. You can make it purely functional or it's versatile and you can refactor older code into one of these and it can keep doing whatever OOP nonsense it might still be required to do, or maybe you have to pass it around somewhere. Whatever. But if you stick to the pattern you can do whatever wild crap you have to do in these supporting functional modules and it always looks the same and acts the same. Nothing earth shattering, but mildly topical and I (like everyone else there) really liked it so I preach.

Yeah, I mean you're essentially doing a from module import * sort of deal, except you're tucking it all under a specific class. But for what I'm talking about a singleton is actually totally unnecessary, because you shouldn't have any shared state - so there's no need to actually create a singleton or anything like that, you could instantiate new versions all day long and it wouldn't change anything.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Slimchandi posted:

Have you considered running WSL2 instead? I use it for dev with poetry and works great.

Why would you setup a whole OS layer for python installs? Like WSL is great and all but there's nothing about WSL you can't do with windows as far as python installs and stuff goes.

Edit: Forgot the person you were replying to specifically said they were familiar with Linux. Ignore me.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Well I'll weigh in then. On my home computer I use exclusively Windows for dev. I've used both poetry and venv without any trouble and also have used pyenv to manage my various python base installs too. That one took some extra setup but still worked pretty well.

I dev on Linux at work so I guess I could have used WSL, but it just seemed unnecessary. All the confusing parts for me were getting vscode to find the install which isnt an os specific problem as far as I can tell

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I dunno, sounds like a container might be perfect if you have weird dlls and stuff. No experience with building that into venvs etc.

If a solution works it's probably not that stupid.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

It's easy to say that anything is bad if you "overinvest" in it, that's not a real argument. Writing concise function names and concise docstrings are fundamental coding skills, they're as important as knowing how to define a function in the first place, and you get better at those skills with practice and review. You're not doing a new developer any favors by letting them skimp

I agree with this. I also agree that it's fine for them to backfill that during code review. I think it's challenging for noobs to really realize how frustrating debugging or improving code without docstrings etc can be, especially in Python where type hinting is optional.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

CarForumPoster posted:

Stupid simple option if they’re grouped in order
Get list of your categories
Save it as a CSV
Use the category list to break it into chunks for each category.
Make each chunk df, adding a column for the category
pd.concat

Otherwise I’d try to read the excel spreadsheet such that the index is ordered numerically and get the indexes that contain the categories (eg by filtering to where the other columns have nans) then just map the category to its applicable data rows between each category

I agree. IMO trying to do this in Pandas is the wrong approach, part of your ETL pipeline is sanitizing and standardizing your data before loading it, so you should just handle that in code first before feeding it over to pandas. Especially if this is an adhoc spreadsheet. Here's an extremely naive example:

Python code:
from csv import DictReader

with open('file_path', 'r') as csvfile:
    reader = DictReader(csvfile)
    data = [row for row in reader]

out_data = []
current_category = ""
for row in data:
    if row['category'] and not (row['names'] or row['dates']):
        current_category = row['category']
    else:
        row['category'] = current_category
        out_data.append(row)
# Assumptions made:
# the category identifier row always shows up before the rows in that category

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

Other comments (comments that aren't basically a docstring) may or may not be necessary. Comments can be useful for explaining some tricky behavior, but they shouldn't be there if the behavior is obvious.

I've heard the most common case for this is new devs who add comments to everything except the parts that don't make sense, so you have stuff like # iterate through list.

Personally, the best explanation for comments I've found is that comments tell you why the code is written this way, not what the code is doing, unless your code has some reason to be unreadable, in which case, sure...but you should explain why you have to do that, if I'm reviewing your code. The why should be more like business logic, or reasons why you chose a particular approach over another, and not simply an explanation of basic algorithms.

Vulture Culture posted:

I agree that it should usually be a separate pipeline step, but you should feel fine doing whatever is idiomatically familiar to the people who need to maintain it while being reasonably performant

Yeah, I suppose that 'the wrong approach' was too opinionated on my part, I agree with the rest of your post.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

DoctorTristan posted:

Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes.

It shouldn't matter as long as you aren't blindly merging new changes to production without testing. If you are, pandas isn't the only thing that's going to screw your day up.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
So I write a fair amount of CLI tools, and I'm getting a little frustrated by the debug pattern for that in VSCode, was wondering if anyone had any advice or tips. Let me explain a bit.

So let's say I have a series of different scenarios I need to run - either in a runbook or just for manual testing/debug purposes. In VSCode, each and every one of those needs to be it's own JSON blob, which requires going into the file, finding the right JSON dict, and editing or copy/pasting it.

From a human perspective, these are just...multiple strings, really. cmd --arg_one x --arg_two y, etc. They're reasonably easy to swap things around between/etc; but in debug configurations, those each have to a separate JSON list of strings, which is just kind of a pain in the rear end extra step. Also the debug name dropdown just...is too small? So the names are cut off? I realize this is a minor complaint but it's kind of just one more thing going on.

Now, I can just go to where my argparser is parsed and then construct a fake one using a scratch driver script or something like, but that has it's own downsides, namely that it gets a lot trickier if I'm using Click or another framework, and also that I could accidentally fail to check some odd argparser edge case.

Is there any better way of handling these? An extension or something that would improve this behavior? I'd frankly love if I could start a debug session and have VSCode prompt me for the args as a string or something as a specific debug setup.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

shame on an IGA posted:

try using promptString input variables inside launch.json

I don't think this will help much, since the args need to split into a list. I'll poke around though.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I'm trying to add a simple caching mechanism to a couple different client libraries, was wondering if there's any easier way to do this. For context: I'm working on a system where I'm making calls where the results are generally safe to cache for a day or more, and when running it several times in a row it can be extremely slow, so I wanted to add in some local caching. I did this already, but I did it after making the calls, so I'm caching the results directly - this works, but I was thinking that just caching directly at the client might be the simplest answer and produce the most straightforward code. The data returned is all easily serializable.

So the clients all are something along the lines of this example - they're all very straightforward and contain a number of methods that make calls to a given service.

Python code:
class SvcClient:

    def api_one(self, args) -> ApiResult:
        ...
    
    def api_two(self, args) -> ApiResult2:
        ...
I'd like to basically extend these classes along the lines of this:

Python code:
class CachedClient:
    def api_one(self, args, clear_cache: bool = False) -> ApiResult:
        if args not in self.cache:
            self.cache[args] = _call_api(args)
        return self.cache[args]
Obviously barebones and I'd want to have a little bit more, but essentially I want to make a class that's the same as an underlying class, except modifying each of the methods a little bit. I can obviously do this by hand, but is there any way to do this programmatically, preferably that would inherit the docstrings of the base class? Mostly just trying to avoid having to do a ton of copy/pasting.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

Are you able to use functools.lru_cache? You can manually clear the cache of the decorated function with cache_clear(), you could use sched to schedule a function to run after 24h that just clears the cache and then reschedules itself in another 24h (or whatever interval you need). Or if you have your own event loop already just put the cache clearing in there

I realize I was unclear - I'm not trying to modify the existing libraries themselves (they're not mine) or necessarily fork them if I can avoid it; I'm just trying to wrap the client.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Yeah, it also doesn't help that I'm talking about a persistent offline local cache, so lru_cache doesn't really help. I think maybe the approach I took in the first place, which is to just call the client and then cache the results by entity name, might be the simpler option than reworking the entire client. I'll just slap a time retrieved stamp on there and then set a cache expiry duration that's checked on launch or something like that. This is a proof of concept anyway so I've got room to tinker.

Edit: I wish that there was something simpler for drop-in serialization and deserialization of dataclasses, especially for any situation where you have a display element or something like that - for example, a CSV where the headers should be "Experiment Name" instead of experiment_name. There's also the other stuff like bools and datetimes/etc, or nested dataclasses. Pydantic/etc seem to include a lot more overhead, but maybe I'll check out marshmallow.

Falcon2001 fucked around with this message at 04:57 on Mar 11, 2024

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Hughmoris posted:

Rookie question here:

What is the easiest way to convert/print a Jupyter notebook to PDF that will look good? I need to submit it for a class.

The other answers are probably fine, but if you're doing this for a class I'd suggest asking the teacher / TA if they have a preferred method, since you're ultimately trying to make them happy, and depending on your school, many professors could just fail your submission outright for failing to meet some arcane formatting requirement.

If they're expecting you to use jupyter, then this should be a pretty trivial question; if they're not then the question becomes 'what do you need out of my answers' should be something they should be able to answer anyway, and then it's just a matter of working backwards from there.

Edit:

Oysters Autobio posted:

Yeah sorry, don't take my advice as reason to buck against whatever stupid arbitrary thing you're asked to do.

nbconvert is exactly what you're looking for exporting pdf or other formats.

I wouldn't have said anything if it wasn't for school, because you're right and in most environments it wouldn't matter (or the dumb reasons would be something it's your job to work through). I've just had enough professors who have very weird requirements to advise caution. Knew a guy who got an assignment failed for forgetting to set his Word font to Times New Roman instead of Calibri or whatever the default was at the time and the professor just straight up gave him 0 credit because "it was in the syllabus!". I don't think he failed the whole course and so just got a B instead of an A or whatever, but yeah...academia be weird sometimes.

Falcon2001 fucked around with this message at 01:19 on Mar 12, 2024

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

StumblyWumbly posted:

Does anyone have a good best practice for type hinting on widely shared stuff? It definitely clears things up, but a lot of it is so new and it seems like the choices are to have people always on the latest version or not use stuff like the Filename type for a few years.
Oh never mind it sounds like Filename is our own specific type. I'm still interested in the question, but it's easier since type hints aren't changing as much as I thought

I'm a little confused by this - type hinting is widely supported and stable, it's in the PEP. For anything widely shared, I think it's extremely important. You should basically do it 100% of the time - the fact that you didn't know that Filename was your own class is a great indication of where type hinting would have helped, for example. One of Python's biggest weaknesses in practice is that you can write extremely unreadable unintelligible code extremely quickly, and type hinting helps offset that significantly. At my work I very routinely reject PRs until type hinting is added, and I think it's more important than docstrings (although realistically there's some overlap between the two).

That being said, there's plenty of times where type hinting becomes super unwieldy, and mostly it comes down to edge case stuff like dictionary dispatch using functions or heterogenous dictionaries or things like that (or just super complex nested dicts). On one hand, that's a good reason to avoid some of those things sometimes - if it's complicated to document, it's going to be complicated to unravel later. I think ultimately you should use them where appropriate internally and don't expose them to the outside directly, rather than avoiding them altogether - dogma in development is rarely ever true 100% of the time.

Except ISO-8601 - that's the one true path and anyone who uses another datetime format is a heretic and does not deserve to draw breath.

Edit:

Also, the PEP8 for Type Hinting is from 2014, so it's definitely not 'new'. https://peps.python.org/pep-0484/. Obviously support for it isn't as old as the actual PEP, but at this point all major IDEs and analysis tools support it that I've seen.

Falcon2001 fucked around with this message at 00:18 on Mar 23, 2024

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

I'm right and I'll die on this hill, surrounded by the bodies of anyone who thought other datetime formats for programming/data were a good idea. :colbert:

I 100% agree with the rest of your post though.

96 spacejam posted:

This is a silly question but in reading the last 100 or so pages I couldn’t put together the ideal tools for a true greenhorn fob to Python. I have an iPad Air 5, broken Chromebook, and Rii4 that needs to be reconfigured.

Should I just fix those?

If you're just looking for a 'how do I do practical things' then I recommend https://automatetheboringstuff.com/, which has a free to read online version or you can get the book. I recommend this generally for entry to Python, but it's kind of pointed directly at the idea of 'what can I do with this language that isn't some huge computer science thing'

Falcon2001 fucked around with this message at 03:27 on Mar 23, 2024

Adbot
ADBOT LOVES YOU

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Data classes rule. Use them everywhere.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply