Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
WHERE MY HAT IS AT
Jan 7, 2011
Python is bad because its type system is based on pinkie promises and even mypy cannot save you. At least at my company where once a week we have some dumb sentry issue because someone has done
code:
blah = foo.bar.baz 
Without realizing that bar is nullable and generating a “type None has no attribute baz” exception.

Adbot
ADBOT LOVES YOU

WHERE MY HAT IS AT
Jan 7, 2011
We’re pretty good about hinting in first party code but it all falls down as soon as you use a third party library for which there are no types. Fine for Flask/Django/whatever other popular library but just lol if I try to get anyone to write our own type packages for third party deps we use.

WHERE MY HAT IS AT
Jan 7, 2011

Falcon2001 posted:

Familiar with Frozen dataclasses, but this solves a problem I'm not actually trying to fix. The issue isn't 'you can reassign/update attributes after creation', the problem is 'certain attributes make calls out to clients to populate data, and we need to ensure that doesn't happen.'

The idea is that all the dependencies/etc of a Cluster might be very complex and big, and span multiple other systems. This framework was built as an accelerator to other projects so it basically works a glue between multiple infrastructure technologies - when you ask a cluster about network details, instead of just giving you a pointer, it actually calls and loads the object from that other client, then it's available to you cached locally from there on out. (See the example of the cluster_children note above)

The problem is that I basically want a way of saying 'hey just return a version of yourself that just contains the data and won't try and fetch anything else - we either initialized it correctly, or datasets will be empty.

One approach would be to construct complex dataclasses and have each of these objects basically turn themselves into those objects, but that's a huge amount of basically copy/paste work there, so the idea is how to build it into the class itself in the least weird way possible.

There's some really long ways of going about it - basically writing two almost identical classes that both fulfill the same interface - so you have a ClusterPlus object and then a ClusterPlusButThisOneDoesntMakeCalls object - the second of which is basically just a dataclass representation of what the first one looks like after you fetched all the data you need. But that's a ton of duplicate code, not to mention the entire problem of drift.

Assuming there's a naming convention for the underlying properties, you could make a class that just dynamically returns those when fetching a property. Something like:

Python code:
class MyClientClass():
    def __init__(self):
        self._fooprop = None

    @property
    def foo(self):
        if self._fooprop:
            return self._fooprop
        else:
            self._fooprop = "EXPENSIVE OPERATION"
            return self._fooprop

class DisabledProxy():
    def __init__(self, proxied):
        self._proxied = proxied

    def __getattr__(self, name):
        return getattr(self._proxied, f"_{name}prop")
                       
foo = MyClientClass()
proxy = DisabledProxy(foo)

print(proxy.foo) # prints None
This gets tricky if you have nested props/sub objects you also need to disable but makes things easy in the simple case as long as people can follow the naming convention or you find some other way to determine it like a dict or something.

WHERE MY HAT IS AT
Jan 7, 2011
Edit: ^^^ :argh:

Dataclasses aren't immutable unless you do @dataclass(frozen=True).

You're still relying on this global variable in create_dict which you don't want, and your Parser (I'd probably call it COnfig or something to be more clear that it doesn't do any parsing itself) is tightly coupled to CLI args. Maybe today you only take CLI args to configure this, but someday it may come from env vars, or a config file, or a network call, etc. What you're aiming for is to separate "the config" and "where the config comes from" in most of the code, so it doesn't have to care.

Python code:
@dataclass(frozen=True)
class Config:
    filename: str
    short_path: str
    site: str
     sort: int

    def full_path(self) -> str:
        return self.shortPath + self.site + "/"

def parse_config_from_cli_args() -> Config:
    parser = argparse.ArgumentParser(...)
    ...
    args = parser.parse_args()
    # whatever validation of those args that needs to happen
    
     return Config(filename = args.filename, short_path = args.short_path, ...)

def create_dict(config: Config) -> dict:
    file = config.full_path
    ....
    return data_dict

def data_display(data: dict, config: Config):
         thd_list=(sorted(thd_list, key = lambda x: x[config.sort]))
         for line in thd_list: print(line[0],line[1], line[2], line[3])

def main():
    config = parse_config_from_cli_args()
    data_dict = create_dict(config)
    data_display(data_dict, config)
Now all your CLI arg validation is in one place, if you want to swap the config source somewhere later down the road you can without disturbing create_dict or data_display, and since they both take everything they need as arguments it's easy to write tests for them without needing to set up a bunch of global state.

Personally, I would avoid passing the entire config object to data_display since all it needs out of it is the sort config. Passing around a big bundle of state like that can make it hard to reason about what exactly a function relies on and has led me to some hair-pulling debugging sessions in the past. Or put another way, if a year from now you need to make some changes or fixes to data_display (or call it from some other part of the codebase), which of these function signatures tells you the most about how it works?

Python code:
# all globals!
def data_display():
    ...

# Global config
def data_display(data_dict: dict):
    ...

# Big bundle o' state
def data_display(data_dict: dict, config: Config):
    ...

# Pull out the sort
def data_display(data_dict: dict, sort: int):
    ...

WHERE MY HAT IS AT fucked around with this message at 16:45 on Sep 15, 2023

WHERE MY HAT IS AT
Jan 7, 2011
I can only assume they're just too close to retirement to want to learn something new, even if it's demonstrably better. Not that people earlier in their careers can't be stubborn or cling to garbage, either. Just IME that's been where I've seen it.

WHERE MY HAT IS AT
Jan 7, 2011

QuarkJets posted:

I especially hate package names that install a completely different module name. Pillow is the big offender for me, I hate that the module you import is actually named PIL. May as well call it PISS

At least Pillow has the excuse of it being a fork of PIL so the import is for backwards compatibility/historical reasons. Beautiful Soup 4 with its dumb "install beautifulsoup4 (but not beautifulsoup, because that's beautifulsoup 3!) and then import bs4" can go to hell, though.

WHERE MY HAT IS AT
Jan 7, 2011

teen phone cutie posted:

i have a general question that i'm just now starting to look into, but would be great if someone could point me in the right direction:

I have a standalone API with endpoints to issue users JWTs either through /login, /register endpoints. These JWTs are signed to expire in 7 days and are never stored in the database, except for when users hit /logout, which will then add the token to a blacklist table.

That's probably more information then needed, but I'm looking for a way to compile the list of active users within the last 15 minutes, and I'm not sure the best way to do it. I also want to get a number of people who are hitting the API not logged in. My API is a flask application, so I'm assuming there's some way to intercept all endpoints, and add them to "active users" and "guests" lists in application state if they don't already exist and timestamp them, but I'm not sure if there's a Flask-y way to do this?

If you want to execute this logic on each request so you can gather stats, what you're after is the before_request decorator. It doesn't pass any params, but you can import flask.request to have access to a request object to inspect.

As for the actual stat storage, you haven't said anything about what your backing data store is, but I'd probably do something like: for each request, extract the user ID out of the JWT and upsert a row in a DB table with the user ID and current timestamp + 15 minutes. Then, when you want to view active users, query the table for rows with a timestamp greater than current. Depending on space constraints, you'll likely want to clean this up periodically by dropping all rows with an expired timestamp.

Tracking users not logged in depends on how you want to use that data. Do you need to bucket it day by day? As a percentage of total requests? Just an absolute count? Harder to suggest an approach to that one without more details but maybe just a table of unique IPs + day of access would suffice.

WHERE MY HAT IS AT
Jan 7, 2011
Curious to hear about other folks' solutions! I used a trie for part two so I could iterate over the string and check substring prefixes as I went along until I hit a valid word.

However, I hadn't implemented a trie since I was in school, which was close to a decade ago now, so re-learning it all burned me out on puzzles and I haven't bothered doing part two of any of the other days.

WHERE MY HAT IS AT
Jan 7, 2011
If I can keep making this the ETL thread for a little longer, what would you all suggest for a totally greenfield project?

I do some contracting for my wife’s company, and inherited a mishmash of make.com no-code projects, single-purpose API endpoints in Heroku, and Airtable automations which sync data to Google Sheets.

They’re starting to hit a scale where this is a problem because things are unreliable, go out of sync, certain data exists only in some data stores, etc. The goal is to get them set up with a proper data store that can act as a single source of truth, and an actual ETL platform where I can write real code or use premade connectors to shift data around.

I did a bit of digging and something like Airbyte + Databricks looks nice, but maybe it’s overkill for what they need? Think thousands of rows a day rather than millions, and they only want to be able to do dashboarding and ad-hoc querying in Tableau. Would I regret just doing a managed Airflow and a Postgres instance at this point? I don’t want to have to redo everything in a year or two.

WHERE MY HAT IS AT
Jan 7, 2011
Thanks! Those both look interesting, leaning towards dagster just because mage doesn’t have a managed hosted setup and I want to minimize the time I spend on this.


Say I have a pydantic model that represents my incoming data from a third party API (in this case a warehouse management system), what are folks using to actually write that to a raw table for transformation with dbt later? At work we use sqlalchemy for all our DB interactions but that seems heavy handed, especially if I’ve already got a list of models parsed from JSON or whatever. I could just hand roll a parameterized sql statement but surely there’s a library out there that will do what I need, right?

Edit: looks like Dagster can do this natively with a data frame, never mind!

WHERE MY HAT IS AT fucked around with this message at 10:23 on Dec 27, 2023

WHERE MY HAT IS AT
Jan 7, 2011
Yeah, they’re probably several years out from needing a full time engineer, if they ever get that far. So I’m it, and the less time spent on maintenance the better for both sides. I’ll start with Postgres since I have experience scaling that pretty far at work, and plan to move to BigQuery someday if they need it.

A data engineering thread sounds like a good idea, I can work on an OP this week since I’m off on PTO and my nephews gave me the plague or something anyways.

WHERE MY HAT IS AT
Jan 7, 2011
Poetry has its own resolver and doesn’t rely on pip, pipenv does just punt to pip under the hood and is even slower. The real problem is just that setup.py is a non deterministic way of specifying dependencies so you can’t have a truly “offline” package resolver, you have to actually install stuff to do lock file generation.

WHERE MY HAT IS AT
Jan 7, 2011
There's also Pex: https://pex.readthedocs.io/en/v2.1.163/

It requires that an interpreter matching your constraints be present on the system already, but otherwise, it includes all runtime dependencies and acts as a hermetic environment like a venv does.

WHERE MY HAT IS AT
Jan 7, 2011
I don't think I'd call myself an "expert" so maybe someone will weigh in with a better solution, but using ffill seems like it would work:

code:
import pandas as pd

data = {'Category': ['Southwest', '1', '2', '3', 'West', '1', '2', '3', '4', 'North', '1', '2']}
df = pd.DataFrame(data)

# Identify rows that are NOT numeric (i.e., region names) and mask others
df['Category'] = df['Category'].where(df['Category'].str.isnumeric() == False)
df['Category'].ffill(inplace=True)

print(df)
This will replace all the numbers with null values and then ffill (forward fill) will propagate any remaining values down the sequence until the next non-null.

Adbot
ADBOT LOVES YOU

WHERE MY HAT IS AT
Jan 7, 2011
There's a burgeoning data engineering thread here that might get you a better answer, but I think the crossover of posters is high anyways: https://forums.somethingawful.com/showthread.php?threadid=4050611

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply