|
Also the ‘connectionless commits’ in sqlAlchemy (your df.to_sql(con=engine,…)) are deprecated, you should do this ascode:
DoctorTristan fucked around with this message at 07:07 on Dec 9, 2022 |
# ¿ Dec 9, 2022 06:54 |
|
|
# ¿ May 14, 2024 04:28 |
|
Also as a very general rule I’d avoid using pandas for data pipeline work like this because the ‘helpful’ inferring of column data types will frequently gently caress you over in exactly this way. Generally the safer way is to upload the data as text into a staging table and then do type conversions on the database side. (The best method is to use proper ELT tools, but from the sound of things that isn’t an option)
|
# ¿ Dec 9, 2022 11:09 |
|
I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger. It’s a bit difficult to give good advice here because I’m still a bit puzzled about the final objective. IIRC the point of all this was that your boss wants you to download the data from salesforce because he’s too cheap to keep paying for the service. So, what do want do once you’ve downloaded it? Is it enough to shove it in a text file somewhere and go “ok I’ve archived it, go turn the service off now cheapskate”? Or are you expected to make it continuously available in some other form in an on-prem database? Also, would you have to keep that database up to date in future - presumably that data is a record of past business activities and if salesforce isn’t updating it then do you need to do that as well? To get back to your original question - *if* downloading the data and parsing it via pandas is the right idea then there are still some answers we need before we can really tell you what you should do next. Firstly, how much data are you trying to pull down via this API - is it a few hundred of these entries or a few billion? Secondly is the schema always the same - does the data stream only contain BounceEvents or could there be other types mixed in there?
|
# ¿ Jan 5, 2023 22:09 |
|
C2C - 2.0 posted:I just finished building a project for my class; it's essentially a Tkinter interface with a bunch of inputs for my music collection that saves the inputs to a .json file. I can’t tell exactly what the issue is from what you’ve posted, but the line code:
The error message tells you that at some point in the invocation it tried to call a render() method on some object, but that object turned out to be None instead of what it was expecting so it couldn’t continue. Unfortunately that’s not enough information to solve the issue - python is sadly notorious for unhelpful error messages like this. Also the show() function in that snippet looks like it was originally a member function that you copy pasted from some class defined elsewhere. That is almost certainly the wrong thing to do, though as I mentioned above that function isn’t currently doing anything in your code.
|
# ¿ Mar 1, 2023 00:02 |
|
QuarkJets posted:Whoa, that is not my experience at all. The problem is not that the message is vague, in fact the message states exactly what the problem was and the specific line that led to the problem is in the printed traceback but was just left out of the post. This is way more useful than the C standard of simply seg faulting. The problem here is that the OP left out most of the traceback Maybe I’m projecting my own experience of using pandas and its truly incomprehensible traces, but I frequently find that the combination of dUcK tYpInG plus a general design approach in third-party libraries of throwing as late as possible means that the trace points to something long after the bug that actually caused the error. Here I agree it’s fairly easy for an experienced python person to figure out, though I really wish argument validation in top-level functions were more widespread than it currently is.
|
# ¿ Mar 1, 2023 13:54 |
|
If you want .apply() to act on every row in that series you’ll need to do .apply(…, axis=1).
|
# ¿ Apr 11, 2023 17:30 |
|
duck monster posted:This is why you, don't let ChatGPT generate content for your website without review. I’m looking forward to the future where chatGPT12 has been trained on an internet largely populated by this kind of shite
|
# ¿ May 16, 2023 13:36 |
|
eXXon posted:You'll enjoy this, then: I’m guessing the original is due to string interning and the rest are compile time optimisations?
|
# ¿ May 19, 2023 10:31 |
|
Oysters Autobio posted:Sometimes I want to go back to R and it's tidyverse but everything at work is python so I know I don't have the same luxury. I'm having a hard time parsing what you're actually trying to do here because this post is jumping around all over the place. Test-driven development is about defining the desired behaviour of your code by creating tests before you begin writing the code. Determining exactly what should constitute a 'testable unit' is a skill in itself, but as a simple example if part of the 'T' in your ETL involves ingesting a pandas dataframe with a string column and converting that column to floats, you might create a test that passes in mock input data to (part of) your transformation code and succeeds if the column in the output has the correct type. Similarly you might want verify how your code handles invalid inputs so you might create further tests that pass in mock data that contains errors and succeed iff the error is handled correctly. Applying this paradigm to ETL is in principle no different - you define the desired behaviour through creating tests for that behaviour. One complication though is that it can be (very) difficult to define useful testable units where you are dependant on external systems - the E and the L in ETL, and I for one usually don't bother. AFAIK there is no generic package that saves you the work of doing Python code:
Oysters Autobio posted:cue Hank Hill meme: If you want to work with web APIs you need to know what JSON is.
|
# ¿ Jun 7, 2023 07:17 |
|
I can’t really add much to what to what the two posters above me have said - using a loop to iterate over the rows is how I’d do it as well. Only thing I’d add is that if you want to add the results back into the data frame as a new column don’t do that during the loop itself - store them in a list then add that as a new column once you’ve processed every row.
|
# ¿ Jun 13, 2023 11:37 |
|
Oysters Autobio posted:Absolutely nothing wrong with sharing (at least to me that is, lol) advice on "how do you prefer to write your pandas". It's much appreciated. Sounds like something to write a unit test for. Less flippantly, if you create the new column data by iterating over the rows in the data frame df and don’t perform any operation on df that change the row order, then the ordering of both will be consistent and it should be safe to assign the new column to the dataframe. There’s no need to create a primary key column in a dataframe as pandas already does that for you (it’s called the index). Lastly I would warn that there are some downsides to using pandas in ETL pipelines; some of its convenience features around inferring data types can behave inconsistently even on very similar datasets, and the developers absolutely love introducing breaking changes. If you do want to use it then at least make sure you’ve pinned the version in whatever environment you’re running it in.
|
# ¿ Jun 15, 2023 09:14 |
|
duck monster posted:Reminder: Zed Shaw's next edition of "Python the hard way" is going to be 100% about data science lol This could also be that close cousin of the humblebrag; the false self-effacing statement on social media made for engagement/ publicity. Like when a techbro millionaire goes on about the setbacks he suffered when he was fired from a summer job in a warehouse or when I talk about how I’m a big dumb idiot and coincidentally am just about to release a book aimed at big dumb idiots.
|
# ¿ Jul 25, 2023 06:15 |
|
Bemused Observer posted:I think I get the reasoning, but it's weird when it becomes "I don't understand this fundamental concept in the area I'm about to teach you about". If he said he didn't understand IP addresses or the Linux file system, I'd have no problem with that Yeah normally those posts go ‘setback->triumph->grift’ but unless there were followup tweets to that one it looks like he forgot to do the last two lol.
|
# ¿ Jul 25, 2023 10:51 |
|
Cyril Sneer posted:Can you elaborate on this? Even if I don't ultimately go in this direction I'd still like to learn/try it. Basically if you actually want a singleton (and I agree with other posters that for this problem you don’t) you should make it literally impossible to create more than one instance. IIRC the usual way to implement this in Python is by altering the __new__() method and checking if cls._instance is None before returning either the sole existing instance of the class or creating and returning a new one (I have never actually needed to do this in the wild so might have got a couple of these names wrong) Singletons do have their uses (eg loggers are frequently implemented using a singleton), but those are fairly specialised and limited and you should probably think carefully about whether you actually need one.
|
# ¿ Aug 9, 2023 18:52 |
|
Do any of the tasks themselves submit to the ThreadPoolExecutor, or are they waiting on the results of another task? Either of those could cause a deadlock.
|
# ¿ Aug 18, 2023 06:41 |
|
I’ve always wondered what goes through the mind of someone who decides to build their operations around some obscure fork that hasn’t seen a release in a decade
|
# ¿ Sep 19, 2023 17:42 |
|
Looks like that third method (using str.split() and isdigit() ) will parse “2022-2022” as a valid date, and only the datetime.strptime() method will reject invalid months like “2022-17” so apples and oranges.
|
# ¿ Oct 6, 2023 07:40 |
|
KICK BAMA KICK posted:Do you guys have any tips for using Python to do insane amounts of crime Obfuscate better than def do_crimes()
|
# ¿ Oct 8, 2023 00:11 |
|
The most obvious approach imo is to keep the shift+user data in database tables and have your app run queries against it. Is there any reason that wouldn’t work?
|
# ¿ Oct 15, 2023 21:26 |
|
So it sounds like your data is a small collection of records with no general inter-record relationships other than ‘I want to consider these things together’. I’d just keep that as a bunch of dataclasses/namedtuples in a list unless/until you run into a reason not to (nothing you have said above seems like a reason not to). One caveat to that is if you’re planning on using any of the numerous python dataviz frameworks then you’ll probably want to keep the data in a pandas dataframe instead since that’s what those frameworks expect.
|
# ¿ Oct 15, 2023 22:27 |
|
Falcon2001 posted:Speaking of which, and this isn't really a Python question, but is there an open source CSV editor that has some of the functionality of excel without all the...overhead? It'd be nice to find something I can use to just muck around quickly with tabular data without having to constantly be like 'no, don't accidentally make it an xlsx, stop formatting it' etc - not that it's a super big problem or anything. VSCode and PyCharm both have plugins that make csv editing less painful
|
# ¿ Nov 7, 2023 18:40 |
|
Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes.
|
# ¿ Feb 27, 2024 06:20 |
|
|
# ¿ May 14, 2024 04:28 |
|
Dash is a lot more flexible than Streamlit
|
# ¿ Mar 20, 2024 10:13 |