Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Also the ‘connectionless commits’ in sqlAlchemy (your df.to_sql(con=engine,…)) are deprecated, you should do this as

code:
with engine.connect() as conn:
    df.to_sql(con=conn,…)
Edit: from a very brief scan of the error message I’d venture that the issue is going to turn out to be that the source data contains some text in brackets that pandas is either parsing into a tuple or not quoting properly and is loving up the sql insert statements. The way to figure it out is to iteratively narrow down the problem row/column, as QuarkJets suggested above.

DoctorTristan fucked around with this message at 07:07 on Dec 9, 2022

Adbot
ADBOT LOVES YOU

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Also as a very general rule I’d avoid using pandas for data pipeline work like this because the ‘helpful’ inferring of column data types will frequently gently caress you over in exactly this way. Generally the safer way is to upload the data as text into a staging table and then do type conversions on the database side.

(The best method is to use proper ELT tools, but from the sound of things that isn’t an option)

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger.

It’s a bit difficult to give good advice here because I’m still a bit puzzled about the final objective. IIRC the point of all this was that your boss wants you to download the data from salesforce because he’s too cheap to keep paying for the service. So, what do want do once you’ve downloaded it? Is it enough to shove it in a text file somewhere and go “ok I’ve archived it, go turn the service off now cheapskate”? Or are you expected to make it continuously available in some other form in an on-prem database? Also, would you have to keep that database up to date in future - presumably that data is a record of past business activities and if salesforce isn’t updating it then do you need to do that as well?

To get back to your original question - *if* downloading the data and parsing it via pandas is the right idea then there are still some answers we need before we can really tell you what you should do next. Firstly, how much data are you trying to pull down via this API - is it a few hundred of these entries or a few billion? Secondly is the schema always the same - does the data stream only contain BounceEvents or could there be other types mixed in there?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

C2C - 2.0 posted:

I just finished building a project for my class; it's essentially a Tkinter interface with a bunch of inputs for my music collection that saves the inputs to a .json file.

I'm having a go at using PyVis to map out some of the relationships of the items in the collection. Here's my code:

Here's the error I'm getting:
code:
AttributeError: 'NoneType' object has no attribute 'render'
music.html

I can’t tell exactly what the issue is from what you’ve posted, but the line

code:
g.show(‘music.html’)
Isn’t calling your show() function, it’s calling a class method on the object g (an instance of class Network) that also has the name show(). That might or might not be the method you want to call, but assuming it is it looks like you’re not calling it correctly - either an argument is missing or is of the wrong type. You’ll need to look more into the Network class and its show() method to figure out what you’re doing wrong here.

The error message tells you that at some point in the invocation it tried to call a render() method on some object, but that object turned out to be None instead of what it was expecting so it couldn’t continue. Unfortunately that’s not enough information to solve the issue - python is sadly notorious for unhelpful error messages like this.

Also the show() function in that snippet looks like it was originally a member function that you copy pasted from some class defined elsewhere. That is almost certainly the wrong thing to do, though as I mentioned above that function isn’t currently doing anything in your code.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

QuarkJets posted:

Whoa, that is not my experience at all. The problem is not that the message is vague, in fact the message states exactly what the problem was and the specific line that led to the problem is in the printed traceback but was just left out of the post. This is way more useful than the C standard of simply seg faulting. The problem here is that the OP left out most of the traceback

Maybe I’m projecting my own experience of using pandas and its truly incomprehensible traces, but I frequently find that the combination of dUcK tYpInG plus a general design approach in third-party libraries of throwing as late as possible means that the trace points to something long after the bug that actually caused the error. Here I agree it’s fairly easy for an experienced python person to figure out, though I really wish argument validation in top-level functions were more widespread than it currently is.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
If you want .apply() to act on every row in that series you’ll need to do .apply(…, axis=1).

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

duck monster posted:

This is why you, don't let ChatGPT generate content for your website without review.

I present ;- "The goto statement in python", including information about the "comehere" statement. Not an April 1sts post.

https://www.upgrad.com/blog/goto-statement-in-python/

Well its either ChatGPT or "loving idiot". There is no goto in python.

I’m looking forward to the future where chatGPT12 has been trained on an internet largely populated by this kind of shite

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

eXXon posted:

You'll enjoy this, then:

Python code:
a = "abc!"; b = "abc!"; a is b
> True
... and this:
Python code:
if True:
    a = "wtf!"
    b = "wtf!"
    a is b
> True
... but what about this?
Python code:
if True:
    a = "wtf!"
    b = "wtf!"
a is b
> True in script/ipython, SyntaxError in base python repl?!

I’m guessing the original is due to string interning and the rest are compile time optimisations?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Oysters Autobio posted:

Sometimes I want to go back to R and it's tidyverse but everything at work is python so I know I don't have the same luxury.

Two questions here, first, anyone have good resources for applying TDD for data analysis or ETL workflows? Specifically here with something that is dataframe friendly. Love the idea of TDD, but every tutorial I've found on testing is focused on objects and other applications so I'd rather not try and reinvent my own data testing process if there's a really great and easy way to do it while staying in the dataframe abstraction or pandas-world.

Second questiom, is there a pandas dataframe-friendly wrapper someone has made for interacting with APIs like the requests library but abstracted with some helper functions for iterating through columns? Or are people just using pd.apply() and pd.map() on a dataframe column they've written into a payload?

Still rather new with pandas and the little footguns I can see with accidentally using a method that results in a column being processed without maintaining order or key:value relationship with a unique identifier.

If there was a declarative dataframe oriented package that just let me do something like.

Python code:
url = https://api.foo.baz.com

processed_df = dataframerequests(df['column2], append = 'True') 

processed_df.head()
Where 'column2' is a column from pandas dataframe that I want to POST to an API for it to process in some way (translation or whatever) and the append boolean tells it to append a new column rather than replacing it with the results.

With requests I always feel uneasy or unsure of how im POSTing a dataframe column as a dict, then adding that results dict back as a new column.

Totally get why from a dev perspective all this 'everything as a REST API' microservices makes sense, but been finding it difficult for us dumbass data analysts who mainly work with SQL calls to a db to adapt.

cue Hank Hill meme:

"Do I look like I know what the hell a JSON is? I just want a table of a goddamn hotdog"

I'm having a hard time parsing what you're actually trying to do here because this post is jumping around all over the place.

Test-driven development is about defining the desired behaviour of your code by creating tests before you begin writing the code. Determining exactly what should constitute a 'testable unit' is a skill in itself, but as a simple example if part of the 'T' in your ETL involves ingesting a pandas dataframe with a string column and converting that column to floats, you might create a test that passes in mock input data to (part of) your transformation code and succeeds if the column in the output has the correct type. Similarly you might want verify how your code handles invalid inputs so you might create further tests that pass in mock data that contains errors and succeed iff the error is handled correctly.

Applying this paradigm to ETL is in principle no different - you define the desired behaviour through creating tests for that behaviour. One complication though is that it can be (very) difficult to define useful testable units where you are dependant on external systems - the E and the L in ETL, and I for one usually don't bother.

AFAIK there is no generic package that saves you the work of doing

Python code:
import requests
import pandas as pd

url="https://fart.butt.com"
response = requests.get(url)
df=pd.read_json(response.text)
It would be very difficult to create such an interface - the JSON structure of an HTML response is very flexible and will not generally be convertible to a dataframe. You need to know the specific data structure returned by the API in order to parse it correctly.

Oysters Autobio posted:

cue Hank Hill meme:

"Do I look like I know what the hell a JSON is? I just want a table of a goddamn hotdog"

If you want to work with web APIs you need to know what JSON is.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

I can’t really add much to what to what the two posters above me have said - using a loop to iterate over the rows is how I’d do it as well. Only thing I’d add is that if you want to add the results back into the data frame as a new column don’t do that during the loop itself - store them in a list then add that as a new column once you’ve processed every row.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Oysters Autobio posted:

Absolutely nothing wrong with sharing (at least to me that is, lol) advice on "how do you prefer to write your pandas". It's much appreciated.

The one part I'm not sure of is how to ensure or test to make sure that the new column I've created is "properly lined up" with the original input rows. Can I somehow assign some kind of primary key to these rows so then when my new column is added I can quickly see "ok awesome, input row 43 from original column lines up with output row 43."

Does that make sense at all? Or am I being a bit to overly concerned here? I'm still fairly new to Python coming from mainly a SQL background so the thought of generating a list based off inputs from a column and then attaching the new outputs column and knowing 100% that they match is for some reason a big concern of mine.

Sounds like something to write a unit test for.

Less flippantly, if you create the new column data by iterating over the rows in the data frame df and don’t perform any operation on df that change the row order, then the ordering of both will be consistent and it should be safe to assign the new column to the dataframe.

There’s no need to create a primary key column in a dataframe as pandas already does that for you (it’s called the index).

Lastly I would warn that there are some downsides to using pandas in ETL pipelines; some of its convenience features around inferring data types can behave inconsistently even on very similar datasets, and the developers absolutely love introducing breaking changes. If you do want to use it then at least make sure you’ve pinned the version in whatever environment you’re running it in.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

duck monster posted:

Reminder: Zed Shaw's next edition of "Python the hard way" is going to be 100% about data science lol

https://twitter.com/lzsthw/status/1682407572479852545

This is going to be a train wreck.

This could also be that close cousin of the humblebrag; the false self-effacing statement on social media made for engagement/ publicity. Like when a techbro millionaire goes on about the setbacks he suffered when he was fired from a summer job in a warehouse or when I talk about how I’m a big dumb idiot and coincidentally am just about to release a book aimed at big dumb idiots.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Bemused Observer posted:

I think I get the reasoning, but it's weird when it becomes "I don't understand this fundamental concept in the area I'm about to teach you about". If he said he didn't understand IP addresses or the Linux file system, I'd have no problem with that

Yeah normally those posts go ‘setback->triumph->grift’ but unless there were followup tweets to that one it looks like he forgot to do the last two lol.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Cyril Sneer posted:

Can you elaborate on this? Even if I don't ultimately go in this direction I'd still like to learn/try it.

Basically if you actually want a singleton (and I agree with other posters that for this problem you don’t) you should make it literally impossible to create more than one instance.

IIRC the usual way to implement this in Python is by altering the __new__() method and checking if cls._instance is None before returning either the sole existing instance of the class or creating and returning a new one (I have never actually needed to do this in the wild so might have got a couple of these names wrong)

Singletons do have their uses (eg loggers are frequently implemented using a singleton), but those are fairly specialised and limited and you should probably think carefully about whether you actually need one.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Do any of the tasks themselves submit to the ThreadPoolExecutor, or are they waiting on the results of another task? Either of those could cause a deadlock.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
I’ve always wondered what goes through the mind of someone who decides to build their operations around some obscure fork that hasn’t seen a release in a decade

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Looks like that third method (using str.split() and isdigit() ) will parse “2022-2022” as a valid date, and only the datetime.strptime() method will reject invalid months like “2022-17” so apples and oranges.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

KICK BAMA KICK posted:

Do you guys have any tips for using Python to do insane amounts of crime
https://x.com/molly0xFFF/status/1710718416724595187?s=20

Obfuscate better than def do_crimes()

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
The most obvious approach imo is to keep the shift+user data in database tables and have your app run queries against it. Is there any reason that wouldn’t work?

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
So it sounds like your data is a small collection of records with no general inter-record relationships other than ‘I want to consider these things together’. I’d just keep that as a bunch of dataclasses/namedtuples in a list unless/until you run into a reason not to (nothing you have said above seems like a reason not to).

One caveat to that is if you’re planning on using any of the numerous python dataviz frameworks then you’ll probably want to keep the data in a pandas dataframe instead since that’s what those frameworks expect.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Falcon2001 posted:

Speaking of which, and this isn't really a Python question, but is there an open source CSV editor that has some of the functionality of excel without all the...overhead? It'd be nice to find something I can use to just muck around quickly with tabular data without having to constantly be like 'no, don't accidentally make it an xlsx, stop formatting it' etc - not that it's a super big problem or anything.

VSCode and PyCharm both have plugins that make csv editing less painful

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Personally I’m extremely anti using pandas in anything even resembling a pipeline since the devs absolutely love introducing breaking changes.

Adbot
ADBOT LOVES YOU

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
Dash is a lot more flexible than Streamlit

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply