|
So here I am once again attempting to pull salesforce data via API into a dataframe and then to a db. The context is different this time, though. I'm pulling from their marketing cloud API and I actually managed to grab the data I need! Huzzah! However, the format looks a bit weird:code:
Now that I think of it, is there a better way to do this kind of thing than dataframes? Or is pandas pretty much the king of tabular mountain? edit: oh wait, if I just convert the "=" to ":", will that get me a valid json? or is there a catch? Seventh Arrow fucked around with this message at 20:01 on Jan 5, 2023 |
# ? Jan 5, 2023 19:59 |
|
|
# ? May 15, 2024 03:20 |
|
Seventh Arrow posted:edit: oh wait, if I just convert the "=" to ":", will that get me a valid json? or is there a catch? You should set this line of thinking aside and step back a little bit. You've correctly identified that the data doesn't look like JSON. It's also not a format I recognize immediately other than to say it's structured and probably from the SOAP API as opposed to the newer REST API. Are you using the Python client for this API? If so you should have Python objects. How did you get to this format?
|
# ? Jan 5, 2023 21:13 |
|
nullfunction posted:You should set this line of thinking aside and step back a little bit. Hey, thanks for the reply, I appreciate it. I think that you're correct that it's SOAP. The API is called FuelSDK and it's specifically for interacting with the salesforce marketing cloud - specifically audit events like unsubs, clicks, etc. I've been looking through some of their documentation about how to put some of this into an object, but everything they use is just print statements. As far as how I got there once you get past the authentication fluff, this is where I'm at: code:
|
# ? Jan 5, 2023 21:38 |
|
I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger. It’s a bit difficult to give good advice here because I’m still a bit puzzled about the final objective. IIRC the point of all this was that your boss wants you to download the data from salesforce because he’s too cheap to keep paying for the service. So, what do want do once you’ve downloaded it? Is it enough to shove it in a text file somewhere and go “ok I’ve archived it, go turn the service off now cheapskate”? Or are you expected to make it continuously available in some other form in an on-prem database? Also, would you have to keep that database up to date in future - presumably that data is a record of past business activities and if salesforce isn’t updating it then do you need to do that as well? To get back to your original question - *if* downloading the data and parsing it via pandas is the right idea then there are still some answers we need before we can really tell you what you should do next. Firstly, how much data are you trying to pull down via this API - is it a few hundred of these entries or a few billion? Secondly is the schema always the same - does the data stream only contain BounceEvents or could there be other types mixed in there?
|
# ? Jan 5, 2023 22:09 |
|
DoctorTristan posted:I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger. Thanks DoctorTristan for your reply. This is actually slightly different than that other project. The company that I work for helps non-profits get setup in salesforce, so in a way all my work is salesforce-related in one way or another. In the other project, yes, my manager was too cheap to have a dedicated SQL client and opted for MySQL. This is for a client, though, and they have their own Azure SQL instance to put stuff in. So in my mind, the process would go something like: python pulls data from salesforce marketing cloud ==> python somehow gets that weirdly-formatted data into a dataframe ==> pandas pushes the data using to_sql to the client's Azure SQL DB ==> buy cigars and booze and chill for the rest of the day For now, they just want a sampling of the data to show the client so they can 'ooh' and 'aah' about our cool pipeline. I would need all of the columns, but just a handful of rows. And yes, this script is just for BounceEvents, but I would also want to generate data for ClickEvents, UnsubEvents, etc. But once I have the logic for BounceEvents, I can simply apply it to everything else. I hope that makes sense and is not too long-winded! The more I look into this, the more it seems like this API is kind of difficult to work with and seems sort of abandoned when I look at their issues page. There seems to be an official API, but that is also SOAP and I'm wary of trying to learn it and getting the same result.
|
# ? Jan 5, 2023 22:52 |
|
As an aside, I've developed a severe aversion to saying, "this next task should be pretty easy."
|
# ? Jan 5, 2023 22:56 |
|
DoctorTristan posted:I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger. This is the key. I had figured you were print()ing the result, and there's a pretty important thing to call out here that will probably help you in many ways going forward, especially working with unfamiliar APIs. When you print() an object, all you're doing is asking that object for a string representation of itself. There's a default representation that gives you something like <__main__.Foo object at 0x0000deadbeef> if you choose not to implement your own __repr__() but it can be literally any string representation that the developer wants, including JSON or some weird string like you have here. Here's a small contrived example to hopefully illustrate: Python code:
Python code:
Unless there's a particularly mature and well-maintained third party API that does exactly what you need it to (simple-salesforce is great but doesn't cover everything!), I'd go with the official API. You'd need something really compelling to convince me that adding a third party dependency over the official dependency is the right move for this, especially if the APIs they expose are similar. Ultimately you're going to have to do the work of mapping the columns you want (object data) to the right format (pandas dataframe) so that you can use the to_sql call regardless, so you may as well do it against the official API. Seventh Arrow posted:As an aside, I've developed a severe aversion to saying, "this next task should be pretty easy." Congrats, you can call yourself a programmer now.
|
# ? Jan 6, 2023 01:36 |
|
Thanks much nullfunction for the detailed explanation. I will check out dir() and see where that leads. With regard toquote:Ultimately you're going to have to do the work of mapping the columns you want (object data) to the right format (pandas dataframe) so that you can use the to_sql call regardless... can you point me to any documents or tutorials that might help with this?
|
# ? Jan 6, 2023 02:41 |
|
I had a look at the SDK link you posted and it's the same one linked from the Salesforce page, so I guess you're using the only choice. I can speak to what you need to do conceptually, but I haven't used pandas, all I've done is have a quick look at what classmethods exist to create a dataframe. I know it's a super popular library so I'm certain you can google "how do I load a pandas dataframe" and get to the same pages I would see. Just don't take this as an optimized solution, it's something to get you started. I gather that you're new to writing code in general, so I'll try to break it down as far as I can, but you should go through the exercise of actually implementing all of this if it's gonna be your job. You're gonna be doing stuff like this a lot, and frankly Python's a great language to do it in. Part of navigating a huge ecosystem like Python's is going to be learning to use some of the built-in tools to self-service when exploring unfamiliar objects, as documentation isn't always available or useful when it does exist. I'm going to have to make some assumptions because I don't have access to Salesforce data, but let's take the repr from the BounceEvent you posted previously as a guide to what might exist on that object: quote:(BounceEvent){ The structure of the "Client" field suggests that the ClientID is a separate object or something more complex that you'll need to figure out and map accordingly, but overall this looks fairly straightforward in terms of mapping, provided you can locate the properties to get these values out of the object. I can see that this is a single bounced email, so I know that each BounceEvent maps to a "row" in pandas. I know that Pandas can accept a list of dicts (where each dict is a row, and each key in the dict is a column) into a dataframe, and you've indicated you want to use a dataframe to load the data into SQL so let's design with that goal in mind... we need a list of dicts to feed into pandas. pre:python pulls data python somehow gets pandas pushes the data from salesforce ==> that weirdly-formatted data ==> using to_sql to the marketing cloud into a dataframe client's Azure SQL DB Python code:
Now that you have all this code in a convenient function, it's a good time to fire up the REPL, paste the function in, and grab an object to play around with: code:
Python code:
Noodling around with dir(), calling methods, reading properties, using type() to map out the object, taking notes of what calls you need to make to get all your data out... these are things you'll want to get good at, you'll probably do them a lot. They're also super useful in a pinch when debugging. At this point, you should have a function that you can feed a BounceEvent into, and get a dict out of that has everything you need for pandas. If you can get something loaded into a dataframe for one example BounceEvent, you've got a pretty good chance that at least one other BounceEvent will work too, so when exploring this way, I always try to start getting a single example of something working, then feeding it more data to try to validate that it works with more than just the one event you're testing. Python code:
A strikingly large portion of the programs you'll write follow the same basic pattern as what you're trying to do above, so much so that entire industries have grown up around it. Have a look at the wiki page for ETL as it may give you some starting points for stuff to research further, vocab, etc.
|
# ? Jan 6, 2023 21:11 |
|
Thanks again, nullfunction, I truly appreciate it - I only glanced over your post and will pore over it more in a bit; but for now, I took your advice and ran dir() and it turns out that it does indeed have a json attribute:quote:print(dir()) I even found someone who made a module to wring a prettified json out of the API response: https://github.com/emailrhoads/sfmc_response_to_json/blob/master/sfmc_to_json/formatter.py (although granted it might have been designed with python 2.x in mind) I copy-pasted each code block in jupyter to see what would happen but I think my ancient brain has forgotten how to pass parameters to a function. I think the last 2 functions are supposed to be what finally gets you a dict that can then be used with json_normalize: code:
|
# ? Jan 6, 2023 21:35 |
|
To pass parameters to a function, you include them in the parenthesis that indicate to Python that you would like to call the function. so, in the REPL, Python code:
When you call "dir()" like that, it's just examining the current local scope. I would guess that there exists a "json" label in this scope because you've imported the json module. To have "dir()" inspect an object, you need to feed it a label that points to that object, which you do by including the label in the parenthesis. The official documentation (above link) is a good source for this and should help clear up anything else you run into as well. 12 rats tied together fucked around with this message at 22:47 on Jan 6, 2023 |
# ? Jan 6, 2023 22:01 |
|
Woops you're right, I did import the json module. I thought I was onto something there, haha. It's good to know that the json module can interact with the stuff in the API, though. Still working through nullfunction's code...this is good learning material!
|
# ? Jan 6, 2023 22:12 |
|
Anyone have any experience of using Dagster they could share here? Moving to a new place in the next couple months and they need some decent ETL tooling putting in place. My current place was always promising Airflow, but never materialised, so we just stuck with a list of python files and notebooks Any reason why I should be considering Airflow over something more modern like Dagster / Prefect? The functionality seems pretty great.
|
# ? Jan 18, 2023 23:06 |
|
Hi there. Fairly new to python so maybe this is very incredibly obvious but I've been noodling around and I can't think of a good way to accomplish this. I'm running a system command and capturing its output. Output is a few items separated at fixed length spaces, not tabs. os.popen(mycommand).readlines() will give me an example list of, let's say code:
code:
Any help appreciated, thanks! And obviously if I've made some colossal blunder in anything I said I'll take feedback too.
|
# ? Jan 19, 2023 01:31 |
|
Python code:
Python code:
The March Hare fucked around with this message at 02:04 on Jan 19, 2023 |
# ? Jan 19, 2023 01:59 |
|
the trick here is that python's split method splits by "one or more" blank space characters, just in case you ever had to do this yourself, you could write a loop that iterates through the string and finds "the first instance of the character to split on", and then just keep iterating forward until you find something else. this is usually a lot easier than hoping that e.g. string index number 21 is the start of the next item.
|
# ? Jan 19, 2023 02:05 |
|
Switched teams recently, was trying to modify some stuff and went to check unit tests; most of the packages I'm touching don't have unit tests (and I don't have time to overhaul them) but one did, so I was like 'oh okay cool, I can add some tests here to check things'. However, when I looked more closely, I realized that all the tests were some variation of the below. See if you can spot the problem. Python code:
|
# ? Jan 19, 2023 02:56 |
|
The March Hare posted:
Using split for fixed-width columns is dangerous, if a "cell" is ever blank then unexpected things happen. But pandas can take care of this problem: https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
|
# ? Jan 19, 2023 03:09 |
|
Thanks a million everyone. This gives me lot of good reading, but I think I erred and provided a bad example that you guys still picked up on. One thing I should have specified or provided a better example of.. I put “ITEM”, “ITEM2”, etc as the standing example, but the string could actually be “THIS IS ITEM1”, “BLA BLA ITEM 2”, etc. So I imagine split will just split on whitespaces. Can I tell split to explicitly take X characters instead of a delimiter? More generalized, if I have a list which contains a number of 60 character strings, I’d like to split each 60 character string into three sub-arrays/lists of 20 characters each. Even better if I can specify the length, but that may be optimistic. Sounds like pandas does what I need so I think that’s my path forward. Was hoping to do it without invoking any external dependencies but I’m happy I can do it at all! Much appreciate the assist! some kinda jackal fucked around with this message at 03:28 on Jan 19, 2023 |
# ? Jan 19, 2023 03:21 |
|
If you don't want to add dependencies (I don't blame you at all, pandas is a big module) then it'd be easy to roll your own. Something like this should work, I didn't test it though:Python code:
- Since you described fixed-width columns that all have the same width, let's define a function that takes advantage of that. If you need to provide varied column widths then you could do that instead, it wouldn't look much different - Determine N number of columns from the args (in our example, this will be 3) - Start iterating over all lines in the file - Given a line, use a generator expression to yield a tuple from the N entries in the row, where N == number of columns. We use strip() to remove any whitespace from the edges of each entry, but our column widths are always the same for the line extraction - Then we use zip() with the splat operator to convert those tuples of rows into tuples of columns. These are separate and distinct tuples, do whatever you want with them from here - turn them into lists, convert them to arrays, etc. e: Ah you're not using a file, you're using popen. This should still work, you'll just use popen instead of open; I'm pretty sure a popen can be context managed like this but I'm not certain QuarkJets fucked around with this message at 05:39 on Jan 19, 2023 |
# ? Jan 19, 2023 05:28 |
|
Continued thanks! I think I came across something similar on one of the stack* sites after you guys gave me some inspiration, so this is probably what I'm going to try. Has a few concepts I wasn't familiar with but otherwise seems fairly straightforward. Apropos of nothing, honestly disappointed in myself that I didn't give python more of a look in over the last 5 or so years. Was initially very put off by the indentation scoping but once I just told myself to heckin' deal with it I've managed to find a lot of convenience just banging out small python things. If I had been a little less snobby at the start I'd be so much further along now
|
# ? Jan 19, 2023 12:22 |
|
Slimchandi posted:Anyone have any experience of using Dagster they could share here? Moving to a new place in the next couple months and they need some decent ETL tooling putting in place. My current place was always promising Airflow, but never materialised, so we just stuck with a list of python files and notebooks I don't have anything to add but would love an answer to this too. The data science thread might be a better place for it, I've asked stuff like this before but it's kind of adjacent to a lot of things.
|
# ? Jan 19, 2023 15:48 |
|
QuarkJets posted:If you don't want to add dependencies (I don't blame you at all, pandas is a big module) then it'd be easy to roll your own. Something like this should work, I didn't test it though: I just thought of a revision: Python code:
|
# ? Jan 19, 2023 20:16 |
Looking for help parsing .csv data. I'm trying to create maps (either in TkinterMapView or Folium; haven't figured out which yet) that will show different types of data from the .csv input (heatmaps, location owner, etc.). The .csv files are provided by the city of New Orleans; trying to work off of GPS coordinates, that column in the .csv is formatted like a tuple: code:
Here's what I've done so far without attempting the above: Python code:
|
|
# ? Jan 19, 2023 21:50 |
|
Continuing to get a ton out of these suggestions, thanks. I have to go back to the docs to understand a few concepts but I think I can follow this. e: Exception catching question figured out some kinda jackal fucked around with this message at 23:56 on Jan 19, 2023 |
# ? Jan 19, 2023 22:20 |
|
C2C - 2.0 posted:This works but it just seems like I could shave time off the map plotting if I didn't have to go thru the Geopy coversion first. Yeah, if you have lat/lon coordinates from the query anyway then you should definitely use them. What's holding you up?
|
# ? Jan 20, 2023 04:35 |
|
Slimchandi posted:Anyone have any experience of using Dagster they could share here? Moving to a new place in the next couple months and they need some decent ETL tooling putting in place. My current place was always promising Airflow, but never materialised, so we just stuck with a list of python files and notebooks Just about everyone in the data engineering space is going coo-coo over DBT, so that might be worth a look. Although ironically people are pairing it with Airflow because it can't do everything. Azure Data Factory is also pretty slick, if your company wants a ritzy cloud solution.
|
# ? Jan 20, 2023 05:00 |
|
In terms of programming best practices, if a called func is expected to return two items, but raises an exception during its processing and can't produce any meaningful/useful data to return, should I return SOMETHING? I know there is a non-zero chance of the called func failing due to external dependencies outside of the python script. I wrap the actual func() call in main() in a try/except, so I suspect there's little value in returning anything as the whole exercise has failed and I won't be doing anything meaningful. Right now that's how my code works. I actually don't know why I'm asking since the answer feels really obvious now that I've typed it out, just wondering if there's some style guide/best practice I'm breaking (which, who cares, ultimately) -- so more out of curiosity. Only thing I can think is that the concept of this func is that it returns two things, so I should get into the habit of returning what I said I would, regardless of whether I'm bailing out or not, or whether it'll actually be used.
|
# ? Jan 20, 2023 13:53 |
Yeah, I think that's how I think of it — just because Python isn't strongly typed doesn't mean you shouldn't impose contracts on your code yourself (e.g. with type hinting). Which means your function should only do one of two things: a) return two things, or b) raise an exception. Otherwise the calling code has to concern itself with all the weird possibilities of things the function might return, which defeats the whole purpose of factoring complex logic into simple callable methods.
|
|
# ? Jan 20, 2023 14:24 |
QuarkJets posted:Yeah, if you have lat/lon coordinates from the query anyway then you should definitely use them. What's holding you up? The lat/long in the .csv are in a single column, like so: (29.84387, -90.94854) and the column name is Location. Initially, using TkinterMapView, I thought the coordinates had to be parsed individually (a separate value for lat & long). Poking around the docs for Folium this morning, it looks like it can set markers from a list of GPS coords. Knowing that, I'm gonna' try to read the .csv with Pandas & create a list just from that field and send it to Folium. Sorry if this is beginner-level stuff; I'm new to programming/Python having just started learning in earnest at the beginning of the year.
|
|
# ? Jan 20, 2023 17:27 |
|
C2C - 2.0 posted:The lat/long in the .csv are in a single column, like so: (29.84387, -90.94854) and the column name is Location. Initially, using TkinterMapView, I thought the coordinates had to be parsed individually (a separate value for lat & long). Poking around the docs for Folium this morning, it looks like it can set markers from a list of GPS coords. Knowing that, I'm gonna' try to read the .csv with Pandas & create a list just from that field and send it to Folium. No problem, you can use strip() to remove whitespace and the parents from the beginning and end, then split() the result to 2 variables (lat and lon). Python code:
|
# ? Jan 20, 2023 17:52 |
QuarkJets posted:No problem, you can use strip() to remove whitespace and the parents from the beginning and end, then split() the result to 2 variables (lat and lon). So I reworked the script a bit; dropped Pandas and just used the csv module: Python code:
code:
Python code:
code:
|
|
# ? Jan 20, 2023 19:27 |
|
It's sort of a detour, but when you have chained strip() calls like that, you can generally condense them into one call with the characters you want stripped passed in as the argument. So string.strip(" ()") should remove spaces and parentheses. Edit: It also looks like your "north_east_corner = min(str_locs)" should be using max() marshalljim fucked around with this message at 19:53 on Jan 20, 2023 |
# ? Jan 20, 2023 19:44 |
|
C2C - 2.0 posted:, then Pycharm returns: I think the problem here is you want to unpack each element of the list as a tuple and you're unpacking the list itself as if it were a tuple instead. I don't remember if you can just surround it with parentheses to make it clearer that you're unpacking each object, or if you just want to do something like code:
|
# ? Jan 20, 2023 19:49 |
|
Oh hey, are we talking about pandas?!? I'm not even sure what I'm looking for can be done and it's difficult to google/ChatGPT for. I have one table of metadata that provides the allowable length of a given field. For example, the 'Account ID' field can only be 18 characters long; the 'Account Name' field can only be 255 characters long, etc. This has to be compared against the client's data in a separate table: Now that I think about it, python wouldn't need to iterate the 'label' column. It does, however, need to iterate over Account ID and compare each value to row 1 of 'length', then iterate over 'Account Name' and compare each value to row 4 of 'length', and so on. And of course do some sort of 'if len('Account ID') > length' type operation. If I had to hand code everything one by one, it might look something like: code:
Wait, maybe I could have a function do a bunch of the legwork: code:
But then I would need something that keeps feeding it the necessary values, like: check_length_not_that_size_matters_or_anything_haha(df['Account ID'], df2.at[0, 'length']) check_length_not_that_size_matters_or_anything_haha(df['Account Name'], df2.at[3, 'length']) check_length_not_that_size_matters_or_anything_haha(df['Account Type'], df2.at[4, 'length']) Is there a way to iterate that as well, maybe? Seventh Arrow fucked around with this message at 21:45 on Jan 20, 2023 |
# ? Jan 20, 2023 21:28 |
|
Data Graham posted:Yeah, I think that's how I think of it — just because Python isn't strongly typed doesn't mean you shouldn't impose contracts on your code yourself (e.g. with type hinting). Which means your function should only do one of two things: a) return two things, or b) raise an exception. Otherwise the calling code has to concern itself with all the weird possibilities of things the function might return, which defeats the whole purpose of factoring complex logic into simple callable methods. Python is strongly typed but it's also dynamically typed - a label can point to a thing that is a different type than what it pointed to previously. Type systems have a lot of dimensions and they can get pretty confusing, for example, python is also "gradually typed" these days, to some degree it is "structurally typed" due to Protocols and also "duck typed". But yes, you should absolutely use type hints when they matter or increase the readability of a particular section of code. Good, relevant example here is tuple unpacking: Python code:
|
# ? Jan 20, 2023 22:09 |
|
Seventh Arrow posted:Oh hey, are we talking about pandas?!? You can do some clever things with masks to return only the rows that exceed your limits, and also tell you which columns in every row has a problem. Python code:
|
# ? Jan 20, 2023 22:23 |
|
C2C - 2.0 posted:So I reworked the script a bit; dropped Pandas and just used the csv module: Since you are using pycharm, try running your code in debug mode so that you can inspect your objects when an error occurs This error message is giving an important clue: it is saying that "row" is a dict, not a string. This is because csv.DictReader returns a dict for each row of data. The keys in the dict are the column names. Python code:
Then later you can iterate over these strings and do something with them Python code:
|
# ? Jan 20, 2023 22:35 |
|
Thank you for your efforts Biffmotron! It will take my brain a bit of time to digest it, but I think I get the general idea. The use of a lambda function is very cool.
|
# ? Jan 20, 2023 22:36 |
|
|
# ? May 15, 2024 03:20 |
|
You're welcome. I spend a lot of time doing stuff in Pandas, which is kind of finnicky and arcane in a lot of ways, but has become standardized. It could be worse, could be matplotlib. Pandas does some pretty heavy lifting under the hood, but it's basically all numpy arrays. This code relies on a couple of intermediate Pandas tricks, which can be understood by inspecting the intermediary objects m and bc. The first one is masking. If you send an index mask, a list of bools the same length as the array/Series/DataFrame, to a DataFramevia df[mask], you get back the rows in the DataFrame where the mask is True. The second trick is using axis=1 in several places. By default, Pandas does all its operations along the columns, which is axis=0. axis=1 does an operation along rows, which combined with mask.sum(axis=1).astype(bool) makes an index mask. df.apply() is a very useful pattern. Generally, iterating over DataFrame rows is slow. Apply invokes numpy vectorization which can be orders of magnitude faster. df['new_col'] = df['column'].apply(lambda x: func(x) is super useful for simple data transforms. More complex data transforms can be done using apply and axis. This function takes in a row and does some arbitrary math, but it's easy to imagine some convoluted business logic that has to be applied to a row in a table to get a result. Python code:
|
# ? Jan 21, 2023 21:55 |