Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Seventh Arrow
Jan 26, 2005

So here I am once again attempting to pull salesforce data via API into a dataframe and then to a db. The context is different this time, though. I'm pulling from their marketing cloud API and I actually managed to grab the data I need! Huzzah! However, the format looks a bit weird:

code:
[(BounceEvent){
   Client = 
      (ClientID){
         ID = client_id_here
      }
   PartnerKey = None
   ObjectID = None
   SendID = 4016
   SubscriberKey = "subscriber_key_here"
   EventDate = 2021-05-12 12:53:38.780000
   EventType = "OtherBounce"
   TriggeredSendDefinitionObjectID = None
   BatchID = 4
 }, (BounceEvent){
   Client = 
      (ClientID){
etc, etc
...]
At first I thought it was json contained in a list, but I believe json uses colon for key-value separation, not equals. So I probably won't be able to use json_normalize or anything like that. Trying to filter out the unnecessary characters and get the keys to be columns seems like messy work. What's the best way to proceed?

Now that I think of it, is there a better way to do this kind of thing than dataframes? Or is pandas pretty much the king of tabular mountain?

edit: oh wait, if I just convert the "=" to ":", will that get me a valid json? or is there a catch?

Seventh Arrow fucked around with this message at 20:01 on Jan 5, 2023

Adbot
ADBOT LOVES YOU

nullfunction
Jan 24, 2005

Nap Ghost

Seventh Arrow posted:

edit: oh wait, if I just convert the "=" to ":", will that get me a valid json? or is there a catch?

You should set this line of thinking aside and step back a little bit.

You've correctly identified that the data doesn't look like JSON. It's also not a format I recognize immediately other than to say it's structured and probably from the SOAP API as opposed to the newer REST API. Are you using the Python client for this API? If so you should have Python objects. How did you get to this format?

Seventh Arrow
Jan 26, 2005

nullfunction posted:

You should set this line of thinking aside and step back a little bit.

You've correctly identified that the data doesn't look like JSON. It's also not a format I recognize immediately other than to say it's structured and probably from the SOAP API as opposed to the newer REST API. Are you using the Python client for this API? If so you should have Python objects. How did you get to this format?

Hey, thanks for the reply, I appreciate it. I think that you're correct that it's SOAP. The API is called FuelSDK and it's specifically for interacting with the salesforce marketing cloud - specifically audit events like unsubs, clicks, etc. I've been looking through some of their documentation about how to put some of this into an object, but everything they use is just print statements. As far as how I got there once you get past the authentication fluff, this is where I'm at:

code:
# Create an instance of the object type we want to work with:

list = FuelSDK.ET_List()

# Associate the ET_Client- object using the auth_stub property:

list.auth_stub = myClient

# Utilize one of the ET_List methods:

response = list.get()

# Print out the results for viewing
print('Post Status: ' + str(response.status))
print('Code: ' + str(response.code))
print('Message: ' + str(response.message))
print('Result Count: ' + str(len(response.results)))

debug = False

getBounceEvent = ET_Client.ET_BounceEvent()
getBounceEvent.auth_stub = myClient
getResponse = getBounceEvent.get()

print(getResponse.results)
So the last line gets me the output in my original post. The trick is getting it into a format that pandas will recognize. Admittedly I'm unfamiliar with APIs and am in some places just blatantly copy-pasting from elsewhere.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?
I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger.

It’s a bit difficult to give good advice here because I’m still a bit puzzled about the final objective. IIRC the point of all this was that your boss wants you to download the data from salesforce because he’s too cheap to keep paying for the service. So, what do want do once you’ve downloaded it? Is it enough to shove it in a text file somewhere and go “ok I’ve archived it, go turn the service off now cheapskate”? Or are you expected to make it continuously available in some other form in an on-prem database? Also, would you have to keep that database up to date in future - presumably that data is a record of past business activities and if salesforce isn’t updating it then do you need to do that as well?

To get back to your original question - *if* downloading the data and parsing it via pandas is the right idea then there are still some answers we need before we can really tell you what you should do next. Firstly, how much data are you trying to pull down via this API - is it a few hundred of these entries or a few billion? Secondly is the schema always the same - does the data stream only contain BounceEvents or could there be other types mixed in there?

Seventh Arrow
Jan 26, 2005

DoctorTristan posted:

I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger.

It’s a bit difficult to give good advice here because I’m still a bit puzzled about the final objective. IIRC the point of all this was that your boss wants you to download the data from salesforce because he’s too cheap to keep paying for the service. So, what do want do once you’ve downloaded it? Is it enough to shove it in a text file somewhere and go “ok I’ve archived it, go turn the service off now cheapskate”? Or are you expected to make it continuously available in some other form in an on-prem database? Also, would you have to keep that database up to date in future - presumably that data is a record of past business activities and if salesforce isn’t updating it then do you need to do that as well?

To get back to your original question - *if* downloading the data and parsing it via pandas is the right idea then there are still some answers we need before we can really tell you what you should do next. Firstly, how much data are you trying to pull down via this API - is it a few hundred of these entries or a few billion? Secondly is the schema always the same - does the data stream only contain BounceEvents or could there be other types mixed in there?

Thanks DoctorTristan for your reply. This is actually slightly different than that other project. The company that I work for helps non-profits get setup in salesforce, so in a way all my work is salesforce-related in one way or another. In the other project, yes, my manager was too cheap to have a dedicated SQL client and opted for MySQL.

This is for a client, though, and they have their own Azure SQL instance to put stuff in. So in my mind, the process would go something like: python pulls data from salesforce marketing cloud ==> python somehow gets that weirdly-formatted data into a dataframe ==> pandas pushes the data using to_sql to the client's Azure SQL DB ==> buy cigars and booze and chill for the rest of the day

For now, they just want a sampling of the data to show the client so they can 'ooh' and 'aah' about our cool pipeline. I would need all of the columns, but just a handful of rows. And yes, this script is just for BounceEvents, but I would also want to generate data for ClickEvents, UnsubEvents, etc. But once I have the logic for BounceEvents, I can simply apply it to everything else. I hope that makes sense and is not too long-winded!

The more I look into this, the more it seems like this API is kind of difficult to work with and seems sort of abandoned when I look at their issues page. There seems to be an official API, but that is also SOAP and I'm wary of trying to learn it and getting the same result.

Seventh Arrow
Jan 26, 2005

As an aside, I've developed a severe aversion to saying, "this next task should be pretty easy."

nullfunction
Jan 24, 2005

Nap Ghost

DoctorTristan posted:

I think those are python objects - specifically a list of BounceEvent instances. Presumably that’s a class defined by SalesForce and op got the output above by print() -ing the output while paused in a debugger.

This is the key.

I had figured you were print()ing the result, and there's a pretty important thing to call out here that will probably help you in many ways going forward, especially working with unfamiliar APIs.

When you print() an object, all you're doing is asking that object for a string representation of itself. There's a default representation that gives you something like <__main__.Foo object at 0x0000deadbeef> if you choose not to implement your own __repr__() but it can be literally any string representation that the developer wants, including JSON or some weird string like you have here.

Here's a small contrived example to hopefully illustrate:

Python code:
from dataclasses import dataclass

@dataclass
class Rectangle:
    height: int
    width: int
    
    @property
    def area(self) -> int:
        return self.height * self.width
    
    def __repr__(self) -> str:
        return f"A rectangle (area {self.area})"

r = Rectangle(3,4)
print(r) # A rectangle (area 12)
print(r.height) # 3
print(r.width) # 4
What you probably intend to use is dir(), which tells you what your options are:

Python code:
['__annotations__', '__class__', '...snip...', 'area', 'height', 'width']
Using dir() like that can help you identify what properties or methods you might be able to use to get all of the column data out of the object, which I think is probably more aligned with what you were originally trying to do. Maybe you luck out and there's a to_json() method hiding among the entries, but more likely you'll get all of the various properties that you need to plug into the next step in the process.

Unless there's a particularly mature and well-maintained third party API that does exactly what you need it to (simple-salesforce is great but doesn't cover everything!), I'd go with the official API. You'd need something really compelling to convince me that adding a third party dependency over the official dependency is the right move for this, especially if the APIs they expose are similar.

Ultimately you're going to have to do the work of mapping the columns you want (object data) to the right format (pandas dataframe) so that you can use the to_sql call regardless, so you may as well do it against the official API.

Seventh Arrow posted:

As an aside, I've developed a severe aversion to saying, "this next task should be pretty easy."

Congrats, you can call yourself a programmer now.

Seventh Arrow
Jan 26, 2005

Thanks much nullfunction for the detailed explanation. I will check out dir() and see where that leads. With regard to

quote:

Ultimately you're going to have to do the work of mapping the columns you want (object data) to the right format (pandas dataframe) so that you can use the to_sql call regardless...

can you point me to any documents or tutorials that might help with this?

nullfunction
Jan 24, 2005

Nap Ghost
I had a look at the SDK link you posted and it's the same one linked from the Salesforce page, so I guess you're using the only choice.

I can speak to what you need to do conceptually, but I haven't used pandas, all I've done is have a quick look at what classmethods exist to create a dataframe. I know it's a super popular library so I'm certain you can google "how do I load a pandas dataframe" and get to the same pages I would see. Just don't take this as an optimized solution, it's something to get you started. I gather that you're new to writing code in general, so I'll try to break it down as far as I can, but you should go through the exercise of actually implementing all of this if it's gonna be your job. You're gonna be doing stuff like this a lot, and frankly Python's a great language to do it in. Part of navigating a huge ecosystem like Python's is going to be learning to use some of the built-in tools to self-service when exploring unfamiliar objects, as documentation isn't always available or useful when it does exist.

I'm going to have to make some assumptions because I don't have access to Salesforce data, but let's take the repr from the BounceEvent you posted previously as a guide to what might exist on that object:

quote:

(BounceEvent){
Client =
(ClientID){
ID = client_id_here
}
PartnerKey = None
ObjectID = None
SendID = 4016
SubscriberKey = "subscriber_key_here"
EventDate = 2021-05-12 12:53:38.780000
EventType = "OtherBounce"
TriggeredSendDefinitionObjectID = None
BatchID = 4
}

The structure of the "Client" field suggests that the ClientID is a separate object or something more complex that you'll need to figure out and map accordingly, but overall this looks fairly straightforward in terms of mapping, provided you can locate the properties to get these values out of the object. I can see that this is a single bounced email, so I know that each BounceEvent maps to a "row" in pandas. I know that Pandas can accept a list of dicts (where each dict is a row, and each key in the dict is a column) into a dataframe, and you've indicated you want to use a dataframe to load the data into SQL so let's design with that goal in mind... we need a list of dicts to feed into pandas.

pre:
python pulls data          python somehow gets                  pandas pushes the data
from salesforce    ==>     that weirdly-formatted data  ==>     using to_sql to the
marketing cloud            into a dataframe                     client's Azure SQL DB
It seems like you've got the first part of this working from the examples you found, so let's just wrap it in a function, and annotate that it returns a list:

Python code:
def get_bounce_events() -> list:
    # snip -- example code you posted earlier goes here -- getResponse.results contains a list of BounceEvent objects
    getResponse = getBounceEvent.get()
    bounce_events = Response.results
    return bounce_events
Boom, first section done. Sort of. Maybe you need to pull out some of the authentication stuff (don't hard-code auth!), maybe it needs an argument to provide some further query info to narrow down the bounce_events you get, but that's the gist of it. We've put the logic relating to the retrieval of data in its own function away from everything else. This makes it easier to change in the future when someone inevitably asks you to make a tweak to how it works.

Now that you have all this code in a convenient function, it's a good time to fire up the REPL, paste the function in, and grab an object to play around with:

code:
>>> # paste your code, hopefully no errors
>>> bounce_events = get_bounce_events() # invoking the function you pasted
>>> bounce_event = bounce_events[0] # grab the first one from the list
>>> # now, you can easily do things like
>>> dir(bounce_event)
['__magic_methods__','UsefulProperty','SomeJunkTooProbably',...]
>>> type(bounce_event.UsefulProperty)
<class 'str'>
>>> print(bounce_event.UsefulProperty)
Some string data that interests you
Using dir() and type() you should be able to locate the properties that correspond to the values in the columns for this row. Once you understand how to get the data out of the object, you can write a function that accepts one of these objects as an argument, and outputs a dict:

Python code:
def transform_bounce_event(bounce_event: object) -> dict:
    return {
        "client_id": bounce_event.ClientID.ID, # or however this is obtained, maybe it's bounce_event['ClientID']['ID']? use the repl to figure out how to access the underlying data!
	"partner_key": bounce_event.PartnerKey,
	"send_id": bounce_event.SendKey,
        # etc
    }
Now, that's probably fine until you get to EventDate. If you're lucky, the BounceEvent object will give you a native Python datetime, and I'd be shocked if pandas didn't recognize and convert native datetimes to whatever it uses internally. The worst case scenario is that you can coax a string out of the object and use strptime to parse the date and time from a string. Familiarize yourself with that link, you'll use it again, even if you lucked out this time. Either way, if it needs to end up in SQL, you need to figure out how to get it into the dict in a format that pandas likes, so that it can go into the dataframe.

Noodling around with dir(), calling methods, reading properties, using type() to map out the object, taking notes of what calls you need to make to get all your data out... these are things you'll want to get good at, you'll probably do them a lot. They're also super useful in a pinch when debugging.

At this point, you should have a function that you can feed a BounceEvent into, and get a dict out of that has everything you need for pandas. If you can get something loaded into a dataframe for one example BounceEvent, you've got a pretty good chance that at least one other BounceEvent will work too, so when exploring this way, I always try to start getting a single example of something working, then feeding it more data to try to validate that it works with more than just the one event you're testing.

Python code:
# pull data from salesforce marketing cloud
bounce_events = get_bounce_events() # maybe there is an argument here for a campaign or date range or whatever
# get that weirdly-formatted data into something appropriate for a dataframe
transformed_events = [transform_bounce_event(e) for e in bounce_events]
# make the dataframe from the transformed data
df = pandas.DataFrame.from_records(data=transformed_events, ...)
# sql magic here, no idea how pandas connects to azure but it happens last
do_some_sql_magic_with_dataframe(df)
Thinking about separating your code like you did is a great start, it gives you clues as to how you can structure the code. Use functions, they're great for abstracting away complexity, providing convenient places to swap code in and out, and just generally keeping things tidy. I didn't go too wild with type annotations above but I highly recommend looking into them further and using them everywhere you can; your IDE can interpret them and shout at you when you try to do things you shouldn't. In Python they're suggestions, not enforced, but more and more libraries are using them and the tooling has improved drastically over the last few years, there's never been a better time to get in the habit.

A strikingly large portion of the programs you'll write follow the same basic pattern as what you're trying to do above, so much so that entire industries have grown up around it. Have a look at the wiki page for ETL as it may give you some starting points for stuff to research further, vocab, etc.

Seventh Arrow
Jan 26, 2005

Thanks again, nullfunction, I truly appreciate it - I only glanced over your post and will pore over it more in a bit; but for now, I took your advice and ran dir() and it turns out that it does indeed have a json attribute:

quote:

print(dir())

['ET_Client', 'FuelSDK', 'In', 'Out', '_', '__', '___', '__builtin__', '__builtins__', '__doc__', '__loader__', '__name__', '__package__', '__spec__', '_dh', '_i', '_i1', '_i10', '_i11', '_i12', '_i2', '_i3', '_i4', '_i5', '_i6', '_i7', '_i8', '_i9', '_ih', '_ii', '_iii', '_oh', 'auth_base_url', 'clientid', 'clientsecret', 'debug', 'exit', 'getBounceEvent', 'getResponse', 'get_ipython', 'json', 'list', 'myClient', 'pd', 'quit', 'response', 'rest_url', 'soap_url', 'subdomain']

I even found someone who made a module to wring a prettified json out of the API response: https://github.com/emailrhoads/sfmc_response_to_json/blob/master/sfmc_to_json/formatter.py

(although granted it might have been designed with python 2.x in mind)

I copy-pasted each code block in jupyter to see what would happen but I think my ancient brain has forgotten how to pass parameters to a function. I think the last 2 functions are supposed to be what finally gets you a dict that can then be used with json_normalize:

code:
def _items(sobject):
    """
    Extract the I{items} from a suds object much like the
    items() method works on I{dict}.
    @param sobject: A suds object
    @type sobject: L{Object}
    @return: A list of items contained in I{sobject}.
    @rtype: [(key, value),...]
    """
    for item in sobject:
       #print(item)
        yield item

def _asdict(sobject):
    """
    Convert a sudsobject into a dictionary.
    @param sobject: A suds object
    @type sobject: L{Object}
    @return: A python dictionary containing the
        items contained in I{sobject}.
    @rtype: dict
    """
    items = _items(sobject)
    #if len(list(items)) == 0:
    #    items.append("Null")
    return dict(items)
Maybe once I read over your post a few times, I will be able to piece things together.

12 rats tied together
Sep 7, 2006

To pass parameters to a function, you include them in the parenthesis that indicate to Python that you would like to call the function.

so, in the REPL,
Python code:
# take whatever the label "json" maps to right now and give me the result of executing its  __repr__ method
>>> json  

# take whatever the label "json" maps to and call it with no arguments/parameters
>>> json() 

# same as above but call it and pass it a string object that contains the characters in the word 'something'
>>> json('something')
(edited for formatting)

When you call "dir()" like that, it's just examining the current local scope. I would guess that there exists a "json" label in this scope because you've imported the json module.

To have "dir()" inspect an object, you need to feed it a label that points to that object, which you do by including the label in the parenthesis. The official documentation (above link) is a good source for this and should help clear up anything else you run into as well.

12 rats tied together fucked around with this message at 22:47 on Jan 6, 2023

Seventh Arrow
Jan 26, 2005

Woops you're right, I did import the json module. I thought I was onto something there, haha. It's good to know that the json module can interact with the stuff in the API, though.

Still working through nullfunction's code...this is good learning material!

Slimchandi
May 13, 2005
That finger on your temple is the barrel of my raygun
Anyone have any experience of using Dagster they could share here? Moving to a new place in the next couple months and they need some decent ETL tooling putting in place. My current place was always promising Airflow, but never materialised, so we just stuck with a list of python files and notebooks :)

Any reason why I should be considering Airflow over something more modern like Dagster / Prefect? The functionality seems pretty great.

some kinda jackal
Feb 25, 2003

 
 
Hi there. Fairly new to python so maybe this is very incredibly obvious but I've been noodling around and I can't think of a good way to accomplish this.

I'm running a system command and capturing its output. Output is a few items separated at fixed length spaces, not tabs.

os.popen(mycommand).readlines() will give me an example list of, let's say

code:
[ 'ITEM                TIME                STATUS              \n', 'ITEM2               TIME2               STATUS2            \n', etc ]
In this hypothetical example each item is padded out to 20 characters, next one starts exactly 20+1. Ideally in my mind I would like to turn this into a list of

code:
[ [ 'ITEM', 'TIME', 'STATUS' ] , [ 'ITEM2', 'TIME2', 'STATUS2' ], etc ]
I'm just struggling how I'd iterate through the original list to split each item out into its own array/list of 3 items by subset of 0-19, 20-39, 40-59. I feel it should be fairly simple but I also feel like my eyes are glazing over python documentation and google searches right now and I'm spinning.

Any help appreciated, thanks! And obviously if I've made some colossal blunder in anything I said I'll take feedback too.

The March Hare
Oct 15, 2006

Je rêve d'un
Wayne's World 3
Buglord
Python code:
unwanted = [ 'ITEM                TIME                STATUS              \n', 'ITEM2               TIME2               STATUS2            \n']
wanted = [line.split() for line in unwanted]

print(wanted)
or, if a list comprehension is something you aren't familiar with yet, this will do the same thing:

Python code:
unwanted = [ 'ITEM                TIME                STATUS              \n', 'ITEM2               TIME2               STATUS2            \n']
wanted = []

for line in unwanted:
    wanted.append(line.split())

print(wanted)
Knowing common little helper methods like .split() (and uncommon ones later!) is part of what makes Python so nice to work with. Unfortunately, becoming familiar with them is a pain in the rear end. The standard library is vast and not all created equal.

The March Hare fucked around with this message at 02:04 on Jan 19, 2023

12 rats tied together
Sep 7, 2006

the trick here is that python's split method splits by "one or more" blank space characters, just in case you ever had to do this yourself, you could write a loop that iterates through the string and finds "the first instance of the character to split on", and then just keep iterating forward until you find something else.

this is usually a lot easier than hoping that e.g. string index number 21 is the start of the next item. :)

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Switched teams recently, was trying to modify some stuff and went to check unit tests; most of the packages I'm touching don't have unit tests (and I don't have time to overhaul them) but one did, so I was like 'oh okay cool, I can add some tests here to check things'.

However, when I looked more closely, I realized that all the tests were some variation of the below. See if you can spot the problem.

Python code:
def test_get_widget():
	expected_response = ["OUTCOME"]
	api_to_test.get_widget = Mock(response=expected_response)
	assert api_to_test.get_widget(test_parameters) == expected_response
This is also the package where I added the most actual functionality that actually could use some tests...so...welp. Looks like I get to rewrite all the unit tests tomorrow.

QuarkJets
Sep 8, 2008

The March Hare posted:

Python code:
unwanted = [ 'ITEM                TIME                STATUS              \n', 'ITEM2               TIME2               STATUS2            \n']
wanted = [line.split() for line in unwanted]

print(wanted)
or, if a list comprehension is something you aren't familiar with yet, this will do the same thing:

Python code:
unwanted = [ 'ITEM                TIME                STATUS              \n', 'ITEM2               TIME2               STATUS2            \n']
wanted = []

for line in unwanted:
    wanted.append(line.split())

print(wanted)
Knowing common little helper methods like .split() (and uncommon ones later!) is part of what makes Python so nice to work with. Unfortunately, becoming familiar with them is a pain in the rear end. The standard library is vast and not all created equal.

Using split for fixed-width columns is dangerous, if a "cell" is ever blank then unexpected things happen.

But pandas can take care of this problem: https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

some kinda jackal
Feb 25, 2003

 
 
Thanks a million everyone. This gives me lot of good reading, but I think I erred and provided a bad example that you guys still picked up on.

One thing I should have specified or provided a better example of.. I put “ITEM”, “ITEM2”, etc as the standing example, but the string could actually be “THIS IS ITEM1”, “BLA BLA ITEM 2”, etc. So I imagine split will just split on whitespaces. Can I tell split to explicitly take X characters instead of a delimiter?

More generalized, if I have a list which contains a number of 60 character strings, I’d like to split each 60 character string into three sub-arrays/lists of 20 characters each. Even better if I can specify the length, but that may be optimistic.

Sounds like pandas does what I need so I think that’s my path forward. Was hoping to do it without invoking any external dependencies but I’m happy I can do it at all!

Much appreciate the assist!

some kinda jackal fucked around with this message at 03:28 on Jan 19, 2023

QuarkJets
Sep 8, 2008

If you don't want to add dependencies (I don't blame you at all, pandas is a big module) then it'd be easy to roll your own. Something like this should work, I didn't test it though:

Python code:
def parse_lines(file_path, col_width, full_width):
    """Yield tuples for each row of a file."""
    # Use smart_open if your file lives on s3 or something
    num_cols = full_width // col_width
    with open(file_path, 'r') as fi:
        for line in fi:
            yield tuple(line[i*col_width+1:(i+1)*col_width].strip() for i in range(num_cols))

col_width = 20
full_width = 60
col1, col2, col3 = zip(*(parse_lines(some_file, col_width, full_width)))
Explanation:
- Since you described fixed-width columns that all have the same width, let's define a function that takes advantage of that. If you need to provide varied column widths then you could do that instead, it wouldn't look much different
- Determine N number of columns from the args (in our example, this will be 3)
- Start iterating over all lines in the file
- Given a line, use a generator expression to yield a tuple from the N entries in the row, where N == number of columns. We use strip() to remove any whitespace from the edges of each entry, but our column widths are always the same for the line extraction
- Then we use zip() with the splat operator to convert those tuples of rows into tuples of columns. These are separate and distinct tuples, do whatever you want with them from here - turn them into lists, convert them to arrays, etc.

e: Ah you're not using a file, you're using popen. This should still work, you'll just use popen instead of open; I'm pretty sure a popen can be context managed like this but I'm not certain

QuarkJets fucked around with this message at 05:39 on Jan 19, 2023

some kinda jackal
Feb 25, 2003

 
 
Continued thanks! I think I came across something similar on one of the stack* sites after you guys gave me some inspiration, so this is probably what I'm going to try. Has a few concepts I wasn't familiar with but otherwise seems fairly straightforward.

Apropos of nothing, honestly disappointed in myself that I didn't give python more of a look in over the last 5 or so years. Was initially very put off by the indentation scoping but once I just told myself to heckin' deal with it I've managed to find a lot of convenience just banging out small python things. If I had been a little less snobby at the start I'd be so much further along now :haw:

Hed
Mar 31, 2004

Fun Shoe

Slimchandi posted:

Anyone have any experience of using Dagster they could share here? Moving to a new place in the next couple months and they need some decent ETL tooling putting in place. My current place was always promising Airflow, but never materialised, so we just stuck with a list of python files and notebooks :)

Any reason why I should be considering Airflow over something more modern like Dagster / Prefect? The functionality seems pretty great.

I don't have anything to add but would love an answer to this too. The data science thread might be a better place for it, I've asked stuff like this before but it's kind of adjacent to a lot of things.

QuarkJets
Sep 8, 2008

QuarkJets posted:

If you don't want to add dependencies (I don't blame you at all, pandas is a big module) then it'd be easy to roll your own. Something like this should work, I didn't test it though:

Python code:
def parse_lines(file_path, col_width, full_width):
    """Yield tuples for each row of a file."""
    # Use smart_open if your file lives on s3 or something
    num_cols = full_width // col_width
    with open(file_path, 'r') as fi:
        for line in fi:
            yield tuple(line[i*col_width+1:(i+1)*col_width].strip() for i in range(num_cols))

col_width = 20
full_width = 60
col1, col2, col3 = zip(*(parse_lines(some_file, col_width, full_width)))
Explanation:
- Since you described fixed-width columns that all have the same width, let's define a function that takes advantage of that. If you need to provide varied column widths then you could do that instead, it wouldn't look much different
- Determine N number of columns from the args (in our example, this will be 3)
- Start iterating over all lines in the file
- Given a line, use a generator expression to yield a tuple from the N entries in the row, where N == number of columns. We use strip() to remove any whitespace from the edges of each entry, but our column widths are always the same for the line extraction
- Then we use zip() with the splat operator to convert those tuples of rows into tuples of columns. These are separate and distinct tuples, do whatever you want with them from here - turn them into lists, convert them to arrays, etc.

e: Ah you're not using a file, you're using popen. This should still work, you'll just use popen instead of open; I'm pretty sure a popen can be context managed like this but I'm not certain

I just thought of a revision:

Python code:
def fixed_width_lines(col_width, full_width):
    """Return a parser function that splits a line into a tuple of fixed-width strings."""
    num_cols = full_width // col_width
    def _parse(line):
        """Split a line into columns with a predefined width."""
        yield tuple(line[i*col_width+1:(i+1)*col_width].strip() for i in range(num_cols))
    return _parse

def parse_lines_from_file(file_path, line_parser):
    """Yield parsed results for each row of a file."""
    # Use smart_open if your file lives on s3 or something
    with open(file_path, 'r') as fi:
        for line in fi:
            yield line_parser(line)

col_width = 20
full_width = 60
line_parser = fixed_width_lines(col_width, full_width)
col1, col2, col3 = zip(*(parse_lines_from_file(some_file, line_parser)))
This uses dependency injection, which I've been trying to get into the habit of implementing more. A good development principle is to have each function you write "do one thing very well". Our old function was doing two things: iterating over lines in a file and parsing those lines into columns. Now if you need to iterate over the stdout of a subprocess you'd just need to implement "parse_stdout_from_process" and give it the same parser that we've already defined. If you need to parse lines in a different way (maybe you need to skip the first 3 rows, or you have variable-width columns) you'd just write a new parser.

C2C - 2.0
May 14, 2006

Dubs In The Key Of Life


Lipstick Apathy
Looking for help parsing .csv data. I'm trying to create maps (either in TkinterMapView or Folium; haven't figured out which yet) that will show different types of data from the .csv input (heatmaps, location owner, etc.).

The .csv files are provided by the city of New Orleans; trying to work off of GPS coordinates, that column in the .csv is formatted like a tuple:

code:
Location
(29.951065, -90.071533)
There's also an Address column; I've been converting the physical addresses to a list of corresponding GPS coordinates using Geopy; but to avoid hitting rate-limits I've got to set a timeout for each request (there's around 2600 addresses). I'm wondering if there's a way to take the Location column & somehow split that into a dictionary (lat/long) so I can skip the Geopy conversion altogether & just feed one of the map libraries the dict instead.

Here's what I've done so far without attempting the above:

Python code:
import pandas as pd
import tkinter
from geopy.geocoders import Nominatim
from tkintermapview import TkinterMapView

# Read CSV file
df = pd.read_csv("data.csv")

# Create an empty list to store the GPS coordinates
coordinates = []

# Use geopy to convert addresses to GPS coordinates
geolocator = Nominatim(
    timeout=10,
    user_agent="PFD")
for address in df["Address"]:
    location = geolocator.geocode(address)
    if location:
        coordinates.append((location.latitude, location.longitude))

# Create a Tkinter map view.
root_tk = tkinter.Tk()
root_tk.geometry(f"{3024}x{1964}")
root_tk.title("NOLA Short-Term Rentals.py")

map_widget = TkinterMapView(
    root_tk,
    width=3000,
    height=1900,
    corner_radius=0,
)
map_widget.place(
    relx=0.5,
    rely=0.5,
    anchor=tkinter.CENTER
)
# Set current widget position and zoom.
map_widget.set_position(29.951065, -90.071533)  # New Orleans
map_widget.set_zoom(15)

# Add markers for each GPS coordinate to the map.
for coord in coordinates:
    map_widget.set_marker(
        deg_x=coord[0],
        deg_y=coord[1],
    )

# Display the map.
map_widget.mainloop()
This works but it just seems like I could shave time off the map plotting if I didn't have to go thru the Geopy coversion first.

some kinda jackal
Feb 25, 2003

 
 
Continuing to get a ton out of these suggestions, thanks. I have to go back to the docs to understand a few concepts but I think I can follow this.

e: Exception catching question figured out

some kinda jackal fucked around with this message at 23:56 on Jan 19, 2023

QuarkJets
Sep 8, 2008

C2C - 2.0 posted:

This works but it just seems like I could shave time off the map plotting if I didn't have to go thru the Geopy coversion first.

Yeah, if you have lat/lon coordinates from the query anyway then you should definitely use them. What's holding you up?

Seventh Arrow
Jan 26, 2005

Slimchandi posted:

Anyone have any experience of using Dagster they could share here? Moving to a new place in the next couple months and they need some decent ETL tooling putting in place. My current place was always promising Airflow, but never materialised, so we just stuck with a list of python files and notebooks :)

Any reason why I should be considering Airflow over something more modern like Dagster / Prefect? The functionality seems pretty great.

Just about everyone in the data engineering space is going coo-coo over DBT, so that might be worth a look. Although ironically people are pairing it with Airflow because it can't do everything.

Azure Data Factory is also pretty slick, if your company wants a ritzy cloud solution.

some kinda jackal
Feb 25, 2003

 
 
In terms of programming best practices, if a called func is expected to return two items, but raises an exception during its processing and can't produce any meaningful/useful data to return, should I return SOMETHING?

I know there is a non-zero chance of the called func failing due to external dependencies outside of the python script. I wrap the actual func() call in main() in a try/except, so I suspect there's little value in returning anything as the whole exercise has failed and I won't be doing anything meaningful. Right now that's how my code works.

I actually don't know why I'm asking since the answer feels really obvious now that I've typed it out, just wondering if there's some style guide/best practice I'm breaking (which, who cares, ultimately) -- so more out of curiosity. Only thing I can think is that the concept of this func is that it returns two things, so I should get into the habit of returning what I said I would, regardless of whether I'm bailing out or not, or whether it'll actually be used.

Data Graham
Dec 28, 2009

📈📊🍪😋



Yeah, I think that's how I think of it — just because Python isn't strongly typed doesn't mean you shouldn't impose contracts on your code yourself (e.g. with type hinting). Which means your function should only do one of two things: a) return two things, or b) raise an exception. Otherwise the calling code has to concern itself with all the weird possibilities of things the function might return, which defeats the whole purpose of factoring complex logic into simple callable methods.

C2C - 2.0
May 14, 2006

Dubs In The Key Of Life


Lipstick Apathy

QuarkJets posted:

Yeah, if you have lat/lon coordinates from the query anyway then you should definitely use them. What's holding you up?

The lat/long in the .csv are in a single column, like so: (29.84387, -90.94854) and the column name is Location. Initially, using TkinterMapView, I thought the coordinates had to be parsed individually (a separate value for lat & long). Poking around the docs for Folium this morning, it looks like it can set markers from a list of GPS coords. Knowing that, I'm gonna' try to read the .csv with Pandas & create a list just from that field and send it to Folium.

Sorry if this is beginner-level stuff; I'm new to programming/Python having just started learning in earnest at the beginning of the year.

QuarkJets
Sep 8, 2008

C2C - 2.0 posted:

The lat/long in the .csv are in a single column, like so: (29.84387, -90.94854) and the column name is Location. Initially, using TkinterMapView, I thought the coordinates had to be parsed individually (a separate value for lat & long). Poking around the docs for Folium this morning, it looks like it can set markers from a list of GPS coords. Knowing that, I'm gonna' try to read the .csv with Pandas & create a list just from that field and send it to Folium.

Sorry if this is beginner-level stuff; I'm new to programming/Python having just started learning in earnest at the beginning of the year.

No problem, you can use strip() to remove whitespace and the parents from the beginning and end, then split() the result to 2 variables (lat and lon).

Python code:

for row in location_column:
    lat, lon = row.strip().strip('(').strip(')').split(',')
    lat = float(lat)
    lon = float(lon)
    # do something with them 

C2C - 2.0
May 14, 2006

Dubs In The Key Of Life


Lipstick Apathy

QuarkJets posted:

No problem, you can use strip() to remove whitespace and the parents from the beginning and end, then split() the result to 2 variables (lat and lon).

Python code:
for row in location_column:
    lat, lon = row.strip().strip('(').strip(')').split(',')
    lat = float(lat)
    lon = float(lon)
    # do something with them 

So I reworked the script a bit; dropped Pandas and just used the csv module:

Python code:
import csv
import folium

with open('data.csv', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)

    count = 0
    str_locs = []

    for row in reader:
        count = count + 1
        str_locs.append(row['Location'])
        lat, lon = str_locs(row.strip().strip('(').strip(')').split(','))
        lat = float(lat)
        lon = float(lon)

folium_map = folium.Map()

south_west_corner = min(str_locs)
north_east_corner = min(str_locs)

for lat, lon in str_locs:
    folium.Marker(location=[lat[0], lon[1]]).add_to(folium_map)
    folium_map.fit_bounds(south_west_corner, north_east_corner)

folium_map.save("str-current.html")
Using a single for loop to iterate thru the .csv file, Pycharm outputs:
code:
 lat, lon = str_locs(row.strip().strip('(').strip(')').split(','))
                        ^^^^^^^^^
AttributeError: 'dict' object has no attribute 'strip'
If I split it into 2 for loops like so:
Python code:
import csv
import folium

with open('data.csv', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)

    count = 0
    str_locs = []

    for row in reader:  # Loop 1
        count = count + 1
        str_locs.append(row['Location'])

    for row in reader:  # Loop 2
        lat, lon = str_locs(row.strip().strip('(').strip(')').split(','))
        lat = float(lat)
        lon = float(lon)

folium_map = folium.Map()

south_west_corner = min(str_locs)
north_east_corner = min(str_locs)

for lat, lon in str_locs:
    folium.Marker(location=[lat[0], lon[1]]).add_to(folium_map)
    folium_map.fit_bounds(south_west_corner, north_east_corner)

folium_map.save("str-current.html")
, then Pycharm returns:
code:
    for lat, lon in str_locs:
        ^^^^^^^^
ValueError: too many values to unpack (expected 2)
I need to do some more reading, I suppose. It's possible that I've misunderstood how to pass values to Folium when the latitude/longitude are for multiple locations.

marshalljim
Mar 6, 2013

yospos
It's sort of a detour, but when you have chained strip() calls like that, you can generally condense them into one call with the characters you want stripped passed in as the argument.

So string.strip(" ()") should remove spaces and parentheses.

Edit: It also looks like your "north_east_corner = min(str_locs)" should be using max()

marshalljim fucked around with this message at 19:53 on Jan 20, 2023

Macichne Leainig
Jul 26, 2012

by VG

C2C - 2.0 posted:

, then Pycharm returns:
code:
    for lat, lon in str_locs:
        ^^^^^^^^
ValueError: too many values to unpack (expected 2)
I need to do some more reading, I suppose. It's possible that I've misunderstood how to pass values to Folium when the latitude/longitude are for multiple locations.

I think the problem here is you want to unpack each element of the list as a tuple and you're unpacking the list itself as if it were a tuple instead. I don't remember if you can just surround it with parentheses to make it clearer that you're unpacking each object, or if you just want to do something like

code:
    for loc in str_locs:
        lat, long = loc

Seventh Arrow
Jan 26, 2005

Oh hey, are we talking about pandas?!? :buddy:

I'm not even sure what I'm looking for can be done and it's difficult to google/ChatGPT for.

I have one table of metadata that provides the allowable length of a given field. For example, the 'Account ID' field can only be 18 characters long; the 'Account Name' field can only be 255 characters long, etc.



This has to be compared against the client's data in a separate table:



Now that I think about it, python wouldn't need to iterate the 'label' column. It does, however, need to iterate over Account ID and compare each value to row 1 of 'length', then iterate over 'Account Name' and compare each value to row 4 of 'length', and so on. And of course do some sort of 'if len('Account ID') > length' type operation.

If I had to hand code everything one by one, it might look something like:

code:
account_id_length = df2.at[0, 'length']
for a in df['Account ID']:
    if len(a) > account_id_length:
        print(f'Account ID {a} exceeds maximum length. Prepare to be vaporized')
account_name_length = df2.at[3, 'length']
for b in df['Account Name']:
    if len(b) > account_name_length:
        print(f'Account Name {b} exceeds maximum length. Prepare to be vaporized')
And so on. I'm getting exhausted just looking at it.

Wait, maybe I could have a function do a bunch of the legwork:

code:
def check_length_not_that_size_matters_or_anything_haha(column_name, length):
    for i in df[column_name]:
        if len(i) > length:
            print(f'{i} exceeds maximum length. Prepare to be vaporized.')
Something like that.

But then I would need something that keeps feeding it the necessary values, like:

check_length_not_that_size_matters_or_anything_haha(df['Account ID'], df2.at[0, 'length'])
check_length_not_that_size_matters_or_anything_haha(df['Account Name'], df2.at[3, 'length'])
check_length_not_that_size_matters_or_anything_haha(df['Account Type'], df2.at[4, 'length'])

Is there a way to iterate that as well, maybe?

Seventh Arrow fucked around with this message at 21:45 on Jan 20, 2023

12 rats tied together
Sep 7, 2006

Data Graham posted:

Yeah, I think that's how I think of it — just because Python isn't strongly typed doesn't mean you shouldn't impose contracts on your code yourself (e.g. with type hinting). Which means your function should only do one of two things: a) return two things, or b) raise an exception. Otherwise the calling code has to concern itself with all the weird possibilities of things the function might return, which defeats the whole purpose of factoring complex logic into simple callable methods.

Python is strongly typed but it's also dynamically typed - a label can point to a thing that is a different type than what it pointed to previously. Type systems have a lot of dimensions and they can get pretty confusing, for example, python is also "gradually typed" these days, to some degree it is "structurally typed" due to Protocols and also "duck typed".

But yes, you should absolutely use type hints when they matter or increase the readability of a particular section of code. Good, relevant example here is tuple unpacking:
Python code:
    lat: str
    lon: str
    for loc in str_locs:
        lat, long = loc

Biffmotron
Jan 12, 2007

Seventh Arrow posted:

Oh hey, are we talking about pandas?!? :buddy:

I'm not even sure what I'm looking for can be done and it's difficult to google/ChatGPT for.

I have one table of metadata that provides the allowable length of a given field. For example, the 'Account ID' field can only be 18 characters long; the 'Account Name' field can only be 255 characters long, etc.



This has to be compared against the client's data in a separate table:



Now that I think about it, python wouldn't need to iterate the 'label' column. It does, however, need to iterate over Account ID and compare each value to row 1 of 'length', then iterate over 'Account Name' and compare each value to row 4 of 'length', and so on. And of course do some sort of 'if len('Account ID') > length' type operation.

Is there a way to iterate that as well, maybe?

You can do some clever things with masks to return only the rows that exceed your limits, and also tell you which columns in every row has a problem.

Python code:
import pandas as pd
import random
import string

# dict of column names and max character length
col_len_map = {'a':5,
            'b':6,
            'c':7}

# function to generate some random strings
def gen_string(low=4, high=8):
    i = random.randint(low, high+1)
    output = []
    for _ in range(i):
        output.append(random.choice(string.ascii_lowercase))
    return ''.join(output)
    
# make a DataFrame
# this only works if the DataFrame is all strings
df = pd.DataFrame({'a':[gen_string() for _ in range(3)],
                  'b':[gen_string() for _ in range(3)],
                  'c':[gen_string() for _ in range(3)]}
                 )
# we do some iffy things with global scope and passing col_len_map and col to this function
# but it makes the apply easier
def check_length_not_that_size_matters_or_anything_haha(x):
    if len(x) > col_len_map[col]:
        return True
    else:
        return False

# initialize an empty mask dataframe
mask = pd.DataFrame()
for col in df.columns:
    # m is a series of boolean values, True where len exceeds the max
    m = df[col].apply(lambda x: check_length_not_that_size_matters_or_anything_haha(x))
    #and concat to build the mask for all columns
    mask = pd.concat([mask, m], axis=1)

# we run through every index in the DataFrame and get which columns are bad as a string
# this is slower, but hopefully these datasets aren't huge
bad_cols = []
for _ in range(df.shape[0]):
    bc = ', '.join(df.columns[mask.loc[_]])
    bad_cols.append(bc)

# add to DataFrame
df['bad_cols'] = bad_cols

# and then we sum the mask across each row, convert to bool, and this tells us which rows and columns exceed our limits
df[mask.sum(axis=1).astype(bool)]

QuarkJets
Sep 8, 2008

C2C - 2.0 posted:

So I reworked the script a bit; dropped Pandas and just used the csv module:

Python code:
import csv
import folium

with open('data.csv', encoding='utf-8') as csvfile:
    reader = csv.DictReader(csvfile)

    count = 0
    str_locs = []

    for row in reader:
        count = count + 1
        str_locs.append(row['Location'])
        lat, lon = str_locs(row.strip().strip('(').strip(')').split(','))
        lat = float(lat)
        lon = float(lon)

folium_map = folium.Map()

south_west_corner = min(str_locs)
north_east_corner = min(str_locs)

for lat, lon in str_locs:
    folium.Marker(location=[lat[0], lon[1]]).add_to(folium_map)
    folium_map.fit_bounds(south_west_corner, north_east_corner)

folium_map.save("str-current.html")
Using a single for loop to iterate thru the .csv file, Pycharm outputs:
code:
 lat, lon = str_locs(row.strip().strip('(').strip(')').split(','))
                        ^^^^^^^^^
AttributeError: 'dict' object has no attribute 'strip'

Since you are using pycharm, try running your code in debug mode so that you can inspect your objects when an error occurs

This error message is giving an important clue: it is saying that "row" is a dict, not a string. This is because csv.DictReader returns a dict for each row of data. The keys in the dict are the column names.

Python code:

# Access the column labeled "Location" 
location = row["Location"]
# Strip away characters we do not want 
location = location.strip(' ()')
# store the remaining string in a list
locations.append(location) 

Then later you can iterate over these strings and do something with them

Python code:

for location in locations:
    lat, lon = location.split(",")
    print(lat)
    print(lon) 

Seventh Arrow
Jan 26, 2005

Thank you for your efforts Biffmotron! It will take my brain a bit of time to digest it, but I think I get the general idea. The use of a lambda function is very cool.

Adbot
ADBOT LOVES YOU

Biffmotron
Jan 12, 2007

You're welcome. I spend a lot of time doing stuff in Pandas, which is kind of finnicky and arcane in a lot of ways, but has become standardized. It could be worse, could be matplotlib.

Pandas does some pretty heavy lifting under the hood, but it's basically all numpy arrays.

This code relies on a couple of intermediate Pandas tricks, which can be understood by inspecting the intermediary objects m and bc. The first one is masking. If you send an index mask, a list of bools the same length as the array/Series/DataFrame, to a DataFramevia df[mask], you get back the rows in the DataFrame where the mask is True.

The second trick is using axis=1 in several places. By default, Pandas does all its operations along the columns, which is axis=0. axis=1 does an operation along rows, which combined with mask.sum(axis=1).astype(bool) makes an index mask.

df.apply() is a very useful pattern. Generally, iterating over DataFrame rows is slow. Apply invokes numpy vectorization which can be orders of magnitude faster. df['new_col'] = df['column'].apply(lambda x: func(x) is super useful for simple data transforms.

More complex data transforms can be done using apply and axis. This function takes in a row and does some arbitrary math, but it's easy to imagine some convoluted business logic that has to be applied to a row in a table to get a result.

Python code:
df = pd.DataFrame({'a':[random.randint(1,20) for _ in range(3)],
                  'b':[random.randint(1,20) for _ in range(3)],
                  'c':[random.randint(1,20) for _ in range(3)]}
                 )

def row_func(row):
    a = row['a']
    b =row['b']
    c =row['c']
    
    return a**b-c

df['math'] = df.apply(row_func, axis=1)
df

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply