Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›230 »

necrotic: Aug 2, 2005; I owe my brother big time for this!

Do any scripting languages allow using main as an entry point, without explicitly calling it?

# ? Dec 18, 2018 21:31

Adbot: ADBOT LOVES YOU

# ? May 16, 2024 12:21

Dominoes: Sep 20, 2007

Nippashish posted:

You could also just commit to not import your scripts as modules and then you don't need any double underscore shenanigans.

I like this approach.

# ? Dec 18, 2018 22:27

Cock Democracy: Jan 1, 2003; Now that is the finest piece of chilean sea bass I have ever smelled

I'm looking for some advice on how to refactor a giant ETL/data pipeline script. Here's a summary of what it does currently:

- Download files from various partners (some use FTP, others are HTTP)
- Load the downloaded files into MySQL temp tables
- De-dupe records (some records appear in the feeds of multiple partners)
- Take the combined records, create a CSV and upload that to an API to get some additional data, then parse the results
- Prepare the final output as MySQL tables + Elasticsearch index used by the Django app
- Verify that the new record counts are reasonable, then put the final output on production
- Save logs to filesystem and a table

The requirements are nothing too complicated, but the code now is a bit of a mess. It's all procedural, has tons of exceptions and hacks for certain partners, and is generally just not very extendable. When it fails, our devs have a hell of a time debugging it. It's sort of fallen under my ownership and I'm not happy about its current state.

Are there any libraries or PEPs to help guide me in rewriting this beast? My thought is I'd like a setup like Django migrations, where each "step" is its own file, containing a class that follows a standard format. Then there could be a file that lists all the classes and what order they run in. Hell, some of the steps could run concurrently, so it would be awesome to support that.

# ? Dec 19, 2018 13:00

cinci zoo sniper: Mar 15, 2013

Cock Democracy posted:

I'm looking for some advice on how to refactor a giant ETL/data pipeline script. Here's a summary of what it does currently:

- Download files from various partners (some use FTP, others are HTTP)
- Load the downloaded files into MySQL temp tables
- De-dupe records (some records appear in the feeds of multiple partners)
- Take the combined records, create a CSV and upload that to an API to get some additional data, then parse the results
- Prepare the final output as MySQL tables + Elasticsearch index used by the Django app
- Verify that the new record counts are reasonable, then put the final output on production
- Save logs to filesystem and a table

The requirements are nothing too complicated, but the code now is a bit of a mess. It's all procedural, has tons of exceptions and hacks for certain partners, and is generally just not very extendable. When it fails, our devs have a hell of a time debugging it. It's sort of fallen under my ownership and I'm not happy about its current state.

Are there any libraries or PEPs to help guide me in rewriting this beast? My thought is I'd like a setup like Django migrations, where each "step" is its own file, containing a class that follows a standard format. Then there could be a file that lists all the classes and what order they run in. Hell, some of the steps could run concurrently, so it would be awesome to support that.

Luigi or Airflow for libraries. Generally speaking, you want to separate each letter in ETL structurally, and then just use something to orchestrate correct, step-wise execution.

# ? Dec 19, 2018 13:19

Malcolm XML: Aug 8, 2009; I always knew it would end like ｔｈｉｓ．

Airflow is the worst out there except for everything else

Luigi doesn't do scheduling just sequencing Iirc

Godspeed goon.

# ? Dec 19, 2018 18:53

cinci zoo sniper: Mar 15, 2013

https://lwn.net/SubscriberLink/775105/5db16cfe82e78dc3/

# ? Dec 20, 2018 17:55

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Dominoes posted:

Does anyone else dislike the if __name__ == __main__ syntax? I still have to look it up every time.

There's tons of stuff like this in every language. I just get used to all of these weird things in every language.

FWIW, I just type "main" and press TAB and Pycharm types it all for me.

# ? Dec 20, 2018 20:00

Dominoes: Sep 20, 2007

Thermopyle posted:

I just type "main" and press TAB and Pycharm types it all for me.

Neat!

# ? Dec 20, 2018 20:07

keyframe: Sep 15, 2007; I have seen things

Guys I was wondering what is a good way to see how long it takes a python script to run and how much memory/resources it is using?

I just discovered timeit but all the examples I have seen for that seems to be run on single line comments with some weird have everything enclosed in a string quote syntax.

# ? Dec 24, 2018 00:47

cinci zoo sniper: Mar 15, 2013

keyframe posted:

Guys I was wondering what is a good way to see how long it takes a python script to run and how much memory/resources it is using?

I just discovered timeit but all the examples I have seen for that seems to be run on single line comments with some weird have everything enclosed in a string quote syntax.

Official cProfile is a good start.

E: For memory you can try package called memory-profiler.

cinci zoo sniper fucked around with this message at 01:11 on Dec 24, 2018

# ? Dec 24, 2018 01:08

keyframe: Sep 15, 2007; I have seen things

cinci zoo sniper posted:

Official cProfile is a good start.

E: For memory you can try package called memory-profiler.

Thanks man!

cProfile was exactly what I was looking for.

# ? Dec 24, 2018 02:58

KICK BAMA KICK: Mar 2, 2009

Just to make sure, cause this behavior could actually be useful to shorten some conditionals into one-liners if it wasn't bananas: everyone would consider it extremely weird and unPythonic to access an item of a two-item sequence using a Boolean value silently evaluated to 0 or 1 as the index, i.e. 'ab'[False] == 'a' or 'ab'[True] == 'b' right?

# ? Dec 26, 2018 08:01

QuarkJets: Sep 8, 2008

KICK BAMA KICK posted:

Just to make sure, cause this behavior could actually be useful to shorten some conditionals into one-liners if it wasn't bananas: everyone would consider it extremely weird and unPythonic to access an item of a two-item sequence using a Boolean value silently evaluated to 0 or 1 as the index, i.e. 'ab'[False] == 'a' or 'ab'[True] == 'b' right?

True and False evaluating to 1 and 0 isn't bananas, that's been a common feature for languages going back decades. It's fine to use that behavior, booleans are just a special type of integer, the bool class even inherits from int!

However, don't do what you explicitly wrote in your post; instead of writing 'ab'[True] == 'b' you should just write 'ab'[1] == 'b'. I can see no benefit to writing out the boolean value instead of its integer equivalent

# ? Dec 26, 2018 08:24

KICK BAMA KICK: Mar 2, 2009

Yeah that was just an illustration, the case I'm thinking of is more like

code:

if b:
    x = foo
else:
    x = bar

being condensed into

code:

x = (bar, foo)[b]

That doesn't raise any eyebrows if b is clearly a bool?

# ? Dec 26, 2018 08:33

breaks: May 12, 2001

I think that

code:

x = foo if b else bar

is the less head exploding way to write that in one line.

In general relying on False == 0 and True == 1 is bad.

# ? Dec 26, 2018 09:12

QuarkJets: Sep 8, 2008

KICK BAMA KICK posted:

Yeah that was just an illustration, the case I'm thinking of is more like
code:
if b:
    x = foo
else:
    x = bar
being condensed into
code:
x = (bar, foo)[b]
That doesn't raise any eyebrows if b is clearly a bool?

What breaks wrote is the pythonic way to do this. Creating a tuple just to immediately index into it... that's no good

# ? Dec 26, 2018 10:32

KICK BAMA KICK: Mar 2, 2009

breaks posted:

QuarkJets posted:

What breaks wrote is the pythonic way to do this. Creating a tuple just to immediately index into it... that's no good

D'oh, thanks! No idea why that escaped me (I really like using that ternary construction!) or what it was that got me hung up on the idea of using a bool as an index in the first place. (As an aside, it does turn out that Python 3 guarantees True == 1 and False == 0 but Python 2 doesn't.)

# ? Dec 26, 2018 18:22

NtotheTC: Dec 31, 2007

I've seen that pattern used once in the wild, where it was something like

Python code:

string_from_bool = ("falsily", "verily",)

>> print(string_from_bool[False])
falsily

>> print(string_from_bool[True])
verily

# ? Dec 26, 2018 21:33

cinci zoo sniper: Mar 15, 2013

NtotheTC posted:

I've seen that pattern used once in the wild, where it was something like
Python code:
string_from_bool = ("falsily", "verily",)

>> print(string_from_bool[False])
falsily

>> print(string_from_bool[True])
verily

loving hell, why?!

# ? Dec 26, 2018 21:37

NtotheTC: Dec 31, 2007

cinci zoo sniper posted:

loving hell, why?!

Well that example is slightly simplified, the two strings were actually entire sentences, and were stored in a config file and often accessed, so it was neater than writing

Python code:

my_string_response = STORED_TRUTHY_STRING if True else STORED_FALSEY_STRING

or similar, i dunno it was several jobs ago but I never found it particularly egregious

NtotheTC fucked around with this message at 21:47 on Dec 26, 2018

# ? Dec 26, 2018 21:43

QuarkJets: Sep 8, 2008

NtotheTC posted:

Well that example is slightly simplified, the two strings were actually entire sentences, and were stored in a config file and often accessed, so it was neater than writing
Python code:
my_string_response = STORED_TRUTHY_STRING if True else STORED_FALSEY_STRING
or similar, i dunno it was several jobs ago but I never found it particularly egregious

I don't know man, it seems like the thing you just wrote is way neater than creating a tuple and then indexing into it with True/Fase

# ? Dec 26, 2018 22:05

Nippashish: Nov 2, 2005; Let me see you dance!

It should be a dictionary.

# ? Dec 26, 2018 22:07

cinci zoo sniper: Mar 15, 2013

Nippashish posted:

It should be a dictionary.

# ? Dec 26, 2018 23:44

NtotheTC: Dec 31, 2007

I'm not trying to nominate it for code of the year, just pointing out a real-world example of it

# ? Dec 26, 2018 23:47

SurgicalOntologist: Jun 17, 2004

The only case where I would find that type of indexing acceptable is in a vectorized environment where you wanted to do something equivalent to:

Python code:

[a if condition else b for condition in conditions]

but vectorized with numpy it could be

Python code:

np.array([a, b])[conditions.astype(int)]

I would only accept it with the astype(int) to make it clear that this isn't typical boolean indexing. Even if it works without it.

And I still would probably to try to find another solution. But the numpy version would be faster I think.

Edit: duh, there is a better solution

Python code:

np.where(conditions, a, b)

I was overthinking it.

# ? Dec 27, 2018 00:30

Ghost of Reagan Past: Oct 7, 2003; rock and roll fun

NtotheTC posted:

I'm not trying to nominate it for code of the year, just pointing out a real-world example of it

We've all done code crimes with Python but this is a real mess.

EDIT: I think my greatest Python code crime was erasing exceptions deep in some library and baffling myself and others about how the gently caress it was breaking. It made sense at the time but holy gently caress it's a mess.

# ? Dec 29, 2018 04:35

keyframe: Sep 15, 2007; I have seen things

Hi guys,

I am trying to scrape craigslist for apartment prices. I am having a problem where when I scrape the prices in $xxxx I get a lot of duplicates which poo poo on my plan on making a urllink/price dictionary with zip(). Any idea why the duplicates are happening? I am using requests/beautiful soup with html.parser.

craigslist link if anyone wants to try themselves:

https://vancouver.craigslist.org/d/apts-housing-for-rent/search/apa

# ? Dec 31, 2018 05:15

KICK BAMA KICK: Mar 2, 2009

keyframe posted:

Hi guys,

I am trying to scrape craigslist for apartment prices. I am having a problem where when I scrape the prices in $xxxx I get a lot of duplicates which poo poo on my plan on making a urllink/price dictionary with zip(). Any idea why the duplicates are happening? I am using requests/beautiful soup with html.parser.

craigslist link if anyone wants to try themselves:

https://vancouver.craigslist.org/d/apts-housing-for-rent/search/apa

At a glance, it looks like each listing's "result-price" is repeated under a  tag. Grab your data only once per entry from either that or its parent.

# ? Dec 31, 2018 05:39

keyframe: Sep 15, 2007; I have seen things

KICK BAMA KICK posted:

At a glance, it looks like each listing's "result-price" is repeated under a  tag. Grab your data only once per entry from either that or its parent.

Ahh I see what you mean. I have no idea how to do that though, any pointers?

# ? Dec 31, 2018 06:19

KICK BAMA KICK: Mar 2, 2009

keyframe posted:

Ahh I see what you mean. I have no idea how to do that though, any pointers?

Don't remember the BeautifulSoup syntax off the top of my head (and also just glanced at the page source the one time) so this is all approximate but instead of searching for all class="result-price" tags on the page you'll want to start by searching for only the the class="result-price" tags that are children of class="result-meta" tags. I put it that way cause I'm pretty sure you can find the tags matching both those conditions in one search and then iterate over those to process them the same way you were doing before; if you can't do that or if you don't follow search for all the class="result-meta" tags, iterate over those and find the one class="result-price" tag underneath each. Poke around in the BeautifulSoup docs a little bit, this should be pretty easy.

I guess the absolute simplest way if you want to immediately solve the problem that might get the correct result but is not a great way to do it would be to step through the list of results you've retrieved with whatever line searches for those "result-price" tags by twos with a [::2] slice.

# ? Dec 31, 2018 06:53

keyframe: Sep 15, 2007; I have seen things

KICK BAMA KICK posted:

Don't remember the BeautifulSoup syntax off the top of my head (and also just glanced at the page source the one time) so this is all approximate but instead of searching for all class="result-price" tags on the page you'll want to start by searching for only the the class="result-price" tags that are children of class="result-meta" tags. I put it that way cause I'm pretty sure you can find the tags matching both those conditions in one search and then iterate over those to process them the same way you were doing before; if you can't do that or if you don't follow search for all the class="result-meta" tags, iterate over those and find the one class="result-price" tag underneath each. Poke around in the BeautifulSoup docs a little bit, this should be pretty easy.

I guess the absolute simplest way if you want to immediately solve the problem that might get the correct result but is not a great way to do it would be to step through the list of results you've retrieved with whatever line searches for those "result-price" tags by twos with a [::2] slice.

Thanks. I got some ways to go before I wrap my head around beautiful soup but somehow solved it with this while messing around trying different things.

code:

price = craigs.find_all('span', {'class':'result-meta'})

for i in price:

    print(i.find('span').text)

This gives me the span class=result-price that is directly under result-meta and not the other result-price so I don't get duplicates. I don't know why it's not giving all the other span tags below the price span tag though. It is the result I am after but I don't fully understand how I got it. EDIT: oh I guess it is because I am using find and not find_all so it just gives the direct child. Finally getting it a bit.

Here is the snapshot of the html btw:

Thanks again for the help.

keyframe fucked around with this message at 07:41 on Dec 31, 2018

# ? Dec 31, 2018 07:34

KICK BAMA KICK: Mar 2, 2009

Awesome! Exactly, that works for now because the price happens to be in the first span tag under result-meta; to make it more robust you will want to specify the class="result-price" in the find, however that works.

# ? Dec 31, 2018 07:54

breaks: May 12, 2001

Just use the soup's select method with a ".result-meta>.result-price" selector and call it a day. IMO always pick elements using CSS selectors up until they don't support what you need, that's what they were made for.

(On that page, there 221 matches for .result-price but 120 for .result-meta>.result-price. There are 120 items per page and at first glance the lower than expected number of .result-price results seems to be due to posts lacking pics only having the price displayed once? or something like that.)

breaks fucked around with this message at 08:13 on Dec 31, 2018

# ? Dec 31, 2018 07:59

keyframe: Sep 15, 2007; I have seen things

breaks posted:

Just use the soup's select method with a ".result-meta>.result-price" selector and call it a day. IMO always pick elements using CSS selectors up until they don't support what you need, that's what they were made for.

(On that page, there 221 matches for .result-price but 120 for .result-meta>.result-price. There are 120 items per page and at first glance the lower than expected number of .result-price results seems to be due to posts lacking pics only having the price displayed once? or something like that.)

Thanks for the tip! Just read up on that in the docs and you are right.

code:

css = craigs.select('p > span.result-meta > span.result-price')

for i in css:

    print(i.text)

This gives the same result in much fewer lines without needing to do nested for loops.

# ? Dec 31, 2018 08:11

Hollow Talk: Feb 2, 2014

Malcolm XML posted:

Airflow is the worst out there except for everything else

Luigi doesn't do scheduling just sequencing Iirc

Godspeed goon.

I'm using Airflow in a customer project, and by God, everything about it is slow and I hate "packaging" the DAGs.

I hands down prefer Luigi+Cron. Unfortunately, Luigi's web interface is functionally useless, so if you need any kind of actual information from your jobs, Airflow it is. :saddowns:

# ? Dec 31, 2018 13:52

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Humble Bundle has a bunch of python books.

# ? Dec 31, 2018 20:18

QuarkJets: Sep 8, 2008

Looks like a Qt5 book and a Pandas book in there, those sound useful

but then there's "Learn Python by Building a Blockchain and Cryptocurrency" which is just :barf:

# ? Jan 1, 2019 07:09

NinpoEspiritoSanto: Oct 22, 2013

Thermopyle posted:

Humble Bundle has a bunch of python books.

Packt currently have an everything for five bucks deal going (Packt usually provide the Humble Python books) if you want to cherry pick. Also hi, thread newcomer

Some resources I often find useful that I didn't notice in the OP:

Writing Idiomatic Python:
https://jeffknupp.com/writing-idiomatic-python-ebook/

This guy is great, if you're poor just write to him and he'll give you a copy of the book for free.

Nedbat's Big O presentation:
https://nedbatchelder.com/text/bigo.html

Nedbat's talk on Pragmatic Unicode:
https://nedbatchelder.com/text/unipain.html

The Clean Architecture in Python (also a nice watch for anyone interested in FP approaches in Python):
https://www.youtube.com/watch?v=DJtef410XaM

If you don't like configparser and/or YAML, consider TOML:
https://github.com/toml-lang/toml

I'll share more as I come across them in my bookmarks. I've worked with Python (3) for about five years now and I figured I'd see if there was anything goon related

# ? Jan 1, 2019 11:52

Thermopyle: Jul 1, 2003; ...the stupid are cocksure while the intelligent are full of doubt. �Bertrand Russell

Bundy posted:

Also hi, thread newcomer

According to the forums search, I've posted in the current incarnation of this thread 242 times!

# ? Jan 1, 2019 21:54

Adbot: ADBOT LOVES YOU

# ? May 16, 2024 12:21

QuarkJets: Sep 8, 2008

Bundy posted:

Packt currently have an everything for five bucks deal going (Packt usually provide the Humble Python books) if you want to cherry pick. Also hi, thread newcomer

Some resources I often find useful that I didn't notice in the OP:

Writing Idiomatic Python:
https://jeffknupp.com/writing-idiomatic-python-ebook/

This guy is great, if you're poor just write to him and he'll give you a copy of the book for free.

Nedbat's Big O presentation:
https://nedbatchelder.com/text/bigo.html

Nedbat's talk on Pragmatic Unicode:
https://nedbatchelder.com/text/unipain.html

The Clean Architecture in Python (also a nice watch for anyone interested in FP approaches in Python):
https://www.youtube.com/watch?v=DJtef410XaM

If you don't like configparser and/or YAML, consider TOML:
https://github.com/toml-lang/toml

I'll share more as I come across them in my bookmarks. I've worked with Python (3) for about five years now and I figured I'd see if there was anything goon related

Thanks for recommending this! I didn't know about this place (Packt), they have a ton of cheap bundles!

QuarkJets fucked around with this message at 22:19 on Jan 1, 2019

# ? Jan 1, 2019 22:16

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›230 »