Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
necrotic
Aug 2, 2005
I owe my brother big time for this!
Do any scripting languages allow using main as an entry point, without explicitly calling it?

Adbot
ADBOT LOVES YOU

Dominoes
Sep 20, 2007

Nippashish posted:

You could also just commit to not import your scripts as modules and then you don't need any double underscore shenanigans.
I like this approach.

Cock Democracy
Jan 1, 2003

Now that is the finest piece of chilean sea bass I have ever smelled
I'm looking for some advice on how to refactor a giant ETL/data pipeline script. Here's a summary of what it does currently:

- Download files from various partners (some use FTP, others are HTTP)
- Load the downloaded files into MySQL temp tables
- De-dupe records (some records appear in the feeds of multiple partners)
- Take the combined records, create a CSV and upload that to an API to get some additional data, then parse the results
- Prepare the final output as MySQL tables + Elasticsearch index used by the Django app
- Verify that the new record counts are reasonable, then put the final output on production
- Save logs to filesystem and a table

The requirements are nothing too complicated, but the code now is a bit of a mess. It's all procedural, has tons of exceptions and hacks for certain partners, and is generally just not very extendable. When it fails, our devs have a hell of a time debugging it. It's sort of fallen under my ownership and I'm not happy about its current state.

Are there any libraries or PEPs to help guide me in rewriting this beast? My thought is I'd like a setup like Django migrations, where each "step" is its own file, containing a class that follows a standard format. Then there could be a file that lists all the classes and what order they run in. Hell, some of the steps could run concurrently, so it would be awesome to support that.

cinci zoo sniper
Mar 15, 2013




Cock Democracy posted:

I'm looking for some advice on how to refactor a giant ETL/data pipeline script. Here's a summary of what it does currently:

- Download files from various partners (some use FTP, others are HTTP)
- Load the downloaded files into MySQL temp tables
- De-dupe records (some records appear in the feeds of multiple partners)
- Take the combined records, create a CSV and upload that to an API to get some additional data, then parse the results
- Prepare the final output as MySQL tables + Elasticsearch index used by the Django app
- Verify that the new record counts are reasonable, then put the final output on production
- Save logs to filesystem and a table

The requirements are nothing too complicated, but the code now is a bit of a mess. It's all procedural, has tons of exceptions and hacks for certain partners, and is generally just not very extendable. When it fails, our devs have a hell of a time debugging it. It's sort of fallen under my ownership and I'm not happy about its current state.

Are there any libraries or PEPs to help guide me in rewriting this beast? My thought is I'd like a setup like Django migrations, where each "step" is its own file, containing a class that follows a standard format. Then there could be a file that lists all the classes and what order they run in. Hell, some of the steps could run concurrently, so it would be awesome to support that.

Luigi or Airflow for libraries. Generally speaking, you want to separate each letter in ETL structurally, and then just use something to orchestrate correct, step-wise execution.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.
Airflow is the worst out there except for everything else

Luigi doesn't do scheduling just sequencing Iirc

Godspeed goon.

cinci zoo sniper
Mar 15, 2013




https://lwn.net/SubscriberLink/775105/5db16cfe82e78dc3/

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Dominoes posted:

Does anyone else dislike the if __name__ == __main__ syntax? I still have to look it up every time.

There's tons of stuff like this in every language. I just get used to all of these weird things in every language.

FWIW, I just type "main" and press TAB and Pycharm types it all for me.

Dominoes
Sep 20, 2007

Thermopyle posted:

I just type "main" and press TAB and Pycharm types it all for me.
Neat!

keyframe
Sep 15, 2007

I have seen things
Guys I was wondering what is a good way to see how long it takes a python script to run and how much memory/resources it is using?

I just discovered timeit but all the examples I have seen for that seems to be run on single line comments with some weird have everything enclosed in a string quote syntax.

cinci zoo sniper
Mar 15, 2013




keyframe posted:

Guys I was wondering what is a good way to see how long it takes a python script to run and how much memory/resources it is using?

I just discovered timeit but all the examples I have seen for that seems to be run on single line comments with some weird have everything enclosed in a string quote syntax.

Official cProfile is a good start.

E: For memory you can try package called memory-profiler.

cinci zoo sniper fucked around with this message at 01:11 on Dec 24, 2018

keyframe
Sep 15, 2007

I have seen things

cinci zoo sniper posted:

Official cProfile is a good start.

E: For memory you can try package called memory-profiler.

Thanks man!

cProfile was exactly what I was looking for.

KICK BAMA KICK
Mar 2, 2009

Just to make sure, cause this behavior could actually be useful to shorten some conditionals into one-liners if it wasn't bananas: everyone would consider it extremely weird and unPythonic to access an item of a two-item sequence using a Boolean value silently evaluated to 0 or 1 as the index, i.e. 'ab'[False] == 'a' or 'ab'[True] == 'b' right?

QuarkJets
Sep 8, 2008

KICK BAMA KICK posted:

Just to make sure, cause this behavior could actually be useful to shorten some conditionals into one-liners if it wasn't bananas: everyone would consider it extremely weird and unPythonic to access an item of a two-item sequence using a Boolean value silently evaluated to 0 or 1 as the index, i.e. 'ab'[False] == 'a' or 'ab'[True] == 'b' right?

True and False evaluating to 1 and 0 isn't bananas, that's been a common feature for languages going back decades. It's fine to use that behavior, booleans are just a special type of integer, the bool class even inherits from int!

However, don't do what you explicitly wrote in your post; instead of writing 'ab'[True] == 'b' you should just write 'ab'[1] == 'b'. I can see no benefit to writing out the boolean value instead of its integer equivalent

KICK BAMA KICK
Mar 2, 2009

Yeah that was just an illustration, the case I'm thinking of is more like
code:
if b:
    x = foo
else:
    x = bar
being condensed into
code:
x = (bar, foo)[b]
That doesn't raise any eyebrows if b is clearly a bool?

breaks
May 12, 2001

I think that

code:
x = foo if b else bar
is the less head exploding way to write that in one line.

In general relying on False == 0 and True == 1 is bad.

QuarkJets
Sep 8, 2008

KICK BAMA KICK posted:

Yeah that was just an illustration, the case I'm thinking of is more like
code:
if b:
    x = foo
else:
    x = bar
being condensed into
code:
x = (bar, foo)[b]
That doesn't raise any eyebrows if b is clearly a bool?

What breaks wrote is the pythonic way to do this. Creating a tuple just to immediately index into it... that's no good

KICK BAMA KICK
Mar 2, 2009

QuarkJets posted:

What breaks wrote is the pythonic way to do this. Creating a tuple just to immediately index into it... that's no good
D'oh, thanks! No idea why that escaped me (I really like using that ternary construction!) or what it was that got me hung up on the idea of using a bool as an index in the first place. (As an aside, it does turn out that Python 3 guarantees True == 1 and False == 0 but Python 2 doesn't.)

NtotheTC
Dec 31, 2007


I've seen that pattern used once in the wild, where it was something like

Python code:
string_from_bool = ("falsily", "verily",)

>> print(string_from_bool[False])
falsily

>> print(string_from_bool[True])
verily

cinci zoo sniper
Mar 15, 2013




NtotheTC posted:

I've seen that pattern used once in the wild, where it was something like

Python code:
string_from_bool = ("falsily", "verily",)

>> print(string_from_bool[False])
falsily

>> print(string_from_bool[True])
verily

loving hell, why?!

NtotheTC
Dec 31, 2007


cinci zoo sniper posted:

loving hell, why?!

Well that example is slightly simplified, the two strings were actually entire sentences, and were stored in a config file and often accessed, so it was neater than writing

Python code:
my_string_response = STORED_TRUTHY_STRING if True else STORED_FALSEY_STRING
or similar, i dunno it was several jobs ago but I never found it particularly egregious

NtotheTC fucked around with this message at 21:47 on Dec 26, 2018

QuarkJets
Sep 8, 2008

NtotheTC posted:

Well that example is slightly simplified, the two strings were actually entire sentences, and were stored in a config file and often accessed, so it was neater than writing

Python code:
my_string_response = STORED_TRUTHY_STRING if True else STORED_FALSEY_STRING
or similar, i dunno it was several jobs ago but I never found it particularly egregious

I don't know man, it seems like the thing you just wrote is way neater than creating a tuple and then indexing into it with True/Fase

Nippashish
Nov 2, 2005

Let me see you dance!
It should be a dictionary.

cinci zoo sniper
Mar 15, 2013




Nippashish posted:

It should be a dictionary.

NtotheTC
Dec 31, 2007


I'm not trying to nominate it for code of the year, just pointing out a real-world example of it

SurgicalOntologist
Jun 17, 2004

The only case where I would find that type of indexing acceptable is in a vectorized environment where you wanted to do something equivalent to:
Python code:
[a if condition else b for condition in conditions]
but vectorized with numpy it could be
Python code:
np.array([a, b])[conditions.astype(int)]
I would only accept it with the astype(int) to make it clear that this isn't typical boolean indexing. Even if it works without it.

And I still would probably to try to find another solution. But the numpy version would be faster I think.

Edit: duh, there is a better solution
Python code:
np.where(conditions, a, b)
I was overthinking it.

Ghost of Reagan Past
Oct 7, 2003

rock and roll fun

NtotheTC posted:

I'm not trying to nominate it for code of the year, just pointing out a real-world example of it
We've all done code crimes with Python but this is a real mess.

EDIT: I think my greatest Python code crime was erasing exceptions deep in some library and baffling myself and others about how the gently caress it was breaking. It made sense at the time but holy gently caress it's a mess.

keyframe
Sep 15, 2007

I have seen things
Hi guys,

I am trying to scrape craigslist for apartment prices. I am having a problem where when I scrape the prices in <span class="result-price">$xxxx</span> I get a lot of duplicates which poo poo on my plan on making a urllink/price dictionary with zip(). Any idea why the duplicates are happening? I am using requests/beautiful soup with html.parser.

craigslist link if anyone wants to try themselves:

https://vancouver.craigslist.org/d/apts-housing-for-rent/search/apa

KICK BAMA KICK
Mar 2, 2009

keyframe posted:

Hi guys,

I am trying to scrape craigslist for apartment prices. I am having a problem where when I scrape the prices in <span class="result-price">$xxxx</span> I get a lot of duplicates which poo poo on my plan on making a urllink/price dictionary with zip(). Any idea why the duplicates are happening? I am using requests/beautiful soup with html.parser.

craigslist link if anyone wants to try themselves:

https://vancouver.craigslist.org/d/apts-housing-for-rent/search/apa
At a glance, it looks like each listing's "result-price" is repeated under a <span class="result-meta"> tag. Grab your data only once per entry from either that or its parent.

keyframe
Sep 15, 2007

I have seen things

KICK BAMA KICK posted:

At a glance, it looks like each listing's "result-price" is repeated under a <span class="result-meta"> tag. Grab your data only once per entry from either that or its parent.

Ahh I see what you mean. I have no idea how to do that though, any pointers?

KICK BAMA KICK
Mar 2, 2009

keyframe posted:

Ahh I see what you mean. I have no idea how to do that though, any pointers?

Don't remember the BeautifulSoup syntax off the top of my head (and also just glanced at the page source the one time) so this is all approximate but instead of searching for all class="result-price" tags on the page you'll want to start by searching for only the the class="result-price" tags that are children of class="result-meta" tags. I put it that way cause I'm pretty sure you can find the tags matching both those conditions in one search and then iterate over those to process them the same way you were doing before; if you can't do that or if you don't follow search for all the class="result-meta" tags, iterate over those and find the one class="result-price" tag underneath each. Poke around in the BeautifulSoup docs a little bit, this should be pretty easy.

I guess the absolute simplest way if you want to immediately solve the problem that might get the correct result but is not a great way to do it would be to step through the list of results you've retrieved with whatever line searches for those "result-price" tags by twos with a [::2] slice.

keyframe
Sep 15, 2007

I have seen things

KICK BAMA KICK posted:

Don't remember the BeautifulSoup syntax off the top of my head (and also just glanced at the page source the one time) so this is all approximate but instead of searching for all class="result-price" tags on the page you'll want to start by searching for only the the class="result-price" tags that are children of class="result-meta" tags. I put it that way cause I'm pretty sure you can find the tags matching both those conditions in one search and then iterate over those to process them the same way you were doing before; if you can't do that or if you don't follow search for all the class="result-meta" tags, iterate over those and find the one class="result-price" tag underneath each. Poke around in the BeautifulSoup docs a little bit, this should be pretty easy.

I guess the absolute simplest way if you want to immediately solve the problem that might get the correct result but is not a great way to do it would be to step through the list of results you've retrieved with whatever line searches for those "result-price" tags by twos with a [::2] slice.

Thanks. I got some ways to go before I wrap my head around beautiful soup but somehow solved it with this while messing around trying different things.


code:
price = craigs.find_all('span', {'class':'result-meta'})

for i in price:

    print(i.find('span').text)
This gives me the span class=result-price that is directly under result-meta and not the other result-price so I don't get duplicates. I don't know why it's not giving all the other span tags below the price span tag though. It is the result I am after but I don't fully understand how I got it. EDIT: oh I guess it is because I am using find and not find_all so it just gives the direct child. Finally getting it a bit.

Here is the snapshot of the html btw:



Thanks again for the help.

keyframe fucked around with this message at 07:41 on Dec 31, 2018

KICK BAMA KICK
Mar 2, 2009

Awesome! Exactly, that works for now because the price happens to be in the first span tag under result-meta; to make it more robust you will want to specify the class="result-price" in the find, however that works.

breaks
May 12, 2001

Just use the soup's select method with a ".result-meta>.result-price" selector and call it a day. IMO always pick elements using CSS selectors up until they don't support what you need, that's what they were made for.

(On that page, there 221 matches for .result-price but 120 for .result-meta>.result-price. There are 120 items per page and at first glance the lower than expected number of .result-price results seems to be due to posts lacking pics only having the price displayed once? or something like that.)

breaks fucked around with this message at 08:13 on Dec 31, 2018

keyframe
Sep 15, 2007

I have seen things

breaks posted:

Just use the soup's select method with a ".result-meta>.result-price" selector and call it a day. IMO always pick elements using CSS selectors up until they don't support what you need, that's what they were made for.

(On that page, there 221 matches for .result-price but 120 for .result-meta>.result-price. There are 120 items per page and at first glance the lower than expected number of .result-price results seems to be due to posts lacking pics only having the price displayed once? or something like that.)

Thanks for the tip! Just read up on that in the docs and you are right.

code:
css = craigs.select('p > span.result-meta > span.result-price')

for i in css:

    print(i.text)
This gives the same result in much fewer lines without needing to do nested for loops.

Hollow Talk
Feb 2, 2014

Malcolm XML posted:

Airflow is the worst out there except for everything else

Luigi doesn't do scheduling just sequencing Iirc

Godspeed goon.

I'm using Airflow in a customer project, and by God, everything about it is slow and I hate "packaging" the DAGs.

I hands down prefer Luigi+Cron. Unfortunately, Luigi's web interface is functionally useless, so if you need any kind of actual information from your jobs, Airflow it is. :saddowns:

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Humble Bundle has a bunch of python books.

QuarkJets
Sep 8, 2008

Looks like a Qt5 book and a Pandas book in there, those sound useful

but then there's "Learn Python by Building a Blockchain and Cryptocurrency" which is just :barf:

NinpoEspiritoSanto
Oct 22, 2013





Packt currently have an everything for five bucks deal going (Packt usually provide the Humble Python books) if you want to cherry pick. Also hi, thread newcomer :)

Some resources I often find useful that I didn't notice in the OP:

Writing Idiomatic Python:
https://jeffknupp.com/writing-idiomatic-python-ebook/

This guy is great, if you're poor just write to him and he'll give you a copy of the book for free.

Nedbat's Big O presentation:
https://nedbatchelder.com/text/bigo.html

Nedbat's talk on Pragmatic Unicode:
https://nedbatchelder.com/text/unipain.html

The Clean Architecture in Python (also a nice watch for anyone interested in FP approaches in Python):
https://www.youtube.com/watch?v=DJtef410XaM

If you don't like configparser and/or YAML, consider TOML:
https://github.com/toml-lang/toml

I'll share more as I come across them in my bookmarks. I've worked with Python (3) for about five years now and I figured I'd see if there was anything goon related :)

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

Bundy posted:

Also hi, thread newcomer :)

According to the forums search, I've posted in the current incarnation of this thread 242 times!

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

Bundy posted:

Packt currently have an everything for five bucks deal going (Packt usually provide the Humble Python books) if you want to cherry pick. Also hi, thread newcomer :)

Some resources I often find useful that I didn't notice in the OP:

Writing Idiomatic Python:
https://jeffknupp.com/writing-idiomatic-python-ebook/

This guy is great, if you're poor just write to him and he'll give you a copy of the book for free.

Nedbat's Big O presentation:
https://nedbatchelder.com/text/bigo.html

Nedbat's talk on Pragmatic Unicode:
https://nedbatchelder.com/text/unipain.html

The Clean Architecture in Python (also a nice watch for anyone interested in FP approaches in Python):
https://www.youtube.com/watch?v=DJtef410XaM

If you don't like configparser and/or YAML, consider TOML:
https://github.com/toml-lang/toml

I'll share more as I come across them in my bookmarks. I've worked with Python (3) for about five years now and I figured I'd see if there was anything goon related :)

Thanks for recommending this! I didn't know about this place (Packt), they have a ton of cheap bundles!

QuarkJets fucked around with this message at 22:19 on Jan 1, 2019

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply