|
Do any scripting languages allow using main as an entry point, without explicitly calling it?
|
# ? Dec 18, 2018 21:31 |
|
|
# ? May 16, 2024 10:42 |
|
Nippashish posted:You could also just commit to not import your scripts as modules and then you don't need any double underscore shenanigans.
|
# ? Dec 18, 2018 22:27 |
|
I'm looking for some advice on how to refactor a giant ETL/data pipeline script. Here's a summary of what it does currently: - Download files from various partners (some use FTP, others are HTTP) - Load the downloaded files into MySQL temp tables - De-dupe records (some records appear in the feeds of multiple partners) - Take the combined records, create a CSV and upload that to an API to get some additional data, then parse the results - Prepare the final output as MySQL tables + Elasticsearch index used by the Django app - Verify that the new record counts are reasonable, then put the final output on production - Save logs to filesystem and a table The requirements are nothing too complicated, but the code now is a bit of a mess. It's all procedural, has tons of exceptions and hacks for certain partners, and is generally just not very extendable. When it fails, our devs have a hell of a time debugging it. It's sort of fallen under my ownership and I'm not happy about its current state. Are there any libraries or PEPs to help guide me in rewriting this beast? My thought is I'd like a setup like Django migrations, where each "step" is its own file, containing a class that follows a standard format. Then there could be a file that lists all the classes and what order they run in. Hell, some of the steps could run concurrently, so it would be awesome to support that.
|
# ? Dec 19, 2018 13:00 |
Cock Democracy posted:I'm looking for some advice on how to refactor a giant ETL/data pipeline script. Here's a summary of what it does currently: Luigi or Airflow for libraries. Generally speaking, you want to separate each letter in ETL structurally, and then just use something to orchestrate correct, step-wise execution.
|
|
# ? Dec 19, 2018 13:19 |
|
Airflow is the worst out there except for everything else Luigi doesn't do scheduling just sequencing Iirc Godspeed goon.
|
# ? Dec 19, 2018 18:53 |
https://lwn.net/SubscriberLink/775105/5db16cfe82e78dc3/
|
|
# ? Dec 20, 2018 17:55 |
|
Dominoes posted:Does anyone else dislike the if __name__ == __main__ syntax? I still have to look it up every time. There's tons of stuff like this in every language. I just get used to all of these weird things in every language. FWIW, I just type "main" and press TAB and Pycharm types it all for me.
|
# ? Dec 20, 2018 20:00 |
|
Thermopyle posted:I just type "main" and press TAB and Pycharm types it all for me.
|
# ? Dec 20, 2018 20:07 |
|
Guys I was wondering what is a good way to see how long it takes a python script to run and how much memory/resources it is using? I just discovered timeit but all the examples I have seen for that seems to be run on single line comments with some weird have everything enclosed in a string quote syntax.
|
# ? Dec 24, 2018 00:47 |
keyframe posted:Guys I was wondering what is a good way to see how long it takes a python script to run and how much memory/resources it is using? Official cProfile is a good start. E: For memory you can try package called memory-profiler. cinci zoo sniper fucked around with this message at 01:11 on Dec 24, 2018 |
|
# ? Dec 24, 2018 01:08 |
|
cinci zoo sniper posted:Official cProfile is a good start. Thanks man! cProfile was exactly what I was looking for.
|
# ? Dec 24, 2018 02:58 |
|
Just to make sure, cause this behavior could actually be useful to shorten some conditionals into one-liners if it wasn't bananas: everyone would consider it extremely weird and unPythonic to access an item of a two-item sequence using a Boolean value silently evaluated to 0 or 1 as the index, i.e. 'ab'[False] == 'a' or 'ab'[True] == 'b' right?
|
# ? Dec 26, 2018 08:01 |
|
KICK BAMA KICK posted:Just to make sure, cause this behavior could actually be useful to shorten some conditionals into one-liners if it wasn't bananas: everyone would consider it extremely weird and unPythonic to access an item of a two-item sequence using a Boolean value silently evaluated to 0 or 1 as the index, i.e. 'ab'[False] == 'a' or 'ab'[True] == 'b' right? True and False evaluating to 1 and 0 isn't bananas, that's been a common feature for languages going back decades. It's fine to use that behavior, booleans are just a special type of integer, the bool class even inherits from int! However, don't do what you explicitly wrote in your post; instead of writing 'ab'[True] == 'b' you should just write 'ab'[1] == 'b'. I can see no benefit to writing out the boolean value instead of its integer equivalent
|
# ? Dec 26, 2018 08:24 |
|
Yeah that was just an illustration, the case I'm thinking of is more likecode:
code:
|
# ? Dec 26, 2018 08:33 |
|
I think thatcode:
In general relying on False == 0 and True == 1 is bad.
|
# ? Dec 26, 2018 09:12 |
|
KICK BAMA KICK posted:Yeah that was just an illustration, the case I'm thinking of is more like What breaks wrote is the pythonic way to do this. Creating a tuple just to immediately index into it... that's no good
|
# ? Dec 26, 2018 10:32 |
|
QuarkJets posted:What breaks wrote is the pythonic way to do this. Creating a tuple just to immediately index into it... that's no good
|
# ? Dec 26, 2018 18:22 |
|
I've seen that pattern used once in the wild, where it was something likePython code:
|
# ? Dec 26, 2018 21:33 |
NtotheTC posted:I've seen that pattern used once in the wild, where it was something like loving hell, why?!
|
|
# ? Dec 26, 2018 21:37 |
|
cinci zoo sniper posted:loving hell, why?! Well that example is slightly simplified, the two strings were actually entire sentences, and were stored in a config file and often accessed, so it was neater than writing Python code:
NtotheTC fucked around with this message at 21:47 on Dec 26, 2018 |
# ? Dec 26, 2018 21:43 |
|
NtotheTC posted:Well that example is slightly simplified, the two strings were actually entire sentences, and were stored in a config file and often accessed, so it was neater than writing I don't know man, it seems like the thing you just wrote is way neater than creating a tuple and then indexing into it with True/Fase
|
# ? Dec 26, 2018 22:05 |
|
It should be a dictionary.
|
# ? Dec 26, 2018 22:07 |
Nippashish posted:It should be a dictionary.
|
|
# ? Dec 26, 2018 23:44 |
|
I'm not trying to nominate it for code of the year, just pointing out a real-world example of it
|
# ? Dec 26, 2018 23:47 |
|
The only case where I would find that type of indexing acceptable is in a vectorized environment where you wanted to do something equivalent to:Python code:
Python code:
And I still would probably to try to find another solution. But the numpy version would be faster I think. Edit: duh, there is a better solution Python code:
|
# ? Dec 27, 2018 00:30 |
|
NtotheTC posted:I'm not trying to nominate it for code of the year, just pointing out a real-world example of it EDIT: I think my greatest Python code crime was erasing exceptions deep in some library and baffling myself and others about how the gently caress it was breaking. It made sense at the time but holy gently caress it's a mess.
|
# ? Dec 29, 2018 04:35 |
|
Hi guys, I am trying to scrape craigslist for apartment prices. I am having a problem where when I scrape the prices in <span class="result-price">$xxxx</span> I get a lot of duplicates which poo poo on my plan on making a urllink/price dictionary with zip(). Any idea why the duplicates are happening? I am using requests/beautiful soup with html.parser. craigslist link if anyone wants to try themselves: https://vancouver.craigslist.org/d/apts-housing-for-rent/search/apa
|
# ? Dec 31, 2018 05:15 |
|
keyframe posted:Hi guys,
|
# ? Dec 31, 2018 05:39 |
|
KICK BAMA KICK posted:At a glance, it looks like each listing's "result-price" is repeated under a <span class="result-meta"> tag. Grab your data only once per entry from either that or its parent. Ahh I see what you mean. I have no idea how to do that though, any pointers?
|
# ? Dec 31, 2018 06:19 |
|
keyframe posted:Ahh I see what you mean. I have no idea how to do that though, any pointers? Don't remember the BeautifulSoup syntax off the top of my head (and also just glanced at the page source the one time) so this is all approximate but instead of searching for all class="result-price" tags on the page you'll want to start by searching for only the the class="result-price" tags that are children of class="result-meta" tags. I put it that way cause I'm pretty sure you can find the tags matching both those conditions in one search and then iterate over those to process them the same way you were doing before; if you can't do that or if you don't follow search for all the class="result-meta" tags, iterate over those and find the one class="result-price" tag underneath each. Poke around in the BeautifulSoup docs a little bit, this should be pretty easy. I guess the absolute simplest way if you want to immediately solve the problem that might get the correct result but is not a great way to do it would be to step through the list of results you've retrieved with whatever line searches for those "result-price" tags by twos with a [::2] slice.
|
# ? Dec 31, 2018 06:53 |
|
KICK BAMA KICK posted:Don't remember the BeautifulSoup syntax off the top of my head (and also just glanced at the page source the one time) so this is all approximate but instead of searching for all class="result-price" tags on the page you'll want to start by searching for only the the class="result-price" tags that are children of class="result-meta" tags. I put it that way cause I'm pretty sure you can find the tags matching both those conditions in one search and then iterate over those to process them the same way you were doing before; if you can't do that or if you don't follow search for all the class="result-meta" tags, iterate over those and find the one class="result-price" tag underneath each. Poke around in the BeautifulSoup docs a little bit, this should be pretty easy. Thanks. I got some ways to go before I wrap my head around beautiful soup but somehow solved it with this while messing around trying different things. code:
Here is the snapshot of the html btw: Thanks again for the help. keyframe fucked around with this message at 07:41 on Dec 31, 2018 |
# ? Dec 31, 2018 07:34 |
|
Awesome! Exactly, that works for now because the price happens to be in the first span tag under result-meta; to make it more robust you will want to specify the class="result-price" in the find, however that works.
|
# ? Dec 31, 2018 07:54 |
|
Just use the soup's select method with a ".result-meta>.result-price" selector and call it a day. IMO always pick elements using CSS selectors up until they don't support what you need, that's what they were made for. (On that page, there 221 matches for .result-price but 120 for .result-meta>.result-price. There are 120 items per page and at first glance the lower than expected number of .result-price results seems to be due to posts lacking pics only having the price displayed once? or something like that.) breaks fucked around with this message at 08:13 on Dec 31, 2018 |
# ? Dec 31, 2018 07:59 |
|
breaks posted:Just use the soup's select method with a ".result-meta>.result-price" selector and call it a day. IMO always pick elements using CSS selectors up until they don't support what you need, that's what they were made for. Thanks for the tip! Just read up on that in the docs and you are right. code:
|
# ? Dec 31, 2018 08:11 |
|
Malcolm XML posted:Airflow is the worst out there except for everything else I'm using Airflow in a customer project, and by God, everything about it is slow and I hate "packaging" the DAGs. I hands down prefer Luigi+Cron. Unfortunately, Luigi's web interface is functionally useless, so if you need any kind of actual information from your jobs, Airflow it is.
|
# ? Dec 31, 2018 13:52 |
|
Humble Bundle has a bunch of python books.
|
# ? Dec 31, 2018 20:18 |
|
Looks like a Qt5 book and a Pandas book in there, those sound useful but then there's "Learn Python by Building a Blockchain and Cryptocurrency" which is just
|
# ? Jan 1, 2019 07:09 |
Packt currently have an everything for five bucks deal going (Packt usually provide the Humble Python books) if you want to cherry pick. Also hi, thread newcomer Some resources I often find useful that I didn't notice in the OP: Writing Idiomatic Python: https://jeffknupp.com/writing-idiomatic-python-ebook/ This guy is great, if you're poor just write to him and he'll give you a copy of the book for free. Nedbat's Big O presentation: https://nedbatchelder.com/text/bigo.html Nedbat's talk on Pragmatic Unicode: https://nedbatchelder.com/text/unipain.html The Clean Architecture in Python (also a nice watch for anyone interested in FP approaches in Python): https://www.youtube.com/watch?v=DJtef410XaM If you don't like configparser and/or YAML, consider TOML: https://github.com/toml-lang/toml I'll share more as I come across them in my bookmarks. I've worked with Python (3) for about five years now and I figured I'd see if there was anything goon related
|
|
# ? Jan 1, 2019 11:52 |
|
Bundy posted:Also hi, thread newcomer According to the forums search, I've posted in the current incarnation of this thread 242 times!
|
# ? Jan 1, 2019 21:54 |
|
|
# ? May 16, 2024 10:42 |
|
Bundy posted:Packt currently have an everything for five bucks deal going (Packt usually provide the Humble Python books) if you want to cherry pick. Also hi, thread newcomer Thanks for recommending this! I didn't know about this place (Packt), they have a ton of cheap bundles! QuarkJets fucked around with this message at 22:19 on Jan 1, 2019 |
# ? Jan 1, 2019 22:16 |