|
CarForumPoster posted:Ya do this. It returns a list tho so be aware of that. Getting good at xpath is all kinds of useful in my current job. One thing I don't see mentioned a lot is that you can type xpath directly into the Chrome console and figure out exactly what you're selecting. No add-ons required. Type this in the Chrome Console: code:
|
# ¿ Apr 20, 2022 04:07 |
|
|
# ¿ Apr 28, 2024 23:24 |
|
D34THROW posted:As an aside, from you more experienced day-to-day Python dev out there, how readable is my code that you've seen? I run black and flake8 and try to be as descriptive as I can; I'm not comfortable sharing the github quite yet but I'm sure I will at some point. This is probs more specific to my team/company, but we'd want more types explicitly stated. Most of your stuff in that example is pretty obvious, but something like the following could be an integer, but it could easily be a tuple (x,y) or a list (the plural makes it seem possible). The general rule here is that if you can't tell what something is on the line where it is declared, then you should add a type. You can still pretty easily suss out that it's an integer by reading on for a bit, but it's nicer for other devs to just have it stated explicitly. Python code:
|
# ¿ Apr 29, 2022 02:25 |
|
samcarsten posted:Can anyone recommend a good tutorial on how to get Python to work with HTML? For my classes capstone project, we're making a self hosted calandar app and we're using python. I need to figure out how to link buttons and text boxes to python variables. If you go with Django, I used to give this tutorial to students when I worked at a bootcamp https://tutorial.djangogirls.org/en/
|
# ¿ Feb 17, 2023 07:38 |
|
Falcon2001 posted:you can also just dump functions into a file and access them without even the sort of 'static classmethod' approach that Java/C# uses. Protip, purely for readability, my last position used a singleton rather than static methods. We had a ton of bespoke code that was constantly being fixed/updated and a lot of hands in the cookie jar reading unfamiliar code. So instead of just seeing methods and needing to check "and where is this guy coming from" in your IDE it's much nicer to be able to read it. You can get the same effect from imports I guess, but this enforced it for us. code:
Fender fucked around with this message at 22:29 on Feb 3, 2024 |
# ¿ Feb 3, 2024 22:27 |
|
QuarkJets posted:Do you mean module imports? E.g. You know, as I was typing that out I was thinking about how pointlessly opinionated it sounded. I was mostly just piping up because it's how we blended our functional programming desires with our older OOP code. And for some reason it was universally loved. Just put all your methods in a class and instantiate it at the end of the module and then import that. Not a perfect singleton, but it was essentially the same thing for this. You can make it purely functional or it's versatile and you can refactor older code into one of these and it can keep doing whatever OOP nonsense it might still be required to do, or maybe you have to pass it around somewhere. Whatever. But if you stick to the pattern you can do whatever wild crap you have to do in these supporting functional modules and it always looks the same and acts the same. Nothing earth shattering, but mildly topical and I (like everyone else there) really liked it so I preach. Fender fucked around with this message at 01:13 on Feb 4, 2024 |
# ¿ Feb 4, 2024 01:09 |
|
Oysters Autobio posted:Just getting into learning flask and my only real question is how people generally start on their HTML templates. Are folks just handwriting these? Or am I missing something here with flask? Chat-GPT will also happily spit out some Flask boilerplate templates for you. I've never tried to get it to do anything more complex than laying out some really basic crud stuff, but it does ok-ish.
|
# ¿ Mar 20, 2024 20:44 |
|
The Fool posted:Yes, use playwright Totally, but with ~2700 day cares, it'll be slooooow. You can use requests to get your data much quicker, then dump it into whatever scraper engine you want to parse it. The site looks like it does all the pagination on the front-end, so the html actually has all ~2700 results in it. I haven't gone further to double-check, but the last item in the search is "Your Child's Place" and that does appear at the bottom of the html. So make the following request, take the response.text and dump it into whatever parser and then you can get all the urls you need to make ~3000 more requests to check the details on each one. Edit: I was curious, so I did a little bit more coding for parsing. The final parsed_urls list here has a length of 2691, which is the same as the search results. So you're left with a url that'll take you to the details for every daycare in Virginia. From there you do more of the same, looking for a value and if you see it, get the url and go look at it. Now you can do things like async requests and can speed up this whole process. I included my own scraper method since I find it's cleaner than bringing in an entire thing like BeautifulSoup when you just need to select some elements. Python code:
Fender fucked around with this message at 19:43 on Apr 2, 2024 |
# ¿ Apr 2, 2024 17:46 |
|
Jose Cuervo posted:Great thank you! I will look into this and post back if I have issues. Good luck! And just because I am one of those freshly unemployed devs with a shoddy portfolio, I took this even further and tossed it up on Github: https://github.com/mcleeder/virginia_dss_scraper. I hope you don't mind. I added the async code and enough parsing to winnow your results down to a list of urls for any daycare inspection with a violation >=2022. I noticed that the site drops connections fairly regularly, so I added some retry logic. For my connection, that resolved the issue and I'm able to get all 2691 daycares back. YMMV. From this point in the code, you have a list of urls for any inspection that resulted in a violation. You can use the same fetch_urls() method to go gather all of those up as well, then parse them looking for the codes you care about.
|
# ¿ Apr 2, 2024 23:31 |
|
Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land. We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale. Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.
|
# ¿ Apr 4, 2024 18:44 |
|
|
# ¿ Apr 28, 2024 23:24 |
|
Jose Cuervo posted:Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure. For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work: Python code:
|
# ¿ Apr 4, 2024 22:50 |