Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

CarForumPoster posted:

Ya do this. It returns a list tho so be aware of that.

When I am scraping html tables and need fairly fast code though I’ll use lxml’s etree and for loop over it getting whatever I want. Eg

For tr in trs: where trs is a list of tr html elements i got using the their xpath. then get the nested tds from each tr and for loop over them again. This is very slow in selenium but fast in lxml.

Also this page is super helpful for understanding xpaths https://www.guru99.com/xpath-selenium.html

Getting good at xpath is all kinds of useful in my current job. One thing I don't see mentioned a lot is that you can type xpath directly into the Chrome console and figure out exactly what you're selecting. No add-ons required.

Type this in the Chrome Console:
code:
$x("//div[@id='postpreview']")

Adbot
ADBOT LOVES YOU

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

D34THROW posted:

As an aside, from you more experienced day-to-day Python dev out there, how readable is my code that you've seen? I run black and flake8 and try to be as descriptive as I can; I'm not comfortable sharing the github quite yet but I'm sure I will at some point.

This is probs more specific to my team/company, but we'd want more types explicitly stated. Most of your stuff in that example is pretty obvious, but something like the following could be an integer, but it could easily be a tuple (x,y) or a list (the plural makes it seem possible). The general rule here is that if you can't tell what something is on the line where it is declared, then you should add a type. You can still pretty easily suss out that it's an integer by reading on for a bit, but it's nicer for other devs to just have it stated explicitly.

Python code:
term_size = get_terminal_size().columns
We're hyper-aware of this sort of stuff because we write a lot of fragile/tightly-coupled integration code that breaks easily when other platforms change. So hotfixing, and thus reading someone else's code, is a constant duty. But it motivates us to write code that is as easy as possible for other devs to sort out as quickly as possible.

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

samcarsten posted:

Can anyone recommend a good tutorial on how to get Python to work with HTML? For my classes capstone project, we're making a self hosted calandar app and we're using python. I need to figure out how to link buttons and text boxes to python variables.

If you go with Django, I used to give this tutorial to students when I worked at a bootcamp https://tutorial.djangogirls.org/en/

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Falcon2001 posted:

you can also just dump functions into a file and access them without even the sort of 'static classmethod' approach that Java/C# uses.

Protip, purely for readability, my last position used a singleton rather than static methods. We had a ton of bespoke code that was constantly being fixed/updated and a lot of hands in the cookie jar reading unfamiliar code. So instead of just seeing methods and needing to check "and where is this guy coming from" in your IDE it's much nicer to be able to read it. You can get the same effect from imports I guess, but this enforced it for us.

code:
from module import singleton as feature
feature.methodname()

Fender fucked around with this message at 22:29 on Feb 3, 2024

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

QuarkJets posted:

Do you mean module imports? E.g.

code:
import os
x = os.path.join("foo", "bar")
If you truly mean a singleton then I don't understand

You know, as I was typing that out I was thinking about how pointlessly opinionated it sounded. I was mostly just piping up because it's how we blended our functional programming desires with our older OOP code. And for some reason it was universally loved.

Just put all your methods in a class and instantiate it at the end of the module and then import that. Not a perfect singleton, but it was essentially the same thing for this. You can make it purely functional or it's versatile and you can refactor older code into one of these and it can keep doing whatever OOP nonsense it might still be required to do, or maybe you have to pass it around somewhere. Whatever. But if you stick to the pattern you can do whatever wild crap you have to do in these supporting functional modules and it always looks the same and acts the same. Nothing earth shattering, but mildly topical and I (like everyone else there) really liked it so I preach.

Fender fucked around with this message at 01:13 on Feb 4, 2024

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Oysters Autobio posted:

Just getting into learning flask and my only real question is how people generally start on their HTML templates. Are folks just handwriting these? Or am I missing something here with flask?

Basically I'm looking for any good resources out there for HTML templates to use for scaffolding your initial templates. Tried searching for HTML templates but generally could only find paid products for entire websites, whereas I just want sort of like an HTML component library that has existing HTML. i.e. here's a generic dashboard page html, here's a web form, here's a basic website with top navbar etc.

Bonus points if it already has Jinja too but even if these were just plain HTML it would be awesome.

Chat-GPT will also happily spit out some Flask boilerplate templates for you. I've never tried to get it to do anything more complex than laying out some really basic crud stuff, but it does ok-ish.

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Totally, but with ~2700 day cares, it'll be slooooow. You can use requests to get your data much quicker, then dump it into whatever scraper engine you want to parse it.

The site looks like it does all the pagination on the front-end, so the html actually has all ~2700 results in it. I haven't gone further to double-check, but the last item in the search is "Your Child's Place" and that does appear at the bottom of the html. So make the following request, take the response.text and dump it into whatever parser and then you can get all the urls you need to make ~3000 more requests to check the details on each one.

Edit: I was curious, so I did a little bit more coding for parsing. The final parsed_urls list here has a length of 2691, which is the same as the search results. So you're left with a url that'll take you to the details for every daycare in Virginia. From there you do more of the same, looking for a value and if you see it, get the url and go look at it. Now you can do things like async requests and can speed up this whole process.

I included my own scraper method since I find it's cleaner than bringing in an entire thing like BeautifulSoup when you just need to select some elements.

Python code:
import requests
from lxml import html
from lxml.html import HtmlElement

def elements(html: HtmlElement, xpath: str) -> list[HtmlElement | None]:
    return [*html.xpath(xpath)]

data_url = "https://www.dss.virginia.gov/facility/search/cc2.cgi"

form_data = {
    "rm" : "Search",
    "search_keywords_name" : None,
    "search_exact_fips": None,
    "search_contains_zip" : None,
    "search_require_client_code-2101" : 1,
}

response = requests.post(url = data_url, data=form_data)

html_tree = html.fromstring(response.content)

raw_anchor_tags = elements(html_tree, "//a[contains(@href, 'code-2101')]")

parsed_urls = [f"https://www.dss.virginia.gov{x.attrib.get("href")}" for x in raw_anchor_tags]

Fender fucked around with this message at 19:43 on Apr 2, 2024

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Jose Cuervo posted:

Great thank you! I will look into this and post back if I have issues.

Good luck! And just because I am one of those freshly unemployed devs with a shoddy portfolio, I took this even further and tossed it up on Github: https://github.com/mcleeder/virginia_dss_scraper. I hope you don't mind.

I added the async code and enough parsing to winnow your results down to a list of urls for any daycare inspection with a violation >=2022. I noticed that the site drops connections fairly regularly, so I added some retry logic. For my connection, that resolved the issue and I'm able to get all 2691 daycares back. YMMV.

From this point in the code, you have a list of urls for any inspection that resulted in a violation. You can use the same fetch_urls() method to go gather all of those up as well, then parse them looking for the codes you care about.

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum
Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land.

We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale.

Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.

Adbot
ADBOT LOVES YOU

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Jose Cuervo posted:

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:

Python code:
connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)
Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply