Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Jose Cuervo
Aug 25, 2004

12 rats tied together posted:

it depends but short answer probably not. if you wanted it to be accessible on that network you'd want to run it bound to an IP address in that network.

if pycharm, specifically, binds to 127.0.0.1, it's not routable, since that means "local computer" and it's ~not possible for a packet to arrive at your computer with that destination address.

if pycharm binds to 0.0.0.0 you can still access it through 127.0.0.1, but it would also be accessible on other addresses, which means it could theoretically be routed to you, which means it could theoretically be accessed by another device in that network. (if nothing blocked it first, like a firewall)

Do you know how I would check if pycharm specifically binds to 127.0.0.1? (I don't want anyone to be able to access the website while I am working on it).

Adbot
ADBOT LOVES YOU

Jose Cuervo
Aug 25, 2004
The following search page https://www.dss.virginia.gov/facility/search/cc2.cgi can be used to find a list of all the licensed child day care centers in Virginia. If you click on a search result it takes you to the page for that child day care (e.g., this is the page for the first result: https://www.dss.virginia.gov/facility/search/cc2.cgi?rm=Details;ID=35291;search_require_client_code-2101=1) and on the page is a table with a list of inspections and a 'Violations' column. If there was a violation, the 'Yes' is a link to a page with a more in depth description of the violation, including the standard # that was violated (e.g., https://www.dss.virginia.gov/facility/search/cc2.cgi?rm=Inspection;Inspection=141676;ID=35291;search_require_client_code-2101=1;#Violations).

I have a list of 5 standards and I would like to get the description text from all centers which had a violation of any of the 5 standards in 2023 or 2024.

Is this something that I can automate with Python, and if so, a general plan of attack would be appreciated.

Jose Cuervo
Aug 25, 2004

Fender posted:

Totally, but with ~2700 day cares, it'll be slooooow. You can use requests to get your data much quicker, then dump it into whatever scraper engine you want to parse it.

The site looks like it does all the pagination on the front-end, so the html actually has all ~2700 results in it. I haven't gone further to double-check, but the last item in the search is "Your Child's Place" and that does appear at the bottom of the html. So make the following request, take the response.text and dump it into whatever parser and then you can get all the urls you need to make ~3000 more requests to check the details on each one.

Edit: I was curious, so I did a little bit more coding for parsing. The final parsed_urls list here has a length of 2691, which is the same as the search results. So you're left with a url that'll take you to the details for every daycare in Virginia. From there you do more of the same, looking for a value and if you see it, get the url and go look at it. Now you can do things like async requests and can speed up this whole process.

I included my own scraper method since I find it's cleaner than bringing in an entire thing like BeautifulSoup when you just need to select some elements.

Python code:
import requests
from lxml import html
from lxml.html import HtmlElement

def elements(html: HtmlElement, xpath: str) -> list[HtmlElement | None]:
    return [*html.xpath(xpath)]

data_url = "https://www.dss.virginia.gov/facility/search/cc2.cgi"

form_data = {
    "rm" : "Search",
    "search_keywords_name" : None,
    "search_exact_fips": None,
    "search_contains_zip" : None,
    "search_require_client_code-2101" : 1,
}

response = requests.post(url = data_url, data=form_data)

html_tree = html.fromstring(response.content)

raw_anchor_tags = elements(html_tree, "//a[contains(@href, 'code-2101')]")

parsed_urls = [f"https://www.dss.virginia.gov{x.attrib.get("href")}" for x in raw_anchor_tags]

Great thank you! I will look into this and post back if I have issues.

Jose Cuervo
Aug 25, 2004

Fender posted:

Good luck! And just because I am one of those freshly unemployed devs with a shoddy portfolio, I took this even further and tossed it up on Github: https://github.com/mcleeder/virginia_dss_scraper. I hope you don't mind.

I added the async code and enough parsing to winnow your results down to a list of urls for any daycare inspection with a violation >=2022. I noticed that the site drops connections fairly regularly, so I added some retry logic. For my connection, that resolved the issue and I'm able to get all 2691 daycares back. YMMV.

From this point in the code, you have a list of urls for any inspection that resulted in a violation. You can use the same fetch_urls() method to go gather all of those up as well, then parse them looking for the codes you care about.

I do not mind at all, in fact I really appreciate it.

While I do not fully understand how the async stuff works, I can follow the logic of your code. I wrote a simple function with BeautifulSoup (I could not easily understand the documentation for lxml) that goes through the responses and if the violation is of one of the standards I care about, it pulls out the description and plan of correction.

I would not have been able to get this far this quickly without your help, so thank you very much!

Jose Cuervo
Aug 25, 2004
I am trying to understand the async portion of the code provided, in particular, in the following function:
Python code:
async def fetch_urls(urls):
    tasks = []
    for url in urls:
        task = asyncio.create_task(fetch_url(url))
        tasks.append(task)

    responses = await asyncio.gather(*tasks)
    return responses
I believe that the calls to fetch_url() placed in the tasks list are not necessarily executed in the order they were placed in the list, but am I correct in saying that the responses list contains the responses in the order of the calls to fetch_url() placed in the tasks list?

Jose Cuervo
Aug 25, 2004

QuarkJets posted:

That's right, the responses will have the same order as the list of tasks provided to gather() even if the tasks happen to execute out of order. From the documentation, "If all awaitables are completed successfully, the result is an aggregate list of returned values. The order of result values corresponds to the order of awaitables."

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

Jose Cuervo
Aug 25, 2004

Fender posted:

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:

Python code:
connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)
Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

Thank you! I am saving the center ID and inspection ID which fail to get a response and plan to try them again.

Jose Cuervo
Aug 25, 2004
I have a SQLAlchemy model called Subject, and each Subject model has an attribute named 'heart_rate': Mapped[list['HeartRateValue']], where the HeartRateValue model stores the time_stamp and value of each heart rate value. I know if I have the ID of the subject I can use

session.get(Subject, subject_id)

to get the Subject object where the subject ID is subject_id. Is there a way with SQLAlchemy to then query the Subject object for the heart rate values which fall into a certain time interval (say start_date_time, end_date_time)?

Jose Cuervo
Aug 25, 2004

Jose Cuervo posted:

I have a SQLAlchemy model called Subject, and each Subject model has an attribute named 'heart_rate': Mapped[list['HeartRateValue']], where the HeartRateValue model stores the time_stamp and value of each heart rate value. I know if I have the ID of the subject I can use

session.get(Subject, subject_id)

to get the Subject object where the subject ID is subject_id. Is there a way with SQLAlchemy to then query the Subject object for the heart rate values which fall into a certain time interval (say start_date_time, end_date_time)?

Figured out how to do this:
Python code:
session.get(Subject, subject_id) \
       .hr_values \
       .where(HeartRateValue.datetime >= start_dt) \
       .where(HeartRateValue.datetime < \
                (pd.to_datetime(start_dt) + \
                 pd.DateOffset(hours=tss_length_hrs))) \
       .all()

Adbot
ADBOT LOVES YOU

Jose Cuervo
Aug 25, 2004

BAD AT STUFF posted:

Are you familiar with SQL? Python libraries with DataFrames (like pandas, polars, or pyspark) use a lot of SQL idioms. You don't need a for loop because a select statement applies to all of your rows.

Python code:
df.select(pl.concat_str('month', 'year').cast(pl.UInt32) * pl.col('quantity'))

Do you have a recommended resource to learn some basics of SQL?

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply