|
12 rats tied together posted:it depends but short answer probably not. if you wanted it to be accessible on that network you'd want to run it bound to an IP address in that network. Do you know how I would check if pycharm specifically binds to 127.0.0.1? (I don't want anyone to be able to access the website while I am working on it).
|
# ¿ Mar 13, 2024 12:07 |
|
|
# ¿ May 22, 2024 10:30 |
|
The following search page https://www.dss.virginia.gov/facility/search/cc2.cgi can be used to find a list of all the licensed child day care centers in Virginia. If you click on a search result it takes you to the page for that child day care (e.g., this is the page for the first result: https://www.dss.virginia.gov/facility/search/cc2.cgi?rm=Details;ID=35291;search_require_client_code-2101=1) and on the page is a table with a list of inspections and a 'Violations' column. If there was a violation, the 'Yes' is a link to a page with a more in depth description of the violation, including the standard # that was violated (e.g., https://www.dss.virginia.gov/facility/search/cc2.cgi?rm=Inspection;Inspection=141676;ID=35291;search_require_client_code-2101=1;#Violations). I have a list of 5 standards and I would like to get the description text from all centers which had a violation of any of the 5 standards in 2023 or 2024. Is this something that I can automate with Python, and if so, a general plan of attack would be appreciated.
|
# ¿ Apr 2, 2024 14:53 |
|
Fender posted:Totally, but with ~2700 day cares, it'll be slooooow. You can use requests to get your data much quicker, then dump it into whatever scraper engine you want to parse it. Great thank you! I will look into this and post back if I have issues.
|
# ¿ Apr 2, 2024 20:14 |
|
Fender posted:Good luck! And just because I am one of those freshly unemployed devs with a shoddy portfolio, I took this even further and tossed it up on Github: https://github.com/mcleeder/virginia_dss_scraper. I hope you don't mind. I do not mind at all, in fact I really appreciate it. While I do not fully understand how the async stuff works, I can follow the logic of your code. I wrote a simple function with BeautifulSoup (I could not easily understand the documentation for lxml) that goes through the responses and if the violation is of one of the standards I care about, it pulls out the description and plan of correction. I would not have been able to get this far this quickly without your help, so thank you very much!
|
# ¿ Apr 3, 2024 03:35 |
|
I am trying to understand the async portion of the code provided, in particular, in the following function:Python code:
|
# ¿ Apr 4, 2024 03:39 |
|
QuarkJets posted:That's right, the responses will have the same order as the list of tasks provided to gather() even if the tasks happen to execute out of order. From the documentation, "If all awaitables are completed successfully, the result is an aggregate list of returned values. The order of result values corresponds to the order of awaitables." Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure. Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls? Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?
|
# ¿ Apr 4, 2024 21:35 |
|
Fender posted:For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work: Thank you! I am saving the center ID and inspection ID which fail to get a response and plan to try them again.
|
# ¿ Apr 5, 2024 00:38 |
|
I have a SQLAlchemy model called Subject, and each Subject model has an attribute named 'heart_rate': Mapped[list['HeartRateValue']], where the HeartRateValue model stores the time_stamp and value of each heart rate value. I know if I have the ID of the subject I can use session.get(Subject, subject_id) to get the Subject object where the subject ID is subject_id. Is there a way with SQLAlchemy to then query the Subject object for the heart rate values which fall into a certain time interval (say start_date_time, end_date_time)?
|
# ¿ May 10, 2024 19:30 |
|
Jose Cuervo posted:I have a SQLAlchemy model called Subject, and each Subject model has an attribute named 'heart_rate': Mapped[list['HeartRateValue']], where the HeartRateValue model stores the time_stamp and value of each heart rate value. I know if I have the ID of the subject I can use Figured out how to do this: Python code:
|
# ¿ May 12, 2024 19:57 |
|
|
# ¿ May 22, 2024 10:30 |
|
BAD AT STUFF posted:Are you familiar with SQL? Python libraries with DataFrames (like pandas, polars, or pyspark) use a lot of SQL idioms. You don't need a for loop because a select statement applies to all of your rows. Do you have a recommended resource to learn some basics of SQL?
|
# ¿ May 21, 2024 23:46 |