Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Vulture Culture
Jul 14, 2003

I was never enjoying it. I only eat it for the nutrients.

Oysters Autobio posted:

Echoing this and expanding it even further: don't touch anything related to XML as a beginner.

While I'm absolutely sure it has some great features about it, I'm finding it's very much a "big boy" format and not beginner friendly. Maybe not the format necessarily but the ecosystem of tools in Python for it are really dense.

Hell, just look at the name "lxml" as a package. Gonna throw out a dumb hot take that I literally put no thought into: Acronyms should be banned from package naming.
Yeah, the big issue is that XML was built as a markup language, not a language for representing data structures or configuration, the two things developers between 1995 and 2005 really like to pretend it was ever good at

If you can make guarantees about the documents you're loading like "text will never contain other elements" then it gets a lot easier to work with and enables much more straightforward APIs like Pydantic

Adbot
ADBOT LOVES YOU

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum
Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land.

We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale.

Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.

Jose Cuervo
Aug 25, 2004

QuarkJets posted:

That's right, the responses will have the same order as the list of tasks provided to gather() even if the tasks happen to execute out of order. From the documentation, "If all awaitables are completed successfully, the result is an aggregate list of returned values. The order of result values corresponds to the order of awaitables."

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

Fender
Oct 9, 2000
Mechanical Bunny Rabbits!
Dinosaur Gum

Jose Cuervo posted:

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:

Python code:
connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)
Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

Jose Cuervo
Aug 25, 2004

Fender posted:

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:

Python code:
connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)
Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

Thank you! I am saving the center ID and inspection ID which fail to get a response and plan to try them again.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Fender posted:

Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land.

We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale.

Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.

I use lxml when needing to iterate over huge lists via xPaths from scraped data. Seems to be the fastest and it ain’t that hard. Selenium is slow at finding elements via xpath when you start needing to find hundreds of individual elements.

Also if you’re using selenium, lxml code can kinda look similar.

I spent multiple years writing and maintaining web scrapers and basically never used BS4.

CarForumPoster fucked around with this message at 10:13 on Apr 5, 2024

rich thick and creamy
May 23, 2005

To whip it, Whip it good
Pillbug
Has anyone played around with Rye yet? I just found it yesterday and am giving it a spin. So far it seems like a pretty nice Poetry alternative.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:

code:
    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }

I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

Cyril Sneer fucked around with this message at 03:46 on Apr 17, 2024

Adbot
ADBOT LOVES YOU

PierreTheMime
Dec 9, 2004

Hero of hormagaunts everywhere!
Buglord

Cyril Sneer posted:

Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:

code:
    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }

I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

This sounds like a good use-case for a vectored database and retrieval-augmented generation (RAG) and/or semantic search. You can use your dialog text as the target material and the rest as metadata you can retrieve on match. Theres a number of free options for database, including local ChromaDB instances (which use SQLite) or free-tier Pinecone.io which has good library support and a decent web UI.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply