|
Oysters Autobio posted:Echoing this and expanding it even further: don't touch anything related to XML as a beginner. If you can make guarantees about the documents you're loading like "text will never contain other elements" then it gets a lot easier to work with and enables much more straightforward APIs like Pydantic
|
# ? Apr 4, 2024 17:47 |
|
|
# ? Jun 5, 2024 08:49 |
|
Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land. We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale. Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.
|
# ? Apr 4, 2024 18:44 |
|
QuarkJets posted:That's right, the responses will have the same order as the list of tasks provided to gather() even if the tasks happen to execute out of order. From the documentation, "If all awaitables are completed successfully, the result is an aggregate list of returned values. The order of result values corresponds to the order of awaitables." Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure. Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls? Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?
|
# ? Apr 4, 2024 21:35 |
|
Jose Cuervo posted:Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure. For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work: Python code:
|
# ? Apr 4, 2024 22:50 |
|
Fender posted:For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work: Thank you! I am saving the center ID and inspection ID which fail to get a response and plan to try them again.
|
# ? Apr 5, 2024 00:38 |
|
Fender posted:Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land. I use lxml when needing to iterate over huge lists via xPaths from scraped data. Seems to be the fastest and it ain’t that hard. Selenium is slow at finding elements via xpath when you start needing to find hundreds of individual elements. Also if you’re using selenium, lxml code can kinda look similar. I spent multiple years writing and maintaining web scrapers and basically never used BS4. CarForumPoster fucked around with this message at 10:13 on Apr 5, 2024 |
# ? Apr 5, 2024 10:10 |
|
Has anyone played around with Rye yet? I just found it yesterday and am giving it a spin. So far it seems like a pretty nice Poetry alternative.
|
# ? Apr 13, 2024 02:19 |
|
Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps. I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text: code:
I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like. Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)? I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project. Cyril Sneer fucked around with this message at 03:46 on Apr 17, 2024 |
# ? Apr 17, 2024 03:42 |
|
Cyril Sneer posted:Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps. This sounds like a good use-case for a vectored database and retrieval-augmented generation (RAG) and/or semantic search. You can use your dialog text as the target material and the rest as metadata you can retrieve on match. Theres a number of free options for database, including local ChromaDB instances (which use SQLite) or free-tier Pinecone.io which has good library support and a decent web UI.
|
# ? Apr 17, 2024 11:49 |
|
I just wanted to say I appreciated this error message when I forgot to put in the '-r'.Bash code:
|
# ? Apr 19, 2024 17:49 |
|
I love the "consider using". It feels very much like "you don't have to go home, but you can't stay here"
|
# ? Apr 19, 2024 18:15 |
|
I have a case where I create two instances of an object via a big configuration dictionary. The difference between the two objects is a single, different value for one key. So, this works:code:
|
# ? Apr 24, 2024 20:07 |
|
You might want to do deepcopy instead of just dict() to copy but that's all I'd change EDIT: If you wanted to be real safe you'd make them frozen too.
|
# ? Apr 24, 2024 20:20 |
|
If I want to take a base dict and change value in one line, I'll usually use a spread operator if the structure and changes are simplePython code:
Python code:
boofhead fucked around with this message at 21:00 on Apr 24, 2024 |
# ? Apr 24, 2024 20:53 |
|
[quote="boofhead" post="539140163"] If I want to take a base dict and change value in one line, I'll usually use a spread operator if the structure and changes are simple Python code:
|
# ? Apr 24, 2024 21:32 |
|
If your things are dataclasses you can neatly dictionary-ize one while creating the otherPython code:
Python code:
|
# ? Apr 26, 2024 08:32 |
Wait a minute, I was just using dataclasses for something similar and was going to mention something annoying about them, which isPython code:
|
|
# ? Apr 26, 2024 10:25 |
|
Data classes rule. Use them everywhere.
|
# ? Apr 26, 2024 20:29 |
|
Falcon2001 posted:Data classes rule. Use them everywhere. I’ve met like three functions that should be a class.
|
# ? Apr 26, 2024 20:35 |
|
Data Graham posted:Wait a minute, I was just using dataclasses for something similar and was going to mention something annoying about them, which is You can assign defaults in the dataclass definition, see my example
|
# ? Apr 26, 2024 22:42 |
Oh nice, I should have figured. Thanks! e: wow I have no idea why it never even occurred to me to try that. Some days my stupidity leaves me breathless Data Graham fucked around with this message at 03:06 on Apr 27, 2024 |
|
# ? Apr 26, 2024 23:04 |
|
I'm hoping that pyspark is python-adjacent enough to be appropriate for this thread (since the Data Engineering thread doesn't seem to be responding) I'm applying for a job and Spark is one of the required skills, but I'm fairly rusty. They sprang an assignment on me where they wanted me to take the MovieLens dataset and calculate: - The most common tag for a movie title and - The most common genre rated by a user After lots of time on Stack Overflow and Youtube, this is the script that I came up with. At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast. code:
|
# ? Apr 26, 2024 23:19 |
|
Seventh Arrow posted:At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast. I'll preface this with an acknowledgement that I'm not a pyspark toucher so I'm not going to really focus on that. From the point of view of someone who is reviewing your submission, I'm very happy to see comments, error checking, and unit tests! They give me additional insight into how you communicate information about your code and how you go about validating your designs. However, if the assignment was supposed to take you 4 hours and you turn in something that looks like it's had 40 put into it, that isn't necessarily a plus. Since you're offering it up for a roast, here are some things to consider:
It's clear that you've seen good code before and have some idea of what it should look like when trying to write it for yourself, but you're missing fundamentals and experience that will allow you to actually write it. To be clear this is a fine place to be for a junior. If this is for a junior role, submitted as-is it's a hire from me but there's a lot of headroom for another junior to impress me over this submission.
|
# ? Apr 27, 2024 00:36 |
|
I don't like how the function code is all wrapped in large try/except blocks that raise new errors. That's not really error checking, it's error obfuscation; yes, you print the caught exception object, but it's hiding the stack trace for no good reason. If you absolutely felt like you had to add more context to exceptions, like if you wanted to use logging or send a notification to someone, then you could use exception chaining to raise the original exceptionPython code:
Don't run pip install commands in your notebook code. A comment that describes what's needed is fine.
|
# ? Apr 27, 2024 03:15 |
|
Thanks guys, much appreciated! I will look into your suggestions.
|
# ? Apr 27, 2024 06:03 |
|
Any recommendations for places to start with interfacing dlls and python? I'm interested in playing around with this and trying to make some file parsing stuff faster by reading binary data in c, then passing it to python for the higher level stuff. I've used ctypes before, but I'm under the impression dll stuff has changed in the last 4 versions or so, and I'm worried searching will suggest bad habits.
|
# ? Apr 27, 2024 15:33 |
|
I am putting together a lesson plan for myself to get from absolute newbie to somewhat competent. I've been sourcing the books, webpages and courses that are mentioned frequently between this thread, r/learnpython and then having ChatGPT take a few of them and spin up a couple month lesson plan. Just occurred to me to ask if anyone had put something like this together already and saw success with it, would love if you could share it, if only to cross reference to make sure mine is on a seemingly well-thought out path.
|
# ? Apr 28, 2024 20:54 |
|
Not a lesson plan, but I slapped together a list of resources a while ago: https://forums.somethingawful.com/showthread.php?threadid=2672629&userid=0&perpage=40&pagenumber=299#post533605221
|
# ? Apr 29, 2024 02:47 |
|
Seventh Arrow posted:Not a lesson plan, but I slapped together a list of resources a while ago: This is helpful, thank you. I'm going to include those 5 YT channels. Below are the resources I am going to give to Chatgpt and have it generate a day-by-day lesson plan. It does a really good job at creating said plan but of course said plan is only as good as the resources it was given. And a newbie is feeding it resources. Books: Think Python Automate the Boring PDF: TokyoEdtech Intro to Python https://drive.google.com/file/d/1ajYJZLGUaVmNbuG98LnRfHMTzvnZx9el/view?pli=1 Course: Sololearn https://www.sololearn.com/en/learn/courses/python-introduction Codeacademy https://www.codecademy.com/catalog/...MBoC3h8QAvD_BwE YT: 12 Hour Course thing https://www.youtube.com/watch?v=WGJJIrtnfpk + the 5 you've presented. A few of the resources above seemed controversial like the 12 hour video, whereas others swore by it. When it comes to the courses that one is really tough because Id imagine if I knew which one was the most intuitive for a newbie I'd go with that. Otherwise Solo and Code are the two I selected based on lots of reading. Am I missing something wildly important? And any last minute tips lmao edit: swapped Sololearn for Coursera 96 spacejam fucked around with this message at 05:35 on Apr 29, 2024 |
# ? Apr 29, 2024 04:52 |
|
I'm searching for a job right now (and it sucks). I was most recently working in fintech, so I applied at a place doing some bank data automation work looking for a Python engineer. They asked me to do a coding challenge and I accepted. It wasn't too bad of a task, just writing a parser for some messy data in three different formats. The super tricky part was finding all the hidden landmines in the data and instructions. The whole thing was littered with traps. I was rejected right away and couldn't get any feedback out of them. It was surprising because I though I had done fairly well, at least well enough to earn an interview. So I humbly submit my code here to see if anyone has time for some feedback. I know I messed a few things up. I did the whole thing in one 5-hour sprint and I could tell my attention was slipping by the end. The thing had a few specific requirements that explain why it is the way it is:
I am sure I messed up:
Here is the code. There is also a readme.md there with the instructions. Fender fucked around with this message at 06:02 on Apr 29, 2024 |
# ? Apr 29, 2024 05:43 |
|
Fender posted:I'm searching for a job right now (and it sucks). I was most recently working in fintech, so I applied at a place doing some bank data automation work looking for a Python engineer. They asked me to do a coding challenge and I accepted. It wasn't too bad of a task, just writing a parser for some messy data in three different formats. The super tricky part was finding all the hidden landmines in the data and instructions. The whole thing was littered with traps. I was rejected right away and couldn't get any feedback out of them. It was surprising because I though I had done fairly well, at least well enough to earn an interview. So I humbly submit my code here to see if anyone has time for some feedback. I know I messed a few things up. I did the whole thing in one 5-hour sprint and I could tell my attention was slipping by the end. I'll dig through, Notes as I go through, in no particular order. I didn't dig deeply into the problem, so my notes below are on your code alone, and I only dug into the problem to confirm details.
Falcon2001 fucked around with this message at 07:32 on Apr 29, 2024 |
# ? Apr 29, 2024 07:17 |
|
The commit message is "challenge completed"; one of the requirements is to write a concise commit message, but I think that's probably too concise. After looking it over a little I realized that the code isn't pep8-compliant, which is another requirement. A common tool to verify this is the standalone utility named "pycodestyle", which used to literally be called "pep8". An even better tool is "flake8", which combines pycodestyle and some other linters into one utility. Try installing flake8 (with pip) and running it on your script (`flake8 challenge.py`), it'll print out a list of line numbers with specific issues to fix. You can also run pycodestyle in the same way if you want to see only the pep8 problems. Linting is integrated directly into most modern IDEs but every one has its own default flavor of what's "right" and not all of them are pep8-compliant; lesson learned, run flake8 on your code before you commit it (you should look up pre-commit-hooks, during a git commit they can not only automatically fix a lot of common issues like unnecessary whitespace at the end of lines, you can also incorporate a flake8 pre-commit stage that will cause the commit to be rejected if flake8 has complaints; this enforces pep8 compliance on anything that you try to commit, it's very handy!) Those two issues (PEP8 compliance, vague commit message) + the result not being sorted by zip code is 3 failed requirements. Those are probably the big issues here. I have some other suggestions for your code that you can think about, because I like providing code review, but I don't think these matter as much: The `Address` class would be a lot more concise if it was a dataclass. Then the `view_dict` method reduces to 2 lines (one of which is the asdict function from the dataclasses module). As it stands, `Address` also has no documentation A class-based approach like this has fallen out of favor with a lot of Python developers. The methods in `Parser` could have all been functions. `file_path[-3:].lower()` is less effective at extracting file extensions than `os.path.splitext(file_path)[-1]`. Technically it satisfies the assignment, but this line is a brittle implementation that could require rework if it had to be extended to other file formats in the future. In a code review I'd reject this line. The rest of the implementation in `parse_file` looks good. A dictionary dispatch approach would be equally valid I'm not great at XML parsing. But when you extract the street components I think you're missing some logic to correctly handle street names. Here's your approach to concatenating these: Python code:
Python code:
XML code:
Python code:
- We use a walrus operator to only keep elements that have length > 0 - We join together the remaining elements with spaces Moving beyond the issues with start parsing in xml, I'm not enamored with the "x = y if y is not None else None" approach in general. 90% of this code is duplicate effort. I think this could be a lot cleaner with a looped approach. I'm imagining a dict comprehension but even a for loop could be cleaner, I think The postal code parser is a little too complicated and I think it doesn't work right with XML values. The Zip+4 values in the XML are formatted as "NNNNN - MMMM" but your regex is `r"^\d{5}-\d{4}$`. That lack of a space means that your parser assumes it's not a Zip+4 value, so it always returns the first 5 digits. This would work better: Python code:
In the tcv parser I think you were meant to infer that sometimes an Organization label is actually in the "last" column; it looks like you're parsing "Central Trading Company Ltd." as a last name for a person with no first name. If a middle name is "N/M/N" you should probably just not include that value. E.g. "no middle name" == "N/A" == don't bother keeping this vlaue Probably don't want to leave a "Hello, world!" docstring in your code
|
# ? Apr 29, 2024 08:08 |
|
Thanks both of you, that was very helpful. And yeah, this was all done in public. With a bit of searching you can go see the PRs and the code I was up against. There are a lot of people writing Python so much worse than mine (and a few writing Python so much better). There is some good feedback in here. Some of it I feel would be appropriate for a conversation. Like, why I check for None a lot. It's just a habit I learned in my last role where we parsed a lot of unreliable data sourced from scrapers. I would happily explain my position there. And stuff like the Address class having that weird view dict method, would love to chat about that. That is like that bc while they never said it, the examples both showed the printed json ignoring any keys without values, so I went out of my way to provide that since all the example data had different fields present/missing depending on the format. I was really getting in my own head by hour number 4. But there is much more mechanically wrong than I thought. You both also pointed out a really neat bug in my xml parsing. It's working... but it shouldn't be. And the zip code bug, I didn't catch that one either. That was just them coming in formats of 00000-0000, 00000, 00000-, and 00000 -. I totally missed the ones coming in as 00000 - 0000, so my code is slicing those down to 5 digits. Needed a better solution there. Again, thank you both. I really appreciate the time you took on this. Also, I enjoyed the comments on my handling the two organization/last tabs in the tsv file. That was one of the last data issues to kick me in the pants and was super annoying. The handling you see fits the pattern in the data, where data that belonged in the organization tab was often (but not always) in the lastname tab. Just another gotcha. My primary take-aways so far:
Fender fucked around with this message at 19:05 on Apr 29, 2024 |
# ? Apr 29, 2024 19:00 |
|
Kind of a weird request, but I'm trying to think of a good approach for batch-removing the theme song from a show I have on my plex server, but audio fingerprinting (with a database etc.) seems like major overkill for a project like this. Is there any simpler approach that could work that I'm just not thinking of?
|
# ? Apr 30, 2024 13:48 |
|
If the opening theme is always at the same time into the video, you could just clip it out based on that timestamp. If it's at an unpredictable time (e.g. there's a cold open of varying length before the titles) - well, you're going to have to figure out where the opening theme is on a video-by-video basis. It sounds like you'd pretty much need to do some audio and/or video recognition to make that happen.
|
# ? Apr 30, 2024 13:55 |
|
Chillmatic posted:Kind of a weird request, but I'm trying to think of a good approach for batch-removing the theme song from a show I have on my plex server, but audio fingerprinting (with a database etc.) seems like major overkill for a project like this. Is there any simpler approach that could work that I'm just not thinking of? I don't want to dissuade from what could be a cool project if you're trying to do something neat in Python, but for this case I would simply pay for Plex Pass and get the "Skip Intro" button from then on.
|
# ? Apr 30, 2024 14:37 |
|
Hed posted:I don't want to dissuade from what could be a cool project if you're trying to do something neat in Python, but for this case I would simply pay for Plex Pass and get the "Skip Intro" button from then on. Oh, I love Plex Pass. The issue is that when setting up TV "Channels" with Ersatz there's no option to skip intros because the broadcast is 'live', so the only option to never have to hear that drat music is cutting the files themselves. I'm really, really trying to avoid doing this by hand because there's several seasons and apparently zero simple video editors that would make that a quick/simple process.
|
# ? Apr 30, 2024 14:51 |
|
That makes sense! In that case I would spend some time researching how you can pull the fingerprinting out of the Plex database -- it has to know time offsets somehow! Then feed those timestamps into your crop commands for ffmpeg or whatever you're using. edit: found an old schema I'd look at metadata_item_clusters or something else that has duration/starts_at/ends_at columns to see if it's in there before trying to analyze your own files. Hed fucked around with this message at 16:15 on Apr 30, 2024 |
# ? Apr 30, 2024 15:47 |
|
Chillmatic posted:Oh, I love Plex Pass. The issue is that when setting up TV "Channels" with Ersatz there's no option to skip intros because the broadcast is 'live', so the only option to never have to hear that drat music is cutting the files themselves. I'm really, really trying to avoid doing this by hand because there's several seasons and apparently zero simple video editors that would make that a quick/simple process. "Simple" being relative, FFMPEG can do this if you know the timestamps.
|
# ? Apr 30, 2024 19:42 |
|
|
# ? Jun 5, 2024 08:49 |
|
The March Hare posted:"Simple" being relative, FFMPEG can do this if you know the timestamps. I wrote a script to get all the xml files from the show in question, and then iterate through them using the intro marker data in those xml files to decide where to make the edits using ffmpeg, except now I'm running into losing the subtitles from the original files, and converting them is a whole thing because they're PGS format.
|
# ? Apr 30, 2024 19:57 |