Python

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Oysters Autobio posted:

Echoing this and expanding it even further: don't touch anything related to XML as a beginner.

While I'm absolutely sure it has some great features about it, I'm finding it's very much a "big boy" format and not beginner friendly. Maybe not the format necessarily but the ecosystem of tools in Python for it are really dense.

Hell, just look at the name "lxml" as a package. Gonna throw out a dumb hot take that I literally put no thought into: Acronyms should be banned from package naming.

Yeah, the big issue is that XML was built as a markup language, not a language for representing data structures or configuration, the two things developers between 1995 and 2005 really like to pretend it was ever good at

If you can make guarantees about the documents you're loading like "text will never contain other elements" then it gets a lot easier to work with and enables much more straightforward APIs like Pydantic

# ? Apr 4, 2024 17:47

Adbot: ADBOT LOVES YOU

# ? Jun 5, 2024 08:49

Fender: Oct 9, 2000; Mechanical Bunny Rabbits!; Dinosaur Gum

Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land.

We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale.

Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.

# ? Apr 4, 2024 18:44

Jose Cuervo: Aug 25, 2004

QuarkJets posted:

That's right, the responses will have the same order as the list of tasks provided to gather() even if the tasks happen to execute out of order. From the documentation, "If all awaitables are completed successfully, the result is an aggregate list of returned values. The order of result values corresponds to the order of awaitables."

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

# ? Apr 4, 2024 21:35

Fender: Oct 9, 2000; Mechanical Bunny Rabbits!; Dinosaur Gum

Jose Cuervo posted:

Great. I saw that in the documentation and thought that is what it meant but I wanted to be sure.

Another related question - I have never built a scraper before but from the initial results it looks like I will have to make about 12,000 requests (i.e., there are about 12,000 urls with violations). Is the aiohttp stuff 'clever' enough to not make all the requests at the same time, or is that something I have to code in so that it does not overwhelm the website if I call the fetch_urls function with a list of 12,000 urls?

Finally, sometimes the response which is returned is Null (when I save it as a json file). Does this just indicate that the fetch_url function ran out of retries?

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:

Python code:

connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)

Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

# ? Apr 4, 2024 22:50

Jose Cuervo: Aug 25, 2004

Fender posted:

For your first question, it looks like the default behavior for aiohttp.ClientSession is to do 100 simultaneous connections. If you want to adjust it, something like this will work:
Python code:
connector = aiohttp.TCPConnector(limit_per_host=10)
aiohttp.ClientSession(connector=connector)
Yes, the fetch_url method will result in None if it fails after 3 retries. I noticed that each url has an id number for the daycare in the params, so you could log which daycares you didn't get a response for and follow-up later. Just add something outside the while loop, the code only gets there if all retries fail. You could also adjust the retry interval. I left it at 1 second but a longer delay might help.

Thank you! I am saving the center ID and inspection ID which fail to get a response and plan to try them again.

# ? Apr 5, 2024 00:38

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Fender posted:

Chiming in about how someone gets wormy brains to the point where they use lxml. In short, fintech startup-land.

We had a product that was an API mostly running python scrapers in the backend. I don't know if it was ever explained to us why we used lxml. By default BeautifulSoup uses lxml as its parser, so I think we just cut out the middleman. I always assumed it was just an attempt at resource savings at a large scale.

Two years of that and I'm a super good scraper and I can get a lot done with just a few convenience methods for lxml and some xpath. And I have no idea how to use BeautifulSoup.

I use lxml when needing to iterate over huge lists via xPaths from scraped data. Seems to be the fastest and it ain�t that hard. Selenium is slow at finding elements via xpath when you start needing to find hundreds of individual elements.

Also if you�re using selenium, lxml code can kinda look similar.

I spent multiple years writing and maintaining web scrapers and basically never used BS4.

CarForumPoster fucked around with this message at 10:13 on Apr 5, 2024

# ? Apr 5, 2024 10:10

rich thick and creamy: May 23, 2005; To whip it, Whip it good; Pillbug

Has anyone played around with Rye yet? I just found it yesterday and am giving it a spin. So far it seems like a pretty nice Poetry alternative.

# ? Apr 13, 2024 02:19

Cyril Sneer: Aug 8, 2004; Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:

code:

    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }

I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

Cyril Sneer fucked around with this message at 03:46 on Apr 17, 2024

# ? Apr 17, 2024 03:42

PierreTheMime: Dec 9, 2004; Hero of hormagaunts everywhere!; Buglord

Cyril Sneer posted:

Fun little learning project I want to do but need some direction. I want to extract the all the video transcripts from a particular youtube channel and make them both keyword and semantically searchable, returning the relevant video timestamps.

I've got the scraping/extraction part working. Each video transcript is returned as a list of dictionaries, where each dictionary contains the timestamp and a (roughly) sentence-worth of text:
code:
    {
    'text': 'replace the whole thing anyways right so',
     'start': 1331.08,
     'duration': 4.28
    }
I don't really know how YT breaks up the text, but I don't think it really matters. Anyway, I obviously don't want to re-extract the transcripts every time so I need to store everything in some kind of database -- and in manner amenable to reasonably speedy keyword searching. If we call this checkpoint 1, I don't have a good sense of what this solution would look like.

Next, I want to make the corpus of text (is that the right term?) semantically searchable. This part is even foggier. Do I train my own LLM from scratch? Do some kind of transfer learning thing (i.e., take existing model and provide my text as additional training data?) Can I just point chatGPT at it (lol)?

I want to eventually wrap it in a web UI, but I can handle that part. Thanks goons! This will be a neat project.

This sounds like a good use-case for a vectored database and retrieval-augmented generation (RAG) and/or semantic search. You can use your dialog text as the target material and the rest as metadata you can retrieve on match. Theres a number of free options for database, including local ChromaDB instances (which use SQLite) or free-tier Pinecone.io which has good library support and a decent web UI.

# ? Apr 17, 2024 11:49

Hed: Mar 31, 2004; Fun Shoe

I just wanted to say I appreciated this error message when I forgot to put in the '-r'.

Bash code:

(venv) $ pip install requirements.txt
ERROR: Could not find a version that satisfies the requirement requirements.txt (from versions: none)
HINT: You are attempting to install a package literally named "requirements.txt" (which cannot exist). Consider using the '-r' flag to install the packages listed in requirements.txt

It sounds like something I would write.

# ? Apr 19, 2024 17:49

boofhead: Feb 18, 2021

I love the "consider using". It feels very much like "you don't have to go home, but you can't stay here"

# ? Apr 19, 2024 18:15

Cyril Sneer: Aug 8, 2004; Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

I have a case where I create two instances of an object via a big configuration dictionary. The difference between the two objects is a single, different value for one key. So, this works:

code:

big_config_dict = { .... }

B = dict(big_config_dict)
B['color'] = 'blue' #this one key is the only difference

thingA = Thing(big_config_dict) #default values
thingB = Thing(B) #single modified value

...but feels clunky. Am I missing some simpler way to do this?

# ? Apr 24, 2024 20:07

Chin Strap: Nov 24, 2002; I failed my TFLC Toxx, but I no longer need a double chin strap; Pillbug

You might want to do deepcopy instead of just dict() to copy but that's all I'd change

EDIT: If you wanted to be real safe you'd make them frozen too.

# ? Apr 24, 2024 20:20

boofhead: Feb 18, 2021

If I want to take a base dict and change value in one line, I'll usually use a spread operator if the structure and changes are simple

Python code:

config_1 = {'val1': 100, 'val2': 200}
# {'val1': 100, 'val2': 200}

config_2 = {**config_1, 'val2': 0}
# {'val1': 100, 'val2': 0}

you can also use it to combine/update multiple dicts, but the more complex the dictionaries or conflicts or changes are, the less appropriate it becomes, e.g.

Python code:

default_person = {"name": "john doe", "age": 50, "mother": {"name": "jane doe", "age": 80}}
default_hobbies = {"hobbies": ["going to church", "reading the bible"]}

sample_person = {**default_person, **default_hobbies}
# {'name': 'john doe', 'age': 50, 'mother': {'name': 'jane doe', 'age': 80}, 'hobbies': ['going to church', 'reading the bible']}

sample_stoner = {**default_person, **default_hobbies, "mother": {"name": "janet doe"}, "hobbies": ["smoking weed"]}
# {'name': 'john doe', 'age': 50, 'mother': {'name': 'janet doe'}, 'hobbies': ['smoking weed']}
# see how it replaces, it doesn't update partials/nested

it's not perfect but i quite like spread operators if the usage is clear and concise. works for lists too, just use one asterisk

boofhead fucked around with this message at 21:00 on Apr 24, 2024

# ? Apr 24, 2024 20:53

Cyril Sneer: Aug 8, 2004; Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

[quote="boofhead" post="539140163"]
If I want to take a base dict and change value in one line, I'll usually use a spread operator if the structure and changes are simple

Python code:

config_1 = {'val1': 100, 'val2': 200}
# {'val1': 100, 'val2': 200}

config_2 = {**config_1, 'val2': 0}
# {'val1': 100, 'val2': 0}

oooh, okay yeah this is perfect.

# ? Apr 24, 2024 21:32

QuarkJets: Sep 8, 2008

If your things are dataclasses you can neatly dictionary-ize one while creating the other

Python code:

@dataclass
class Thing:
    a: int
    b: int
    c: int

thing1 = Thing(a=1, b=2, c=3)
thing2 = Thing(**{**asdict(thing1), 'c': 4})

Or if they're really default values you can skip using dicts entirely:

Python code:

@dataclass
class Thing:
    a: int = 1
    b: int = 2
    c: int = 3

thing1 = Thing()
thing2 = Thing(c=4)

# ? Apr 26, 2024 08:32

Data Graham: Dec 28, 2009; 📈📊🍪😋

Wait a minute, I was just using dataclasses for something similar and was going to mention something annoying about them, which is

Python code:

----> 1 thing1 = Thing()

TypeError: __init__() missing 3 required positional arguments: 'a', 'b', and 'c'

# ? Apr 26, 2024 10:25

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Data classes rule. Use them everywhere.

# ? Apr 26, 2024 20:29

CarForumPoster: Jun 26, 2013; â¡POWERâ¡

Falcon2001 posted:

Data classes rule. Use them everywhere.

I�ve met like three functions that should be a class.

# ? Apr 26, 2024 20:35

QuarkJets: Sep 8, 2008

Data Graham posted:

Wait a minute, I was just using dataclasses for something similar and was going to mention something annoying about them, which is
Python code:
----> 1 thing1 = Thing()

TypeError: __init__() missing 3 required positional arguments: 'a', 'b', and 'c'

You can assign defaults in the dataclass definition, see my example

# ? Apr 26, 2024 22:42

Data Graham: Dec 28, 2009; 📈📊🍪😋

Oh nice, I should have figured. Thanks!

e: wow I have no idea why it never even occurred to me to try that. Some days my stupidity leaves me breathless

Data Graham fucked around with this message at 03:06 on Apr 27, 2024

# ? Apr 26, 2024 23:04

Seventh Arrow: Jan 26, 2005

I'm hoping that pyspark is python-adjacent enough to be appropriate for this thread (since the Data Engineering thread doesn't seem to be responding)

I'm applying for a job and Spark is one of the required skills, but I'm fairly rusty. They sprang an assignment on me where they wanted me to take the MovieLens dataset and calculate:

- The most common tag for a movie title and
- The most common genre rated by a user

After lots of time on Stack Overflow and Youtube, this is the script that I came up with. At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast.

code:

# Set up the Google Colab environment for pyspark, including integration with Google Drive

from google.colab import drive
drive.mount('/content/drive')

!pip install pyspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, row_number, split, explode
from pyspark.sql.window import Window
import unittest

# Initialize the Spark Session for Google Colab
def initialize_spark_session():
    """Initialize and return a Spark session configured for Google Colab"""
    return SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

"""
This code will ingest data from the MovieLens 20M dataset. The goal will be to use Spark to find the following:

- The most common tag for each movie title:
  Utilizing Spark's DataFrame operations, the script joins the movies dataset with the tags dataset on the movie ID.
  After joining, it aggregates tag data by movie title to count occurrences of each tag, then employs a window function
  to rank tags for each movie. The most frequent tag for each movie is identified and selected for display.

- The most common genre rated by each user:
  Similarly, the script joins the movies dataset, which includes genre information, with the ratings dataset on the movie ID.
  It then groups the data by user ID and genre to count how many times each user has rated movies of each genre.
  A window function is again used to rank the genres for each user based on these counts. The top genre for each user
  is then extracted and presented.


"""

def load_dataset(spark, file_path, has_header=True, infer_schema=True):
    """Load the need files, with some error checking for good measure"""
    try:
        df = spark.read.csv(file_path, header=has_header, inferSchema=infer_schema)
        if df.head(1):
            return df
        else:
            raise ValueError("DataFrame is empty, check the dataset or path.")
    except Exception as e:
        raise IOError(f"Failed to load data: {e}")

def calculate_most_common_tag(movies_df, tags_df):
    """Calculates the most common tag for each movie title."""
    try:
        movies_tags_df = movies_df.join(tags_df, "movieId")
        tag_counts = movies_tags_df.groupBy("title", "tag").agg(count("tag").alias("tag_count"))
        window_spec = Window.partitionBy("title").orderBy(col("tag_count").desc())
        most_common_tags = tag_counts.withColumn("rank", row_number().over(window_spec)) \
                             .filter(col("rank") == 1) \
                             .drop("rank") \
                             .orderBy(col("tag_count").desc())
        return most_common_tags
    except Exception as e:
        raise RuntimeError(f"Error calculating the most common tag: {e}")

def calculate_most_common_genre(ratings_df, movies_df):
    """Calculates the most common genre rated by each user."""
    try:
        movies_df = movies_df.withColumn("genre", explode(split(col("genres"), "[|]")))
        ratings_genres_df = ratings_df.join(movies_df, "movieId")
        genre_counts = ratings_genres_df.groupBy("userId", "genre").agg(count("genre").alias("genre_count"))
        window_spec = Window.partitionBy("userId").orderBy(col("genre_count").desc())
        most_common_genres = genre_counts.withColumn("rank", row_number().over(window_spec)) \
                                 .filter(col("rank") == 1) \
                                 .drop("rank") \
                                 .orderBy(col("genre_count").desc())
        return most_common_genres
    except Exception as e:
        raise RuntimeError(f"Error calculating the most common genre: {e}")

def main():
    # Calling Spark session info
    spark = initialize_spark_session()
    # Load CSV files from MovieLens
    movies_df = load_dataset(spark, "/content/drive/My Drive/spark_project/movies.csv")
    tags_df = load_dataset(spark, "/content/drive/My Drive/spark_project/tags.csv")
    ratings_df = load_dataset(spark, "/content/drive/My Drive/spark_project/ratings.csv")

    # Perform analysis
    most_common_tag = calculate_most_common_tag(movies_df, tags_df)
    most_common_genre = calculate_most_common_genre(ratings_df, movies_df)

    # Displaying results
    print("Most Common Tag for Each Movie Title:")
    most_common_tag.show()
    print("Most Common Genre Rated by User:")
    most_common_genre.show()

class TestMovieLensAnalysis(unittest.TestCase):
    '''This will perform some unittests on a truncated set of files in a dedicated testing directory'''
    def setUp(self):
        self.spark = initialize_spark_session()
        self.movies_path = "/content/drive/My Drive/spark_testing/movies.csv"
        self.tags_path = "/content/drive/My Drive/spark_testing/tags.csv"
        self.ratings_path = "/content/drive/My Drive/spark_testing/ratings.csv"

    def test_load_dataset(self):
        # Assuming fake paths to simulate failure
        with self.assertRaises(IOError):
            load_dataset(self.spark, "fakepath.csv")

    def test_calculate_most_common_tag(self):
        movies_df = load_dataset(self.spark, self.movies_path)
        tags_df = load_dataset(self.spark, self.tags_path)
        result_df = calculate_most_common_tag(movies_df, tags_df)
        self.assertIsNotNone(result_df.head(1))

    def test_calculate_most_common_genre(self):
        ratings_df = load_dataset(self.spark, self.ratings_path)
        movies_df = load_dataset(self.spark, self.movies_path)
        result_df = calculate_most_common_genre(ratings_df, movies_df)
        self.assertIsNotNone(result_df.head(1))

if __name__ == "__main__":
    main()
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

The line "unittest.main(argv=['first-arg-is-ignored'], exit=False)" is apparently a necessary quirk when working in a notebook format

# ? Apr 26, 2024 23:19

nullfunction: Jan 24, 2005; Nap Ghost

Seventh Arrow posted:

At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast.

I'll preface this with an acknowledgement that I'm not a pyspark toucher so I'm not going to really focus on that. From the point of view of someone who is reviewing your submission, I'm very happy to see comments, error checking, and unit tests! They give me additional insight into how you communicate information about your code and how you go about validating your designs. However, if the assignment was supposed to take you 4 hours and you turn in something that looks like it's had 40 put into it, that isn't necessarily a plus.

Since you're offering it up for a roast, here are some things to consider:

Your error handling choices only look good on the surface. Yes, you've made an error message slightly more fancy by adding some text to it, yes, the functions will always raise errors of a consistent type. They also don't react in any meaningful way to handle the errors that might be raised or enrich any of the error messages with context that would be useful to an end user (or logging system). You could argue that they make the software worse because they swallow the stack trace that might contain something meaningful (because they don't raise from the base exception).
Docstrings for each function are a good practice. Some docstring formats have a description for each argument, raised exception, and a description of the return value, and I tend to like these because it's helpful to tie relevant info directly to an argument. The docstrings you wrote contain the function's name, just reworded, and are not useful. In fact, you could take the tiny bit of extra information from the docstring and put it back into the function name and have a function name that is even better than the one you started with. calculate_most_common_genre_by_user is better than calculate_most_common_genre (but I would probably go with top_genre_by_user personally).
Normalize your use of single or double quotes. Run it through a formatter like black or ruff. The best case is that the reviewer doesn't notice or thinks it's a bit sloppy, the worst case is that they assume you copied it from two different websites.
You don't have any type annotations in your arguments or on your return values. Help your IDE help you.
Your unit tests check that an answer was returned, but stop short of actually seeing if that answer is correct. Constructing a tiny fake dataset with some known answers for your unit tests is a great way to validate that, and it seems like that's what you did from the filenames, but it seems unfinished.

It's clear that you've seen good code before and have some idea of what it should look like when trying to write it for yourself, but you're missing fundamentals and experience that will allow you to actually write it. To be clear this is a fine place to be for a junior. If this is for a junior role, submitted as-is it's a hire from me but there's a lot of headroom for another junior to impress me over this submission.

# ? Apr 27, 2024 00:36

QuarkJets: Sep 8, 2008

I don't like how the function code is all wrapped in large try/except blocks that raise new errors. That's not really error checking, it's error obfuscation; yes, you print the caught exception object, but it's hiding the stack trace for no good reason. If you absolutely felt like you had to add more context to exceptions, like if you wanted to use logging or send a notification to someone, then you could use exception chaining to raise the original exception

Python code:

def test():
    try:
        raise RuntimeError('foo')
    except RuntimeError as e:
        logging.warning(f'gently caress, I saw exeption {e}!!!')
        raise  # <-- This simple line raises the original exception as though it wasn't caught at all

But I think you should ditch the try/except blocks entirely, they're not helping you and bare Exception catching (where you catch the base Exception type instead of a more specific type) is a code smell.

Don't run pip install commands in your notebook code. A comment that describes what's needed is fine.

# ? Apr 27, 2024 03:15

Seventh Arrow: Jan 26, 2005

Thanks guys, much appreciated! I will look into your suggestions.

# ? Apr 27, 2024 06:03

StumblyWumbly: Sep 12, 2007; Batmanticore!

Any recommendations for places to start with interfacing dlls and python?
I'm interested in playing around with this and trying to make some file parsing stuff faster by reading binary data in c, then passing it to python for the higher level stuff.
I've used ctypes before, but I'm under the impression dll stuff has changed in the last 4 versions or so, and I'm worried searching will suggest bad habits.

# ? Apr 27, 2024 15:33

96 spacejam: Dec 4, 2009

I am putting together a lesson plan for myself to get from absolute newbie to somewhat competent. I've been sourcing the books, webpages and courses that are mentioned frequently between this thread, r/learnpython and then having ChatGPT take a few of them and spin up a couple month lesson plan.

Just occurred to me to ask if anyone had put something like this together already and saw success with it, would love if you could share it, if only to cross reference to make sure mine is on a seemingly well-thought out path.

# ? Apr 28, 2024 20:54

Seventh Arrow: Jan 26, 2005

Not a lesson plan, but I slapped together a list of resources a while ago:

https://forums.somethingawful.com/showthread.php?threadid=2672629&userid=0&perpage=40&pagenumber=299#post533605221

# ? Apr 29, 2024 02:47

96 spacejam: Dec 4, 2009

Seventh Arrow posted:

Not a lesson plan, but I slapped together a list of resources a while ago:

https://forums.somethingawful.com/showthread.php?threadid=2672629&userid=0&perpage=40&pagenumber=299#post533605221

This is helpful, thank you.

I'm going to include those 5 YT channels.

Below are the resources I am going to give to Chatgpt and have it generate a day-by-day lesson plan. It does a really good job at creating said plan but of course said plan is only as good as the resources it was given. And a newbie is feeding it resources.

Books:
Think Python
Automate the Boring

PDF:
TokyoEdtech Intro to Python https://drive.google.com/file/d/1ajYJZLGUaVmNbuG98LnRfHMTzvnZx9el/view?pli=1

Course:
Sololearn https://www.sololearn.com/en/learn/courses/python-introduction
Codeacademy https://www.codecademy.com/catalog/...MBoC3h8QAvD_BwE

YT:
12 Hour Course thing https://www.youtube.com/watch?v=WGJJIrtnfpk
+ the 5 you've presented.

A few of the resources above seemed controversial like the 12 hour video, whereas others swore by it.

When it comes to the courses that one is really tough because Id imagine if I knew which one was the most intuitive for a newbie I'd go with that. Otherwise Solo and Code are the two I selected based on lots of reading.

Am I missing something wildly important? And any last minute tips lmao

edit: swapped Sololearn for Coursera

96 spacejam fucked around with this message at 05:35 on Apr 29, 2024

# ? Apr 29, 2024 04:52

Fender: Oct 9, 2000; Mechanical Bunny Rabbits!; Dinosaur Gum

I'm searching for a job right now (and it sucks). I was most recently working in fintech, so I applied at a place doing some bank data automation work looking for a Python engineer. They asked me to do a coding challenge and I accepted. It wasn't too bad of a task, just writing a parser for some messy data in three different formats. The super tricky part was finding all the hidden landmines in the data and instructions. The whole thing was littered with traps. I was rejected right away and couldn't get any feedback out of them. It was surprising because I though I had done fairly well, at least well enough to earn an interview. So I humbly submit my code here to see if anyone has time for some feedback. I know I messed a few things up. I did the whole thing in one 5-hour sprint and I could tell my attention was slipping by the end.

The thing had a few specific requirements that explain why it is the way it is:

<24 hours to complete
One file only
Standard libraries only
Only one commit

I am sure I messed up:

Forgot the sort by zip code requirement from the first paragraph
Didn't add 'strict' to the tsv file loader to act as a validator for the requirement in bullet points
There is a typo on line 85, easy to miss as the code for it never runs (that element is empty in the example data)

Here is the code. There is also a readme.md there with the instructions.

Fender fucked around with this message at 06:02 on Apr 29, 2024

# ? Apr 29, 2024 05:43

Falcon2001: Oct 10, 2004; Eat your hamburgers, Apollo.; Pillbug

Fender posted:

I'm searching for a job right now (and it sucks). I was most recently working in fintech, so I applied at a place doing some bank data automation work looking for a Python engineer. They asked me to do a coding challenge and I accepted. It wasn't too bad of a task, just writing a parser for some messy data in three different formats. The super tricky part was finding all the hidden landmines in the data and instructions. The whole thing was littered with traps. I was rejected right away and couldn't get any feedback out of them. It was surprising because I though I had done fairly well, at least well enough to earn an interview. So I humbly submit my code here to see if anyone has time for some feedback. I know I messed a few things up. I did the whole thing in one 5-hour sprint and I could tell my attention was slipping by the end.

The thing had a few specific requirements that explain why it is the way it is:

<24 hours to complete

One file only

Standard libraries only

Only one commit

I am sure I messed up:

Forgot the sort by zip code requirement from the first paragraph

Didn't add 'strict' to the tsv file loader to act as a validator for the requirement in bullet points

There is a typo on line 85, easy to miss as the code for it never runs (that element is empty in the example data)

Here is the code. There is also a readme.md there with the instructions.

I'll dig through, but notably you probably don't want to publish the name of the company along with these details, they might get pretty upset about it because it could let future people prep for the question ahead of time or just find a solution. I would recommend obfuscating it and possibly rewriting the instructions if you're going to show people. Actually it looks like this might just be a public repo it's based on? In which case, who cares.

Notes as I go through, in no particular order. I didn't dig deeply into the problem, so my notes below are on your code alone, and I only dug into the problem to confirm details.

Your XML parsing solution seems odd and overly verbose. I haven't done much XML parsing work before, but I'd be surprised if there wasn't a pattern that avoided the weird ent.find('X').text.strip() if ent.find('X') is not None else None part.
Additionally, you have a bug in this section - you combine the street parts, but if street_1 exists but 2 and 3 don't, you'll crash out, because you can't concatenate a string and a None.
Your Address class is a much bigger red flag - you shouldn't ever have a class that hides its signature; instead of providing it a dictionary, you should just provide it with arguments - conveniently, you can unpack a dictionary into arguments with **.
The Address class would also be a good place for someone to use a Dataclass instead. It's essentially just a data storage class, so dataclasses are perfect for it, and you could get rid of the view_dict part as well)
In general, the task is asking you to load three different kinds of files, and then perform the same *action* on each of them, so I would expect to see code that flowed into more of a 'load file (depends on type) -> parse contents -> return result'. It could be that there's simply too much difference in the file format, but it would be my first approach.

Falcon2001 fucked around with this message at 07:32 on Apr 29, 2024

# ? Apr 29, 2024 07:17

QuarkJets: Sep 8, 2008

The commit message is "challenge completed"; one of the requirements is to write a concise commit message, but I think that's probably too concise.

After looking it over a little I realized that the code isn't pep8-compliant, which is another requirement. A common tool to verify this is the standalone utility named "pycodestyle", which used to literally be called "pep8". An even better tool is "flake8", which combines pycodestyle and some other linters into one utility. Try installing flake8 (with pip) and running it on your script (`flake8 challenge.py`), it'll print out a list of line numbers with specific issues to fix. You can also run pycodestyle in the same way if you want to see only the pep8 problems. Linting is integrated directly into most modern IDEs but every one has its own default flavor of what's "right" and not all of them are pep8-compliant; lesson learned, run flake8 on your code before you commit it
(you should look up pre-commit-hooks, during a git commit they can not only automatically fix a lot of common issues like unnecessary whitespace at the end of lines, you can also incorporate a flake8 pre-commit stage that will cause the commit to be rejected if flake8 has complaints; this enforces pep8 compliance on anything that you try to commit, it's very handy!)

Those two issues (PEP8 compliance, vague commit message) + the result not being sorted by zip code is 3 failed requirements. Those are probably the big issues here. I have some other suggestions for your code that you can think about, because I like providing code review, but I don't think these matter as much:

The `Address` class would be a lot more concise if it was a dataclass. Then the `view_dict` method reduces to 2 lines (one of which is the asdict function from the dataclasses module). As it stands, `Address` also has no documentation

A class-based approach like this has fallen out of favor with a lot of Python developers. The methods in `Parser` could have all been functions.

`file_path[-3:].lower()` is less effective at extracting file extensions than `os.path.splitext(file_path)[-1]`. Technically it satisfies the assignment, but this line is a brittle implementation that could require rework if it had to be extended to other file formats in the future. In a code review I'd reject this line. The rest of the implementation in `parse_file` looks good. A dictionary dispatch approach would be equally valid

I'm not great at XML parsing. But when you extract the street components I think you're missing some logic to correctly handle street names. Here's your approach to concatenating these:

Python code:

            street_1 = ent.find("STREET").text.strip() if ent.find("STREET") is not None else None
            street_2 = ent.find("STREET_2").text.strip() if ent.find("STREET_2") is not None else None
            street_3 = ent.find("STREET_3").text.strip() if ent.find("STREET_3") is not None else None
            street = street_1 + street_2 + street_3

First, it doesn't make sense that you're allowing assignment to None. If any of these are None then an exception is going to be raised when you try to add them together, so a lot of this code either doesn't do anything or will fail. I think that you could just use findtext() to provide empty strings, then you could concatenate those together safely even when one of the tags isn't present:

Python code:

            street_1 = ent.findtext("STREET", "").strip()
            street_2 = ent.findtext("STREET_2", "").strip()
            street_3 = ent.findtext("STREET_3", "").strip()
            street = street_1 + street_2 + street_3

But the concatenation I think is also wrong; for instance here's some of the example data:

XML code:

      <STREET>135 West 20th Street</STREET>
      <STREET_2>Suite 201</STREET_2>
      <STREET_3> </STREET_3>

Since you just add these together, street would be "135 West 20th StreetSuite 201" right? That's a problem. Check out this approach:

Python code:

    street_components = (s for t in ("STREET", "STREET_2", "STREET_3") if (s := ent.findtext(t, "").strip()))
    street = " ".join(street_components)

- The list of valid tag names is a concise tuple of elements
- We use a walrus operator to only keep elements that have length > 0
- We join together the remaining elements with spaces

Moving beyond the issues with start parsing in xml, I'm not enamored with the "x = y if y is not None else None" approach in general. 90% of this code is duplicate effort. I think this could be a lot cleaner with a looped approach. I'm imagining a dict comprehension but even a for loop could be cleaner, I think

The postal code parser is a little too complicated and I think it doesn't work right with XML values. The Zip+4 values in the XML are formatted as "NNNNN - MMMM" but your regex is `r"^\d{5}-\d{4}$`. That lack of a space means that your parser assumes it's not a Zip+4 value, so it always returns the first 5 digits. This would work better:

Python code:

def parse_zip(zip):
    """Given a zip code, remove spaces and strip off trailing dashes."""
    return zip.replace(' ','').strip('-')

I think there's a lot of unnecessary None-checking in this code. For instance parse_zip shouldn't be receiving None objects, you should be eliminating that possibility sooner. This is kind of a nitpicky suggestion, it's fine to use None but I personally don't seeing it used this often

In the tcv parser I think you were meant to infer that sometimes an Organization label is actually in the "last" column; it looks like you're parsing "Central Trading Company Ltd." as a last name for a person with no first name.

If a middle name is "N/M/N" you should probably just not include that value. E.g. "no middle name" == "N/A" == don't bother keeping this vlaue

Probably don't want to leave a "Hello, world!" docstring in your code

# ? Apr 29, 2024 08:08

Fender: Oct 9, 2000; Mechanical Bunny Rabbits!; Dinosaur Gum

Thanks both of you, that was very helpful. And yeah, this was all done in public. With a bit of searching you can go see the PRs and the code I was up against. There are a lot of people writing Python so much worse than mine (and a few writing Python so much better).

There is some good feedback in here. Some of it I feel would be appropriate for a conversation. Like, why I check for None a lot. It's just a habit I learned in my last role where we parsed a lot of unreliable data sourced from scrapers. I would happily explain my position there. And stuff like the Address class having that weird view dict method, would love to chat about that. That is like that bc while they never said it, the examples both showed the printed json ignoring any keys without values, so I went out of my way to provide that since all the example data had different fields present/missing depending on the format. I was really getting in my own head by hour number 4.

But there is much more mechanically wrong than I thought. You both also pointed out a really neat bug in my xml parsing. It's working... but it shouldn't be. And the zip code bug, I didn't catch that one either. That was just them coming in formats of 00000-0000, 00000, 00000-, and 00000 -. I totally missed the ones coming in as 00000 - 0000, so my code is slicing those down to 5 digits. Needed a better solution there.

Again, thank you both. I really appreciate the time you took on this.

Also, I enjoyed the comments on my handling the two organization/last tabs in the tsv file. That was one of the last data issues to kick me in the pants and was super annoying. The handling you see fits the pattern in the data, where data that belonged in the organization tab was often (but not always) in the lastname tab. Just another gotcha.

My primary take-aways so far:

The mechanical code feedback from here. Good stuff.
If you have 24h, just chill and submit it the next morning. I would have caught so many issues that way.
The 1-commit thing messed me up. I should have made a branch, and then squashed and merged back into main to get it down to one commit. I wrote & fixed so many bugs by not having my usual Git habits. I am sure that's how "strict" got removed from my tsv import code.
The ticket/readme broke my brain (like I'm sure it was designed to). I should have written a list of tasks/requirements to be my guide. Instead I kept going back to the readme, and eventually fixated on the bullet points at the end until I totally forgot about the paragraph at the top.

Fender fucked around with this message at 19:05 on Apr 29, 2024

# ? Apr 29, 2024 19:00

Chillmatic: Jul 25, 2003; always seeking to survive and flourish

Kind of a weird request, but I'm trying to think of a good approach for batch-removing the theme song from a show I have on my plex server, but audio fingerprinting (with a database etc.) seems like major overkill for a project like this. Is there any simpler approach that could work that I'm just not thinking of?

# ? Apr 30, 2024 13:48

Jabor: Jul 16, 2010; #1 Loser at SpaceChem

If the opening theme is always at the same time into the video, you could just clip it out based on that timestamp.

If it's at an unpredictable time (e.g. there's a cold open of varying length before the titles) - well, you're going to have to figure out where the opening theme is on a video-by-video basis. It sounds like you'd pretty much need to do some audio and/or video recognition to make that happen.

# ? Apr 30, 2024 13:55

Hed: Mar 31, 2004; Fun Shoe

Chillmatic posted:

Kind of a weird request, but I'm trying to think of a good approach for batch-removing the theme song from a show I have on my plex server, but audio fingerprinting (with a database etc.) seems like major overkill for a project like this. Is there any simpler approach that could work that I'm just not thinking of?

I don't want to dissuade from what could be a cool project if you're trying to do something neat in Python, but for this case I would simply pay for Plex Pass and get the "Skip Intro" button from then on.

# ? Apr 30, 2024 14:37

Chillmatic: Jul 25, 2003; always seeking to survive and flourish

Hed posted:

I don't want to dissuade from what could be a cool project if you're trying to do something neat in Python, but for this case I would simply pay for Plex Pass and get the "Skip Intro" button from then on.

Oh, I love Plex Pass. The issue is that when setting up TV "Channels" with Ersatz there's no option to skip intros because the broadcast is 'live', so the only option to never have to hear that drat music is cutting the files themselves. I'm really, really trying to avoid doing this by hand because there's several seasons and apparently zero simple video editors that would make that a quick/simple process.

# ? Apr 30, 2024 14:51

Hed: Mar 31, 2004; Fun Shoe

That makes sense! In that case I would spend some time researching how you can pull the fingerprinting out of the Plex database -- it has to know time offsets somehow! Then feed those timestamps into your crop commands for ffmpeg or whatever you're using.

edit: found an old schema I'd look at metadata_item_clusters or something else that has duration/starts_at/ends_at columns to see if it's in there before trying to analyze your own files.

Hed fucked around with this message at 16:15 on Apr 30, 2024

# ? Apr 30, 2024 15:47

The March Hare: Oct 15, 2006; _{Je r�ve d'un}
Wayne's World 3; Buglord

Chillmatic posted:

Oh, I love Plex Pass. The issue is that when setting up TV "Channels" with Ersatz there's no option to skip intros because the broadcast is 'live', so the only option to never have to hear that drat music is cutting the files themselves. I'm really, really trying to avoid doing this by hand because there's several seasons and apparently zero simple video editors that would make that a quick/simple process.

"Simple" being relative, FFMPEG can do this if you know the timestamps.

# ? Apr 30, 2024 19:42

Adbot: ADBOT LOVES YOU

# ? Jun 5, 2024 08:49

Chillmatic: Jul 25, 2003; always seeking to survive and flourish

The March Hare posted:

"Simple" being relative, FFMPEG can do this if you know the timestamps.

I wrote a script to get all the xml files from the show in question, and then iterate through them using the intro marker data in those xml files to decide where to make the edits using ffmpeg, except now I'm running into losing the subtitles from the original files, and converting them is a whole thing because they're PGS format. :negative:

# ? Apr 30, 2024 19:57

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Python

«‹›231 »