Data engineering: what are you doing in my data swamp?!

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!

Seventh Arrow: Jan 26, 2005

Also if you're going to get into data engineering then it'd probably be a good idea to get a certification for one of the three big cloud platforms - Amazon AWS, Microsoft Azure, or Google GCP.

However don't let job interviewers intimidate you if you have AWS and they're an Azure house. It's all the same stuff with different labels slapped on, and you need to show them that.

# ¿ Dec 29, 2023 15:02

Adbot: ADBOT LOVES YOU

# ¿ May 21, 2024 01:20

Seventh Arrow: Jan 26, 2005

If there's two things that you need to have a lock on in the DE space, it's python and SQL. Here's some python learning resources I posted in another thread:

Our humble SA thread

https://forums.somethingawful.com/showthread.php?threadid=3812541

Reddit

https://www.reddit.com/r/inventwithpython/
https://www.reddit.com/r/learnpython/
https://www.reddit.com/r/Python/
https://www.reddit.com/r/PythonNoobs/

Books

Python Crash Course
Automate the Boring Stuff with Python
Think Python

Youtube

Indently
Tech With Tim
mCoding
ArjanCodes
Python Simplified

Online Learning

Codecademy
Real Python
The Official Guide
DataCamp

Discussion

Official Python IRC
Discord

Brain-busters

LeetCode
Practice Python
HackerRank
Edabit

# ¿ Dec 29, 2023 17:00

Seventh Arrow: Jan 26, 2005

The easiest way to do anything with tabular data in python is to use pandas dataframes. Import your CSV into a dataframe, then export it to SQLite. Boom, done. There are a zillion pandas tutorials out there, it won't be hard if you search for things like "import csv into pandas dataframe" or "pandas dataframe export to database", etc.

# ¿ Dec 31, 2023 04:01

Seventh Arrow: Jan 26, 2005

Also, when I originally had thoughts of creating a data engineering thread, the subtitle was going to be "do you want that in CSV, CSV, or CSV?"

# ¿ Dec 31, 2023 04:02

Seventh Arrow: Jan 26, 2005

Right now I'm working with the three technologies I hate the most: Salesforce, Informatica, and Oracle. However, the team is great so I don't mind. They're in the process of migrating Informatica to GCP, so I'm currently working on getting my Google cert. I've also let my python and SQL skills lag, so I need to pick that up again in case I have to do interviews when my contract is done.

# ¿ Jan 3, 2024 00:53

Seventh Arrow: Jan 26, 2005

Currently we're using Oracle as a dumping ground for Salesforce data so fortunately I don't have to interact with it much - do some queries, dump metadata to a spreadsheet, that kind of thing.

What really does drive me up the wall, however, is the query language that Salesforce uses...specifically, how limited it is. You can't do joins, you can't use aliases, you can't do subqueries, no window functions. You can't even do "select *", lol. You have to include each and every field/column by name instead. You can't even do "distinct" on specific fields/columns. Someone wanted me to do some aggregate functions on specific fields with a where clause, so I had to export the whole object to SQL and query it there ಠ_ಠ

# ¿ Jan 3, 2024 18:56

Seventh Arrow: Jan 26, 2005

Ok so my contract with TELUS ended abruptly (because of budget lockdowns, not because I deleted production tables or something) so I'm on the hunt for a new job. Here is my resume, feel free to laugh and make fun of it but also offer critiques.

Mind you, I will usually tailor it to the job I'm applying for, but it's fairly feature-complete as-is (I think).

My lack of experience with Airflow, DBT, and Snowflake is a but of a weakness but I'm hoping to make up for it with project work. I'm also hoping to get a more recent cloud certification than my expired AWS thang (maybe Azure).

# ¿ Feb 16, 2024 16:36

Seventh Arrow: Jan 26, 2005

There was a huge upswell in demand for bootcamps back when every tech company was thirsty for more data scientists, but nowadays it seems like it's largely a wasteland of hucksters and profiteers. If there's a bootcamp that you think will teach you something that you need to know, then sure, proceed with caution. But under no circumstances should you go to a bootcamp under the impression that it'll give you resume clout. It won't.

A degree in computer science will probably help you understand data structures and algorithms, but there's other ways of learning those things. The resume clout thing is similar, I think - every job opening says they want some sort of CompSci diploma but I've had loads of interviews with companies that have said that and I don't have a degree.

Really, I think that having lots of personal projects really makes your application pop. I would say that many (but not all) of them should focus on ETL stuff - make a data pipeline with some combination of code, cloud, and some of the toys that every employer seems to want. If you're stuck for ideas, check with Darshil Parmar and Seattle Data Guy for starters.

# ¿ Mar 2, 2024 04:46

Adbot: ADBOT LOVES YOU

# ¿ May 21, 2024 01:20

Seventh Arrow: Jan 26, 2005

To any Spark experts whose eyeballs may be glancing at this thread:

I'm applying for a job and Spark is one of the required skills, but I'm fairly rusty. They sprang an assignment on me where they wanted me to take the MovieLens dataset and calculate:

- The most common tag for a movie title and
- The most common genre rated by a user

After lots of time on Stack Overflow and Youtube, this is the script that I came up with. At first, I had something much simpler that just did the assigned task, but I figured that I would also add commenting, error checking, and unit testing because rumor has it that this is what professionals actually do. I've tested it and know it works but I'm wondering if it's a bit overboard? Feel free to roast.

code:

# Set up the Google Colab environment for pyspark, including integration with Google Drive

from google.colab import drive
drive.mount('/content/drive')

!pip install pyspark

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, row_number, split, explode
from pyspark.sql.window import Window
import unittest

# Initialize the Spark Session for Google Colab
def initialize_spark_session():
    """Initialize and return a Spark session configured for Google Colab"""
    return SparkSession.builder\
        .master("local")\
        .appName("Colab")\
        .config('spark.ui.port', '4050')\
        .getOrCreate()

"""
This code will ingest data from the MovieLens 20M dataset. The goal will be to use Spark to find the following:

- The most common tag for each movie title:
  Utilizing Spark's DataFrame operations, the script joins the movies dataset with the tags dataset on the movie ID.
  After joining, it aggregates tag data by movie title to count occurrences of each tag, then employs a window function
  to rank tags for each movie. The most frequent tag for each movie is identified and selected for display.

- The most common genre rated by each user:
  Similarly, the script joins the movies dataset, which includes genre information, with the ratings dataset on the movie ID.
  It then groups the data by user ID and genre to count how many times each user has rated movies of each genre.
  A window function is again used to rank the genres for each user based on these counts. The top genre for each user
  is then extracted and presented.


"""

def load_dataset(spark, file_path, has_header=True, infer_schema=True):
    """Load the need files, with some error checking for good measure"""
    try:
        df = spark.read.csv(file_path, header=has_header, inferSchema=infer_schema)
        if df.head(1):
            return df
        else:
            raise ValueError("DataFrame is empty, check the dataset or path.")
    except Exception as e:
        raise IOError(f"Failed to load data: {e}")

def calculate_most_common_tag(movies_df, tags_df):
    """Calculates the most common tag for each movie title."""
    try:
        movies_tags_df = movies_df.join(tags_df, "movieId")
        tag_counts = movies_tags_df.groupBy("title", "tag").agg(count("tag").alias("tag_count"))
        window_spec = Window.partitionBy("title").orderBy(col("tag_count").desc())
        most_common_tags = tag_counts.withColumn("rank", row_number().over(window_spec)) \
                             .filter(col("rank") == 1) \
                             .drop("rank") \
                             .orderBy(col("tag_count").desc())
        return most_common_tags
    except Exception as e:
        raise RuntimeError(f"Error calculating the most common tag: {e}")

def calculate_most_common_genre(ratings_df, movies_df):
    """Calculates the most common genre rated by each user."""
    try:
        movies_df = movies_df.withColumn("genre", explode(split(col("genres"), "[|]")))
        ratings_genres_df = ratings_df.join(movies_df, "movieId")
        genre_counts = ratings_genres_df.groupBy("userId", "genre").agg(count("genre").alias("genre_count"))
        window_spec = Window.partitionBy("userId").orderBy(col("genre_count").desc())
        most_common_genres = genre_counts.withColumn("rank", row_number().over(window_spec)) \
                                 .filter(col("rank") == 1) \
                                 .drop("rank") \
                                 .orderBy(col("genre_count").desc())
        return most_common_genres
    except Exception as e:
        raise RuntimeError(f"Error calculating the most common genre: {e}")

def main():
    # Calling Spark session info
    spark = initialize_spark_session()
    # Load CSV files from MovieLens
    movies_df = load_dataset(spark, "/content/drive/My Drive/spark_project/movies.csv")
    tags_df = load_dataset(spark, "/content/drive/My Drive/spark_project/tags.csv")
    ratings_df = load_dataset(spark, "/content/drive/My Drive/spark_project/ratings.csv")

    # Perform analysis
    most_common_tag = calculate_most_common_tag(movies_df, tags_df)
    most_common_genre = calculate_most_common_genre(ratings_df, movies_df)

    # Displaying results
    print("Most Common Tag for Each Movie Title:")
    most_common_tag.show()
    print("Most Common Genre Rated by User:")
    most_common_genre.show()

class TestMovieLensAnalysis(unittest.TestCase):
    '''This will perform some unittests on a truncated set of files in a dedicated testing directory'''
    def setUp(self):
        self.spark = initialize_spark_session()
        self.movies_path = "/content/drive/My Drive/spark_testing/movies.csv"
        self.tags_path = "/content/drive/My Drive/spark_testing/tags.csv"
        self.ratings_path = "/content/drive/My Drive/spark_testing/ratings.csv"

    def test_load_dataset(self):
        # Assuming fake paths to simulate failure
        with self.assertRaises(IOError):
            load_dataset(self.spark, "fakepath.csv")

    def test_calculate_most_common_tag(self):
        movies_df = load_dataset(self.spark, self.movies_path)
        tags_df = load_dataset(self.spark, self.tags_path)
        result_df = calculate_most_common_tag(movies_df, tags_df)
        self.assertIsNotNone(result_df.head(1))

    def test_calculate_most_common_genre(self):
        ratings_df = load_dataset(self.spark, self.ratings_path)
        movies_df = load_dataset(self.spark, self.movies_path)
        result_df = calculate_most_common_genre(ratings_df, movies_df)
        self.assertIsNotNone(result_df.head(1))

if __name__ == "__main__":
    main()
    unittest.main(argv=['first-arg-is-ignored'], exit=False)

# ¿ Apr 25, 2024 18:48

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Data engineering: what are you doing in my data swamp?!