Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
ConanThe3rd
Mar 27, 2009
In a world with Copilot in it, any hint that a company thinks like that is a red flag.

Adbot
ADBOT LOVES YOU

boofhead
Feb 18, 2021

I think I posted in this thread when I was doing the coding challenge for my current job a couple years ago, freaking out because the key logic of the challenge was a knacksack problem that expected deterministic programming, which I never studied in my ancient history degree. I came up with some dodgy "good enough for the requirements as they stand", added some explanations for how I'd clarify the specs and actual use cases and so forth, and the only feedback I got (apart from being hired, which I guess is the main thing) is that I need to add more and better comments

And then I saw the leads/hiring devs code and i gotta say, people in glass houses buddy

e: or maybe it was in the getting a job thread, or general programming, don't remember. But it is very stressful and imposer syndrome-y, until you actually see what 99% of your actual workload in a real job is like. Then you just have imposter syndrome for "literally any job but the one I have currently", which is an improvement at least

boofhead fucked around with this message at 08:25 on Apr 19, 2023

SurgicalOntologist
Jun 17, 2004

boofhead posted:

i, too, couldnt help myself

Yours will get the wrong answer if the input list has 0 in it. :viggo:

boofhead
Feb 18, 2021

SurgicalOntologist posted:

Yours will get the wrong answer if the input list has 0 in it. :viggo:

Please create a bug report ticket and assign it to the garbage bin icon and I'll add an "if n == 0 or" before EOY

But yea I dragged myself out of bed 3 minutes previously to put on coffee and take my ADHD medication, I am surprised it even worked for the sample list without accidentally formatting my C:

Seventh Arrow
Jan 26, 2005

boofhead posted:


e: ive never done like a live coding test beyond 5 minutes of basic SQL maybe 6 years ago for a PM position, you can still google and check documentation, right?

It totally, completely depends on the interviewer. I had one recently where I could do a search on the SQL test but not on the python test :confused: the SQL test was pretty easy too, although the time limit was 20 minutes and there were 5 queries of increasing difficulty. I don't think there was an expectation of finishing all of them. The python one started off with softball questions like giving the differences between a list, a dictionary, and a tuple. The coding part had a list of numbers and I had to find the index of the second 5. I can't remember if he wanted the index of the second 5 specifically, or of any 5 that might occur after the first one. Anyways, my answer was a mess of nested for loops and enumerate() abuse but I more or less got it.

In theory live coding tests are supposed to be about seeing your logic, communication and understanding the general concepts, but I'm pretty sure the person who gets the job is the one who nails their puzzle even if they're an antisocial turbonerd.

edit: I just got called for a second interview with the company that gave me the odd/even test, so I must have done something right. Thanks for low standards!

Seventh Arrow fucked around with this message at 15:45 on Apr 19, 2023

ConanThe3rd
Mar 27, 2009
It might even be that they're not testing for aptitude as much as ability to soldier on in the face of demands.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I've taken interview training at one of the BigTech companies and the bigger point in an interview should be how you're thinking about the problem, not necessarily the right answer.

Like, if an interviewee just silently slams out an algorithm correctly and refuses to elaborate, that's a worse interview than someone who struggles a little to remember random standard library function but clearly is talking about their approach and why they'd do X over Y, because the point of an interview to evaluate someone's skills. Anyone can just google stuff, so the 'how do you think about a problem' is much more important.

That being said, it's not like interviewers are a monolith and I guarantee there's people at my company who would take option A over B in the above example, whereas I'd either think that they were way overskilled for the question or just googling it on the side if we weren't in the same room. Ninja Edit: I wouldn't fail the person in the above example, but I'd definitely lean into 'alright, let's review your solution, walk me through how you arrived at this'.

Seventh Arrow posted:

In theory live coding tests are supposed to be about seeing your logic, communication and understanding the general concepts, but I'm pretty sure the person who gets the job is the one who nails their puzzle even if they're an antisocial turbonerd.

Yeah this part specifically should be wrong for any company that is actually thinking about their interview process and not just cargo-culting practices they half-read about at FAANG

Edit:
One of the better interview questions I've seen around (and got a variation of myself) was 'Look over this code, tell me what it does, and explain how you'd improve it' because it's pretty open ended and gives a lot of space to discuss actual code behavior/etc, and probably tells you more about how someone actually writes and approaching being a developer than 'invert this list on a whiteboard plz'.

Falcon2001 fucked around with this message at 22:01 on Apr 20, 2023

QuarkJets
Sep 8, 2008

Falcon2001 posted:

Edit:
One of the better interview questions I've seen around (and got a variation of myself) was 'Look over this code, tell me what it does, and explain how you'd improve it' because it's pretty open ended and gives a lot of space to discuss actual code behavior/etc, and probably tells you more about how someone actually writes and approaching being a developer than 'invert this list on a whiteboard plz'.

I love this because it is basically asking you if you know how to shitpost with tact

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

I love this because it is basically asking you if you know how to shitpost with tact

This has been basically 90% of my professional career so it probably should be part of any interview.

Seventh Arrow
Jan 26, 2005

I had a live coding test for a Walmart interview today and I was initially terrified because the recruiter mentioned brushing up on data structures and algorithms. Here I was thinking that the solution was going to involve pan-dimensional hashmaps and backwards-dynamic programming and all that, but it was pretty simple:

quote:

# you are given an array of integers, write a method to return boolean if there are 2 numbers x and y in the array such that x=2*y, otherwise return false

My needlessly elaborate answer was:

code:
def value_of_x(lst):
    list_times_two = []
    times_two = y * 2
    for y in lst:
        list_times_two.append(times_two)
    for x in list_times_two:
        if x in lst:
            return True
    else:
        return False
I knew that it could be more efficient, especially due to all the iteration going on, but he was willing to accept it. He started talking about big O notation and it took all of earthly power to keep from zoning out and going into a coma.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Seventh Arrow posted:

I had a live coding test for a Walmart interview today and I was initially terrified because the recruiter mentioned brushing up on data structures and algorithms. Here I was thinking that the solution was going to involve pan-dimensional hashmaps and backwards-dynamic programming and all that, but it was pretty simple:

My needlessly elaborate answer was:

code:
def value_of_x(lst):
    list_times_two = []
    times_two = y * 2
    for y in lst:
        list_times_two.append(times_two)
    for x in list_times_two:
        if x in lst:
            return True
    else:
        return False
I knew that it could be more efficient, especially due to all the iteration going on, but he was willing to accept it. He started talking about big O notation and it took all of earthly power to keep from zoning out and going into a coma.

Here's my smarmy one-line take on it, thanks to the Python standard library.

Python code:
from itertools import permutations

def value_of_x(lst: List[int]) -> bool:
	return any([bool(y == x*2) for x,y in permutations(lst, 2)])
Technically you could make it slightly faster by using combinations instead and making sure that X is lower than Y.

An approach without itertools could look like this:

Python code:
def value_of_x(lst: List[int]) -> bool:
	for x in lst:
		for y in lst:
			if x == (y*2):
				return True
	return False
A way I'd talk about this, for context, would be that the first approach reuses the standard library and is reasonably straightforward. The second approach has a possible advantage in time in that it returns True as soon as any valid case is met, as opposed to iterating over all possible options. Both are also reasonably naive approaches; I would be up front about it and would probably spend any additional time thinking out loud about possible improvements.

QuarkJets posted:

What was the optimal solution according to the interviewer? I might turn the list into a set so that the membership testing is faster, caveating that that's actually a poor answer if the list size is very small

This is actually a good example of another thing: you probably shouldn't be afraid in an interview to ask questions about the requirements (because requirements defining is basically half of your job as a developer) - for example, are the values unique? (as I assumed in my response) If so, passing to a set won't do anything useful, but otherwise it's a very useful starting point before trying permutations or combinations/etc.

Falcon2001 fucked around with this message at 04:01 on Apr 21, 2023

QuarkJets
Sep 8, 2008

What was the optimal solution according to the interviewer? I might turn the list into a set so that the membership testing is faster, caveating that that's actually a poor answer if the list size is very small

Precambrian Video Games
Aug 19, 2002



Yeah I probably would have said:

Python code:
def has_double(lst):
    x = set(lst)
    return len(x.intersection({2*y for y in x})) > 0
... and sweated about whether it missed some corner case.

Zugzwang
Jan 2, 2005

You have a kind of sick desperation in your laugh.


Ramrod XTreme

Falcon2001 posted:

Here's my smarmy one-line take on it, thanks to the Python standard library.

Python code:
from itertools import permutations

def value_of_x(lst: List[int]) -> bool:
	return any([bool(y == x*2) for x,y in permutations(lst, 2)])
You could also cut the square brackets from any(). This way you get a lazy generator and don't first have to instantiate a list of all those permutations/combinations.

Anyway itertools owns and I am constantly trying to learn more about how to apply its dark magic.

Seventh Arrow
Jan 26, 2005

Falcon2001 posted:

An approach without itertools could look like this:

Python code:
def value_of_x(lst: List[int]) -> bool:
	for x in lst:
		for y in lst:
			if x == (y*2):
				return True
	return False

Oh I like this, it's nice and snappy. I usually send interviewers a thank you note, maybe I should see what he thinks of this solution.


QuarkJets posted:

What was the optimal solution according to the interviewer? I might turn the list into a set so that the membership testing is faster, caveating that that's actually a poor answer if the list size is very small

While he was going on about big O notation, he only kind of hinted at what sort of not-as-much-iteration approach he would have taken. Then we started joking about how hackerrank lists Fortran as one of its available coding languages.

QuarkJets
Sep 8, 2008

eXXon posted:

Yeah I probably would have said:

Python code:
def has_double(lst):
    x = set(lst)
    return len(x.intersection({2*y for y in x})) > 0
... and sweated about whether it missed some corner case.

Now that I have a keyboard instead of a phone, this was going to be my take:

Python code:
def contains_doubled_value(input_list):
    input_set = set(input_list)
    return any(2*x in input_set for x in input_set)
This is clean and fairly fast, it requires iterating once over the entire list and it can finish iterating early during the double-checking because a generator expression is used.

But this is a faster answer I think:
Python code:
import numba as nb

@nb.njit
def contains_doubled_value(input_list: list[int]) -> bool:
    unique_values = set()
    for x in input_list:
        if 2*x in unique_values :
            return True
        unique_values.add(x)

    for x in unique_values :
        if 2*x in unique_values :
            return True
    return False

contains_doubled_value([1, 1, 1, 1, 2, 4, 1, 1])  # returns True
- Makes use of a compiler to make the looping fast during set creation
- We can quit iteration early during set creation if the list order happens to expedite finding the answer
- The second iteration is only over the values in the set, so we'll never check the same value twice

Personally I prefer the much cleaner set + generator expression, this numba solution is ugly

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I will also note that basically every thing someone posted here is probably fine code, as long as it accomplishes the requirements.

There's a certain trap in development that can lead toward code golf or getting obsessed with the 'cleverest' answer, but end up writing code that is hard to read. Code is read much more than it's written and so oftentimes "suboptimal" code from a performance perspective is perfectly fine if the weakest member of your team can quickly read and understand what it's doing. For example, you might gain 24ms over 10,000 iterations of solution A vs Solution B, but...if you only do this action once a week, why bother optimizing it, especially if Solution B is much cleaner to read and parse? Like in the previous response, I'd pick the first option every single time unless I had a very good reason to have the crazier one.

I think Python might lend itself more towards 'weird clever solutions' because of how much control it gives you; I had a coworker on my last team who had written some incredibly bizarre code making use a bunch of oddities of __new__ and other magic methods because it was 'so convenient' but I had no idea what he was doing, and I generally don't struggle with picking up on code architecture given a little bit of time.

Edit: Does anyone else know people that seem obsessed with non-Python design patterns? I have a couple coworkers that clearly used Java previously, because the first thing they do in any class is setup a bunch of getters and setters (with zero side effects) and private variables exposed through them. When I poked at it the guy was like 'well this doesn't evaluate at runtime' and I'm like 'yeah uh you're just setting an empty list on a class that only gets instantiated once, is this actually a problem?'. Am I missing something here where that's actually useful?

Again, just bare getter/setter properties, nothing fancy or doing fun side effect stuff.

Falcon2001 fucked around with this message at 07:49 on Apr 21, 2023

QuarkJets
Sep 8, 2008

Yeah I often see code where I think the developer's background was either Java, Matlab, or (lol) Fortran. The Fortran ones are funny because they're the closest to being pythonic but use the most god awful bullshit variable and function names imaginable, which all in all isn't so bad, at least they're using broadcasting syntax a lot of the time. Matlab is like Python's idiot uncle who keeps getting DUIs but you can at least see the family resemblance if you squint. The Java examples just look completely alien, like you have to wonder whether the developer is being paid per character in their commit history

pmchem
Jan 22, 2010


Some of my best work used my Python wrapper just to call my Fortran code.

Foxfire_
Nov 8, 2010

Unlike Python, Java is optimizable, so trivial getters/setters are (nearly) entirely free. It also has binaries and changing something from an attribute to a property (function call) is an ABI-breaking change. Most classes are never going to need to be changed without forcing their users to recompile, but people propertyize reflexively as a habit anyway

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
More relevantly for Java, changing from a raw field to a property getter is a source-breaking change, so you also need to go through and change every other file that touches that field.

spiritual bypass
Feb 19, 2008

Grimey Drawer
Do IDEs include a refactoring for that?

QuarkJets
Sep 8, 2008

Foxfire_ posted:

Unlike Python, Java is optimizable, so trivial getters/setters are (nearly) entirely free. It also has binaries and changing something from an attribute to a property (function call) is an ABI-breaking change. Most classes are never going to need to be changed without forcing their users to recompile, but people propertyize reflexively as a habit anyway

I care more about having a bunch of getters and setters that just expose an attribute than performance, that's a lot of useless cruft code. I love being able to flexibly create properties when they're needed but don't want to do it sooner than that

QuarkJets fucked around with this message at 01:57 on Apr 22, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

I care more about having a bunch of getters and setters that just expose an attribute than performance, that's a lot of useless cruft code. I love being able to flexibly create properties when they're needed but don't want to do it sooner than that

Yeah this was my issue - basically a page or two of code that does nothing at all. Like at least it's in a nice tidy block of useless, but like...why. I wish Black had a detector for it and could just remove it.

Falcon2001 fucked around with this message at 07:45 on Apr 22, 2023

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.
Does anyone have a good resource for dask that can be understood by an idiot? I've been struggling with the library for way too long and I still have no idea what I'm doing.

It's really frustrating to think you're running code in parallel, only to find out that you're not actually using all the threads in your processor unless you set the config to either 'multiprocessing' or 'distributed' and I don't know the difference between them. All I know is that 'distributed' runs faster than 'multiprocessing' but also causes my computer to restart with larger files. It also produces cryptic messages like this:

code:
distributed.nanny - WARNING - Worker process still alive after 4 seconds, killing
Why'd you have to kill my worker? He was only four seconds old!

Anyway I feel like I need a better foundation and reading the documentation is getting me nowhere.

QuarkJets
Sep 8, 2008

It's really impossible to say without knowing more about your configuration and your workload. You might consider going through some basic tutorials. Have you tried watching these videos? https://www.dask.org/get-started

In HPC parlance, "distributed" computing means using two or more workers simultaneously to solve a task. These workers may be on the same node (a computer) or spread across multiple nodes (a cluster of computers). "Multiprocessing" falls under the umbrella of "distributed" computing, it means "this node has multiple processes working together at the same time to accomplish some task". Dask has a "distributed" scheduler that exists to facilitate this; you describe the tasks and the kinds of resources that you want to use to run those tasks, and then the scheduler makes that happen. Describing the resources that the scheduler is allowed to use is an essential part of the process - you may have 4 cores in one node and 6 cores in another, but the scheduler needs to be told about those resources and potentially how to access them. Dask also has a "multiprocessing" scheduler that's one step simpler than the "distributed" scheduler, it's just built on top of ProcessPoolExecutor from Python's concurrent.futures module. If you only have 1 node (1 computer with however many cores) then the multiprocessing scheduler and the distributed scheduler are doing roughly the same thing. The docs go through some of the differences between these and other schedulers: https://docs.dask.org/en/stable/scheduler-overview.html.

It's very easy to use Dask for synchronous workloads, and it sounds like that's what you wound up doing by accident - this is an important step in developing a distributed application, you should design simple tests that exercise your workflow in a synchronous way and then you can verify that you get the same results when you go to a distributed implementation.

Deadite
Aug 30, 2003

A fat guy, a watermelon, and a stack of magazines?
Family.
So here's my dask test case, and it's a little misleading because when I check the times the compute() without a LocalCluster/Client runs much, much faster for the example than it does with the program I'm actually building. The real program runs faster without a LocalCluster until the file gets to be around 1GB in size, then the LocalCluster distributed compute starts being faster. I can't seem to recreate some of the errors I'm seeing with large files with the example code though. It does top out at 5 million rows before I get this error, which I don't get with my real program. This whole thing is so confusing.

code:
ValueError: 3713192179 exceeds max_bin_len(2147483647)
Python code:
import pandas as pd
import numpy as np
from datetime import datetime
from dask import delayed, compute
from dask.distributed import Client, LocalCluster
import multiprocessing
import platform

if platform.system().lower() == 'windows':
    import multiprocessing.popen_spawn_win32
else:
    import multiprocessing.popen_spawn_posix
    
cols = 100

for rows in range(1000000,11000000,1000000):
    # Create dataframe (single column with integers that match row number)
    df = pd.DataFrame(np.arange(rows).reshape(rows,1))
    df.columns = ['col1']

    # Duplicate that column to number set in 'cols'
    for i in range(cols):
        df['col'+str(i)] = df['col1']
        
        
    for splits in range(1,5):
        
        # Split dataframe by columns
        k, m = divmod(len(df.columns), splits)
        list_of_split_dfs = list((df[df.columns[i*k+min(i, m):(i+1)*k+min(i+1, m)]] for i in range(splits)))

        # Function that accepts dataframe and returns a dictionary 
        # with column name as key, and a dictionary with min/max as a value
        def find_minmax(df):

            results = {}
            for col in df.columns:
                results[col] = {}
                results[col]['min'] = df[col].min()
                results[col]['max'] = df[col].max()

            return results

        # Function to combine the list of dictionaries into one
        def combine_dicts(list_of_dicts):
            results = {}
            for d in list_of_dicts:
                results.update(d)
            return results

        # Create a list of delayed find_minmax functions
        list_of_dicts = []
        for df in list_of_split_dfs:
            list_of_dicts.append(delayed(find_minmax)(df))

        # Submit the delayed find_minmax functions to combine_dicts function
        combined_dictionary = delayed(combine_dicts)(list_of_dicts)
        start = datetime.now()
        combined_dictionary.compute()
        time_base = (datetime.now()-start).total_seconds()

        # Submit the function to a multiprocessing client
        start = datetime.now()
        with LocalCluster(n_workers=splits, dashboard_address=None) as cluster, Client(cluster) as client:
            combined_dictionary.compute(scheduler='multiprocessing')
        time_mp = (datetime.now()-start).total_seconds()

        # Submit the function to a distributed client
        start = datetime.now()
        with LocalCluster(n_workers=splits, dashboard_address=None) as cluster, Client(cluster) as client:
            combined_dictionary.compute(scheduler='distributed')
        time_dist = (datetime.now()-start).total_seconds()

        line = ', '.join([str(rows),str(cols),str(splits),str(time_base),str(time_mp),str(time_dist)])
        print(line)

QuarkJets
Sep 8, 2008

Deadite posted:

So here's my dask test case, and it's a little misleading because when I check the times the compute() without a LocalCluster/Client runs much, much faster for the example than it does with the program I'm actually building. The real program runs faster without a LocalCluster until the file gets to be around 1GB in size, then the LocalCluster distributed compute starts being faster. I can't seem to recreate some of the errors I'm seeing with large files with the example code though. It does top out at 5 million rows before I get this error, which I don't get with my real program. This whole thing is so confusing.

code:
ValueError: 3713192179 exceeds max_bin_len(2147483647)
Python code:
import pandas as pd
import numpy as np
from datetime import datetime
from dask import delayed, compute
from dask.distributed import Client, LocalCluster
import multiprocessing
import platform

if platform.system().lower() == 'windows':
    import multiprocessing.popen_spawn_win32
else:
    import multiprocessing.popen_spawn_posix
    
cols = 100

for rows in range(1000000,11000000,1000000):
    # Create dataframe (single column with integers that match row number)
    df = pd.DataFrame(np.arange(rows).reshape(rows,1))
    df.columns = ['col1']

    # Duplicate that column to number set in 'cols'
    for i in range(cols):
        df['col'+str(i)] = df['col1']
        
        
    for splits in range(1,5):
        
        # Split dataframe by columns
        k, m = divmod(len(df.columns), splits)
        list_of_split_dfs = list((df[df.columns[i*k+min(i, m):(i+1)*k+min(i+1, m)]] for i in range(splits)))

        # Function that accepts dataframe and returns a dictionary 
        # with column name as key, and a dictionary with min/max as a value
        def find_minmax(df):

            results = {}
            for col in df.columns:
                results[col] = {}
                results[col]['min'] = df[col].min()
                results[col]['max'] = df[col].max()

            return results

        # Function to combine the list of dictionaries into one
        def combine_dicts(list_of_dicts):
            results = {}
            for d in list_of_dicts:
                results.update(d)
            return results

        # Create a list of delayed find_minmax functions
        list_of_dicts = []
        for df in list_of_split_dfs:
            list_of_dicts.append(delayed(find_minmax)(df))

        # Submit the delayed find_minmax functions to combine_dicts function
        combined_dictionary = delayed(combine_dicts)(list_of_dicts)
        start = datetime.now()
        combined_dictionary.compute()
        time_base = (datetime.now()-start).total_seconds()

        # Submit the function to a multiprocessing client
        start = datetime.now()
        with LocalCluster(n_workers=splits, dashboard_address=None) as cluster, Client(cluster) as client:
            combined_dictionary.compute(scheduler='multiprocessing')
        time_mp = (datetime.now()-start).total_seconds()

        # Submit the function to a distributed client
        start = datetime.now()
        with LocalCluster(n_workers=splits, dashboard_address=None) as cluster, Client(cluster) as client:
            combined_dictionary.compute(scheduler='distributed')
        time_dist = (datetime.now()-start).total_seconds()

        line = ', '.join([str(rows),str(cols),str(splits),str(time_base),str(time_mp),str(time_dist)])
        print(line)

Multiprocessing entails launching a new Python process (1 per worker) and then copying any memory that the worker needs, including any data objects and executable code from the main process. You could use dask's visualizer to inspect what's happening in your task graph, but we can just look at the pieces that are being scheduled:

1. On the host, split the dataframe into chunks
2. For each dataframe chunk, have a worker calculate the min and max of each column and return a dictionary of results. Repeat for all chunks
3. Combine all of the dictionaries

Since each Python process needs to be given the data, the second step involves a huge copy operation as data is moved from the host process to each of the workers. It's unclear how much of these data is going to be cached for subsequent calls; since the task graph is used a second time with a different scheduler, it's possible that there's a big difference in processing time because the input data has already been copied to the workers. Or maybe not, I'm unsure - but it'd probably be good to isolate that effect by completely isolating these experiments, e.g. by only using one of the schedulers per execution.

It'd also be good to implement better scope isolation - if this was just using multiprocessing `concurrent.futures` then all of the code in the global scope (including the for loops and the creation of the pandas dataframes) would be getting re-executed, but I'm not sure how much exposure you have to that risk since dask uses task graphs - maybe it's fine? Still, it's better to organize the code into discrete functions that do simple things, in a distributed codebase that's going to be less prone to issues.

So I re-wrote your code a little, I especially want to draw your attention the use of f-strings instead of str() concatenation; f-strings rule, use f-strings

Python code:
import pandas as pd
import numpy as np
from datetime import datetime
import dask
from dask.distributed import Client, LocalCluster
import platform
from time import sleep

if platform.system().lower() == 'windows':
    import multiprocessing.popen_spawn_win32
else:
    import multiprocessing.popen_spawn_posix


def create_dataframe(rows, cols):
    """Create a pandas dataframe with a specific number of rows and duplicate columns."""
    df = pd.DataFrame(np.arange(rows).reshape(rows,1))
    df.columns = ['col1']
    # Duplicate that column to number set in 'cols'
    for i in range(cols):
        df[f'col{i}'] = df['col1']
    return df


def split_dataframe(df, num_splits):
    """Split a dataframe and return its chunks in a list."""
    k, m = divmod(len(df.columns), num_splits)
    split_dfs = list((df[df.columns[i*k+min(i, m):(i+1)*k+min(i+1, m)]]
                      for i in range(num_splits)))
    return split_dfs


@dask.delayed
def find_minmax(df):
    """Return a dictionary containing the min and max of each column in a dataframe."""
    results = {}
    for col in df.columns:
        results[col] = {}
        results[col]['min'] = df[col].min()
        results[col]['max'] = df[col].max()
    return results


@dask.delayed
def combine_dicts(*dicts):
    """Combine results from multiple dictionaries."""
    results = {}
    for another_dict in dicts:
        results.update(another_dict)
    return results


def run_experiment(rows, cols, num_splits, num_workers, scheduler):
    """Run a dataframe distribution experiment."""
    dask.config.set(scheduler=scheduler, num_workers=num_workers) 
    df = create_dataframe(rows, cols)
    split_dfs = split_dataframe(df, num_splits)

    # Create a list of delayed find_minmax functions
    list_of_dicts = []
    for df in split_dfs:
        list_of_dicts.append(find_minmax(df))

    # Submit the delayed find_minmax functions to combine_dicts function
    combined_dictionary = combine_dicts(*list_of_dicts)
    start = datetime.now()
    combined_dictionary.compute()
    exec_time = (datetime.now()-start).total_seconds()

    print(f'{rows=}, {cols=}, {num_splits=}, {exec_time=}, {scheduler=}')


if __name__ == '__main__':
    rows = 1000000
    cols = 100
    num_splits = 4
    num_workers = 4
    num_workers = 4
    scheduler = 'synchronous'  # synchronous, multiprocessing
    run_experiment(rows, cols, num_splits, num_workers, scheduler)

This runs way faster in synchronous mode than with processes (e.g. multiprocessing). Why? Probably because the operations are bottlenecked by copying a ton of data from the host to each process, whereas synchronous mode doesn't need to do that. There's an easy way to test this: reduce the dataset size (rows=100) and instead put a sleep(2) in find_minmax(), to simulate a costly data processing operation. This brings the difference way down, since sequential has to deal with those `sleep(2)` statements sequentially whereas they occur simultaneously with `processes`.

The dask docs make some additional notes about this:

[url posted:

https://docs.dask.org/en/stable/shared.html[/url]]
The multiprocessing scheduler must serialize functions between workers, which can fail
The multiprocessing scheduler must serialize data between workers and the central process, which can be expensive
The multiprocessing scheduler cannot transfer data directly between worker processes; all data routes through the main process.

Ick, so the main process is the hub for all of the data needed by each subprocess. That can be very costly.

What if we have each process load its own data? This is a much more common way to handle distributed computation, it eliminates the bottleneck of having to serialize a bunch of data between the host and the workers.

Python code:
import pandas as pd
import numpy as np
from datetime import datetime
import dask
from dask.distributed import Client
import platform
from time import sleep

if platform.system().lower() == 'windows':
    import multiprocessing.popen_spawn_win32
else:
    import multiprocessing.popen_spawn_posix


def create_dataframe(rows, cols):
    """Create a pandas dataframe with a specific number of rows and duplicate columns."""
    df = pd.DataFrame(np.arange(rows).reshape(rows,1))
    df.columns = ['col1']
    # Duplicate that column to number set in 'cols'
    for i in range(cols):
        df[f'col{i}'] = df['col1']
    return df


def split_dataframe(df, num_splits):
    """Split a dataframe and return its chunks in a list."""
    k, m = divmod(len(df.columns), num_splits)
    split_dfs = list((df[df.columns[i*k+min(i, m):(i+1)*k+min(i+1, m)]]
                      for i in range(num_splits)))
    return split_dfs


@dask.delayed
def create_dataframe_chunks(rows, cols, num_splits, split_number):
    """Create a given dataframe split."""
    df = pd.DataFrame(np.arange(rows).reshape(rows,1))
    cols_per_split = cols // num_splits
    col_start = cols_per_split * split_number
    col_end = max(col_start + cols_per_split, cols)
    col_name_start = f'col{col_start}'
    df.columns = [col_name_start]
    for i in range(col_start+1, col_end):
        df[f'col{i}'] = df[col_name_start]
    return df

@dask.delayed
def find_minmax(df):
    """Return a dictionary containing the min and max of each column in a dataframe."""
    results = {}
    for col in df.columns:
        results[col] = {}
        results[col]['min'] = df[col].min()
        results[col]['max'] = df[col].max()
    return results


@dask.delayed
def combine_dicts(list_of_dicts):
    """Combine results from multiple dictionaries."""
    results = {}
    for another_dict in list_of_dicts:
        results.update(another_dict)
    return results


def run_experiment(rows, cols, num_splits, num_workers, scheduler):
    """Run a dataframe distribution experiment."""
    dask.config.set(scheduler=scheduler, num_workers=num_workers)
    if scheduler == 'distributed':
        Client()

    split_dfs = [create_dataframe_chunks(rows, cols, num_splits, split_number)
                 for split_number in range(num_splits)]
    list_of_dicts = [find_minmax(split_df) for split_df in split_dfs]
    combined_dictionary = combine_dicts(list_of_dicts)

    start = datetime.now()
    combined_dictionary.compute(num_workers=num_workers)
    exec_time = (datetime.now()-start).total_seconds()

    print(f'{rows=}, {cols=}, {num_splits=}, {exec_time=}, {scheduler=}')


if __name__ == '__main__':
    rows = 1000000
    cols = 100
    num_splits = 20
    num_workers = 4
    scheduler = 'processes'  # synchronous, processes, distributed
    run_experiment(rows, cols, num_splits, num_workers, scheduler)
I also increased the number of splits to 20, so each task is basically dealing with 5 columns of data. I still don't see much benefit to using processes, whereas distributed definitely runs faster than synchronous. Something funny is going on here with how dask handles a multiprocessing situation, and I don't know what it is.

As a simple experiment, I also tried just taking off the dask decorators and using concurrent.futures directly.
Python code:
from concurrent.futures import ProcessPoolExecutor
from functools import partial


def split_tasks(rows, cols, num_splits, split_number):
    """Perform all of the tasks required of a process."""
    df = create_dataframe_chunks(rows, cols, num_splits, split_number)
    minmax = find_minmax(df)
    return minmax


def run_multiprocessing_experiment(rows, cols, num_splits, num_workers):
    start = datetime.now()
    create_split_result = partial(split_tasks, rows, cols, num_splits)
    with ProcessPoolExecutor(max_workers=num_workers) as pool:
        minmax_dicts = pool.map(create_split_result, range(num_splits))
    #minmax_dicts = map(create_split_result, range(num_splits))
    combined_dictionary = combine_dicts(minmax_dicts)
    exec_time = (datetime.now()-start).total_seconds()
    print(f'{rows=}, {cols=}, {num_splits=}, {num_workers=}, {exec_time=}')


if __name__ == '__main__':
    rows = 10000
    cols = 100
    num_splits = 4
    num_workers = 1
    run_multiprocessing_experiment(rows, cols, num_splits, num_workers)
Keeping the sleep(2) in find_minmax to simulate expensive processing, I get 4x faster processing with 4 workers than with 1, and that's true whether I use pool.map() or a simple map() (which would mean not using a process pool at all but still running the code sequentially).

So the tl;dr is that I don't really understand what dask does with its 'multiprocessing' / 'processes' scheduler, it doesn't seem to be doing what I'd expect it to do when comparing to a direct concurrent.futures.ProcessPoolExecutor implementation. There's probably some bit of magic sauce that I'm missing. But "distributed" seems to work well regardless and is where they put the bulk of their efforts anyway, and that's what you should use since that's designed for task graphs whereas concurrent.futures is not.

As for the exception that you saw, are you maybe trying to serialize too much data at once? Alternatively, I googled that error and found a dask github post claiming that this was a bug that got fixed, so maybe you can just update to a newer version of dask?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
This is less 'a thing I'm working on' and more 'cool sorcery' but: https://textual.textualize.io/ seems to be pretty interesting and it taught me something very interesting about SSH I didn't know.

TL;DR It's a text-based UI that looks pretty slick and has mouse controls for rapid development of terminal programs. Since I work on a lot of those, that's very interesting to me. At first I was like "well that's cool and all but you're going to lose all the mouse functionality over SSH" so I installed it on my server and SSH'd over...and all the mouse stuff just worked???

This might just be common knowledge but it was super wild to me. Anyway the framework looks neat, I might try playing around with it in the next few weeks.

https://unix.stackexchange.com/questions/418901/how-some-applications-accept-mouse-click-in-bash-over-ssh is apparently how it works; I didn't realize mouse clicks were actually sent over SSH at all!

12 rats tied together
Sep 7, 2006

SSH is extremely good. another neat library in the realm of "python and ssh" is mitogen

ziasquinn
Jan 1, 2006

Fallen Rib

StumblyWumbly posted:

Do you have a project you'd like to try doing? Like automating some spreadsheet work or renaming files or doing math?
And what's your programming background? There's no one size fits all.

Not a lot really, I just know some real basic stuff. I do IT work generally.

Ben Nerevarine posted:

I usually do recommend the Automate the Boring Stuff book that you mentioned, but not necessarily reading it front to back. I think you’d have more success if you knew what you wanted out of a small project rather than try and learn a bunch of boring stuff THEN come up with a project idea. So what I recommend is skimming the book (and I’m talking, like, read the table of contents and peek at some chapters that stick out as interesting or useful) to at least get a sense of the kinds of things it touches on, things everyone does every day, like file manipulation for example. Moving files around, creating directories, appending to files, etc. are operations you’re likely to need across many of the projects you work on. Get a sense of the building blocks that are there, then think about something you would like to do (again, sticking to small projects at this stage), then think about how you’d put them together with those building blocks. That way you have reasons for reading particular chapters in depth rather than slogging through the whole thing at a go, which will likely only serve to bore and demotivate you

Ok good idea. I own this book already so I'll do that. (It's digital so I haven't even cracked it I think...)

Falcon2001 posted:

IMO as someone who was really terrible in school I found Code Katas like I mentioned above to be a very useful setup for this - you just have to solve one discrete problem to get the little 'I did a thing' feeling, not build a whole program and slog through all the stuff surrounding it. And they start off really easy, like 'reverse a list' easy.

Protip: Try and solve the problem, and if you get stuck, just google the answer. You're not taking a test, you're learning, and half of development is googling stuff or looking up the docs anyway.

Thank you all you ppl. All very true and helpful advice. I knew this was the truth. I just haven't ever really needed to "develop something" because I almost always can just find it already created somewhere :(, but obviously I should just determine such projects to work on and do them. The Code Katas sound really helpful so I'll look into that too.

ziasquinn fucked around with this message at 02:46 on May 7, 2023

AfricanBootyShine
Jan 9, 2006

Snake wins.

Anyone have luck using ChatGPT to generate code for importing badly formatted CSV/XML data? Every instrument I use exports data in a different broken way and I spend 3-4 hours a week manually cleaning the data into a format that I can properly import it into. I've written scripts that automate it for one instrument but some are so broken in ways I can't wrap my mind around.

One particular instrument is awful as you can only properly export data from an experiment as an XML file which requires a $10,000 software package to analyse. People have scrabbled together code for importing older data formats into R but I can't figure out how to get it to work for numpy.

QuarkJets
Sep 8, 2008

Instead of manually cleaning data each week you should spend some time writing a function that cleans up the data automatically. Your code should identify the instrument type and then call the required transformations to make the data fit into your data model.

I haven't tried using gpt to write any code

CarForumPoster
Jun 26, 2013

⚡POWER⚡

AfricanBootyShine posted:

Anyone have luck using ChatGPT to generate code for importing badly formatted CSV/XML data? Every instrument I use exports data in a different broken way and I spend 3-4 hours a week manually cleaning the data into a format that I can properly import it into. I've written scripts that automate it for one instrument but some are so broken in ways I can't wrap my mind around.

One particular instrument is awful as you can only properly export data from an experiment as an XML file which requires a $10,000 software package to analyse. People have scrabbled together code for importing older data formats into R but I can't figure out how to get it to work for numpy.


I have definitely used ChatGPT to extract stuff, make regexs, etc.

I usually start with importing CSVs to dataframes with pandas. The distance that can get you doesnt require chatCPT as the pandas docs are about as simple as can be, simply pd.read_csv with whatever params you want. Once its in a dataframe its easy enough to get into an array, though YMMV if the data is quite large as pandas will make huge dataframes if it doesnt know a type for the columns.

For wrangling XML/HTML, Ive certainly had ChatGPT spit out code using lxml. I imagine this use case happens frequently enough it'll make something closeish.

QuarkJets posted:

Instead of manually cleaning data each week you should spend some time writing a function that cleans up the data automatically. Your code should identify the instrument type and then call the required transformations to make the data fit into your data model.

I haven't tried using gpt to write any code

I think your suggestion is exactly what he's asking about. Also you're missing out with ChatGPT. Its good enough at it that it's making me a horrible, lazy programmer. Literally retarding my programming skills cause it writes the code for me, usually with some small errors for me to fix, and even writes decent docstrings, type hints, comments, etc.

CarForumPoster fucked around with this message at 23:15 on May 8, 2023

StumblyWumbly
Sep 12, 2007

Batmanticore!
Using ChatGPT for regex is great, AIs are very good at that kind of translation.
But, if you can't figure out the rules you need for reformatting the data, you are pretty much doomed. Since you've been cleaning the data, you probably know the rules, you just haven't formalized them. Hopefully you can get a program to do most of the work, and then manually, eg, delete garbage data at the start of the recording.

Out of curiosity how are you controlling these instruments now? VISA bus? Or truly custom software?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Yeah, I think if you can't generate it with ChatGPT the bigger problem is you don't understand the issue you're trying to solve (which is half of dev anyway). The goal is the same though: write some stuff to sanitize/standardize the data that comes off the instruments before loading it in.


AfricanBootyShine posted:

Anyone have luck using ChatGPT to generate code for importing badly formatted CSV/XML data? Every instrument I use exports data in a different broken way and I spend 3-4 hours a week manually cleaning the data into a format that I can properly import it into. I've written scripts that automate it for one instrument but some are so broken in ways I can't wrap my mind around.

One particular instrument is awful as you can only properly export data from an experiment as an XML file which requires a $10,000 software package to analyse. People have scrabbled together code for importing older data formats into R but I can't figure out how to get it to work for numpy.

Can you add some more details about what you're trying to change about the data? Is it actually breaking CSV convention or just weird? etc etc.

FWIW I also haven't had much luck with ChatGPT doing anything other than bog-standard Regex stuff, like phone numbers, emails, etc.

CarForumPoster
Jun 26, 2013

⚡POWER⚡
Btw never regex your xml that is extremely haram use lxml like a grownup

Hughmoris
Apr 21, 2007
Let's go to the abyss!

CarForumPoster posted:

Btw never regex your xml that is extremely haram use lxml like a grownup

The <center> cannot hold

Data Graham
Dec 28, 2009

📈📊🍪😋



But not if you’re on windows and have to install some decrepit unsupported version of Visual Studio to compile it during pip install or else look for a prebuilt wheel on some page at a university cluster in Sweden that predates x64 architectures

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

Data Graham posted:

But not if you’re on windows and have to install some decrepit unsupported version of Visual Studio to compile it during pip install or else look for a prebuilt wheel on some page at a university cluster in Sweden that predates x64 architectures

Use conda instead of doing any of that poo poo

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply