Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

eXXon posted:

You can make do_something an abstractmethod and move the implementation to Bob.

Unless you really desperately need these things to be quasi-singletons, I would not make do_something a classmethod. For one, it makes it difficult to have more than one around, whereas I don't see why users should be forbidden from doing so. If you want default set of Bob/Sam/whatever instances you can define that in a module.

It's the second part - I want to provide a default set and those should be the only ones available to the user. They're really a set of pre-defined ETL operations.

To elaborate, I have a bunch of files, containing tabular data, but structured differently. So I want the user to be able to invoke calls like:

code:
RecordsA.load (filepath1, a_config_dict)
RecordsA.load (filepath2, a_config_dict)
RecordsA.load (filepath3, a_config_dict)

RecordsB.load (filepath4, b_config_dict)
RecordsB.load (filepath5, b_config_dict)
RecordsB.load (filepath6, b_config_dict)
where the load method loads the file and, along with the config_dict, extracts the relevant info from a table, and appends it to a pandas dataframe. The exact processing steps may very, so a single fixed function won't work.

I only want the user to access the provided Records interfaces.

Adbot
ADBOT LOVES YOU

12 rats tied together
Sep 7, 2006

I think I would probably model that as
Python code:
RecordsLoader.load(file="some_path", accumulator=lambda x, y: x + y, config=a_config_dict)
# or maybe
class AdditionAccumulator:
  vals = []
  @whatever
  def whatever:
    pass

class MultiplicationAccumulator:
  vals = []
  @whatever
  def whatever:
    pass

RecordsLoader.load(file="some_path", accumulator=AdditionAccumulator, config=b_config_dict)
Instead of creating a full class for a "named thing" that holds some behavior, create a class that holds the behavior directly, and is named what it does. Users can load records by composing various behaviors. Importantly, the "default behavior" is a fully realized thing, maybe its just called the DefaultAccumulator, instead of being a read-between-the-lines side effect of Bob subclassing aName and not overriding a method.

QuarkJets
Sep 8, 2008

Cyril Sneer posted:

Looking for advice on the following example code I've been playing around with:

code:
class aName:
    vals = []
    @classmethod
    def do_something (cls, x,y):
        output = x + y
        cls.vals.append( output )
        print(f'appended {output} in {cls}')
        
    @classmethod
    def get_vals(cls):
        return cls.vals
    
class Bob (aName):
    vals = []
    
    
class Sam (aName):
    vals = []
    @classmethod
    def do_something (cls, x,y):
        output = x * y
        cls.vals.append( output )
        print(f'appended {output} in {cls}')
    
    

Bob.do_something (1,3)
Bob.do_something (2,3)
Bob.do_something (3,3)

Sam.do_something(1,3)
Sam.do_something(2,3)
Sam.do_something(3,3)

Bob.get_vals() # returns [4, 5, 6]
Sam.get_vals() # returns [3, 6, 9]
It's easiest to understand if you start from the bottom, where you'll see my desired calling pattern. I want to create different "static" classes that can accumulate different results depending on which is called. You'll see the Bob class does not subclass anything and so preserves the addition operation. In my Sam class, I've modified the do_something function to perform multiplication instead of addition.

The above code does actually work the way I want it to, I just don't know if its the best way to do it (or might in fact be considered a bad way!).

My main gripe is the need to define that Bob class, which doesn't override any behaviour, and only serves to re-scope the class variable.

There's a lot of duplicate code between aName, Bob, and Sam. You could try control inversion here to simplify the implementation a lot; it's a lot of work to define a class just to overload a single method of its parent, it'd be a lot easier to just accept a function object that defines the modular behavior (e.g. a summing function, a multiplying function, etc)

Since we're no longer using inheritance and method overloading, now we can just define 1 class and instantiate it a couple of times.

Why does get_vals() exist? It's pointless, ditch it. Getter methods that just return an attribute should be avoided

Python code:
class aName:
    def __init__(self, fun=lambda x,y: x+y):
        self.fun = fun
        self.vals = [] 
    
    def do_something(self, x, y):
        output = self.fun(x, y)
        self.vals.append(output)

 
Bob = aName()
Sam = aName(lambda x,y: x*y)

Bob.do_something(1,3)
Bob.do_something(2,3)
Bob.do_something(3,3)

Sam.do_something(1,3)
Sam.do_something(2,3)
Sam.do_something(3,3)

Bob.vals  # returns [4, 5, 6]
Sam.vals  # returns [3, 6, 9]

QuarkJets fucked around with this message at 07:01 on Aug 9, 2023

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Another perhaps minor note, but I'm reasonably sure that this:

Python code:
class ButtsMaker:
	vals = []
and this:

Python code:
class ButtsMaker:
	def __init__(self):
		self.vals = []
Are different; the first sets vals to a shared attribute for the entire class, which can result in some very strange behavior. Most of the time you want the second, where each object will have a new variable.\

I also agree with QuarkJet's approach: if the lambda part seems confusing (and fwiw, I still find Lambdas confusing), just have it be various full functions (or classes if you really want) that users can pass in, so you have something like this:
Python code:

def Adder(x,y) -> int:
	return x+y

def Multiplier(x,y) -> int:
	return x*y

# I can't remember if you can set a function as a default value, so if you can't then you can do some if none shenanigans.

class aName:
    def __init__(self, fun: callable = Adder):
        self.fun = fun
        self.vals = [] 
    
    def do_something(self, x, y):
        output = self.fun(x, y)
        self.vals.append(output)

 
Bob = aName()
Sam = aName(Multiplier)

Bob.do_something(1,3)
Bob.do_something(2,3)
Bob.do_something(3,3)

Sam.do_something(1,3)
Sam.do_something(2,3)
Sam.do_something(3,3)

Bob.vals  # returns [4, 5, 6]
Sam.vals  # returns [3, 6, 9]
You can also make use of factory functions that just...make the thing the way you want. Just don't make it part of your default __init__.
Python code:
def get_aname_multiplier() -> aName:
	return aName(Multiplier)

def get_aname_adder() -> aName:
	return aName()

Precambrian Video Games
Aug 19, 2002



You're kind of missing the part where Cyril wanted them to behave like singletons, which is why vals is a class attribute rather than an instance attribute. The best advice I've read in that regard is to only try to make something a singleton if your program cannot possibly work otherwise and I don't think that's the case here. To that end,

Cyril Sneer posted:

It's the second part - I want to provide a default set and those should be the only ones available to the user. They're really a set of pre-defined ETL operations.

To elaborate, I have a bunch of files, containing tabular data, but structured differently. So I want the user to be able to invoke calls like:

code:
RecordsA.load (filepath1, a_config_dict)
RecordsA.load (filepath2, a_config_dict)
RecordsA.load (filepath3, a_config_dict)

RecordsB.load (filepath4, b_config_dict)
RecordsB.load (filepath5, b_config_dict)
RecordsB.load (filepath6, b_config_dict)
where the load method loads the file and, along with the config_dict, extracts the relevant info from a table, and appends it to a pandas dataframe. The exact processing steps may very, so a single fixed function won't work.

I only want the user to access the provided Records interfaces.

... why can't there be more than one RecordsA? If these are chunks of a larger dataset, couldn't a user conceivably want filename1+2 separately from filename3?

And if you're ultimately creating a dataframe from this, are you planning to protect it from mutation somehow while thr Records classes provide some subset of dataframe functionality? Otherwise I'm not sure what you mean by only allowing the user to access the Records interfaces.

joebuddah
Jan 30, 2005
Thanks for the assist.

There isn't always a decimal or a comma.
I am supposed to allow for values between 0.001 and 200,000.001

I think you are correct and I should ignore typos.

Chin Strap
Nov 24, 2002

I failed my TFLC Toxx, but I no longer need a double chin strap :buddy:
Pillbug

joebuddah posted:

Thanks for the assist.

There isn't always a decimal or a comma.
I am supposed to allow for values between 0.001 and 200,000.001

I think you are correct and I should ignore typos.

Isn't this non-distinguishable on its own?

200,002

Doesn't tell me if it is 200 with a decimal portion of 002 or 200,002 with a decimal portion of 0

boofhead
Feb 18, 2021

yeah unless they consistently pass the value with 3 decimal places (so 1234 -> 1,234.000 or 1.234,000) you'd need either some specific flag or enough contextual data to infer which format it should use (e.g. values from this source or with this pattern of metadata are always US)

sometimes the source data is just too crap to do anything with and they gotta fix the problem before they pass the data to you

spiritual bypass
Feb 19, 2008

Grimey Drawer
You could read backward through the string to see if there's punctuation before you've passed 3 places. Then, use that knowledge to add zeroes to the end. That ought to at least normalize the decimal part length.

Chin Strap
Nov 24, 2002

I failed my TFLC Toxx, but I no longer need a double chin strap :buddy:
Pillbug

spiritual bypass posted:

You could read backward through the string to see if there's punctuation before you've passed 3 places. Then, use that knowledge to add zeroes to the end. That ought to at least normalize the decimal part length.

But there potentially isn't a decimal

spiritual bypass
Feb 19, 2008

Grimey Drawer
If you've passed three spots without punctuation, you know that's where it would go. Add zeroes, then strip all punctuation and you have yourself a nice, safe integer that you can reformat for display.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

eXXon posted:

You're kind of missing the part where Cyril wanted them to behave like singletons, which is why vals is a class attribute rather than an instance attribute. The best advice I've read in that regard is to only try to make something a singleton if your program cannot possibly work otherwise and I don't think that's the case here. To that end,

... why can't there be more than one RecordsA? If these are chunks of a larger dataset, couldn't a user conceivably want filename1+2 separately from filename3?

And if you're ultimately creating a dataframe from this, are you planning to protect it from mutation somehow while thr Records classes provide some subset of dataframe functionality? Otherwise I'm not sure what you mean by only allowing the user to access the Records interfaces.

I should clarify on the files a bit. I have a giant pool of files named as -

code:
recordsfile_TypeA_00001.xlsx
recordsfile_TypeA_00002.xlsx
recordsfile_TypeA_00003.xlsx
recordsfile_TypeB_00001.xlsx
recordsfile_TypeB_00002.xlsx
recordsfile_TypeC_00001.xlsx
recordsfile_TypeC_00002.xlsx
recordsfile_TypeC_00003.xlsx
So there are multiple file records types (A,B,C here), and multiple cases of each type. I want all Type A records accumulated into one dataframe, all Type B records into another, etc. My plan was that I would iterate through each file name, extract the record type (easy), then via a dictionary look-up, call the appropriate (class) function, e.g., RecordsA.load (recordsfile_TypeA_xxx, a_config_dict). The actual processing steps could be the same between say, TypeA and TypeB,, aside from some custom details fed in through the two different config dicts. So this is equivalent to my empty Bob class that overrides nothing, but re-sets the class variable. In TypeC, the processing steps differ, so I overwite the load method with the different steps.


Does that make sense?

QuarkJets
Sep 8, 2008

eXXon posted:

You're kind of missing the part where Cyril wanted them to behave like singletons, which is why vals is a class attribute rather than an instance attribute.

I tried rereading the post but I don't see this singleton requirement stated anywhere - I think instances should work fine for their usecase.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Even if it was supposed to be a Singleton, I wouldn't just implement one class level attribute. You should be enforcing the Singleton pattern via some init fuckery or via a factory.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

Falcon2001 posted:

Even if it was supposed to be a Singleton, I wouldn't just implement one class level attribute. You should be enforcing the Singleton pattern via some init fuckery or via a factory.

Can you elaborate on this? Even if I don't ultimately go in this direction I'd still like to learn/try it.

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

Cyril Sneer posted:

Can you elaborate on this? Even if I don't ultimately go in this direction I'd still like to learn/try it.

Basically if you actually want a singleton (and I agree with other posters that for this problem you don’t) you should make it literally impossible to create more than one instance.

IIRC the usual way to implement this in Python is by altering the __new__() method and checking if cls._instance is None before returning either the sole existing instance of the class or creating and returning a new one (I have never actually needed to do this in the wild so might have got a couple of these names wrong)

Singletons do have their uses (eg loggers are frequently implemented using a singleton), but those are fairly specialised and limited and you should probably think carefully about whether you actually need one.

joebuddah
Jan 30, 2005

Chin Strap posted:

Isn't this non-distinguishable on its own?

200,002

Doesn't tell me if it is 200 with a decimal portion of 002 or 200,002 with a decimal portion of 0


All of our values would have at least 1 decimal location, except for Boolean values. So I use the pseudo code below


This is how I'm testing the values going in

If there is a decimal and a comma

If the decimal index =<2 it's in EU

If the comma index =<2 then it's US

If there is only one
If there is a comma it's EU
If there is a period is US

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
So here's a simple way to do it as a class instance:

Python code:
class SingleButt:

    def __new__(cls):
        if not cls._instance:
            cls._instance = super().__new__(cls)
        return cls._instance

a = SingleButt()
b = SingleButt()
a == b
And here's the better way to do it as a factory function:
Python code:
def get_single_butt(_butt = SingleButt()) -> SingleButt:
	return _butt

a = get_single_butt()
b = get_single_butt()
a == b
Or, my preferred method, which is to create a class for the factory, which lets you store things a little more cleanly as class attributes. This pattern isn't so bad for vending things like clients, but we'll get into why this can be a problem:

Python code:
class Butt:
    def __init__(self, args1, args2) -> None:
        self.args1 = args1
        self.args2 = args2

class ButtsFactory:

    @classmethod
    def get_single_butt(cls) -> Butt:
        if not cls._single_butt:
            cls.single_butt = Butt("Your Args 1 goes here", "Your args 2 goes here")
        return cls.single_butt
    

a = ButtsFactory.get_single_butt()
b = ButtsFactory.get_single_butt()
a == b
So why don't we use singletons? Because it's a very easy way to gently caress with dependency injection, which then makes testing a massive pain in the rear end involving a lot of patching, as just a starting point for the issues.

For example: one very common pattern people use singletons for is for client objects, such as S3 or whatever; this isn't necessarily a bad idea in most cases because your program may never need to call more than one AWS account, so why bother having multiple singletons?

The issue comes up when suddenly you realize 'oh poo poo, instead of initializing the client at the beginning and passing it down via dependency injection, I'm instead just requesting that singleton specifically in my code where it's needed.' That's not a problem...until you need to patch 50 different places you call S3.

In my uneducated opinion, singletons share a lot of code smell space with Global variables - they're not always the wrong idea, but they're certainly something worth being careful about using and are probably the wrong idea. For example, I'm using one in a current program I'm working on, because the specific use case is essentially a logging system of sorts. It does not take any additional side effect actions of any kind, it just stores data in a globally available place for later retrieval, and the architecture of our software makes initializing that up top and passing it down somewhat complex, and definitely much more verbose. All of those things mean that patching is not required - the bits that actually take actions based on this singleton's data use dependency injection correctly, allowing us to pass in appropriate objects there. This was a tradeoff for complexity, but was one made after discussing it with team mates and having a quick design discussion.

Falcon2001 fucked around with this message at 07:53 on Aug 10, 2023

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
I skipped over them, but I wanted to comment on the "normal" class-based solutions from QuarkJets and Falcon2001. I understand both of these, and I agree they work, but I'm somewhat dissatisfied with this approach as I don't really want the user to have to initiate the class instances, and since they're liable to be accessed in various other functions, they'd have to live as globals (or, passed around, which would be annoying). Whereas using static class methods in a module, I can call them from wherever and they don't have to be instantiated.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Cyril Sneer posted:

I skipped over them, but I wanted to comment on the "normal" class-based solutions from QuarkJets and Falcon2001. I understand both of these, and I agree they work, but I'm somewhat dissatisfied with this approach as I don't really want the user to have to initiate the class instances, and since they're liable to be accessed in various other functions, they'd have to live as globals (or, passed around, which would be annoying). Whereas using static class methods in a module, I can call them from wherever and they don't have to be instantiated.

I guess my question is why? Like, is it just 'I want users to get a pre-configured object they don't have to think about?' Because the right answer to that is a factory function where they just call 'get_butts_factory()' and it returns the right thing to them, preconfigured for whatever scenario is appropriate.

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.

Falcon2001 posted:

For example, I'm using one in a current program I'm working on, because the specific use case is essentially a logging system of sorts. It does not take any additional side effect actions of any kind, it just stores data in a globally available place for later retrieval,

I would argue that this is actually similar to what I'm doing. Except replacing the "logging" with "tables" (from data has been extracted from the various files).

The vals list in my example code would be the equivalent to your logging accumulation.

QuarkJets
Sep 8, 2008

Cyril Sneer posted:

I skipped over them, but I wanted to comment on the "normal" class-based solutions from QuarkJets and Falcon2001. I understand both of these, and I agree they work, but I'm somewhat dissatisfied with this approach as I don't really want the user to have to initiate the class instances, and since they're liable to be accessed in various other functions, they'd have to live as globals (or, passed around, which would be annoying). Whereas using static class methods in a module, I can call them from wherever and they don't have to be instantiated.

If I am loading and transforming some data then my workflow should have full custody of that data at every step. If users are meant to use the data stored in these classes, then my opinion is that this is a design that's prone to "side effects". There are a zillion words written on the subject of why you should try to avoid side effects in code but I'll just point back to my first sentence: if I don't pass some data to a function then it's a cardinal sin for that data to get transformed by that function, and this design facilitates that kind of situation. It's creating a kind of pitfall

Cyril Sneer
Aug 8, 2004

Life would be simple in the forest except for Cyril Sneer. And his life would be simple except for The Raccoons.
Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram:

https://ibb.co/Fgy2f26

I'm working on a project to extract data from a bunch of production-related excel files. Individual files consist of two sheets - a cover sheet and a report sheet. The cover sheet has certain fields whose values I extract and the report sheet contains tabular data records. This tabular data gets extracted, possibly cleaned, then merged with the cover fields.

The blocks in the black circles can be considered stable/fixed, meaning the same code works for all file types. The red circles represent places where the code may vary. For example, for some file types, the clean block has to have a few lines of code to deal with merged cells.

We can think of there being 3 files types. FileTypeA and FileTypeB require the same processing steps, with only certain options in a configuration dictionary that need changing (column names, desired fields, that sort of thing). However they are different datasets and should be separately aggregated. A 3rd file type, FileTypeC, requires some different processing in the Clean module.

Normal classes at first pass seem like an obvious solution. I can define standard behaviors for those 5 blocks, and aggregate the results to each class instance. Then, I can subclass the blocks when/if needed (i.e,. to handle FileTypeC). The thing that doesn't sit will with me here is that none of these blocks actually require any state information. They can all be standalone functions. This was partially why I explored the singleton approach.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

Cyril Sneer posted:

Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram:

https://ibb.co/Fgy2f26

I'm working on a project to extract data from a bunch of production-related excel files. Individual files consist of two sheets - a cover sheet and a report sheet. The cover sheet has certain fields whose values I extract and the report sheet contains tabular data records. This tabular data gets extracted, possibly cleaned, then merged with the cover fields.

The blocks in the black circles can be considered stable/fixed, meaning the same code works for all file types. The red circles represent places where the code may vary. For example, for some file types, the clean block has to have a few lines of code to deal with merged cells.

We can think of there being 3 files types. FileTypeA and FileTypeB require the same processing steps, with only certain options in a configuration dictionary that need changing (column names, desired fields, that sort of thing). However they are different datasets and should be separately aggregated. A 3rd file type, FileTypeC, requires some different processing in the Clean module.

Normal classes at first pass seem like an obvious solution. I can define standard behaviors for those 5 blocks, and aggregate the results to each class instance. Then, I can subclass the blocks when/if needed (i.e,. to handle FileTypeC). The thing that doesn't sit will with me here is that none of these blocks actually require any state information. They can all be standalone functions. This was partially why I explored the singleton approach.

The diagram helps; I would say before going to deep that this might be a small enough thing that breaking convention will not bite you in the rear end too hard.

That being said; I would consider composition over inheritance for solving this.

For example:
Python code:
from typing import callable

class DataSource:

    table_clean_func: callable
    merge_func: callable

    def __init__(self, table_clean_func: callable, merge_func: callable, inputs) -> None:
        self.table_clean_func = table_clean_func
        self.merge_func = merge_func
        self.inputs = inputs

    def extract_fields(self):
        pass
        # common extraction stuff goes here

    def extract_table(self):
        pass
        # common extract table stuff goes here

    # etc etc

def table_clean_a():
    # Clean code for type A goes here
    pass

def table_clean_b():
    # Clean code for type B goes here
    pass

def merge_a():
    # Merge code for condition A goes here
    pass

def merge_b():
    # Merge code for condition B goes here
    pass

a = IngestionPipeline(table_clean_func=table_clean_a, merge_func=merge_a, inputs)
b = IngestionPipeline(table_clean_func=table_clean_b, merge_func=merge_a, inputs)
Hopefully this pile of pseudocode gets across the idea, and that I didn't typo anything. Instead of using inheritance/etc; you can simply provide functions in to compose various steps (you can also enclose those in a proper class if it feels nicer or there's enough going on there, this is very common with Config objects in various languages). It's the opposite idea of using an abstract base class and inheritance, but works well for situations where you need to mix and match things instead of simply being like 'everything must have strict inheritance'.

How I'd approach it would be having a pipeline setup like this:

Step 1: For all inputs, have a function that can read them, determine which type they should be, and build the appropriate DataSource object, then put those in a big queue for processing. This would cover your Extract Fields / Extract Table step in your diagram.
Step 2: Once your inputs are normalized, do a for data_source in data_sources: data_source.clean_table() sort of step, relying on each object to have been created with the correct clean_table function.
Step 3: Same, but with merge.
Step 4: Once this is completed, I presume you have a list of dataframes, so I'd aggregate by whatever data you desire.

I could write this up into a Gist or something if you want more details, but I actually owned a pretty similar pipeline on a previous project and we used a lot of the same concepts. We had taken it a step fruther, with our MergeStrategy being an abstract base class where we had multiple different MergeStrategy options that had a similar interface, but those were all passed into a specific merge step of our pipeline that took them in.

Other ways to possibly approach this:

- Instead of building in the table_clean/etc into an object that represents the files, just have an enum assigned to the files during the Step1 above and then use a mapping later on in Step 2 and 3 to determine which clean/merge functions to use; this concept is called dictionary dispatch and is a pretty handy pattern.

Falcon2001 fucked around with this message at 02:26 on Aug 15, 2023

I. M. Gei
Jun 26, 2005

CHIEFS

BITCH



This question is dumb as hell — I haven't coded anything in almost 10 years and I am still a complete noob at Python — but why might my break points not be working in PyCharm?

The program is supposed to pause when it gets to a line with a red circle and wait for me to click a button to keep going, is it not?

Because mine keeps running my code without stopping, and I don't think the tutorial has an answer as to why (or maybe it does and I just didn't see it?).

I. M. Gei fucked around with this message at 02:42 on Aug 15, 2023

Data Graham
Dec 28, 2009

📈📊🍪😋



Breakpoints/debug mode only work if you run your code through a run configuration (the toolbar in the middle right at the top). If you just run python at the terminal it won't be in the context of the debugger.

If you have a single .py file that you're trying to run as a script, try selecting "Current File" from the Configurations menu and hit Debug. Your breakpoints should work then.

I. M. Gei
Jun 26, 2005

CHIEFS

BITCH



Data Graham posted:

Breakpoints/debug mode only work if you run your code through a run configuration (the toolbar in the middle right at the top). If you just run python at the terminal it won't be in the context of the debugger.

If you have a single .py file that you're trying to run as a script, try selecting "Current File" from the Configurations menu and hit Debug. Your breakpoints should work then.

I'll try that when I get back to my computer and report back. Thanks!



EDIT: I tried selecting Current File and hitting Debug, and the breakpoints still don't work. Also I'm getting a message on the left side of my screen that says "Connection to Python debugger failed. Attempt timed out".

What the gently caress?

I. M. Gei fucked around with this message at 06:16 on Aug 15, 2023

QuarkJets
Sep 8, 2008

Cyril Sneer posted:

Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram:

https://ibb.co/Fgy2f26

I'm working on a project to extract data from a bunch of production-related excel files. Individual files consist of two sheets - a cover sheet and a report sheet. The cover sheet has certain fields whose values I extract and the report sheet contains tabular data records. This tabular data gets extracted, possibly cleaned, then merged with the cover fields.

The blocks in the black circles can be considered stable/fixed, meaning the same code works for all file types. The red circles represent places where the code may vary. For example, for some file types, the clean block has to have a few lines of code to deal with merged cells.

We can think of there being 3 files types. FileTypeA and FileTypeB require the same processing steps, with only certain options in a configuration dictionary that need changing (column names, desired fields, that sort of thing). However they are different datasets and should be separately aggregated. A 3rd file type, FileTypeC, requires some different processing in the Clean module.

Normal classes at first pass seem like an obvious solution. I can define standard behaviors for those 5 blocks, and aggregate the results to each class instance. Then, I can subclass the blocks when/if needed (i.e,. to handle FileTypeC). The thing that doesn't sit will with me here is that none of these blocks actually require any state information. They can all be standalone functions. This was partially why I explored the singleton approach.

Inheritance should not be used for code reuse.

I think you should just use dictionary dispatch, maybe with an Enum if you really feel strongly that you need to have a class somewhere in your codebase.
Python code:
def process_worksheets(file_name):
    file_type = return_file_type(file_name)
    clean_table = {FileType.A: clean_table_a,
                   FileType.B: clean_table_b,
                   FileType.C: clean_table_c}[file_type]
    merge_table= {FileType.A: merge_table_a,
                  FileType.B: merge_table_b,
                  FileType.C: merge_table_c}[file_type]
    sheets = extract_sheets(file_name)
    config = read_config()
    df = sheets_to_dataframe(sheets, config, clean_table, merge_table)
    aggregate_dataframe(df)


def sheets_to_dataframe(sheets, config, clean_table, merge_table):
    df = extract_report_table(sheet, config)
    df = clean_table(df)
    fields = extract_fields(sheet, config)
    df = merge_table(df, fields)
    return df


def clean_table_a():
    pass


def clean_table_b():
    pass

# etc.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

I think you should just use dictionary dispatch, maybe with an Enum if you really feel strongly that you need to have a class somewhere in your codebase.

This is cleaner than my example and covers the edit I put in; I'd say go with something like this.

In my opinion, classes in Python are best when you need to handle some sort of shared state consistently, or when holding data in a structured way since Python doesn't really have a Struct concept.

Data Graham
Dec 28, 2009

📈📊🍪😋



I. M. Gei posted:

I'll try that when I get back to my computer and report back. Thanks!



EDIT: I tried selecting Current File and hitting Debug, and the breakpoints still don't work. Also I'm getting a message on the left side of my screen that says "Connection to Python debugger failed. Attempt timed out".

What the gently caress?

Do you have a Python interpreter set for your project? Bottom right edge of the window, to the left of the git/source control info.

If not, you should probably set one; ideally make a virtual environment (it will do this for you; "Add New Interpreter" > "Add Local Interpreter" > virtualenv environment).

CompeAnansi
Feb 1, 2011

I respectfully decline
the invitation to join
your hallucination

Cyril Sneer posted:

Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram:

https://ibb.co/Fgy2f26

I'm working on a project to extract data from a bunch of production-related excel files. Individual files consist of two sheets - a cover sheet and a report sheet. The cover sheet has certain fields whose values I extract and the report sheet contains tabular data records. This tabular data gets extracted, possibly cleaned, then merged with the cover fields.

The blocks in the black circles can be considered stable/fixed, meaning the same code works for all file types. The red circles represent places where the code may vary. For example, for some file types, the clean block has to have a few lines of code to deal with merged cells.

We can think of there being 3 files types. FileTypeA and FileTypeB require the same processing steps, with only certain options in a configuration dictionary that need changing (column names, desired fields, that sort of thing). However they are different datasets and should be separately aggregated. A 3rd file type, FileTypeC, requires some different processing in the Clean module.

Normal classes at first pass seem like an obvious solution. I can define standard behaviors for those 5 blocks, and aggregate the results to each class instance. Then, I can subclass the blocks when/if needed (i.e,. to handle FileTypeC). The thing that doesn't sit will with me here is that none of these blocks actually require any state information. They can all be standalone functions. This was partially why I explored the singleton approach.

This is a data engineering task. Generally, we don't go nuts on using classes or use OOP-style approaches when writing data pipelines especially if it's a one-off task. The most recent code from QuarkJets using the dictionary dispatch is closer to the code I'd write for this than the other class based proposals.

If I were handed this task, I'd write a function for each step and then an overall pipeline function that strings them together passing data from one step to the next. To provide some background here, generally, data pipelines have three discrete steps: extract, transform, and load. For your case, that means that you'd start by extracting the data from the excel files into an in-memory object. Then you'd do whatever cleaning procedures, transformations, and/or formatting changes needed for the load step. Finally, you'd load it into the destination, which usually means writing it to a table or loading it into s3.

Your diagram in this post covers the first two steps and then, from prior posts, it seems your load step is unioning together the cleaned dataframes from each file into a single dataframe for each type. Not sure what you're ultimately doing with those giant dataframes though... If you're loading them into a table, then you don't need to combine the dataframes first. You can just append the rows from each dataframe to the table directly as you process them.

For handling the various kinds of transforms you have to do depending on the base data, I think the suggestions so far overcomplicate things unless the different transforms have basically no overlap between them. If they overlap but, say, some types skip some steps while others add some steps, then I'd just gate certain steps within the transform function behind conditionals that test against values passed into the function rather than breaking each transform "type" out into its own separate function.

CompeAnansi fucked around with this message at 23:07 on Aug 16, 2023

I. M. Gei
Jun 26, 2005

CHIEFS

BITCH



Data Graham posted:

Do you have a Python interpreter set for your project? Bottom right edge of the window, to the left of the git/source control info.

If not, you should probably set one; ideally make a virtual environment (it will do this for you; "Add New Interpreter" > "Add Local Interpreter" > virtualenv environment).

Either I already have an interpreter set up, or it's not letting me set one up for some reason.

Here is what my screen looks like when I click Add Local Interpreter.

Data Graham
Dec 28, 2009

📈📊🍪😋



Yeah, you've already got one; down in the lower right it's saying you've got a virtualenv with 3.11.

That being the case, I'm afraid I'm fresh out of ideas as to why it's not cooperating; I haven't seen that "timeout" error.

I. M. Gei
Jun 26, 2005

CHIEFS

BITCH



Data Graham posted:

Yeah, you've already got one; down in the lower right it's saying you've got a virtualenv with 3.11.

That being the case, I'm afraid I'm fresh out of ideas as to why it's not cooperating; I haven't seen that "timeout" error.

I don't know if it'll help but I can post a screenshot of the error message next time I run my code.

In the meantime, Google is saying I can type "breakpoint()" anywhere in my code and it should work the same way as a red circle breakpoint. I can give that a try until I get the circles working.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

I. M. Gei posted:

I don't know if it'll help but I can post a screenshot of the error message next time I run my code.

In the meantime, Google is saying I can type "breakpoint()" anywhere in my code and it should work the same way as a red circle breakpoint. I can give that a try until I get the circles working.

I haven't used PyCharm in a while but in general, I'll at least reassure you that sometimes debugger setups are kind of weird and not well explained and cause more consternation than most new devs are expecting.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Falcon2001 posted:

I haven't used PyCharm in a while but in general, I'll at least reassure you that sometimes debugger setups are kind of weird and not well explained and cause more consternation than most new devs are expecting.

Seems like all Python IDEs debugging of .py files are so bad that it inspired someone to make jupyter

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
Ehhhh Jupiter is a very different beast and has a lot more to do with cached cell by cell execution, etc.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Falcon2001 posted:

Ehhhh Jupiter is a very different beast and has a lot more to do with cached cell by cell execution, etc.

:thejoke:

StumblyWumbly
Sep 12, 2007

Batmanticore!
I've used multiple iterations of PyCharm, and they've all just worked for me. One drawback to PyCharm is that there are so many releases, sometimes advice doesn't always work for your particular version.
Is something weird about this computer? Did you install python 2.7 or muck around with the path? Can you run scripts and get them to just print stuff? Is there weird security stuff or some kind of Python framework thing?
Are you sure you are pushing the right buttons and setting break points in the right files?
Hopefully those questions spark something because without sitting at the computer it'll be hard to debug.

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

Anyone else ever experience a ThreadPoolExecutor that just like... gives up? I have a weird edge case where the number of workers is 1 and the ThreadPoolExecutor just seems to stop accepting new tasks after awhile. I mean, I can submit to it successfully, all previous tasks finished, but then the next submission never actually starts and I'm not sure why

E: I have confirmed that none of the prior futures raised an exception, by checking the relevant method of each future object

QuarkJets fucked around with this message at 04:11 on Aug 18, 2023

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply