|
eXXon posted:You can make do_something an abstractmethod and move the implementation to Bob. It's the second part - I want to provide a default set and those should be the only ones available to the user. They're really a set of pre-defined ETL operations. To elaborate, I have a bunch of files, containing tabular data, but structured differently. So I want the user to be able to invoke calls like: code:
I only want the user to access the provided Records interfaces.
|
# ? Aug 9, 2023 05:12 |
|
|
# ? May 30, 2024 13:52 |
|
I think I would probably model that asPython code:
|
# ? Aug 9, 2023 05:19 |
|
Cyril Sneer posted:Looking for advice on the following example code I've been playing around with: There's a lot of duplicate code between aName, Bob, and Sam. You could try control inversion here to simplify the implementation a lot; it's a lot of work to define a class just to overload a single method of its parent, it'd be a lot easier to just accept a function object that defines the modular behavior (e.g. a summing function, a multiplying function, etc) Since we're no longer using inheritance and method overloading, now we can just define 1 class and instantiate it a couple of times. Why does get_vals() exist? It's pointless, ditch it. Getter methods that just return an attribute should be avoided Python code:
QuarkJets fucked around with this message at 07:01 on Aug 9, 2023 |
# ? Aug 9, 2023 05:46 |
|
Another perhaps minor note, but I'm reasonably sure that this:Python code:
Python code:
I also agree with QuarkJet's approach: if the lambda part seems confusing (and fwiw, I still find Lambdas confusing), just have it be various full functions (or classes if you really want) that users can pass in, so you have something like this: Python code:
Python code:
|
# ? Aug 9, 2023 06:59 |
|
You're kind of missing the part where Cyril wanted them to behave like singletons, which is why vals is a class attribute rather than an instance attribute. The best advice I've read in that regard is to only try to make something a singleton if your program cannot possibly work otherwise and I don't think that's the case here. To that end,Cyril Sneer posted:It's the second part - I want to provide a default set and those should be the only ones available to the user. They're really a set of pre-defined ETL operations. ... why can't there be more than one RecordsA? If these are chunks of a larger dataset, couldn't a user conceivably want filename1+2 separately from filename3? And if you're ultimately creating a dataframe from this, are you planning to protect it from mutation somehow while thr Records classes provide some subset of dataframe functionality? Otherwise I'm not sure what you mean by only allowing the user to access the Records interfaces.
|
# ? Aug 9, 2023 12:19 |
|
Thanks for the assist. There isn't always a decimal or a comma. I am supposed to allow for values between 0.001 and 200,000.001 I think you are correct and I should ignore typos.
|
# ? Aug 9, 2023 13:37 |
|
joebuddah posted:Thanks for the assist. Isn't this non-distinguishable on its own? 200,002 Doesn't tell me if it is 200 with a decimal portion of 002 or 200,002 with a decimal portion of 0
|
# ? Aug 9, 2023 14:18 |
|
yeah unless they consistently pass the value with 3 decimal places (so 1234 -> 1,234.000 or 1.234,000) you'd need either some specific flag or enough contextual data to infer which format it should use (e.g. values from this source or with this pattern of metadata are always US) sometimes the source data is just too crap to do anything with and they gotta fix the problem before they pass the data to you
|
# ? Aug 9, 2023 14:32 |
|
You could read backward through the string to see if there's punctuation before you've passed 3 places. Then, use that knowledge to add zeroes to the end. That ought to at least normalize the decimal part length.
|
# ? Aug 9, 2023 14:37 |
|
spiritual bypass posted:You could read backward through the string to see if there's punctuation before you've passed 3 places. Then, use that knowledge to add zeroes to the end. That ought to at least normalize the decimal part length. But there potentially isn't a decimal
|
# ? Aug 9, 2023 14:51 |
|
If you've passed three spots without punctuation, you know that's where it would go. Add zeroes, then strip all punctuation and you have yourself a nice, safe integer that you can reformat for display.
|
# ? Aug 9, 2023 15:04 |
|
eXXon posted:You're kind of missing the part where Cyril wanted them to behave like singletons, which is why vals is a class attribute rather than an instance attribute. The best advice I've read in that regard is to only try to make something a singleton if your program cannot possibly work otherwise and I don't think that's the case here. To that end, I should clarify on the files a bit. I have a giant pool of files named as - code:
Does that make sense?
|
# ? Aug 9, 2023 15:25 |
|
eXXon posted:You're kind of missing the part where Cyril wanted them to behave like singletons, which is why vals is a class attribute rather than an instance attribute. I tried rereading the post but I don't see this singleton requirement stated anywhere - I think instances should work fine for their usecase.
|
# ? Aug 9, 2023 15:54 |
|
Even if it was supposed to be a Singleton, I wouldn't just implement one class level attribute. You should be enforcing the Singleton pattern via some init fuckery or via a factory.
|
# ? Aug 9, 2023 16:06 |
|
Falcon2001 posted:Even if it was supposed to be a Singleton, I wouldn't just implement one class level attribute. You should be enforcing the Singleton pattern via some init fuckery or via a factory. Can you elaborate on this? Even if I don't ultimately go in this direction I'd still like to learn/try it.
|
# ? Aug 9, 2023 17:03 |
|
Cyril Sneer posted:Can you elaborate on this? Even if I don't ultimately go in this direction I'd still like to learn/try it. Basically if you actually want a singleton (and I agree with other posters that for this problem you don’t) you should make it literally impossible to create more than one instance. IIRC the usual way to implement this in Python is by altering the __new__() method and checking if cls._instance is None before returning either the sole existing instance of the class or creating and returning a new one (I have never actually needed to do this in the wild so might have got a couple of these names wrong) Singletons do have their uses (eg loggers are frequently implemented using a singleton), but those are fairly specialised and limited and you should probably think carefully about whether you actually need one.
|
# ? Aug 9, 2023 18:52 |
|
Chin Strap posted:Isn't this non-distinguishable on its own? All of our values would have at least 1 decimal location, except for Boolean values. So I use the pseudo code below This is how I'm testing the values going in If there is a decimal and a comma If the decimal index =<2 it's in EU If the comma index =<2 then it's US If there is only one If there is a comma it's EU If there is a period is US
|
# ? Aug 9, 2023 23:15 |
|
So here's a simple way to do it as a class instance:Python code:
Python code:
Python code:
For example: one very common pattern people use singletons for is for client objects, such as S3 or whatever; this isn't necessarily a bad idea in most cases because your program may never need to call more than one AWS account, so why bother having multiple singletons? The issue comes up when suddenly you realize 'oh poo poo, instead of initializing the client at the beginning and passing it down via dependency injection, I'm instead just requesting that singleton specifically in my code where it's needed.' That's not a problem...until you need to patch 50 different places you call S3. In my uneducated opinion, singletons share a lot of code smell space with Global variables - they're not always the wrong idea, but they're certainly something worth being careful about using and are probably the wrong idea. For example, I'm using one in a current program I'm working on, because the specific use case is essentially a logging system of sorts. It does not take any additional side effect actions of any kind, it just stores data in a globally available place for later retrieval, and the architecture of our software makes initializing that up top and passing it down somewhat complex, and definitely much more verbose. All of those things mean that patching is not required - the bits that actually take actions based on this singleton's data use dependency injection correctly, allowing us to pass in appropriate objects there. This was a tradeoff for complexity, but was one made after discussing it with team mates and having a quick design discussion. Falcon2001 fucked around with this message at 07:53 on Aug 10, 2023 |
# ? Aug 9, 2023 23:17 |
|
I skipped over them, but I wanted to comment on the "normal" class-based solutions from QuarkJets and Falcon2001. I understand both of these, and I agree they work, but I'm somewhat dissatisfied with this approach as I don't really want the user to have to initiate the class instances, and since they're liable to be accessed in various other functions, they'd have to live as globals (or, passed around, which would be annoying). Whereas using static class methods in a module, I can call them from wherever and they don't have to be instantiated.
|
# ? Aug 10, 2023 03:18 |
|
Cyril Sneer posted:I skipped over them, but I wanted to comment on the "normal" class-based solutions from QuarkJets and Falcon2001. I understand both of these, and I agree they work, but I'm somewhat dissatisfied with this approach as I don't really want the user to have to initiate the class instances, and since they're liable to be accessed in various other functions, they'd have to live as globals (or, passed around, which would be annoying). Whereas using static class methods in a module, I can call them from wherever and they don't have to be instantiated. I guess my question is why? Like, is it just 'I want users to get a pre-configured object they don't have to think about?' Because the right answer to that is a factory function where they just call 'get_butts_factory()' and it returns the right thing to them, preconfigured for whatever scenario is appropriate.
|
# ? Aug 10, 2023 03:25 |
|
Falcon2001 posted:For example, I'm using one in a current program I'm working on, because the specific use case is essentially a logging system of sorts. It does not take any additional side effect actions of any kind, it just stores data in a globally available place for later retrieval, I would argue that this is actually similar to what I'm doing. Except replacing the "logging" with "tables" (from data has been extracted from the various files). The vals list in my example code would be the equivalent to your logging accumulation.
|
# ? Aug 10, 2023 03:30 |
|
Cyril Sneer posted:I skipped over them, but I wanted to comment on the "normal" class-based solutions from QuarkJets and Falcon2001. I understand both of these, and I agree they work, but I'm somewhat dissatisfied with this approach as I don't really want the user to have to initiate the class instances, and since they're liable to be accessed in various other functions, they'd have to live as globals (or, passed around, which would be annoying). Whereas using static class methods in a module, I can call them from wherever and they don't have to be instantiated. If I am loading and transforming some data then my workflow should have full custody of that data at every step. If users are meant to use the data stored in these classes, then my opinion is that this is a design that's prone to "side effects". There are a zillion words written on the subject of why you should try to avoid side effects in code but I'll just point back to my first sentence: if I don't pass some data to a function then it's a cardinal sin for that data to get transformed by that function, and this design facilitates that kind of situation. It's creating a kind of pitfall
|
# ? Aug 10, 2023 06:57 |
|
Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram: https://ibb.co/Fgy2f26 I'm working on a project to extract data from a bunch of production-related excel files. Individual files consist of two sheets - a cover sheet and a report sheet. The cover sheet has certain fields whose values I extract and the report sheet contains tabular data records. This tabular data gets extracted, possibly cleaned, then merged with the cover fields. The blocks in the black circles can be considered stable/fixed, meaning the same code works for all file types. The red circles represent places where the code may vary. For example, for some file types, the clean block has to have a few lines of code to deal with merged cells. We can think of there being 3 files types. FileTypeA and FileTypeB require the same processing steps, with only certain options in a configuration dictionary that need changing (column names, desired fields, that sort of thing). However they are different datasets and should be separately aggregated. A 3rd file type, FileTypeC, requires some different processing in the Clean module. Normal classes at first pass seem like an obvious solution. I can define standard behaviors for those 5 blocks, and aggregate the results to each class instance. Then, I can subclass the blocks when/if needed (i.e,. to handle FileTypeC). The thing that doesn't sit will with me here is that none of these blocks actually require any state information. They can all be standalone functions. This was partially why I explored the singleton approach.
|
# ? Aug 15, 2023 00:30 |
|
Cyril Sneer posted:Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram: The diagram helps; I would say before going to deep that this might be a small enough thing that breaking convention will not bite you in the rear end too hard. That being said; I would consider composition over inheritance for solving this. For example: Python code:
How I'd approach it would be having a pipeline setup like this: Step 1: For all inputs, have a function that can read them, determine which type they should be, and build the appropriate DataSource object, then put those in a big queue for processing. This would cover your Extract Fields / Extract Table step in your diagram. Step 2: Once your inputs are normalized, do a for data_source in data_sources: data_source.clean_table() sort of step, relying on each object to have been created with the correct clean_table function. Step 3: Same, but with merge. Step 4: Once this is completed, I presume you have a list of dataframes, so I'd aggregate by whatever data you desire. I could write this up into a Gist or something if you want more details, but I actually owned a pretty similar pipeline on a previous project and we used a lot of the same concepts. We had taken it a step fruther, with our MergeStrategy being an abstract base class where we had multiple different MergeStrategy options that had a similar interface, but those were all passed into a specific merge step of our pipeline that took them in. Other ways to possibly approach this: - Instead of building in the table_clean/etc into an object that represents the files, just have an enum assigned to the files during the Step1 above and then use a mapping later on in Step 2 and 3 to determine which clean/merge functions to use; this concept is called dictionary dispatch and is a pretty handy pattern. Falcon2001 fucked around with this message at 02:26 on Aug 15, 2023 |
# ? Aug 15, 2023 02:22 |
|
This question is dumb as hell — I haven't coded anything in almost 10 years and I am still a complete noob at Python — but why might my break points not be working in PyCharm? The program is supposed to pause when it gets to a line with a red circle and wait for me to click a button to keep going, is it not? Because mine keeps running my code without stopping, and I don't think the tutorial has an answer as to why (or maybe it does and I just didn't see it?). I. M. Gei fucked around with this message at 02:42 on Aug 15, 2023 |
# ? Aug 15, 2023 02:39 |
Breakpoints/debug mode only work if you run your code through a run configuration (the toolbar in the middle right at the top). If you just run python at the terminal it won't be in the context of the debugger. If you have a single .py file that you're trying to run as a script, try selecting "Current File" from the Configurations menu and hit Debug. Your breakpoints should work then.
|
|
# ? Aug 15, 2023 02:43 |
|
Data Graham posted:Breakpoints/debug mode only work if you run your code through a run configuration (the toolbar in the middle right at the top). If you just run python at the terminal it won't be in the context of the debugger. I'll try that when I get back to my computer and report back. Thanks! EDIT: I tried selecting Current File and hitting Debug, and the breakpoints still don't work. Also I'm getting a message on the left side of my screen that says "Connection to Python debugger failed. Attempt timed out". What the gently caress? I. M. Gei fucked around with this message at 06:16 on Aug 15, 2023 |
# ? Aug 15, 2023 04:53 |
|
Cyril Sneer posted:Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram: Inheritance should not be used for code reuse. I think you should just use dictionary dispatch, maybe with an Enum if you really feel strongly that you need to have a class somewhere in your codebase. Python code:
|
# ? Aug 15, 2023 07:47 |
|
QuarkJets posted:I think you should just use dictionary dispatch, maybe with an Enum if you really feel strongly that you need to have a class somewhere in your codebase. This is cleaner than my example and covers the edit I put in; I'd say go with something like this. In my opinion, classes in Python are best when you need to handle some sort of shared state consistently, or when holding data in a structured way since Python doesn't really have a Struct concept.
|
# ? Aug 15, 2023 16:46 |
I. M. Gei posted:I'll try that when I get back to my computer and report back. Thanks! Do you have a Python interpreter set for your project? Bottom right edge of the window, to the left of the git/source control info. If not, you should probably set one; ideally make a virtual environment (it will do this for you; "Add New Interpreter" > "Add Local Interpreter" > virtualenv environment).
|
|
# ? Aug 15, 2023 17:03 |
|
Cyril Sneer posted:Thanks for all the responses. So this doesn't turn into an XY problem, maybe I'll just start from the top and explain what I'm trying to do. See the following diagram: This is a data engineering task. Generally, we don't go nuts on using classes or use OOP-style approaches when writing data pipelines especially if it's a one-off task. The most recent code from QuarkJets using the dictionary dispatch is closer to the code I'd write for this than the other class based proposals. If I were handed this task, I'd write a function for each step and then an overall pipeline function that strings them together passing data from one step to the next. To provide some background here, generally, data pipelines have three discrete steps: extract, transform, and load. For your case, that means that you'd start by extracting the data from the excel files into an in-memory object. Then you'd do whatever cleaning procedures, transformations, and/or formatting changes needed for the load step. Finally, you'd load it into the destination, which usually means writing it to a table or loading it into s3. Your diagram in this post covers the first two steps and then, from prior posts, it seems your load step is unioning together the cleaned dataframes from each file into a single dataframe for each type. Not sure what you're ultimately doing with those giant dataframes though... If you're loading them into a table, then you don't need to combine the dataframes first. You can just append the rows from each dataframe to the table directly as you process them. For handling the various kinds of transforms you have to do depending on the base data, I think the suggestions so far overcomplicate things unless the different transforms have basically no overlap between them. If they overlap but, say, some types skip some steps while others add some steps, then I'd just gate certain steps within the transform function behind conditionals that test against values passed into the function rather than breaking each transform "type" out into its own separate function. CompeAnansi fucked around with this message at 23:07 on Aug 16, 2023 |
# ? Aug 15, 2023 22:04 |
|
Data Graham posted:Do you have a Python interpreter set for your project? Bottom right edge of the window, to the left of the git/source control info. Either I already have an interpreter set up, or it's not letting me set one up for some reason. Here is what my screen looks like when I click Add Local Interpreter.
|
# ? Aug 17, 2023 00:43 |
Yeah, you've already got one; down in the lower right it's saying you've got a virtualenv with 3.11. That being the case, I'm afraid I'm fresh out of ideas as to why it's not cooperating; I haven't seen that "timeout" error.
|
|
# ? Aug 17, 2023 00:49 |
|
Data Graham posted:Yeah, you've already got one; down in the lower right it's saying you've got a virtualenv with 3.11. I don't know if it'll help but I can post a screenshot of the error message next time I run my code. In the meantime, Google is saying I can type "breakpoint()" anywhere in my code and it should work the same way as a red circle breakpoint. I can give that a try until I get the circles working.
|
# ? Aug 17, 2023 01:49 |
|
I. M. Gei posted:I don't know if it'll help but I can post a screenshot of the error message next time I run my code. I haven't used PyCharm in a while but in general, I'll at least reassure you that sometimes debugger setups are kind of weird and not well explained and cause more consternation than most new devs are expecting.
|
# ? Aug 17, 2023 02:52 |
|
Falcon2001 posted:I haven't used PyCharm in a while but in general, I'll at least reassure you that sometimes debugger setups are kind of weird and not well explained and cause more consternation than most new devs are expecting. Seems like all Python IDEs debugging of .py files are so bad that it inspired someone to make jupyter
|
# ? Aug 17, 2023 04:05 |
|
Ehhhh Jupiter is a very different beast and has a lot more to do with cached cell by cell execution, etc.
|
# ? Aug 17, 2023 04:14 |
|
Falcon2001 posted:Ehhhh Jupiter is a very different beast and has a lot more to do with cached cell by cell execution, etc.
|
# ? Aug 17, 2023 04:15 |
|
I've used multiple iterations of PyCharm, and they've all just worked for me. One drawback to PyCharm is that there are so many releases, sometimes advice doesn't always work for your particular version. Is something weird about this computer? Did you install python 2.7 or muck around with the path? Can you run scripts and get them to just print stuff? Is there weird security stuff or some kind of Python framework thing? Are you sure you are pushing the right buttons and setting break points in the right files? Hopefully those questions spark something because without sitting at the computer it'll be hard to debug.
|
# ? Aug 17, 2023 05:10 |
|
|
# ? May 30, 2024 13:52 |
|
Anyone else ever experience a ThreadPoolExecutor that just like... gives up? I have a weird edge case where the number of workers is 1 and the ThreadPoolExecutor just seems to stop accepting new tasks after awhile. I mean, I can submit to it successfully, all previous tasks finished, but then the next submission never actually starts and I'm not sure why E: I have confirmed that none of the prior futures raised an exception, by checking the relevant method of each future object QuarkJets fucked around with this message at 04:11 on Aug 18, 2023 |
# ? Aug 18, 2023 03:19 |