Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Macichne Leainig
Jul 26, 2012

by VG
Not sure if this is better suited as a Python or general programming question, but I'll ask here since I'm using Python and pytesseract specifically.

Anyone have experience with OCR? I think I'm just misunderstanding something about Tesseract, but I'm at my wit's end trying to get this to work here.

I have a bunch of images like this - pre-processed, binary black and white images of numbers of interest:



I need to extract these numbers out of here. I know it's possible because if I drop this image into the web demo of Tesseract here, it picks it up fine:



This is my call to PyTesseract, and I do not get any usable results out of it. I've tried other PSM modes as well. I have Tesseract v5.0.0-alpha.20210811 installed locally. I figured PSM 5 should be ideal because it's described as "a single uniform block of vertically aligned text," which this is, is it not?

Python code:
pytesseract.image_to_string(Image.open(f"results/{basename}_ocr.jpg"), config='digits --psm 5')
I'm almost sure I have Tesseract configured incorrectly because as seen above, Tesseract can clearly handle these images - just not with any of the PSM modes I've tried (and I did try them all, just as a sanity test...).

Adbot
ADBOT LOVES YOU

mr_package
Jun 13, 2000
How much should I worry about setting things in os.environ? There's a warning in the docs about memory leaks but it's unclear to me how much this can matter. The warning if you dig into it is about assigning values with different length/sizes. Is this really a problem in real world usage? https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/putenv.3.html

quote:

BUGS
Successive calls to setenv() or putenv() assigning a differently sized value to the same name will result in a memory leak. The FreeBSD semantics for these functions (namely, that the contents of value are copied and that old values remain accessible indefinitely) make this bug unavoidable. Future versions may eliminate one or both of these semantic guarantees in order to fix the bug.

I think in my case where I am making a dictionary copy and then passing it to subprocess.run is probably fine anyway (it should clean up when subprocess finishes, yes?)

code:
e = os.environ.copy()
e["DEVELOPER_DIR"] = "path/to/xcode/dir"
subprocess.run([python, "build_macos.py"], env=e)
It's not clear to me whether there's a workaround such as removing the value entirely: Python docs indicate unsetenv() is called when deleting from os.environ but nothing suggesting it is a workaround for this memory leak. But the explicit error condition is "successive calls to setenv() or putenv()" so it's kind of implied but I don't know enough about BSD guts to say.

I'm thinking we're talking about a few bytes per day so odds are it would take years to notice anyway..?

edit: someone please tell me how to generate that awesome linted "python code" quoted text above. I've seen it a few times but it's not in the documented PHPBB codes is it?

mr_package fucked around with this message at 19:05 on Sep 13, 2021

Wallet
Jun 19, 2006

mr_package posted:

How much should I worry about setting things in os.environ?
I think something like this probably gets around the weirdness of copying the environment itself:
Python code:
subprocess.run([python, "build_macos.py"], env={**os.environ, "DEVELOPER_DIR": "path/to/xcode/dir"})

mr_package posted:

edit: someone please tell me how to generate that awesome linted "python code" quoted text above. I've seen it a few times but it's not in the documented PHPBB codes is it?

Instead of regular code tags you use code=python to get the highlights, you can see it if you quote someone using it.

Wallet fucked around with this message at 19:19 on Sep 13, 2021

QuarkJets
Sep 8, 2008

mr_package posted:

How much should I worry about setting things in os.environ? There's a warning in the docs about memory leaks but it's unclear to me how much this can matter. The warning if you dig into it is about assigning values with different length/sizes. Is this really a problem in real world usage? https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man3/putenv.3.html

I think in my case where I am making a dictionary copy and then passing it to subprocess.run is probably fine anyway (it should clean up when subprocess finishes, yes?)

code:
e = os.environ.copy()
e["DEVELOPER_DIR"] = "path/to/xcode/dir"
subprocess.run([python, "build_macos.py"], env=e)
It's not clear to me whether there's a workaround such as removing the value entirely: Python docs indicate unsetenv() is called when deleting from os.environ but nothing suggesting it is a workaround for this memory leak. But the explicit error condition is "successive calls to setenv() or putenv()" so it's kind of implied but I don't know enough about BSD guts to say.

I'm thinking we're talking about a few bytes per day so odds are it would take years to notice anyway..?

edit: someone please tell me how to generate that awesome linted "python code" quoted text above. I've seen it a few times but it's not in the documented PHPBB codes is it?

The memory leak only applies to successive calls to setenv when trying to set the same key with differently sized values. In other words, the value persists even if the variable name is deleted or set to some other value with a different size. In practice, this is an insane edge case that doesn't have any meaningful impact; if you are repeatedly setting the env for a continuous process then you're doing something weird and there's almost certainly a better way. You are not doing that; each new process is getting its env set once, and that env should disappear along with the new process ending

QuarkJets
Sep 8, 2008

Loezi posted:

I need to associate Things (below: strings) with float ranges, s.t. some of the ranges go to either negative or positive infinity. The "trivial" implementation would probably be something like this:

Python code:
thresholds = {
    "low": (float('-inf'), 0),
    "medium": (0, 10),
    "high": (10, float('inf'))
}

def find(x: float) -> str:
    for key, (lower_bound, upper_bound) in thresholds.items():
        if lower_bound <= x < upper_bound: 
            return key
But I dislike find having to know stuff about what I feel like are the internals of thresholds. I could, instead, do this with lambdas:

Python code:
thresholds = {
    "low": lambda x: x < 0,
    "medium": lambda x: 0 <= x < 10,
    "high": lambda x: 10 <= x
}

def find(x: float) -> str:
    for key, check in thresholds.items():
        if check(x): 
            return key
but I'm not thrilled about that either, because lambdas seem like a great recipe to ensure that I get some hard-to-debug bug later on.

Naturally, I could add a custom class, perhaps along the lines of following:
Python code:
class Bounds:
    def __init__(self, lower_bound, upper_bound) -> None:
        self.lower_bound = lower_bound if lower_bound is not None else float('-inf')
        self.upper_bound = upper_bound if upper_bound is not None else float('inf')

    def __contains__(self, value: float) -> bool:
        return self.lower_bound <= value < self.upper_bound

thresholds = {
    "low": Bounds(None, 0),
    "medium": Bounds(0, 10),
    "high": Bounds(10, None)
}

def find(x: float) -> str:
    for key, bounds in thresholds.items():
        if x in bounds: 
            return key
This feels like I'm making a custom implementation of range which I dislike immensely.

Is there some part of the stdlib that I'm missing here, that would make this pattern nice and concise without going all lambda?

This feels like something you should just use a numpy array for, using dtype='object' assuming you're dealing with actual objects and not just strings

code:

x = np.array(['medium' for _ in somearray], dtype='object') 
x[somearray < 0] = "low"
x[somearray > 10] = "high" 

Loezi
Dec 18, 2012

Never buy the cheap stuff

cinci zoo sniper posted:

PEP-636 if you’re on 3.10, otherwise I would subclass dictionary to implement range checking inside dictionary key.

Thanks, PEP636 seems worth keeping in mind. Playing around a bit with a dictionary subclass, this seems like a neat approach, allowing me to write code like this:

Python code:
class BoundMapping(dict):

    def __init__(self, *args, **kwargs):
        self.update(*args, **kwargs)

    def __getitem__(self, key):
        for (lb, up), value in self.items():
            if lb <= key < up:
                return value

    def __setitem__(self, key, val):
        key = (
            key[0] if key[0] is not None else float('-inf'),
            key[1] if key[1] is not None else float('inf')
        )
        dict.__setitem__(self, key, val)

    def update(self, *args, **kwargs):
            for k, v in dict(*args, **kwargs).items():
                self[k] = v


bound_mapping = BoundMapping({
    (None, 0): 'low',
    (0, 10): 'medium',
    (10, None): 'high',
})

def get_label(value):
    return bound_mapping[value]
The dict subclass is, naturally, quite fugly, but I like the fact that this hides all the nastiness in that one class def while the rest of the code is super clean :cheers:


QuarkJets posted:

This feels like something you should just use a numpy array for, using dtype='object' assuming you're dealing with actual objects and not just strings

code:
x = np.array(['medium' for _ in somearray], dtype='object') 
x[somearray < 0] = "low"
x[somearray > 10] = "high" 

Not too hot on adding a dependency to numpy just to do something as simple as this.

Loezi fucked around with this message at 10:44 on Sep 14, 2021

Hed
Mar 31, 2004

Fun Shoe
Wow this caused me to look at 3.10 and the structural pattern matching looks neat.

Coming from doing a bit of Rust lately the lack of having to always declare a catch-all makes me nervous but could lead to some “interesting” uses.

QuarkJets
Sep 8, 2008

Loezi posted:

Thanks, PEP636 seems worth keeping in mind. Playing around a bit with a dictionary subclass, this seems like a neat approach, allowing me to write code like this:

Python code:
class BoundMapping(dict):

    def __init__(self, *args, **kwargs):
        self.update(*args, **kwargs)

    def __getitem__(self, key):
        for (lb, up), value in self.items():
            if lb <= key < up:
                return value

    def __setitem__(self, key, val):
        key = (
            key[0] if key[0] is not None else float('-inf'),
            key[1] if key[1] is not None else float('inf')
        )
        dict.__setitem__(self, key, val)

    def update(self, *args, **kwargs):
            for k, v in dict(*args, **kwargs).items():
                self[k] = v


bound_mapping = BoundMapping({
    (None, 0): 'low',
    (0, 10): 'medium',
    (10, None): 'high',
})

def get_label(value):
    return bound_mapping[value]
The dict subclass is, naturally, quite fugly, but I like the fact that this hides all the nastiness in that one class def while the rest of the code is super clean :cheers:

Not too hot on adding a dependency to numpy just to do something as simple as this.

If that's the case then I would use map:

Python code:

def label_mapper(x):
    if x < 0:
        return "low" 
    if x < 10:
        return "medium" 
    return "high" 

mapped = map(label_mapper, some_other_iterable) 

D34THROW
Jan 29, 2012

RETAIL RETAIL LISTEN TO ME BITCH ABOUT RETAIL
:rant:
Hopefully this makes sense as I'm writing it, like it does in my head.

I have a QWidget window with two QLineEdit boxes and four QRadioButtons, grouped into two QButtonGroups of two each. I want to verify, before enabling the "Finish" button, that both QLineEdits are populated and both QButtonGroups have a checkedId() less than -1.

Where I'm running into issue is finding some sort of editingFinished or focusOut type event for a QButtonGroup. I need self.check_complete() to run on that event of the QButtonGroups.

Python code:
    # Focus-out checks for completeness.
    self.ui.line_width.editingFinished.connect(self.check_complete)
    self.ui.line_projection.editingFinished.connect(self.check_complete)
...
def check_complete(self):
    if (self.ui.grp_fasciasides.checkedId() < -1 and
            self.ui.grp_covers.checkedId() < -1 and
            self.ui.line_width.text() != "" and
            self.ui.line_projection.text() != ""):
        self.ui.btn_finish.setEnabled(True)
    else:
        self.ui.btn_finish.setEnabled(False)
EDIT: Never mind, clicked did the trick.

D34THROW fucked around with this message at 19:48 on Sep 14, 2021

HappyHippo
Nov 19, 2003
Do you have an Air Miles Card?

Loezi posted:

I need to associate Things (below: strings) with float ranges, s.t. some of the ranges go to either negative or positive infinity. The "trivial" implementation would probably be something like this:

Python code:
thresholds = {
    "low": (float('-inf'), 0),
    "medium": (0, 10),
    "high": (10, float('inf'))
}

def find(x: float) -> str:
    for key, (lower_bound, upper_bound) in thresholds.items():
        if lower_bound <= x < upper_bound: 
            return key
But I dislike find having to know stuff about what I feel like are the internals of thresholds. I could, instead, do this with lambdas:

Python code:
thresholds = {
    "low": lambda x: x < 0,
    "medium": lambda x: 0 <= x < 10,
    "high": lambda x: 10 <= x
}

def find(x: float) -> str:
    for key, check in thresholds.items():
        if check(x): 
            return key
but I'm not thrilled about that either, because lambdas seem like a great recipe to ensure that I get some hard-to-debug bug later on.

Naturally, I could add a custom class, perhaps along the lines of following:
Python code:
class Bounds:
    def __init__(self, lower_bound, upper_bound) -> None:
        self.lower_bound = lower_bound if lower_bound is not None else float('-inf')
        self.upper_bound = upper_bound if upper_bound is not None else float('inf')

    def __contains__(self, value: float) -> bool:
        return self.lower_bound <= value < self.upper_bound

thresholds = {
    "low": Bounds(None, 0),
    "medium": Bounds(0, 10),
    "high": Bounds(10, None)
}

def find(x: float) -> str:
    for key, bounds in thresholds.items():
        if x in bounds: 
            return key
This feels like I'm making a custom implementation of range which I dislike immensely.

Is there some part of the stdlib that I'm missing here, that would make this pattern nice and concise without going all lambda?

I feel like you're overthinking this and the first solution is completely fine.

Mycroft Holmes
Mar 26, 2010

by Azathoth
I'm attempting to program adding and removing items from an array for class. I'm throwing up an error when I attempt to grow the array. My array is [3,77,2,1,0] and I am attempting to add 88 at position 2. My insert code is this:
code:
   def insert(self, index, newItem):
        """Inserts item at index in the array."""
        # grow if array is full
        if index > self.logicalSize:
            self.grow()
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
        else:
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
What am I doing wrong? When i modify the code to this:
code:
  def insert(self, index, newItem):
        """Inserts item at index in the array."""
        # grow if array is full
        if index >= self.logicalSize:
            self.grow()
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
        else:
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
It grows before it is supposed to.

Mycroft Holmes fucked around with this message at 23:54 on Sep 14, 2021

QuarkJets
Sep 8, 2008

Mycroft Holmes posted:

I'm attempting to program adding and removing items from an array for class. I'm throwing up an error when I attempt to grow the array. My array is [3,77,2,1,0] and I am attempting to add 88 at position 2. My insert code is this:
code:
   def insert(self, index, newItem):
        """Inserts item at index in the array."""
        # grow if array is full
        if index > self.logicalSize:
            self.grow()
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
        else:
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
What am I doing wrong? When i modify the code to this:
code:
  def insert(self, index, newItem):
        """Inserts item at index in the array."""
        # grow if array is full
        if index >= self.logicalSize:
            self.grow()
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
        else:
            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]
            self.items[index] = newItem
            self.logicalSize += 1
It grows before it is supposed to.

You should post the specific exception, along with the lines that Python will tell you about when the exception is raised

But look at this:

Python code:
  

            for i in range(self.size(), index, -1):
                self.items[i] = self.items[i-1]

Presuming that size() is the length of your list (or whatever it is), you are immediately going out of bounds.

QuarkJets
Sep 8, 2008

Also a lot of your code is repeated, you could do this:

code:
   def insert(self, index, newItem):
        """Inserts item at index in the array."""
        # grow if array is full
        while index > self.logicalSize:
            self.grow()

        for i in range(self.size(), index, -1):
            self.items[i] = self.items[i-1]
        self.items[index] = newItem
        self.logicalSize += 1

Same problem though, e.g. If the size is 5 then index 5 is out of bounds.

Da Mott Man
Aug 3, 2012


Unless I'm missing something the easy way to do what you want is this. List manipulation functions are builtins.

Python code:
class Thing:
    def __init__(self):
        self.items = [3,77,2,1,0]

    def insert(self, index, new_item):
        self.items.insert(index, new_item)

    def remove_by_index(self, index):
        self.items.pop(index)

    def remove_by_value(self, value):
        self.items.remove(value)

object = Thing()

object.insert(2, 88)

print(object.items)

object.remove_by_index(2)

print(object.items)

object.remove_by_value(3)

print(object.items)

Da Mott Man fucked around with this message at 04:28 on Sep 15, 2021

Loezi
Dec 18, 2012

Never buy the cheap stuff

HappyHippo posted:

I feel like you're overthinking this and the first solution is completely fine.

The toy example I've been using is naturally just that, a toy example. There's definitely value in hiding most of the logic re: processing the upper and lower bounds to a separate class in the actual thing I'm doing, rather than replicating that same logic in a billion places.

That being said, it might very well turn out that I was overthinking this in the long run. For now, I'm rather happy with the dictionary-based implementation for a "thing that represents potentially unbounded number ranges that I can query for membership, each range associated with a label"

D34THROW
Jan 29, 2012

RETAIL RETAIL LISTEN TO ME BITCH ABOUT RETAIL
:rant:
Okay, now I have a real question.

I have a main menu wherein the user can select one of the calculators, or exit the program. The feature QPushButtons are in a QButtonGroup.

I have a constant list declared at the top of my guiQt module, ENABLED_FEATURES, which contains boolean values to control, in the MainMenu class, whether or not each button is enabled.

Python code:
ENABLED_FEATURES = [True, False, True, False]
...
class MainMenu(QtWidgets.QDialog):
    def __init__(self):
    ...
    # Set enabled buttons.
    self.enable_by_feature()

    def enable_by_feature(self):
        # Set up the id for each button.
        self.ui.grp_features.setId(self.ui.button_panroof, 0)
        self.ui.grp_features.setId(self.ui.button_comproof, 1)
        self.ui.grp_features.setId(self.ui.button_stormprotsf, 2)
        self.ui.grp_features.setId(self.ui.button_stormpanel, 3)
        log_debug(f"")
        for button in self.ui.grp_features.findChildren(
            PyQt5.QtWidgets.QPushButton):
            button.setEnabled(ENABLED_FEATURES[button.id()])
This is the code as it stands now - the intent is to set the ID of each QPushButton in the QButtonGroup, then loop through the QButtonGroup by id and perform setEnabled() based on the id's position in the ENABLED_FEATURES list. However, it is not doing...jack poo poo. And the internet is little help.

Thinking about it a second time, the easier solution would be to make ENABLED_FEATURES a dict with button names as the keys and bools as the values. The goal is so that I can just add it to the dict at the top of the code rather than hard-coding it into enable_by_feature(), further down the class - probably with a default value of True if the button can't be found in the list.

Wallet
Jun 19, 2006

D34THROW posted:

Thinking about it a second time, the easier solution would be to make ENABLED_FEATURES a dict with button names as the keys and bools as the values. The goal is so that I can just add it to the dict at the top of the code rather than hard-coding it into enable_by_feature(), further down the class - probably with a default value of True if the button can't be found in the list.

I can't speak to Qt much/at all but are you sure it's redrawing the buttons after you're disabling them?

Also, yeah, doing it by the index in a list of booleans without context is basically the same as just hard-coding the config values anyway. configparser is pretty quick to set up for this kind of thing to get them out of code entirely.

AfricanBootyShine
Jan 9, 2006

Snake wins.

I have what I think is a really simple question with numPy. I finally have a job where I can actually use it for work, but now I've forgotten it all.

I have a csv that contains readings for a bunch of samples at different wavelengths. I've pasted an example portion of it below. Normally it'll go all the way down to 300 nm. But I've trimmed it for everyone's sanity.

code:
Baseline 100%T,,SampleOx,,SampleOx1,,SampleOx2,,SampleRed,,SampleRed1,,SampleRed2,
Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs
700,2.521076918,700,0.051371451,700,0.020255247,700,-0.000277047,700,-0.013994155,700,-0.040811472,700,-0.046730809
699,2.515768766,699,0.056336451,699,0.021696234,699,0.002584572,699,-0.014951141,699,-0.038384374,699,-0.042782523
698,2.51525569,698,0.054913107,698,0.020626975,698,0.005365098,698,-0.013208756,698,-0.039243225,698,-0.044276398
697,2.517320871,697,0.051321168,697,0.018043108,697,-0.001523819,697,-0.01844346,697,-0.039591964,697,-0.044799961
696,2.516803503,696,0.048457876,696,0.016133199,696,-0.003205611,696,-0.019673269,696,-0.042768963,696,-0.048874158
First row are the sample names, which are spaced because each sample contains two values: a wavelength and an absorbance reading. I want to make a 3D array so that I can easily pick the absorbance at 400 nm for SampleRed. What's the easiest way to feed this into a 3D array, but still retain info like the sample names and the wavelengths?

AfricanBootyShine
Jan 9, 2006

Snake wins.

I have what I think is a really simple question with numPy. I finally have a job where I can actually use it for work, but it's been years since I did any real python work so I'm a bit lost.

I have a csv that contains readings for a bunch of samples at different wavelengths. I've pasted an example portion of it below. Normally it'll go all the way down to 300 nm. But I've trimmed it for everyone's sanity.

code:
Baseline 100%T,,SampleOx,,SampleOx1,,SampleOx2,,SampleRed,,SampleRed1,,SampleRed2,
Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs
700,2.521076918,700,0.051371451,700,0.020255247,700,-0.000277047,700,-0.013994155,700,-0.040811472,700,-0.046730809
699,2.515768766,699,0.056336451,699,0.021696234,699,0.002584572,699,-0.014951141,699,-0.038384374,699,-0.042782523
698,2.51525569,698,0.054913107,698,0.020626975,698,0.005365098,698,-0.013208756,698,-0.039243225,698,-0.044276398
697,2.517320871,697,0.051321168,697,0.018043108,697,-0.001523819,697,-0.01844346,697,-0.039591964,697,-0.044799961
696,2.516803503,696,0.048457876,696,0.016133199,696,-0.003205611,696,-0.019673269,696,-0.042768963,696,-0.048874158
First row are the sample names, which are spaced because each sample contains two values: a wavelength and an absorbance reading. I want to make a 3D array so that I can easily pick the absorbance at 400 nm for SampleRed. What's the easiest way to feed this into a 3D array, but still retain info like the sample names and the wavelengths?

I want to build this to be extensible, as I will be taking readings using this system for the next few years.

QuarkJets
Sep 8, 2008

That csv layout is janky as gently caress and I suspect that you will need to write something custom to deal with it. It feels like you want a pandas multiindex dataframe for this but I don't think that the pandas csv reader will be able to easily figure out the layout on its own

E: although first thing you should do is try the pandas csv reader and see what it does

QuarkJets fucked around with this message at 19:48 on Sep 17, 2021

CarForumPoster
Jun 26, 2013

⚡POWER⚡

QuarkJets posted:

That csv layout is janky as gently caress and I suspect that you will need to write something custom to deal with it. It feels like you want a pandas multiindex dataframe for this but I don't think that the pandas csv reader will be able to easily figure out the layout on its own

E: although first thing you should do is try the pandas csv reader and see what it does

This looks pretty easy though, I'm not sure I fully understand the file layout. My first instinct is: ignore the first two lines of the file w/ pandas read_csv(), supply it with column headings. It looks like its one row, one data set.

e.g. columns: [Baseline 100%T,SampleOx_Wavelength (nm),SampleOx_Abs,and so on]

QuarkJets
Sep 8, 2008

Yeah it's easy enough to create a 2D table with labels, I just don't think there's an obvious way to make one that's 3D without some twiddling

DoctorTristan
Mar 11, 2006

I would look up into your lifeless eyes and wave, like this. Can you and your associates arrange that for me, Mr. Morden?

AfricanBootyShine posted:

I have what I think is a really simple question with numPy. I finally have a job where I can actually use it for work, but it's been years since I did any real python work so I'm a bit lost.

I have a csv that contains readings for a bunch of samples at different wavelengths. I've pasted an example portion of it below. Normally it'll go all the way down to 300 nm. But I've trimmed it for everyone's sanity.

code:
Baseline 100%T,,SampleOx,,SampleOx1,,SampleOx2,,SampleRed,,SampleRed1,,SampleRed2,
Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs,Wavelength (nm),Abs
700,2.521076918,700,0.051371451,700,0.020255247,700,-0.000277047,700,-0.013994155,700,-0.040811472,700,-0.046730809
699,2.515768766,699,0.056336451,699,0.021696234,699,0.002584572,699,-0.014951141,699,-0.038384374,699,-0.042782523
698,2.51525569,698,0.054913107,698,0.020626975,698,0.005365098,698,-0.013208756,698,-0.039243225,698,-0.044276398
697,2.517320871,697,0.051321168,697,0.018043108,697,-0.001523819,697,-0.01844346,697,-0.039591964,697,-0.044799961
696,2.516803503,696,0.048457876,696,0.016133199,696,-0.003205611,696,-0.019673269,696,-0.042768963,696,-0.048874158
First row are the sample names, which are spaced because each sample contains two values: a wavelength and an absorbance reading. I want to make a 3D array so that I can easily pick the absorbance at 400 nm for SampleRed. What's the easiest way to feed this into a 3D array, but still retain info like the sample names and the wavelengths?

I want to build this to be extensible, as I will be taking readings using this system for the next few years.

As posters above have commented, this is easy enough to turn into a pandas DataFrame via the DataFrame.read_csv() function. This is almost certainly what you actually want to do - a NumPy 3D array is going to be a lot more awkward to retrieve the correct data from.

Your data structure does look quite odd, though. Is there any reason you've arranged things as

code:
|    Baseline 100%T     |       SampleOx        |      SampleOx1        |   ...
| Wavelength (nm) | Abs | Wavelength (nm) | Abs | Wavelength (nm) | Abs |   ...
|       700       |     |       700       |     |       700       |     |   ...
|       699       |     |       699       |     |       699       |     |   ...
|       698       |     |       698       |     |       698       |     |   ...

All of the wavelength entries in each row seem identical, so wouldn't it make much more sense (and be much easier to work with) to arrange the data like this:

code:
| Wavelength (nm) | Baseline 100%T Abs | SampleOx Abs | SampleOx1 Abs | ...
|       700       |                    |              |               | ...
|       699       |                    |              |               | ...
|       698       |                    |              |               | ...
?

Biffmotron
Jan 12, 2007

I'm going to third that this is a really weird data format. Unless there's a very good reason to do otherwise, data should be tidy where every row is a unique observation and each column is measurement of the same type across observations. If your data is tidy, analysis and plotting becomes much easier. If not, you're fighting the data at almost every step.

Original Data with slightly renamed columns
code:
	WL	Abs			WL1Abs.1			WL2Abs.2			WL3Abs.3			WL4Abs.4			WL5Abs.5			WL6Abs.6
0	700	2.52107691	700	0.051371451	700	0.020255247	700	-0.000277047	700	-0.013994155	700	-0.040811472	700	-0.046730809
1	699	2.51576876	699	0.056336450        699	0.021696234	699	0.002584572	699	-0.014951141	699	-0.038384374	699	-0.042782522
2	698	2.51525569	698	0.054913107	698	0.020626975	698	0.005365098	698	-0.013208756	698	-0.039243225	698	-0.044276398
3	697	2.51732087	697	0.051321168	697	0.018043108	697	-0.001523819	697	-0.018443460	697	-0.039591964	697	-0.044799961
4	696	2.51680350	696	0.048457876	696	0.016133198	696	-0.003205610	696	-0.019673269	696	-0.042768963	696	-0.048874158
Same data, but tidy
code:
	Sample		700_nm			699_nm			698_nm			697_nm			696_nm
0	Baseline 	2.521076918	2.515768766	2.51525569		2.517320871	2.516803503
1	SampleOx	0.051371451	0.056336450	0.054913107	0.051321168	0.048457876
2	SampleOx1	0.020255247	0.021696234	0.020626975	0.018043108	0.016133198999999997
3	SampleOx2	-0.000277047	0.002584572	0.005365098	-0.001523819	-0.0032056109999999997
4	SampleRed	-0.013994155	-0.014951141	-0.013208756	-0.01844346		-0.019673269
5	SampleRed1	-0.040811472	-0.038384374	-0.039243225	-0.039591964	-0.042768963
6	SampleRed2	-0.046730809	-0.042782522	-0.044276398	-0.044799961	-0.048874158
There are multiple ways to do this, but the one that I like is reading in the data starting with the second row as dfx, and then constructing a new dataframe from the columns. The dataframe is built from a dictionary with keys that become the column names, and then a list of cell entries as the values. The list of cell entries is created by selecting the Abs.* columns in each row from the raw data. This can be turned into a function that takes the names of the samples and the corresponding Abs.* columns as arguments.

Python code:
dfx = pd.read_csv('pth/your_file.csv', header=1)

dfy = pd.DataFrame({'Sample':['Baseline 100%T','SampleOx','SampleOx1','SampleOx2','SampleRed','SampleRed1','SampleRed2',],
					'700_nm':dfx[['Abs', 'Abs.1', 'Abs.2', 'Abs.3', 'Abs.4', 'Abs.5', 'Abs.6']].loc[0],
					'699_nm':dfx[['Abs', 'Abs.1', 'Abs.2', 'Abs.3', 'Abs.4', 'Abs.5', 'Abs.6']].loc[1],
					'698_nm':dfx[['Abs', 'Abs.1', 'Abs.2', 'Abs.3', 'Abs.4', 'Abs.5', 'Abs.6']].loc[2],
					'697_nm':dfx[['Abs', 'Abs.1', 'Abs.2', 'Abs.3', 'Abs.4', 'Abs.5', 'Abs.6']].loc[3],
					'696_nm':dfx[['Abs', 'Abs.1', 'Abs.2', 'Abs.3', 'Abs.4', 'Abs.5', 'Abs.6']].loc[4]}).reset_index(drop=True)

AfricanBootyShine
Jan 9, 2006

Snake wins.

Thanks for all the help. Looks like I need to sit down with pandas for a few hours.

I agree that the format of the data is incredibly goofy. It's what the instrument spits out when data is exported, so tidying the dataset is something I'd like to write a script to automate.

The initial analysis is dead simple- I can do it in excel in ten minutes. But I also to do some deconvolution on the spectra, which requires some real tools.

SurgicalOntologist
Jun 17, 2004

It sort of looks like a pandas MultiIndex as columns, except the labels aren't repeated. I would suggest to "manually" construct a MultiIndex for the column axis. They you can stack or similar to tidy the dataset.

If you really want to try the 3D thing, the library you want is xarray. But I don't think it would actually help reading the data, just manipulating it, depending on what you need to do. And it's probably overkill in this case, it really shines with data on a grid (eg volumetric).

Dawncloack
Nov 26, 2007
ECKS DEE!
Nap Ghost
I have a question about python, but it's more of a strategic question than a specific doubt.

I'm slowly working towards the point where I can jump into computer touching if (when) my current career becomes a gently caress. I am following a little study plan gently provided by a goon, that reccomended I learn python as a scripting language.

I got a book and started working through it, and I am handy with the basics, to the point when I made myself a script that backups my files in a specific manner (tons of directory management and such, essentially a reimplementation of rsync). Now I'm working on parallelism.

In your opinion, what areas of python are important to learn, resume-wise? At what point can I say I know python without it being a massive lie?
Or perhaps that is the wrong question and I'm just going to be dropped in an unknown area and have to google stuff, so what areas are useful? I imagine parallelism and networking?

This, I bet, is a highly subjective question but I'd appreciate opinions.

Thanks in advance.

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Dawncloack posted:

I got a book and started working through it, and I am handy with the basics, to the point when I made myself a script that backups my files in a specific manner (tons of directory management and such, essentially a reimplementation of rsync). Now I'm working on parallelism.

In your opinion, what areas of python are important to learn, resume-wise? At what point can I say I know python without it being a massive lie?
Or perhaps that is the wrong question and I'm just going to be dropped in an unknown area and have to google stuff, so what areas are useful? I imagine parallelism and networking?

This, I bet, is a highly subjective question but I'd appreciate opinions.

Thanks in advance.

Writing Python in a list of skills won’t provide any credibility that you can do work as a programmer. It’s not really the right question or method to build a resume.

You need to show that you can complete work and projects. Pick a thing you wish existed and build a project, deploy it and put the code on GitHub. Include your GitHub on your resume.

Some suggestions:
-Pick an open source Python package you like and ask the devs what the process to make changes and pull requests is. See if you can contribute a PR. This is something that can go on a resume.
-Pick a topic you like and deploy a flask/django/dash web app about that thing. Put the code on GitHub with a working demo. One which makes some API calls and does some business process. Also popular: Build an ML/AI project in a Jupyter notebook. Then, deploy the model you built with a web app.


Have working demos, it will put you above the many, many, many other just starting out Python coders without engineering degrees.

Dawncloack
Nov 26, 2007
ECKS DEE!
Nap Ghost
I had no idea! Thanks a bunch!

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Dawncloack posted:

I had no idea! Thanks a bunch!

You're welcome. Feel free to stop by the resume thread in BFC or the YOSPOS interviewing thread for advice from a broader audience. Several people, including myself, who hire computer touchers regularly post there.

Here is the breakdown of me hiring an entry-level python person late 2020 at a startup to give you an idea of competition. Going rate is $20-30/hr depending on job and just how entry level they are.

CarForumPoster posted:

Yep, single position in 3 weeks. Here was the breakdown from my 91 applicants for an entry-level python job.

I was very curious what the breakdown was since the advice I am leaning toward with strawberrymoose with coding is that he'd better have some good projects to show off because he's gonna have a tough go at coding jobs.

~40% were Indian developers with hilariously bullshit resumes that didnt provide a GitHub despite me saying it was the one hard qualification requirement in bold and including it as a required question to apply.

Of the remaining ~60% I went through and wrote their degrees. I may have missed a couple. I separated degrees into the tier I view them in for entry-level developers. Bold are the ones I interviewed or I at least wrote some favorable notes on. Ugrad means they're currently in undergrad, otherwise they'd graduated. The places in parenthesis are where their undergrad was from.





P.S. An open source package that has very little support and a creator that actively wants help is moviepy. Its useful, but has been stalled on a major version update for a long time. Has 5k+ stars so decently popular.

CarForumPoster fucked around with this message at 17:58 on Sep 18, 2021

SirPablo
May 1, 2004

Pillbug
Any recommendations of a pythonic way to make a density map of hundreds of small shapefiles on to a lat/lon grid? I found one option, geocube, but the code below just returns a single coverage and not a density.

code:
import geopandas as gpd
from geocube.api.core import make_geocube

wwas = gpd.read_file(f'https://mesonet.agron.iastate.edu/pickup/wwa/2021_tsmf_sbw.zip')
wwas = wwas[(wwas.PHENOM=='FF')&(wwas.STATUS=='NEW')&(wwas.SIG=='W')]
wwas['Z'] = 1

C = make_geocube(
    vector_data=wwas,
    measurements=['Z'],
    resolution=(-0.01, 0.01))

Biffmotron
Jan 12, 2007

I last did this three years ago, but the package that I found most helpful was geopandas, which you're using and then Bokeh for the plotting. Bokeh is nice because it makes it easy to add geographic tiles, so you get streets and other features below your data. Bokeh is kinda weird, but there are plenty of tutorials floating around for pretty similar cases.

SirPablo
May 1, 2004

Pillbug

Biffmotron posted:

I last did this three years ago, but the package that I found most helpful was geopandas, which you're using and then Bokeh for the plotting. Bokeh is nice because it makes it easy to add geographic tiles, so you get streets and other features below your data. Bokeh is kinda weird, but there are plenty of tutorials floating around for pretty similar cases.

Not sure that is quite what I'm aiming at. He's an example of one polygon that is rasterized at 0.01°x0.01°. I'd like to do this for hundreds of similar polygons, but the step I'm scratching my head on is counting them up grid by grid on a much larger domain, thus giving me a polygon density.

Only registered members can see post attachments!

SirPablo
May 1, 2004

Pillbug
Now with a white background. edit damnit lol

Only registered members can see post attachments!

accipter
Sep 12, 2003

SirPablo posted:

Not sure that is quite what I'm aiming at. He's an example of one polygon that is rasterized at 0.01°x0.01°. I'd like to do this for hundreds of similar polygons, but the step I'm scratching my head on is counting them up grid by grid on a much larger domain, thus giving me a polygon density.



Make a raster of the entire area with a value of zero. Loop over each polygon, and increment all points in it by one. You should be able to do this with rasterio.

Bad Munki
Nov 4, 2008

We're all mad here.


You could also do it through something like qgis pretty readily. That’s what we do, composing hundreds of thousands up to millions of shapes, via a Python script that makes a few basic qgis calls.

SirPablo
May 1, 2004

Pillbug
Here's what I ended up doing.

code:
# Download data
wwas = gpd.read_file(f'https://mesonet.agron.iastate.edu/pickup/wwa/2021_tsmf_sbw.zip')
wwas['Z'] = 1

# Make array to add counts
lons, lats = np.meshgrid(np.arange(-120,-70,0.1),np.arange(25,60,0.1))
shape = lons.shape
lons = lons.ravel()
lats = lats.ravel()
counts = lons * 0

# Loop through each FFW
for w in range(wwas.shape[0]):

  # Rasterize
  C = make_geocube(
      vector_data=[wwas.iloc[w]],
      measurements=['Z'],
      resolution=(-0.01, 0.01))

  # Get geometry and values
  xs, ys = np.meshgrid(C.x.values, C.y.values)
  xs, ys = xs.flatten().round(1), ys.flatten().round(1)
  Zs = np.nan_to_num(C.Z.values.flatten(), 0)

  # Add to density array
  before = counts.copy()
  for i in zip(xs, ys, Zs):
    iz = np.argmin((lons-i[0])**2 + (lats-i[1])**2)
    counts[iz] += i[2]
  
  # Clip to make sure aren't adding to too many grids
  after = counts.copy()
  delta = np.clip(after-before, 0, 1)
  counts = before + delta
  del before, delta

Ranzear
Jul 25, 2013

Just found out dictionaries are now ordered in 3.7+, but they didn't add any native way to sort by key. That's the most pythonian thing ever.

QuarkJets
Sep 8, 2008

Ranzear posted:

Just found out dictionaries are now ordered in 3.7+, but they didn't add any native way to sort by key. That's the most pythonian thing ever.

I thought sorted already did that

Adbot
ADBOT LOVES YOU

nullfunction
Jan 24, 2005

Nap Ghost

QuarkJets posted:

I thought sorted already did that

sorted() gives you the sorted keys, not key-value pairs.

You can do something like this but I'm not aware of a native method on dict that would do this for you.

Python code:
>>> d = {"a": 1, "c": 3, "b": 2}
>>> s = {k: d[k] for k in sorted(d)}
>>> s
{'a': 1, 'b': 2, 'c': 3}

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply