Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Eela6
May 25, 2007
Shredded Hen
Hi. I'm trying to learn NumPy as a replacement for MATLAB. I'm implementing an image processing routine I wrote that detects copy-paste forgeries.

image is an mxnx3 array of unsigned 8-bit integers.
mask is an mxn array of booleans.
separator is an mx8x3 array that's BLUE (i.e, each pixel is [0, 0, 255])

I want to create a masked image that is the original image on the left, the blue separator, and a psuedo-grayscale version on the right, eg:



But I'm still having some trouble with basic array manipulation. Here's what I did in MATLAB. How would I do this in numpy?
code:
RED = 1
GREEN = 2 
BLUE = 3

maskedImage = rgb2gray(image)			% set to grayscale
maskedImage = repmat(maskedImage, 1, 1, 3)	% make 'psuedo-rgb'

% I need help with this:
maskedImage(:, :, RED) =  maskedImage + mask*255	
maskedImage(:, :, BLUE) = maskedImage + mask*-255	
maskedImage(:, :, GREEN) = maskedImage + mask*-255
maskedImage(maskedImage <= 0) = 0
maskedImage(maskedImage >= 255) = 255
% keep in uint8 space. this might not be necessary
% but i don't understand numpy's typing very well yet

% and this:
outputImage = [image, separator, maskedImage]	
%bmat('image', 'separator', 'maskedImage') seems like the right command,	
%except it seems to only work specifically on matrices, not n-dimensional numpy arrays.

Thanks for your time!

Eela6 fucked around with this message at 20:38 on May 10, 2016

Adbot
ADBOT LOVES YOU

Eela6
May 25, 2007
Shredded Hen

QuarkJets posted:

What problems are you running into, specifically? Can you post your Python code?

QuarkJets posted:

Here's also a helpful cheat sheet for converting between Matlab and Python syntax:

http://mathesaurus.sourceforge.net/matlab-numpy.html

Thank you for your advice. I am aware of the MATLAB-numPy cheat sheet & looked there first.

Edit: No I didn't. I was aware of ~a~ MATLAB-numPy cheat sheet. Let me do some digging in this one.

I seem to have the hang of 2d arrays in numpy (though I am still making the occasional off-by-one error adjusting to the zero-based indexing).

My problem is that I don't seem to understand how to deal with 3+ dimension arrays.
EG,
code:
import numpy as np
from skimage import io
RED = 0
# i want an image where the top left corner is RED
# and the rest is black; just as a toy case.

redMask = np.zeros((128, 128, 3), 'uint8')
redMask[0:63][0:63][:, RED] = 255
redMask = np.uint8(redMask)
io.imshow(redMask)
# but this does not do what I want it to do.
# ... :(
A snippet of the relevant part of my real code:
code:
import numpy as np
from skimage import color
def create_mask(blocks, img, init):
    # create mask: this is a 2d boolean matrix of the same dimensions as the 
    # original image, set to True in blocks that are connected to other blocks
    # (i.e, considered possibly copy-paste forged) and false elsewhere
    size = np.shape(img)
    rows = size[0]
    cols = size[1]
    mask = np.zeros(rows, cols)
    for block in blocks:
        row = block.row
        col = block.col
        rowEnd = row +init.blockSize-1
        colEnd = col + init.blockSize-1
        mask[row:rowEnd][:, col:colEnd] = True        
    return mask
    
def write_mask(mask, img, init):
    # we create an image image_out which is the original image on the left, a
    # 8-row BLUE (0, 0, 255) barrier, and then a MASKED image on the right.
    # The MASKED image is a grayscale version of the original image, except that
    # and set to RED (255, 0, 0) where the MASK is True 
    # (i.e, blocks considered modified)
    cols = np.shape[1]
    # we want a 'color' grayscale version of the original image:
    imgGray = color.gray2rgb(color.rgb2gray(img))


    """
create separator, maskedImage
we want to add maskedImage*255 to the red channel of imgGray
maskedImage*(-255) to the green channel
maskedImage*(-255) to the blue channel
then cast back to a uint8
    """
# this should work?
    imgOut =  np.hstack((image, separator, maskedImage))
    return imgOut
And the relevant code in MATLAB:
code:
function [imgMasked, imgOut] = write_mask(mask, imgIn)
% use a mask (matrix of same size comprised of natural numbers)
% to 'write over' image: 

%imgMasked is the image in naive grayscale except
% RED   (255, 0, 0) where mask >= 1

%imgOut is the original m x n image (on LEFT)
% a m x 8 separation of BLUE (0, 255, 0)
% and then imageMasked on RIGHT

% create RED mask
mask = repmat(mask, 1, 1, 3);
to_red = (mask > 0);
mask(:, :, 1) = to_red*512;
mask(:, :, 2) = to_red*(-1024);
mask(:, :, 3) = to_red*(-1024);

% every element with a 1 will be > 256        
imgGray = rgb2gray(imgIn);
imgMasked = zeros([size(mask), 3]);
imgMasked(:, :, 1) = imgGray;
imgMasked(:, :, 2) = imgGray;
imgMasked(:, :, 3) = imgGray;
imgMasked = imgMasked + mask;
% every element in imgMasked will have
% blue and green channel < 0
separation = zeros([size(mask, 1), 8, 3]);
separation(:, :, 3) = 255;

imgOut = uint8([imgIn, separation, imgMasked]);
% every element in imgMasked will be grayscale except where mask has nonzeros
end
Thank you again for your time. Sorry if these are silly questions. I am very new to numPy, but trying to get better: I enjoy working in python!

Eela6 fucked around with this message at 03:54 on May 11, 2016

Eela6
May 25, 2007
Shredded Hen
Thank you for your help, everyone! I think I've got it.

code:
import numpy as np
from skimage import io
RED = 0
GREEN = 1
BLUE = 2
ADD_CHANNEL = 8
REMOVE_CHANNEL = 16
testImg = io.imread('test.png')
mask = np.zeros((256, 256, 3), 'uint8')

""" set mask """
mask[0:64, 0:64, RED] = ADD_CHANNEL
mask[0:64, 0:64, GREEN] = REMOVE_CHANNEL
mask[0:64, 0:64, BLUE] = REMOVE_CHANNEL

mask[0:64, 64:128, RED] = REMOVE_CHANNEL
mask[0:64, 64:128, BLUE] = ADD_CHANNEL
mask[0:64, 64:128, GREEN] = REMOVE_CHANNEL

mask[64:128, 0:64, RED] = REMOVE_CHANNEL
mask[64:128, 0:64, BLUE] = REMOVE_CHANNEL
mask[64:128, 0:64, GREEN] = ADD_CHANNEL

mask[64:128, 64:128, RED] = ADD_CHANNEL
mask[64:128, 64:128, BLUE]= ADD_CHANNEL
mask[64:128, 64:128, GREEN] = REMOVE_CHANNEL

mask[32:96, 32:96, RED] = ADD_CHANNEL
mask[32:96, 32:96, GREEN] = ADD_CHANNEL
mask[32:96, 32:96, BLUE] = ADD_CHANNEL

mask[48:80, 48:80, RED] = REMOVE_CHANNEL
mask[48:80, 48:80, BLUE] = REMOVE_CHANNEL
mask[48:80, 48:80, GREEN] = REMOVE_CHANNEL

""" mask over image """
testImg[mask==ADD_CHANNEL] = 255
testImg[mask==REMOVE_CHANNEL]= -255
io.imshow(testImg)


Edit: I was able to fully re-implement my program. Thanks again for the help!

Eela6 fucked around with this message at 20:34 on May 12, 2016

Eela6
May 25, 2007
Shredded Hen
I've been reading Fluent Python by Luciano Ramalho and really enjoying it. The section on operator overloading has really let me get a lot more out of my custom classes (I.e, overloading __len__ to have len(foo) return something sensible) . Just wanted to give it a plug.

Eela6
May 25, 2007
Shredded Hen

Zero Gravitas posted:

It is, and it does, thankyou.:facepalm:

Is there a way of solving that kind of system to find x and y given x and x^2 terms?

Or will I need to repeat this process with the derivation and rearranging to isolate my term of interest?

Hard to tell from the snippet of mathematics you've posted, but this looks like partial differential equations and they are pretty much always a nightmare.

Eela6
May 25, 2007
Shredded Hen
I don't know how your data framework works, but this is the most 'Pythonic' way I can think of to handle the problem without needing to import anything. Hop[e this helps some!

code:
database = ['111111', '111111_222222', '141561', '123123']
def assign_parent_child(database):
    """assign tuples to database of index, unique id, and parent or child"""
    KEYLEN = 6
    # create dictionary of children
    childDict = {}  
    for data in database:
        if len(data) > KEYLEN:
            parent = (data[0:KEYLEN])
            childDict[parent] = data[KEYLEN+1:]
    
    # go through index and reassign in-place
    for index, data in enumerate(database):
        if len(data) > KEYLEN:
            parentChild = 'child'
        else:
            try:
                childDict[data]
                parentChild = 'parent'
            except KeyError:
                parentChild = None
            
        database[index] = (index, data, parentChild)
    return database

dataOut = assign_parent_child(database)
print(dataOut)
[(0, '111111', 'parent'), (1, '111111_222222', 'child'), (2, '141561', None), (3, '123123', None)]

Eela6 fucked around with this message at 05:41 on Jul 2, 2016

Eela6
May 25, 2007
Shredded Hen
This seems like a numerical analysis problem.

Note that 0.999999999999999999999999999999 == 1 in standard floating-point arithmetic.

Since arcsin defined on -1<=x<=1, we get a nan.

code:
0.999999999999
Out[43]: 0.999999999999

0.99999999999999999
Out[44]: 1.0
Not sure what your code is supposed to do. Remember that angles are modulo 2pi; you shouldn't be able to end up with infinite angles.

Eela6
May 25, 2007
Shredded Hen
LochNess, glad to see you're enjoying python. It is by far my favorite language.

As a small piece of advice, try and avoid the construct
code:

except Exception:
#code
Unless you're planning to re-raise the exception. Catching and handling specific errors is OK, but generally ignoring ALL errors is asking for trouble later on. This is the sort of thing that seems exponent at the time but can make debugging your program down the line basically impossible.

Good luck with your coding!

Eela6
May 25, 2007
Shredded Hen
Awesome. The posters above explained it better than I could have. As a quick note, if you're trying to debug, there's nothing wrong with putting off generic errors long enough to get some info first, I.e:

try:
// some problematic code
except exception as err:
// some additional debugging code that provides you info about the error
raise err

Later on you'll probably want to factor that out, but it can help you understand what's going on when your code behaves unexpectedly.

Eela6 fucked around with this message at 17:37 on Sep 23, 2016

Eela6
May 25, 2007
Shredded Hen

quote:


I'll also add that I usually re-raise an exception with a bare "raise" rather than "raise err". I think this provides a better traceback, but I was not able to confirm this with a quick Google.
I was curious about this so I did a brief experiment. They appear to be identical in Cpython 3.5.

unnamed input
Python code:
def raise_unnamed_exception():
    def internal_unnamed_raise():
        try:
            raise ValueError("basic exception")
        except Exception as err:
            raise
            
    internal_unnamed_raise()

raise_unnamed_exception()
unnamed output
Python code:
runfile('C:/eela6_online_example/raise_unnamed_exception.py', wdir='C:/eela6_online_example')
Traceback (most recent call last):

  File "<ipython-input-15-83da24a50a8d>", line 1, in <module>
    runfile('C:/eela6_online_example/raise_unnamed_exception.py', wdir='C:/eela6_online_example')

  File "D:\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
    execfile(filename, namespace)

  File "D:\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 89, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/eela6_online_example/raise_unnamed_exception.py", line 17, in <module>
    raise_unnamed_exception()

  File "C:/eela6_online_example/raise_unnamed_exception.py", line 15, in raise_unnamed_exception
    internal_unnamed_raise()

  File "C:/eela6_online_example/raise_unnamed_exception.py", line 11, in internal_unnamed_raise
    raise ValueError("basic exception")

ValueError: basic exception
named input:
Python code:
def raise_named_exception():
    def internal_named_raise():
        try:
            raise ValueError("basic exception")
        except Exception as err:
            raise err
            
    internal_named_raise()

raise_named_exception()
named output:
Python code:
runfile('C:/eela6_online_example/raise_named_exception.py', wdir='C:/eela6_online_example')
Traceback (most recent call last):

  File "<ipython-input-16-280b30bc959d>", line 1, in <module>
    runfile('C:/eela6_online_example/raise_named_exception.py', wdir='C:/eela6_online_example')

  File "D:\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 714, in runfile
    execfile(filename, namespace)

  File "D:\Anaconda3\lib\site-packages\spyderlib\widgets\externalshell\sitecustomize.py", line 89, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/eela6_online_example/raise_named_exception.py", line 17, in <module>
    raise_named_exception()

  File "C:/eela6_online_example/raise_named_exception.py", line 15, in raise_named_exception
    internal_named_raise()

  File "C:/eela6_online_example/raise_named_exception.py", line 13, in internal_named_raise
    raise err

  File "C:/eela6_online_example/raise_named_exception.py", line 11, in internal_named_raise
    raise ValueError("basic exception")

ValueError: basic exception

Eela6 fucked around with this message at 22:07 on Sep 23, 2016

Eela6
May 25, 2007
Shredded Hen
Are you in python 2 or python 3?

Eela6
May 25, 2007
Shredded Hen
Your problem is that python 2 treats strings by default as raw bytes (implicitly, ASCII), but you have unicode somewhere in there.

When you used the default 'with' context manager and writer, you told it to start writing ASCII, then fed it a character it couldn't handle.

The simplest solution seems to be the following:

Python code:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
The hypertext above links to a guy who seemed to have the same problem.

This is a common Python 2 'gotcha' and a bit of a pain.

Edit: If someone more familiar with python 2 and unicode wrangling could weigh in I'd appreciate it, this is a little outside my wheelhouse.

Eela6 fucked around with this message at 18:28 on Sep 27, 2016

Eela6
May 25, 2007
Shredded Hen

Death Zebra posted:

Yep, both of these seem to work. For some reason some of the items in the list were unicode and not the ones I expected. Another takeaway from this is that encode doesn't seem to work on strings that are hybrids of string and unicode.

A common idiom in text processing is the 'Unicode sandwich' . The idea is that whatever your input and output are going to look like, the very first thing you do with your input is convert uniformly to Unicode, and if for whatever reason your output needs to be non-unicode, you do that last.

I've found it helpful to keep in mind.

Eela6
May 25, 2007
Shredded Hen
The best way to do this is by maintaining a sort - here, by time. This is an O(n*log(n)) solution.
I like namedtuples for easy syntax.
Python code:
from collections import namedtuple
database = object() # mock for whatever you have
Transfer = namedtuple('Transfer', ['name', 'isFrom', 'to', 'time'])
FrequentFlier = namedtuple('FrequentFlier', ['name', 'departed', 'returned'])
# i.e,  james = Transfer('James', 'ICU', '2E', <datetime obj>)
def within_24_hours():
    pass # I can't remember datetime syntax, but this part shouldn't be too bad.
transfers = (Transfer(x) for x in database)
transfers = sorted(database, key = lambda x: x.time)

def generate_frequent_fliers():
    seen = {}
    for transfer in transfers:
        patient = transfer.name
        if transfer.isFrom == 'ICU':
            seen[patient] = transfer 
            # because of our sort, this is always the most recent transfer out time
        elif patient in seen and transfer.to == 'ICU':
            if within_24_hours():
                departed = seen[patient].isFrom
                returned = transfer.to
                yield FrequentFlier(patient, departed, returned)

frequentFliers = generate_frequent_fliers()
for frequentFlier in frequentFliers:
   print('patient {0} left at {1} but came back at {2}'.format(
         frequentFlier.name, frequentFlier.departed, frequentFlier.returned))

Eela6 fucked around with this message at 04:50 on Oct 6, 2016

Eela6
May 25, 2007
Shredded Hen
OK I'm not going to lie that pandas solution looks slick.

Eela6
May 25, 2007
Shredded Hen
Premature abstraction is as bad as premature optimization, imo. Write them separately. If you can then take a step back and refactor it to something pretty, great!

Eela6
May 25, 2007
Shredded Hen
For some reason I had that icu code thing stuck in my head so I fleshed it out the rest of the way.
Python code:

from collections import namedtuple
import csv
import datetime

""" 
helper functions start here
"""

def import_transfers(filename):
    """ import transfers from CSV, convert to namedtuple, then read into memory """
    Transfer = namedtuple('Transfer', ['name', 'isFrom', 'to', 'time'])
    asDateTime = datetime.strptime
    with open(filename) as file:
        database = csv.reader(file)
        transfers = [Transfer(*line[:3], asDateTime(line[3])) 
            for line in database]
        return transfers

def generate_frequent_fliers(transfers):
    """ generator function that yields patients who have transferred
    out of ICU, and then return within 24 hours"""
    FrequentFlier = namedtuple('FrequentFlier', ['name', 'left', 'returned'])
    TWENTY_FOUR_HOURS = datetime.timedelta(seconds = 60) * 60 * 24
    seen = {}
    for transfer in transfers:
        patient = transfer.name
        if transfer.isFrom == 'ICU':
            seen[patient] = transfer
            # because of our sort, this is always the most recent transfer out time
        elif patient in seen and transfer.to == 'ICU':
            departed = seen[patient].isFrom
            returned = transfer.to
            if returned.time - departed.time < TWENTY_FOUR_HOURS:
                yield FrequentFlier(patient, departed.time, returned.time)

def export_frequent_fliers(OUTPUT_FILEPATH, frequentFliers):
    """ writes list of frequent_fliers as CSV to output file in the following form:
    NAME, TIME IN, TIME OUT """
    with open(OUTPUT_FILEPATH, 'w') as file:
        writer = csv.writer(file)
        for frequentFlier in frequentFliers:
            writer.writerow(frequentFlier)

"""
helper functions end here
"""

#this is what actually runs
INPUT_FILEPATH = 'testin.csv'
OUTPUT_FILEPATH = 'testout.csv'
transfers = import_transfers(INPUT_FILEPATH)
transfers = sorted(transfers, key = lambda x: x.time)
frequentFliers = generate_frequent_fliers(transfers)
export_frequent_fliers(OUTPUT_FILEPATH, frequentFliers)

Eela6 fucked around with this message at 22:51 on Oct 6, 2016

Eela6
May 25, 2007
Shredded Hen
Subway masturbator is 100% correct. To further expand on his point, you can also handle positional and keyword arguments with *args and **kwargs!

I made a small example for a strategy game below.
EG:
Python code:
class Unit:
     def __init__(self):
          pass
     

class Footman(Unit):
     def __init__(self, currentHp = 5, power = 5, maxHp = 5):
          self.currentHp = currentHp
          self.maxHp = maxHp
          self.power = power
          self.attackRange = 1
          self.armor = 2

     def __repr__(self):
          return 'Footman(currentHp = {0}, power = {1}, maxHp = {2})'.format(
                          self.currentHp, self.power, self.maxHp)

     def takeDamage(self, damage):
          self.currentHp -= max(damage-self.armor, 0)

def makeUnit(unitType, *args, **kwargs):
     return unitType(*args, **kwargs)

steve = makeUnit(Footman)
superSteve = makeUnit(Footman, 50, 100, maxHp = 250)
print(repr(steve))
steve.takeDamage(4)
print(repr(steve))
print(repr(superSteve))
-- >
code:
Footman(currentHp = 5, power = 5, maxHp = 5)
Footman(currentHp = 3, power = 5, maxHp = 5)
Footman(currentHp = 50, power = 100, maxHp = 250)
For anyone who wants to expand their python knowledge, I highly recommend 'Fluent Python' by Luciano Ramalho, it taught me nearly all my tricks.

Eela6 fucked around with this message at 20:55 on Oct 7, 2016

Eela6
May 25, 2007
Shredded Hen
In enums, self.name is reserved -that's the name you chose, i.e, 'thumb'. You don't need to specify it because you get it 'for free' as part of an Enum. Check it out!

Python code:

from enum import Enum
from enum import Enum

class Fingers(Enum):
    thumb = 1
    index = 2
    middle = 3
    ring = 4
    pinky = 5
    
    def __init__(self,num):
        self.num = num

print([x for x in Fingers])

for finger in Fingers:
     print(finger.name)
Results
code:
[<Fingers.thumb: 1>, <Fingers.index: 2>, <Fingers.middle: 3>, <Fingers.ring: 4>, <Fingers.pinky: 5>]
thumb
index
middle
ring
pinky

Eela6
May 25, 2007
Shredded Hen
As tempting as it is to use identifiers in Linear A , it's widely considered non-pythonic.

Eela6
May 25, 2007
Shredded Hen
Tons of problems involve people importing data from a CSV where the first row is the LABELS of the CSV.

I always find myself converting these labels into namedtuples so I can access the CSV's info either positionally or by keyword.

So I wrote some small helper functions to easily read and write labeled csvs.

This is not high-level python, but hopefully it's useful or interesting to someone!

Eela6
May 25, 2007
Shredded Hen
If you're looking for more information on Python re: concurrency, threads, async, and parallelism, the Python Cookbook (3rd ed), Fluent Python, and Essential Python all cover it pretty well.

Eela6
May 25, 2007
Shredded Hen
with regards to jose quervo's question:

This is how I would do it. Included is a test for your example. This is pure python - not familiar with the library mentioned. (Please indulge my love for namedtuples.)

Python code:
def count_machine_intervals(machines):
    START = 'START'
    END = 'END'
    Event = namedtuple('Event', ['time', 'type'])
    Interval = namedtuple('Interval', ['start', 'end'])
    def _generate_events():
        for machine in machines:
            for startTime in machine[0]:
                yield Event(startTime, START)
            for endTime in machine[1]:
                yield Event(endTime, END)

    events = sorted(_generate_events(), key = lambda x: x.time)
    count = 0
    previousTime = 0
    machineCountIntervals = defaultdict(list)
    def _update_machine_count_intervals():

        if interval.start == interval.end:
            return
        if machineCountIntervals[count]:
            # try merging intervals if appropriate
            lastSeen = machineCountIntervals[count][-1]
            if lastSeen.end == interval.start:
                machineCountIntervals[count][-1] = Interval(lastSeen.start, interval.end)
                return
        machineCountIntervals[count].append(interval)
        
    for event in events:
        interval = Interval(previousTime, event.time)
        _update_machine_count_intervals()

        if event.type == START:
            count += 1
        else:
            count -= 1
        previousTime = interval.end
    return machineCountIntervals

def test():
    machine_00 = [2, 7], [6, 11]
    machine_01 = [1, 5, 11], [3, 10, 13]
    machine_02 = [2, 6], [5, 8]
    machines = (machine_00, machine_01, machine_02)
    count = count_machine_intervals(machines)
    for key in count:
        print(key, count[key])

test()
output:
code:
0 [Interval(start=0, end=1)]
1 [Interval(start=1, end=2), Interval(start=10, end=13)]
2 [Interval(start=3, end=7), Interval(start=8, end=10)]
3 [Interval(start=2, end=3), Interval(start=7, end=8)]
edit: some small formatting changes.

Eela6 fucked around with this message at 20:46 on Oct 22, 2016

Eela6
May 25, 2007
Shredded Hen
Makes sense! A well-regarded library is always a good way to go. Besides being a little bulky, there is probably some corner case I've missed in my implementation. That said, I enjoyed writing it and figuring out the 'tricks ' (like ignoring zero-width intervals and merging intervals of the form n: (a, b) , n:(b, c) --> n:(a, c)

Eela6
May 25, 2007
Shredded Hen
Overager's solution is good but creates a new list. You can use a generator comprehension to lazily yield new entries, which avoids a lot of extra overhead from creating an intermediate list. If you do need an explicit list rather than iterator, you can just call list() on the generator, or use a list comprehension later.

This is accomplished by replacing the brackets [] with parentheses ()

Python code:
def gen_x_of_every_y(iter, x, y):
    return (j for (i, j) in enumerate(iter) if (i%y) < x)

Eela6
May 25, 2007
Shredded Hen
Cingulate, the answer is 'it depends'. If there's some name that accurately describes your container's contents, use that. On the other hand, sometimes your container explains itself.

EG:
Python code:
# good - makes it clear why we chose these elements
uniformColors= {'black', 'blue', 'white')
icecreamFlavors = {'chocolate', 'vanilla', 'strawberry'}

# debatable - probably useful. saves a little bit of brain space when reading over the code.
powersOfFive = {5**n for n in range(10)}

# silly
yesOrNo = {'yes', 'no'}
abcde = set('abcde')
It's generally a matter of best practice to do membership testing with sets. As mentioned above, you can create sets with the set() constructor, or directly with curly braces.

If your possible elements are very small, it's OK to test membership in a list or tuple. But it's good to get in the habit of testing membership in sets, because this will save you headaches down the line.

Testing membership in a list or tuple is O(n), where testing membership in a set via hash function is O(1).

Python code:
# equivalent:
membership = {'a', 'b', 'c'}
membership = set(('a', 'b', 'c')) 

Eela6 fucked around with this message at 03:25 on Nov 10, 2016

Eela6
May 25, 2007
Shredded Hen

Thermopyle posted:

The chances are super high that you should use whatever you want because the performance differences are going to be minuscule for your application.

Yes. I tend to emphasize 'always test membership in sets! ' because most of the python programmers I know are academics who are not professional programmers. Teaching them this way avoids the worst - case scenarios.

Eela6
May 25, 2007
Shredded Hen
You want to use numpy / MATLAB style logical indexing.

Remember not to use the bitwise operators like '&' unlesss you are actually working bitwise. Apparently this is a difference between numpy and pandas

numpy has a number of formal logic operators that are what you want, called logical_and, logical_not, logical_xor, etc...

It's easiest to understand given an example. You might already know this, but it's always nice to have a refresher.

IN:
Python code:
A = np.array([2, 5, 8, 12, 20])
print(A)
between_twenty_and_three = np.logical_and(A>3, A<20)

print(between_twenty_and_three)
A[between_twenty_and_three] = 500

print(A)
OUT:
Python code:
[ 2  5  8 12 20]
[False  True  True  True False]
[  2 500 500 500  20]
Specifically, for your question:
IN:
Python code:
def update_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    df.Current = df.New
    df.Current[pd.isnull(df.New)] = df.Old[pd.isnull(df.New)]
    return df
    
def test_update_dataframe():
    df = pd.DataFrame([[1, 2,np.nan],
                   [3, 2,np.nan],
                   [7, np.nan,np.nan], 
                   [np.nan, 8,np.nan]],
                  columns=['Old', 'New', 'Current'])
    print('old')
    print(df)
    df = update_dataframe(df)
    print('new')
    print(df)    
    
test_update_dataframe()
OUT:
code:
old
   Old  New  Current
0  1.0  2.0      NaN
1  3.0  2.0      NaN
2  7.0  NaN      NaN
3  NaN  8.0      NaN
new
   Old  New  Current
0  1.0  2.0      2.0
1  3.0  2.0      2.0
2  7.0  NaN      7.0
3  NaN  8.0      8.0

Eela6 fucked around with this message at 23:11 on Nov 18, 2016

Eela6
May 25, 2007
Shredded Hen
That makes sense! It seems like a reasonable overload of those operators (they mean the first thing I would guess, which is generally good sign for overloaded operators). I tried to do exactly that when I first switched from MATLAB to numpy.

(Whenever I get heavy into numerics, I occasionally still find myself using container(key) instead of container[key]. At least I've beat zero-based indexing into my skull. )

Eela6
May 25, 2007
Shredded Hen

Jose Cuervo posted:

OK, makes sense. Thanks.

Next question:

I am looking at the code found in the accepted answer to this question

http://stackoverflow.com/questions/32791911/fast-calculation-of-pareto-front-in-python

...

Why not just use the function without passing it in since it is in the same module?
This is a matter of style and there's nothing wrong with doing it the way you have suggested. Some programmers prefer to make as many parameters used by a function explicit as possible. I prefer having functions with as few arguments as possible. The important part is coherency and, where possible, consistency.

Implementation note: According to Luciano Ramalho in Fluent Python, within cpython, explicitly passing the function is ever-so-slightly faster because then the passed functon is in the internal namespace - this speeds up lookup on the interpreter's end. However, the speed gains are negligible in almost every case.

As a matter of style, if I wanted to make it clear that dominates is within cull, I would make it a subfunction of cull. This inherits the speed gains of explicit passing but, again, these are not important.

i.e,

Python code:
def cull(pts):

    def dominates(row, rowCandidate):
        return all(r >= rc for r, rc in zip(row, rowCandidate))
        
    dominated = []
    cleared = []
    remaining = pts
    while remaining:
        candidate = remaining[0]
        new_remaining = []
        for other in remaining[1:]:
            [new_remaining, dominated][dominates(candidate, other)].append(other)
        if not any(dominates(other, candidate) for other in new_remaining):
            cleared.append(candidate)
        else:
            dominated.append(candidate)
        remaining = new_remaining
    return cleared, dominated
RE: your multiprocessing question. Are you sure you need to use multiprocessing? Have you profiled your code? For a single desktop or laptop, parallelization is generally the 'last gasp' of optimization. You get, at best, eight times the performance. For example, If you can eliminate cache misses in the 'hot' part of your code you can gain 100-400x speed without needing to muck around with multiprocessing.

Eela6
May 25, 2007
Shredded Hen

Jose Cuervo posted:

I have never thought about structuring code this way, but this make a lot of sense (especially given how small the function dominates is.

Each simulation replication takes 3-5 seconds to run. The desktop I use at work has the equivalent of 24 cores so running 25 replications in parallel takes 10-11 seconds. Running 25 replications in series takes about 2 minutes. This is why I chose to use multiprocessing.

However, I too am interested in a) what a cache miss is, and b) how I would find and eliminate them in my code.
Congratulations, you have a real use case for parallelism! Carry on :). In response to your questions:

A: I do not have a formal computer science or engineering background (I did math), so I would appreciate if someone with a stronger understanding of hardware could give a better explanation. I will try, though:
In an extremely general sense, a cache miss is when your processor tries to access data in the (extremely fast) L1 cache but can't find it. It then has to look in a higher-level cache. If it's not in that cache, it has to look in a higher-level cache... and if it's not in any of them, it then has to access RAM (which is very slow in comparison). Just like reading data from a hard drive is very slow compared to RAM, reading data from RAM is slow compared to the cache.


B: This is the subject of a small talk I am going to give at my local python developers' meeting. Once I've finished my research and slides, I will present it here, too! But to give an idea of the basics, you can avoid cache misses by structuring your code to use memory more efficiently. Basically, you want to be able to do your work without constantly loading things into and out of the cache. This means avoiding unnecesssary data structures & copying. As an extremely general rule of thumb, every place where you are using return when you could be using yield is a great way to more efficiently use memory.

Generators, coroutines, and functional-style programming are your friends, and they are often appropriate for simulations. (Not every function should be replaced by a generator, and not every list comprehension should be a generator comprehension. But you would be surprised how many can and should.)



Even more importantly, you have to know what the slow part of your code is before you bother spending time optimizing. This is what profiling is for. First get your code to work, then find out if it's slow enough. If it's too slow, find out why and where. Often times a small subsection of your code takes 95%+ of execution time. If you can optimize THAT part of your code, you are done. It's easy to spend a dozen man-hours 'optimizing' something that takes <1% of runtime. Don't do that.

I am not an expert. There are many great PyCon talks on code profiling & optimization by experts in the field that can give you better instruction than I can. The talk I'm planning to give is basically just cherry-picking bits and pieces from these pieces of excellent instruction.

As a starting place, this talk is a little long but an excellent overview of the topic of speed in Python.

https://www.youtube.com/watch?v=JDSGVvMwNM8

Eela6 fucked around with this message at 00:45 on Nov 22, 2016

Eela6
May 25, 2007
Shredded Hen
It is bad style. Whether it's a common style is something you'll have to ask the Coding Horrors thread. :)

Eela6
May 25, 2007
Shredded Hen

Ghost of Reagan Past posted:

Here's a question about what people prefer

Suppose you were looking at a database interface that claimed to be Pythonic, and you wanted to query a table for a value. Which syntax would be preferable, in your mind, for a simple SQL "SELECT * IN TABLE WHERE COLUMN = VALUE"?


If you think some other option is better, let me know, I know I haven't exhausted all the reasonable options.

Just assume that it also supports DB-API.


I prefer numpy style. I.e,
Python code:
someValue = database['table', 'column', 'value']
someColumn = database['table', 'column']
someTable = database['table'] 

everyValueInAColumn = database['table', 'column', :]
I really dislike chained brackets. I come from a weird programming background, though, and I've, uh, never actually used SQL, so I might not even understand what that query means :doh:

Eela6 fucked around with this message at 17:25 on Nov 22, 2016

Eela6
May 25, 2007
Shredded Hen
Fluent Python is the best book on python programming I've ever read, and I take any opportunity to advertise it. I can vouch for the Python Cookbook - it's not simply recipes, it has decent explanations, too. Make sure you get the latest edition, it's substantially different from the python 2-focused versions. Effective Python is good too, It's very concise and doesn't have as much theory or explanation, so it might not be what you're looking for. The author of that text, Brett Slatkin, has a number of tutorials on YouTube that are worthwhile in their own right.

YouTube is a great resource for python knowledge, actually. Most of the presentations from the last four years or so of PyCons are available. Maybe start with Raymond Hettinger, who is a python core developer and one of the best python educators I've ever seen. 'Beyond Pep8' is a great piece on 'pythonic style', and his recent talk in Russia about async and parallelism was really good too, if you're looking for a place to start with that.

If you don't have a formal CS background and like theory, you can't do better than the big white book Introduction to Algorithms . A real programmer should never spend too long without reviewing the core material.

If you aren't familiar with these libraries, get familiar :
collections, functools, itertools

Eela6 fucked around with this message at 23:30 on Dec 15, 2016

Eela6
May 25, 2007
Shredded Hen

Boris Galerkin posted:

Got any examples for this numpy stuff?

https://www.youtube.com/watch?v=EEUXKG97YRw

Eela6
May 25, 2007
Shredded Hen

Suspicious Dish posted:

I mean, it's nice, but at the same time, now Python has three string formatting options without any recommendations or standards for which ones you should use where.

I fully intend to use f-strings exclusively going forward. I like the syntax a lot and I think they are easiest to understand. They work "pretty much how you would expect", which is generally a sign of good code. I don't think it's a ton of extra mental overhead for python programmers to understand.

That said, I get that this is messier than it could be. It's not exactly living up to "there should be one way to do it".

With the impending release of 3.6, I am most excited for the new dict/set {} implementation. Basically, all dicts are faster OrderedDicts. Since classes and functions use dictionaries internally, this opens up a lot of fun areas for metaprogramming!

Eela6 fucked around with this message at 06:44 on Dec 18, 2016

Eela6
May 25, 2007
Shredded Hen

Thermopyle posted:

I mean, people are going to write unreadable code no matter what tools you give them. Here's a couple of your choices:

1. Leave string formatting as it is and bad coders will make unreadable messes of their strings in the various ways .format() lets you do and good coders will have strings that look ok but have a lot of repetition.
2. Introduce f-strings and bad coders will make unreadable messes right in the middle of their strings and good coders will have strings that look great in a succinct manner.

Personally, I prefer option #2.

Python gives you a lot of power and freedom. It lets you do things the way you want to. The good side of this is that well-written Python is some of the cleanest and easiest to understand code out there. The bad side of this is that there is no floor on just how bad Python coding can get.

There are plenty of languages that attempt to raise the floor by adopting a lot of restrictions, - Golang strikes me as an extreme example. Java doesn't have operator overloading for the same reason: abuses of operator overloading can make code almost impossible to understand. I think that's a valid approach for a language, but I like the way python is. It makes well-written libraries a pleasure to use.

For example, let's take numpy. An expression like
Python code:
A[A < 20] = 0
is pretty easy to understand, but would be a pain to express in Java, because it uses not one but three overloaded operators: __le__, __setitem__, and __getitem__.

[If you're reading the thread and not familiar with numpy or MATLAB, this means 'set every item in A that's less than 20 to zero']

Obviously a new way to format strings is not quite the same as operator overloading, but I think the same basic ideas apply.

I've always enjoyed this aphorism and think it applies: "Python is a language for consenting adults."

Eela6 fucked around with this message at 22:32 on Dec 18, 2016

Eela6
May 25, 2007
Shredded Hen
Just a guess, but the class constructor might be the wrong approach. I'm on my phone, but try


code:
OrderedDict: [str,  Numbers.Integral] 


This might be barking completely up the wrong tree, though.

Eela6
May 25, 2007
Shredded Hen
I am unreasonably excited for python 3.6. f-strings! Asynchronous generators and comprehensions! Underscores in numeric literals! Preservation of argument order in class instantiation and **kwargs! Low-level optimizations I don't fully understand!

Python is a great language that's getting better all the time. It's certainly not perfect and it's not appropriate for everything, but it's so good most of the time. I enjoy progamming in python and the python community and I hope you all do too.

Adbot
ADBOT LOVES YOU

Eela6
May 25, 2007
Shredded Hen

Master_Odin posted:

Are f-strings going to be back ported in some way to older versions of Python or is it 3.6 only?

As far as I'm aware 3.6 and above only.

  • Locked thread