Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

I'm trying to figure out what the least-dumb way to assign a service tech based on a map of highlighted states. Each rep has 1 or more states that they are responsible for, and i want to email the responsible tech when a ticket comes in from a store in their state. I'm new to Python.

code:
    states = ['TX', 'OK']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'jim'

    states = ['AR', 'LA', 'MS', 'AL', 'GA', 'TN']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'mark'

    states = ['FL']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'dave'

    states = ['MI', 'IL', 'IN', 'OH', 'NC', 'SC']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'tom'

    states = ['KY', 'WV', 'VA', 'MD', 'DE', 'NJ', 'CT', 'RI', 'MA', 'ME', 'NH', 'VT',
        'NY', 'PA']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'bill'

    states = ['WA', 'OR', 'ID', 'MT', 'WY', 'UT', 'CO', 'NM', 'ND', 'SD', 'NE', 'KS',
        'MN', 'IA', 'MO', 'WI']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'joe'
Would some kind of 2D array work better? Trying to not repeat myself as much as I can as well as make it easy to change techs/states later down the road.

Adbot
ADBOT LOVES YOU

Jeb Bush 2012
Apr 4, 2007

A mathematician, like a painter or poet, is a maker of patterns. If his patterns are more permanent than theirs, it is because they are made with ideas.

Bob Morales posted:

I'm trying to figure out what the least-dumb way to assign a service tech based on a map of highlighted states. Each rep has 1 or more states that they are responsible for, and i want to email the responsible tech when a ticket comes in from a store in their state. I'm new to Python.

code:
    states = ['TX', 'OK']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'jim'

    states = ['AR', 'LA', 'MS', 'AL', 'GA', 'TN']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'mark'

    states = ['FL']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'dave'

    states = ['MI', 'IL', 'IN', 'OH', 'NC', 'SC']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'tom'

    states = ['KY', 'WV', 'VA', 'MD', 'DE', 'NJ', 'CT', 'RI', 'MA', 'ME', 'NH', 'VT',
        'NY', 'PA']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'bill'

    states = ['WA', 'OR', 'ID', 'MT', 'WY', 'UT', 'CO', 'NM', 'ND', 'SD', 'NE', 'KS',
        'MN', 'IA', 'MO', 'WI']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'joe'
Would some kind of 2D array work better? Trying to not repeat myself as much as I can as well as make it easy to change techs/states later down the road.

You probably want to maintain a dictionary mapping states to the techs that are responsible for them.

Jose Cuervo
Aug 25, 2004

Bob Morales posted:

I'm trying to figure out what the least-dumb way to assign a service tech based on a map of highlighted states. Each rep has 1 or more states that they are responsible for, and i want to email the responsible tech when a ticket comes in from a store in their state. I'm new to Python.

code:
    states = ['TX', 'OK']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'jim'

    states = ['AR', 'LA', 'MS', 'AL', 'GA', 'TN']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'mark'

    states = ['FL']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'dave'

    states = ['MI', 'IL', 'IN', 'OH', 'NC', 'SC']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'tom'

    states = ['KY', 'WV', 'VA', 'MD', 'DE', 'NJ', 'CT', 'RI', 'MA', 'ME', 'NH', 'VT',
        'NY', 'PA']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'bill'

    states = ['WA', 'OR', 'ID', 'MT', 'WY', 'UT', 'CO', 'NM', 'ND', 'SD', 'NE', 'KS',
        'MN', 'IA', 'MO', 'WI']
    if d.get('State Abbreviation') in states:
      responsibleTech = 'joe'
Would some kind of 2D array work better? Trying to not repeat myself as much as I can as well as make it easy to change techs/states later down the road.

You could create a dictionary with each state mapped to the tech responsible:
Python code:
state_to_tech = {'TX': 'jim,
        'OK': 'jim',
        'AR': 'mark', 
....}
and then use the following line
Python code:
responsible_tech = state_to_tech[d.get('State Abbreviation')]
This only works if each state is taken care of by a single tech.

Jeb Bush 2012
Apr 4, 2007

A mathematician, like a painter or poet, is a maker of patterns. If his patterns are more permanent than theirs, it is because they are made with ideas.

Jose Cuervo posted:

This only works if each state is taken care of by a single tech.

If there are multiple techs per state, just make the dict values lists, and then send an e-mail to each of them (or the first not occupied, or whatever your desired behaviour is).

Jose Cuervo
Aug 25, 2004
I have a 'Factory' class that creates instances of the 'Tanker' class. I want to keep track of the order in which tanker instances are created by all instances of the 'Factory' class. For example if there are two factories, and the first factory creates one tanker at time 4 and one tanker at time 10, and the second factory creates one tanker at time 7, I want the tanker created at time 4 to have the ID 1, tanker created at time 7 to have the ID 2, and the tanker created at time 10 to have the ID 3.

It seems like creating a global variable would accomplish this, but is there a better way to keep track of this? Perhaps a variable in the factory class that is common to all instances?

EDIT: Turns out it is called a static variable: http://stackoverflow.com/questions/68645/static-class-variables-in-python

Jose Cuervo fucked around with this message at 21:36 on Mar 20, 2015

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

I'm starting a little project writing a wrapper for a web REST(ish) API. Anyone have any examples of any wrappers out there that they like in particular that I can steal ideas from?

Particularly I'm looking for wrappers that do a good job of any of these (and more): python-izing the API, making good objects out of API results, maps saving objects to POSTing objects in a transparent-as-possible manner, caching related objects, rate limiting, all of that stuff.

Bob Morales
Aug 18, 2006


Just wear the fucking mask, Bob

I don't care how many people I probably infected with COVID-19 while refusing to wear a mask, my comfort is far more important than the health and safety of everyone around me!

Jose Cuervo posted:

You could create a dictionary with each state mapped to the tech responsible:
Python code:
state_to_tech = {'TX': 'jim,
        'OK': 'jim',
        'AR': 'mark', 
....}
and then use the following line
Python code:
responsible_tech = state_to_tech[d.get('State Abbreviation')]
This only works if each state is taken care of by a single tech.

We use 1 tech per state right now.

I thought about that, but then if we fire a tech and hire a new one I have to change the tech assigned to like, 15 states in some cases.

Cingulate
Oct 23, 2012

by Fluffdaddy

SurgicalOntologist posted:

The only thing "wrong" with a functional style is it's not mainstream Python, so your typical Python programmer is not likely to have encountered it and will be confused.

Cingulate, I don't know about identifying a style (it's not really clear cut, even my version isn't functional in the more academic sense), but it's worth pointing out that there's no list comprehension, but a generator expression. You shouldn't have 75% list comprehensions, but a lot of generator expressions/comprehensions of all kinds is usually considered the best way to write Python code.
Ah thanks.

BigRedDot posted:

I haven't used map or filter in years. List comprehensions and generator expressions
  • are declarative (almost always the best option when it is available)
  • are non-trivially more efficient (dispatch immediately to Python C API)
As for the style, it's sometimes referred to as "declarative". In the manner of Prolog, the idea is to tell the computer, directly, what you want, rather than a bunch of steps for how to do it. The implication being that you always have to know what you want, but that giving the steps requires converting what you want into the sequence of steps, and that eliminating this conversion process leaves less room for errors.
Multiprocessing only does .map though, there is no "parallel list comprehension" right?

Cingulate
Oct 23, 2012

by Fluffdaddy

Bob Morales posted:

We use 1 tech per state right now.

I thought about that, but then if we fire a tech and hire a new one I have to change the tech assigned to like, 15 states in some cases.
You could do that in a single line though.

[state_to_tech.__setitem__(state, new_technician) for state, technician in zip(state_to_tech.keys(), state_to_tech.values()) if state_to_tech[state] == old_technician]

vikingstrike
Sep 23, 2007

whats happening, captain
You could always use a DataFrame and have name, email, and state columns. You can easily filter based on state or name to slice it however you need. If you start assigning techs to multiple states you can create variables as needed or just add additional rows.

duck monster
Dec 15, 2004

Or invert the problem

tech['jim'] = ('WA','CA','TX <etc>)
tech['bill'] = ('AK','AZ',<etc>)

Then just iiterate over the techs looking for the state.
code:
for guy,states in tech.items():
     if state in states:
          email('guy')

duck monster fucked around with this message at 00:05 on Mar 21, 2015

underage at the vape shop
May 11, 2011

by Cyrano4747
I'm doing a python course and I'm completely stumped on how to do a question in an assignment.

It's a beginner course and they've given us some predefined stuff to use.

This is an example of what its supposed to look like:


This is what I've currently managed to do:



This is my code:
code:
def display_maxs(stations, dates, data, start_date, end_date):
    sdpos = dates.index(start_date)
    edpos = dates.index(end_date)
    i=0
    k= edpos-sdpos
    display_stations(stations, 'Date')
    while i<=k:
        print(dates[sdpos+i])
        for z in range(len(data)):
            display_temp(data[z][i])
        i=i+1
I have to use that display_temp function, it's a predefined one they want us to use:

code:
def display_temp(temp):
    """Display the temperature in an appropriate format. Dashes are displayed
    for invalid temperatures.

    display_temp(float) -> None
    """
    if temp == UNKNOWN_TEMP:
        print("{:<15}".format(" ----"), end='')
    else:
        print("{:<15}".format("{:5.1f}".format(temp)), end='')
How do I fix my mess to match their stuff? The display stations is what gives you the top line.

E: I fixed the numbers being wrong after I posted this, instead of
display_temp(data[z][i])
it should be
display_temp(data[z][sdpos+i])

E: Somehow I got it to work, kinda, it looks like this now:

underage at the vape shop fucked around with this message at 12:11 on Mar 21, 2015

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

You probably want to have a look at string formatting
(If it's a bit much to look at, have a look at the examples, there's some using the exact same alignment trick as the method you're calling)

Basically the output is tabulated, each temperature their method prints is padded to a fixed width, so the next one starts in the right place. Your issue is that your dates aren't padded, so the first temperature is printed too far to the left, and so all the rest are too. Look at their printing code and steal it!

underage at the vape shop
May 11, 2011

by Cyrano4747


It worked, thanks!

I just need to finish making the interact function that lets the user put in commands (aka calling the functions I just wrote), then the comments, and then I'll have done this whole assignment in about 6 hours.

Dominoes
Sep 20, 2007

Hey dudes, Numba's pretty cool. Ie mean fast.

Would there be any interest in a pypi module that contains some basic numerical functions implemented with it? Ie mean, sum, std, variance, bisect, interp etc.

Ie maybe call the module 'fast', and you could just call 'fast.sum()', 'fast.interp() etc, and it would do the same as the standard/math/numpy libraries, but faster? Seems like it could be a drop-in replacement.

Background: Numba's a continuum module that lets you write code that runs about as fast as C. You make a python function with a numba decorator, but you have to take away python's niceties in that function. It's still super-convenient, because once you've made a numba function, you can call it from full-up python.


That said, caveat I need to confirm: I think some of the benefit of numba is from joining as much stuff as you can get away with into a single loop, and if you split the code into standalone functions, you may lose some of the performance benefits.

ie a pearson correlation function that runs way faster than the scipy one:
Python code:
@numba.jit
def corr_opt(data1, data2):
    """Optimized pearson correlation test, similar to scipy.stats.pearsonr."""
    M = data1.size

    sum1 = 0.
    sum2 = 0.
    for i in range(M):
        sum1 += data1[i]
        sum2 += data2[i]
    mean1 = sum1 / M
    mean2 = sum2 / M

    var_sum1 = 0.
    var_sum2 = 0.
    cross_sum = 0.
    for i in range(M):
        var_sum1 += (data1[i] - mean1) ** 2
        var_sum2 += (data2[i] - mean2) ** 2
        cross_sum += (data1[i] * data2[i])

    std1 = (var_sum1 / M) ** .5
    std2 = (var_sum2 / M) ** .5
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)
I lumped several things together, and I feel like breaking it up into functions might slow it down, but haven't tested yet. I could still provide multiple levels of funcs, ie providing a correlation func that looks like the above, and still have separate mean/std funcs without calling them from it.

Dominoes fucked around with this message at 18:34 on Mar 22, 2015

Dominoes
Sep 20, 2007

Did some homework: It looks like splitting loops does cause slowdowns (Although still much faster than equivalent numpy/scipy funcs), but you can still split up some of the funcs. Ie:

These basic funcs that work together can't really be made to share loops, so there's no benefit to mushing them together:
Python code:
@numba.jit(nopython=True)
def sum_(data):
    sum__ = 0.
    for i in range(data.size):
        sum__ += data[i]
    return sum__


@numba.jit(nopython=True)
def mean(data):
    M = data.size
    sum__ = sum_(data)
    return sum__ / M


@numba.jit(nopython=True)
def var(data):
    M = data.size

    mean_ = mean(data)
    var_sum = 0.
    for i in range(M):
        var_sum += (data[i] - mean_) ** 2
    return var_sum / M


@numba.jit(nopython=True)
def std(data):
    return var(data) ** .5
But this alternative to the pearson implementation I posted above that uses those funcs is more than twice as slow:


Python code:
@numba.jit(nopython=True)
def corr_broken(data1, data2):
    """Optimized pearson correlation test, similar to scipy.stats.pearsonr."""
    M = data1.size

    mean1 = mean(data1)
    mean2 = mean(data2)

    std1 = std(data1)
    std2 = std(data2)

    cross_sum = 0.
    for i in range(M):
        cross_sum += (data1[i] * data2[i])
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

nonathlon
Jul 9, 2004
And yet, somehow, now it's my fault ...
A bit of an odd situation: I'm developing a website with Pelican and testing it locally using the builtin webserver via SimpleHttpServer. Which works fine on one machine but when I do this on my laptop, the webserver would repeatedly refuse to launch. Python swallowed the error, but it turned out to be a socket error "Address already in use". netstat doesn't show the address in use, which is weird. When I change the port, it works fine the first time but on subsequent uses reverts to the error. So it's like the port isn't being released, but over quite a long time period. If I run the line that calls the server separately (python -m pelican.server 8000), things seem to be fine.

Python 2.7.9, OSX 10.9, Pelican 3.5 for what it's worth.

Dominoes
Sep 20, 2007

setuptools question.

Let's say I have a module called 'quick', and ran python setup.py develop on it. Inside the main directory is a subdirectory called quick, and a file in it called funcs.py.

funcs.py has some functions called 'mean', 'var', etc.

I want to use this like so:

Python code:
import quick

quick.var()
No worky. I have to do this currently:

Python code:
import quick.funcs

quick.funcs.var()
I looked at some setup.pys on github and found something like this in requests':
from .funcs import sum_ as sum, mean, var
No worky. Any idea?

SurgicalOntologist
Jun 17, 2004


I never really had a need to speed up my numerical code so I haven't looked much into numba or other optimization techniques and I can't address your questions. I just wanted to point out that it is possible to compute variance in a single pass. That could improve things. Also, be sure to test with a variety of data sizes, as that may affect which styles are faster.

Dominoes posted:

setuptools question.

Let's say I have a module called 'quick', and ran python setup.py develop on it. Inside the main directory is a subdirectory called quick, and a file in it called funcs.py.

funcs.py has some functions called 'mean', 'var', etc.

I want to use this like so:

Python code:
import quick

quick.var()
No worky. I have to do this currently:

Python code:
import quick.funcs

quick.funcs.var()
I looked at some setup.pys on github and found something like this in requests':
from .funcs import sum_ as sum, mean, var
No worky. Any idea?


Put any module-level imports that you want to be accessible directly at the library level in __init__.py.

So, quick/__init__.py:
Python code:
from quick.funcs import var
Many tutorials/examples will instead recommend from .funcs import var but my understanding is since Python 3 relative imports are not recommended.

SurgicalOntologist fucked around with this message at 19:26 on Mar 22, 2015

Dominoes
Sep 20, 2007

Thanks bro! I did notice that the speed improvement over numpy/scipy gets lower with higher .sizes, so that makes sense. Ie something might have 100x performance with sample size 100, but 2x when you add a few more zeros. I'll see if I can figure out a single-pass variance. ( I assume this means calculate the mean while you're calculating hte variance?)

Hey turns out my prob with imports was that I had double-extensioned the __init__.py. Changed the .funcs to quick.funcs based on your advice.

Also if you have no tolerance, caffeine is a hell of a drug.

Dominoes fucked around with this message at 19:30 on Mar 22, 2015

QuarkJets
Sep 8, 2008

Dominoes posted:

Did some homework: It looks like splitting loops does cause slowdowns (Although still much faster than equivalent numpy/scipy funcs), but you can still split up some of the funcs. Ie:

These basic funcs that work together can't really be made to share loops, so there's no benefit to mushing them together:
Python code:
@numba.jit(nopython=True)
def sum_(data):
    sum__ = 0.
    for i in range(data.size):
        sum__ += data[i]
    return sum__


@numba.jit(nopython=True)
def mean(data):
    M = data.size
    sum__ = sum_(data)
    return sum__ / M


@numba.jit(nopython=True)
def var(data):
    M = data.size

    mean_ = mean(data)
    var_sum = 0.
    for i in range(M):
        var_sum += (data[i] - mean_) ** 2
    return var_sum / M


@numba.jit(nopython=True)
def std(data):
    return var(data) ** .5
But this alternative to the pearson implementation I posted above that uses those funcs is more than twice as slow:


Python code:
@numba.jit(nopython=True)
def corr_broken(data1, data2):
    """Optimized pearson correlation test, similar to scipy.stats.pearsonr."""
    M = data1.size

    mean1 = mean(data1)
    mean2 = mean(data2)

    std1 = std(data1)
    std2 = std(data2)

    cross_sum = 0.
    for i in range(M):
        cross_sum += (data1[i] * data2[i])
    cross_mean = cross_sum / M

    return (cross_mean - mean1 * mean2) / (std1 * std2)

Well, yeah. Think about the operations that you're performing here. In the first example that you posted (the one note quoted here), you calculated the mean, and then you used that in your variance calculation, and then you used those variances to immediately spit out standard deviations. And since you were merging loops, you only actually looped over the data twice (once for the mean, once for the variance)

In this new example, your functions are unaware of the previous results, and the loops are split. So you calculated the mean, and then in calculating the std you wound up recalculating the mean. That's three loops, plus extra function overhead, plus extra compilation overhead since each of those functions gets compiled separately the first time that they're called. Naturally, this runs slower than the two-loop implementation that is all encapsulated in a single function, for many of the same reasons that the numpy implementation is also slower.

Have you actually compared the numpy mean() and sum() method against your compiled mean() and sum() functions? I would have guessed that the numpy functions are about the same speed in the single-array case, since it's using compiled Fortran (faster) with a bunch of extra features (slower). But once you start doing more complex things, like calculating correlations, numpy's tendency to create temporary arrays would bite you in the rear end and the numba implementation would become faster.

Dominoes
Sep 20, 2007

Your explanation of why the first correlation example was faster makes sense.

QuarkJets posted:

Have you actually compared the numpy mean() and sum() method against your compiled mean() and sum() functions? I would have guessed that the numpy functions are about the same speed in the single-array case, since it's using compiled Fortran (faster) with a bunch of extra features (slower). But once you start doing more complex things, like calculating correlations, numpy's tendency to create temporary arrays would bite you in the rear end and the numba implementation would become faster.

Yep. Performance increase is inversely-proportional to data size for the basic funcs.
Python code:
x = np.random.random(100)

%timeit np.sum(x)
1000000 loops, best of 3: 1.58 µs per loop

%timeit quick.sum(x)
1000000 loops, best of 3: 223 ns per loop

%timeit np.mean(x)
100000 loops, best of 3: 5.52 µs per loop

%timeit quick.mean(x)
1000000 loops, best of 3: 230 ns per loop
Python code:
xbig = np.random.random(100000000)

%timeit np.sum(xbig)
10 loops, best of 3: 85.6 ms per loop

%timeit quick.sum(xbig)
10 loops, best of 3: 75.2 ms per loop

%timeit np.mean(xbig)
10 loops, best of 3: 85.1 ms per loop

%timeit quick.mean(xbig)
10 loops, best of 3: 76 ms per loop
You're also right about performance on complex functions; the increase is more dramatic.
Python code:
x = np.random.random(100)

y = np.random.random(100)

%timeit scipy.stats.pearsonr(x, y)
10000 loops, best of 3: 34.9 µs per loop

%timeit quick.corr(x, y)
1000000 loops, best of 3: 419 ns per loop
Python code:
xbig = np.random.random(100000000)

ybig = np.random.random(100000000)

%timeit scipy.stats.pearsonr(xbig, ybig)
1 loops, best of 3: 1.78 s per loop

%timeit quick.corr(xbig, ybig)
1 loops, best of 3: 238 ms per loop

Dominoes fucked around with this message at 20:12 on Mar 22, 2015

SurgicalOntologist
Jun 17, 2004

This brings up an interesting point. We often have situations where an intermediate result of one algorithm is useful on its own. Variance and mean is an obvious, and simple, example, but I can recall encountering the same issue before with more complex algorithms. Is there a standard for dealing with this?

It seems that if someone was really concerned about speed, and needed both the variance and the mean, they would end up implementing it themselves rather than use a library version which will inevitably compute the mean twice. In this day and age I can't see anyone re-implementing mean and variance as anything except wasted effort. What can library authors do about this?

I see two options, neither of which are very attractive (and neither of which I've seen done). You could have the variance function spit out the mean as well. Which would work but would create some awkward code. Alternatively, the variance function could take the mean as an optional input, which if passed causes the mean computation step to be skipped. This would also work but relies on an invariant that could break down.

Is there a more general term for this problem in computer science? Basically the phenomenon whereby clean encapsulation is at odds with doing each calculation only once.

nonathlon
Jul 9, 2004
And yet, somehow, now it's my fault ...

SurgicalOntologist posted:

Many tutorials/examples will instead recommend from .funcs import var but my understanding is since Python 3 relative imports are not recommended.

Kinda glad to hear it: I've written a lot of Python code, but relative imports have always given me trouble. I just avoid them by using the absolute path all the time.

QuarkJets
Sep 8, 2008

SurgicalOntologist posted:

It seems that if someone was really concerned about speed, and needed both the variance and the mean, they would end up implementing it themselves rather than use a library version which will inevitably compute the mean twice. In this day and age I can't see anyone re-implementing mean and variance as anything except wasted effort. What can library authors do about this?

This is probably a topic that should be brought over to the scientific computing thread, where it's more relevant, but:

You're absolutely right, in computation-heavy fields it's common to reimplement basic library features for the sake of speed. Even in very fast computational languages (C and Fortran) it is common for computational experts to reimplement basic functions instead of using standard functions, much as Dominoes has done here. Even a speed increase of a few milliseconds per function call can have huge implications for code that normally takes days to run on a large supercomputer.

And I don't think that there's anything that library maintainers can really do about this in a lot of cases. On one hand, you want features that are as fast as possible. On the other hand, you also want features that are user-friendly and adaptable, which sometimes comes at the expense of speed. Consider FFTW: the core code is extremely fast, but people who have never used FFTW before may need some time in order to use it effectively. Reimplementations of FFTW (the fft features of MATLAB or Scipy/Numpy) provide a trivial-to-use interface, but using the default parameters usually results in a slower FFT. Or, as we've seen here, the numpy mean() function is compiled but runs slower than the simple mean() function that someone can code up in a few seconds. The numpy mean function takes 4 optional arguments that are useful in a variety of specific use cases but might not be necessary in most cases

e: I guess that you could write simple specific-case functions and just make their uses very specific ("mean_float32 returns the float32 flattened mean of a an array of float32 values", "mean_int64 returns the int64 flattened mean of an array of int64 values", etc). That's appealing to people solving hard computational problems due to the speed involved, but it's not very appealing to the computer scientists that create standard libraries (for maintainability and user friendliness reasons) or to the average user of these libraries. Going back to FFTW, that's why FFTW has so many separate but almost-equivalent functions (this one takes a complex array and modifies it in place, this one takes a real array and returns a complex array, etc). The maintainers want a library that is optimally fast, and they chose to sacrifice some user friendliness in order to maintain that speed.

QuarkJets fucked around with this message at 21:37 on Mar 22, 2015

Dominoes
Sep 20, 2007

QuarkJets posted:

e: I guess that you could write simple specific-case functions and just make their uses very specific ("mean_float32 returns the float32 flattened mean of a an array of float32 values", "mean_int64 returns the int64 flattened mean of an array of int64 values", etc). That's appealing to people solving hard computational problems due to the speed involved, but it's not very appealing to the computer scientists that create standard libraries (for maintainability and user friendliness reasons) or to the average user of these libraries. Going back to FFTW, that's why FFTW has so many separate but almost-equivalent functions (this one takes a complex array and modifies it in place, this one takes a real array and returns a complex array, etc). The maintainers want a library that is optimally fast, and they chose to sacrifice some user friendliness in order to maintain that speed.
Coincidentally, that's the solution I've come to to for implementing a linear interpolation with Numba: One function where x is a single value, and another if it's an array.

BigRedDot
Mar 6, 2008

SurgicalOntologist posted:

This brings up an interesting point. We often have situations where an intermediate result of one algorithm is useful on its own. Variance and mean is an obvious, and simple, example, but I can recall encountering the same issue before with more complex algorithms. Is there a standard for dealing with this?

It seems that if someone was really concerned about speed, and needed both the variance and the mean, they would end up implementing it themselves rather than use a library version which will inevitably compute the mean twice. In this day and age I can't see anyone re-implementing mean and variance as anything except wasted effort. What can library authors do about this?

I see two options, neither of which are very attractive (and neither of which I've seen done). You could have the variance function spit out the mean as well. Which would work but would create some awkward code. Alternatively, the variance function could take the mean as an optional input, which if passed causes the mean computation step to be skipped. This would also work but relies on an invariant that could break down.

Is there a more general term for this problem in computer science? Basically the phenomenon whereby clean encapsulation is at odds with doing each calculation only once.

There is another approach, deferred execution, and that is the path that Blaze (and dynd to some degree) are trying to follow. Basically, instead of immediately executing expressions, you keep track of all the expressions that are used and collect them into a big expression graph. The only when you actually want a "realized" result is the entire expression graph executed. But at this point you have a lot more information, you can optimize the expression graph to coalesce duplicate computations, remove unnecessary temporaries, use the most efficient access patterns, and only compute enough to actually provide what was asked for. If you are thinking that sounds like another compiler, well that's because basically it is. If you combine this with things like Numba, it becomes even more powerful. Our vision is for scientific and analytical compute is really about being able to spell things at a high level, but push efficient execution down to the metal and across clusters.

BTW Dominoes I share a new office with most of the Numba devs. If you have any specific ideas you want me to pass on please feel free to let me know here or pm.

BigRedDot fucked around with this message at 21:50 on Mar 22, 2015

Dominoes
Sep 20, 2007

BigRedDot posted:

BTW Dominoes I share a new office with most of the Numba devs. If you have any specific ideas you want me to pass on please feel free to let me know here or pm.

Allow for Python function annotations.

Method one: Ignore them, like the python interpreter does, instead of throwing an error. Should be easy to implement.

Method two: Use them as an alternative syntax for function signatures, like with mypy.

from Numba's documentation:
Python code:
from numba import jit, int32

@jit(int32(int32, int32))
def f(x, y):
    # A somewhat trivial example
    return x + y
Make this work too:
Python code:
from numba import jit, int32

@jit
def f(x: int32, y: int32) -> int32:
    # A somewhat trivial example
    return x + y

Dominoes
Sep 20, 2007

Quick on Github. Rough. Needs work and more functionality before I put it on pypi.

Not sure how useful it will be to others, since the performance increase is only notable for number crunching. The idea is to install with pip install quick, and use its numerical functions as faster drop-in replacements for builtins or numpy's. Might fill a niche between using numpy functions, and writing custom optimized code.

Dominoes fucked around with this message at 22:46 on Mar 22, 2015

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS
Still a programming newbie. Would like some help with enumerate(). When I try to learn stuff on my own, it's hard to follow because of the different variables they are using.
I'm reading a txt file into a list. I need to number each line. Then print even numbered lines. I made it work like this, but I have a feeling that enumerate works better. ( I discovered it afterwards)
code:
or_file = open("test5.txt", 'r')
outtext = or_file.readlines()
or_file.close()
f_text = list()
x = 1
newtxt = list()

for line in outtext:
	y = str(x)
	newtxt.append(y + " " + line)
	x = x + 1

for line in newtxt:
	f = int(line[0])
	
	if (f % 2 == 0):
		f_text.append(line[:])
		
	else: continue
	
print f_text
I can probably get rid of 10 lines or more by doing 1 crazy trick. and I bet it's simple and everyone is laughing at me.

KICK BAMA KICK
Mar 2, 2009

Didn't test this:
Python code:
with open('test5.txt', 'r') as in_file:
    # This is the preferred way to use 'open';
    # it automatically closes the file when exiting this block
    # and even does so if an error occurs inside
    lines = in_file.readlines()

output = []
for i, line in enumerate(lines, start=1):  # Specify 'start' because it defaults to 0
    if not (i % 2):  # Zero evaluates to False, so not zero is true -- we catch the even lines
        new_line = '%d %s' % (i, line)  # Old-style string formatting; could also use the .format() string method
        output.append(new_line)

print('\n'.join(output))  # Join creates a string consisting of each string in the list we give it separated by the given string, in this case, a newline
That's a very readable way to do it. Probably superior to what I'm about to show you, but just to show a different approach, you could code golf that loop down into a list comprehension along the lines of:
Python code:
output = ['%d %s' % (i, line) for i, line in enumerate(lines, start=1) if not (i % 2)]
but I think most would agree that's a bit much for anything other than a one-off script no one -- including you -- will ever have to read or edit again.
I think in Python 2 (again, not testing) you could use a slice on the output of enumerate to avoid the condition altogether:
Python code:
for i, line in enumerate(lines, start=1)[1::2]  # Start at the second item (indexed from zero but #2 by our numbering), proceed to the end by twos
    new_line = '%d %s' % (i, line)  # So we're only getting every second line
but that doesn't work in Python 3 because enumerate returns a generator or iterator or something like that instead of a plain old list.

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS

KICK BAMA KICK posted:

CODE AND poo poo

The first example is great. I understand it and it works perfectly.
%d and the like are confusing to me for some reason.
Also, I haven't gotten to join yet, but seems pretty straightforward.

Thanks much.

This is for an early Rosalind problem. Has anyone gone through these problems before?
It seems fun and not terribly boring.

KICK BAMA KICK
Mar 2, 2009

jimcunningham posted:

%d and the like are confusing to me for some reason.
That's the old-style string formatting a.k.a. interpolation. Inside the string you mark the things to be replaced with specifiers like %d for integers or %s for strings (in practice, I think you can use %s for just about anything) and then you use the modulus operator against the string with a tuple of the values you want to fill in the blanks -- in your example, the line number and the contents of the corresponding line from the input file. An alternative, probably preferable but I tend to use the former for simple cases, is the new-style .format method all strings have.
Python code:
new_line = '{} {}'.format(i, line)
or more explicitly
Python code:
new_line = '{number} {text}'.format(number=i, text=line)
You can get a lot more specific in how you format those strings.

ButtWolf
Dec 30, 2004

by Jeffrey of YOSPOS
Is the old style use a no-no. Or just preference?

SurgicalOntologist
Jun 17, 2004

BigRedDot posted:

There is another approach, deferred execution, and that is the path that Blaze (and dynd to some degree) are trying to follow. Basically, instead of immediately executing expressions, you keep track of all the expressions that are used and collect them into a big expression graph. The only when you actually want a "realized" result is the entire expression graph executed. But at this point you have a lot more information, you can optimize the expression graph to coalesce duplicate computations, remove unnecessary temporaries, use the most efficient access patterns, and only compute enough to actually provide what was asked for. If you are thinking that sounds like another compiler, well that's because basically it is. If you combine this with things like Numba, it becomes even more powerful. Our vision is for scientific and analytical compute is really about being able to spell things at a high level, but push efficient execution down to the metal and across clusters.

That's super cool. I've been following Matthew Rocklin's blog so this is all familiar, but it didn't occur to me as a resolution to this issue. Good stuff. I should probably start using blaze one of these days.

BigRedDot
Mar 6, 2008

KICK BAMA KICK posted:

but that doesn't work in Python 3 because enumerate returns a generator or iterator or something like that instead of a plain old list.
You can "slice" generators using itertools:
code:
In [12]: stuff = "abcdefghijklmnop"

In [13]: from itertools import islice

In [14]: for x in islice(enumerate(stuff), 0, None, 2): print(x)
(0, 'a')
(2, 'c')
(4, 'e')
(6, 'g')
(8, 'i')
(10, 'k')
(12, 'm')
(14, 'o')
I personally like this much better than using the modulus operator. Though also note range is a special case in python 3, you can slice it directly:
code:
In [15]: range(10)[::2]
Out[15]: range(0, 10, 2)

KICK BAMA KICK
Mar 2, 2009

BigRedDot posted:

You can "slice" generators using itertools:
Ha, discovered that the other day to solve a problem and totally forgot about it in the meantime. itertools really does solve everything.

luchadornado
Oct 7, 2004

A boombox is not a toy!

SurgicalOntologist posted:

This brings up an interesting point. We often have situations where an intermediate result of one algorithm is useful on its own. Variance and mean is an obvious, and simple, example, but I can recall encountering the same issue before with more complex algorithms. Is there a standard for dealing with this?

It seems that if someone was really concerned about speed, and needed both the variance and the mean, they would end up implementing it themselves rather than use a library version which will inevitably compute the mean twice. In this day and age I can't see anyone re-implementing mean and variance as anything except wasted effort. What can library authors do about this?

I see two options, neither of which are very attractive (and neither of which I've seen done). You could have the variance function spit out the mean as well. Which would work but would create some awkward code. Alternatively, the variance function could take the mean as an optional input, which if passed causes the mean computation step to be skipped. This would also work but relies on an invariant that could break down.

Is there a more general term for this problem in computer science? Basically the phenomenon whereby clean encapsulation is at odds with doing each calculation only once.

If the results are deterministic, you can do something called memoization, which is basically caching intermediate values for functions where you know the output won't change.

But really, there is almost always a trade off between performance, flexibility, and ease of use. I'm writing GIS software from scratch right now, and I've got the standard ConvertLatLonToScreenCoords() and DrawLine() type things for one-off tasks or user-defined things where performance isn't a big deal. The core methods for drawing roads or cities or whatever else are all custom and repeat a ton of code. It's harder to maintain, it creates bigger binaries, etc. but performance is the most critical thing when rendering out 100,000 polylines to the screen, so it's a necessary evil.

Abstract things out and make code easy to read and maintain. Once you hit a brick wall, start considering other options like cutting down on how frequently your function is called, pre-fetching it, caching it, etc. I'd probably save writing a one-off function for last when all else has failed, but don't be ashamed when you have to do it.

SurgicalOntologist
Jun 17, 2004

Yeah I almost added memoization as another option but I decided I had already written enough :v:. Anyways, for me it's mostly academic as I've rarely if ever had to seriously optimize my code, so I've always been on the "ease of use" side of the tradeoff.

Adbot
ADBOT LOVES YOU

QuarkJets
Sep 8, 2008

jimcunningham posted:

Is the old style use a no-no. Or just preference?

I think it's just preference, but the old-style is so much less flexible that I basically never use it anymore

  • Locked thread