Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Locked thread
vikingstrike
Sep 23, 2007

whats happening, captain

Eela6 posted:

You want to use numpy / MATLAB style logical indexing.

Remember not to use the bitwise operators like '&' unlesss you are actually working bitwise. Apparently this is a difference between numpy and pandas

numpy has a number of formal logic operators that are what you want, called logical_and, logical_not, logical_xor, etc...

It's easiest to understand given an example. You might already know this, but it's always nice to have a refresher.

IN:
Python code:
A = np.array([2, 5, 8, 12, 20])
print(A)
between_twenty_and_three = np.logical_and(A>3, A<20)

print(between_twenty_and_three)
A[between_twenty_and_three] = 500

print(A)
OUT:
Python code:
[ 2  5  8 12 20]
[False  True  True  True False]
[  2 500 500 500  20]
Specifically, for your question:
IN:
Python code:
def update_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    df.Current = df.New
    df.Current[pd.isnull(df.New)] = df.Old[pd.isnull(df.New)]
    return df
    
def test_update_dataframe():
    df = pd.DataFrame([[1, 2,np.nan],
                   [3, 2,np.nan],
                   [7, np.nan,np.nan], 
                   [np.nan, 8,np.nan]],
                  columns=['Old', 'New', 'Current'])
    print('old')
    print(df)
    df = update_dataframe(df)
    print('new')
    print(df)    
    
test_update_dataframe()
OUT:
code:
old
   Old  New  Current
0  1.0  2.0      NaN
1  3.0  2.0      NaN
2  7.0  NaN      NaN
3  NaN  8.0      NaN
new
   Old  New  Current
0  1.0  2.0      2.0
1  3.0  2.0      2.0
2  7.0  NaN      7.0
3  NaN  8.0      8.0

BTW, touching on your post before the edit, I believe that this is something the pandas devs do on purpose to make it better align with other data analysis platforms.

Adbot
ADBOT LOVES YOU

Eela6
May 25, 2007
Shredded Hen
That makes sense! It seems like a reasonable overload of those operators (they mean the first thing I would guess, which is generally good sign for overloaded operators). I tried to do exactly that when I first switched from MATLAB to numpy.

(Whenever I get heavy into numerics, I occasionally still find myself using container(key) instead of container[key]. At least I've beat zero-based indexing into my skull. )

Jose Cuervo
Aug 25, 2004
I am using the joblib library to implement running simulations in parallel. My code is structured as follows:

Python code:
import cfg

cfg.policy_history = []

for policy in set_of_policies:
    # Code that runs simulations of a given policy using joblib is called here 
    # Note that cfg.policy_history is accessed (but is not modified) while the simulations are being run)

    # Code that appends policy results to cfg.policy_history goes here
I am using cfg.py to store policy_history so that I can use policy_history as a global variable in different modules.

The code runs just fine when I set the number of cores to use to 1, however when it is set to anything more than 1 the code does not work. The reason for this seems to be that the simulation runs for the second policy still believe that cfg.policy_history is an empty list, even though the results were appended to cfg.policy_history after the first policy was done being simulated.

Thoughts on how to overcome this issue?

QuarkJets
Sep 8, 2008

I know nothing about joblib, but if it works anything like multiprocessing then it's creating forked jobs that don't have access to the memory of the other jobs. You can get around this (in multiprocessing anyway) in a few ways, an easy one being to create a shared memory array with some primitive type and some fixed size.

If subsequent jobs need output from previous jobs then it doesn't make much sense to me that you're trying to run them in parallel. If you just want a running history of what each job did then that's something you can do with multiprocessing (by just having each job be a function that returns a thing, then you gather all of the things)

QuarkJets fucked around with this message at 13:14 on Nov 19, 2016

Jose Cuervo
Aug 25, 2004

QuarkJets posted:

I know nothing about joblib, but if it works anything like multiprocessing then it's creating forked jobs that don't have access to the memory of the other jobs. You can get around this (in multiprocessing anyway) in a few ways, an easy one being to create a shared memory array with some primitive type and some fixed size.

If subsequent jobs need output from previous jobs then it doesn't make much sense to me that you're trying to run them in parallel. If you just want a running history of what each job did then that's something you can do with multiprocessing (by just having each job be a function that returns a thing, then you gather all of the things)

I guess I didn't explain myself clearly.

I have a for loop that iterates over different policies. For each policy in the for loop I want 25 simulation replications, where the code for a single replication looks up values stored in cfg.policy_history. Since each replication is independent, I am using multiprocessing (through joblib) to run the replications in parallel.

So once I finish simulating the first policy I calculate some metrics and store them in cfg.policy_history. Then on the second policy (the second iteration of the for loop), the code for a single replication should be able to look up the values stored in cfg.policy_history that were stored from the first policy.

The problem seems to be that, when I start the multiprocessing the second time, cfg.policy_history still seems to be the empty list that it was initialized to. So my question is, why is cfg.policy_history not updated when I start the second round of multiprocessing?

EDIT: Here is the code in case that makes it clearer. The function single_simulation_replication is the one that contain the line that looks up values in cfg.policy_history, and both single_simulation_replication and analyse_policy_results are located in the same module as the for loop.

Python code:
import cfg
import joblib
import multiprocessing
import policy

cfg.policy_history = []

for policy_idx in range(10):
    pol = policy.Policy(policy_idx)
    seed_range = range(25)

    cores_to_use = max(1, (multiprocessing.cpu_count() - 1))
    parallel_results = joblib.Parallel(n_jobs=cores_to_use)(
        joblib.delayed(single_simulation_replication)(pol, the_seed) for the_seed in seed_range)
    rep_dict = dict(parallel_results)

    cfg.policy_history.append(analyse_policy_results(rep_dict))

Jose Cuervo fucked around with this message at 16:39 on Nov 19, 2016

Xeno54
Apr 25, 2004
bread

axolotl farmer posted:

I'm using pandas to update a column in a dataframe.

My rules are if there is a value in the New column, that becomes the Current value.

If there is a NaN in the New column, the value in the Old column becomes the Current value

code:
In[]: df = pd.DataFrame([[1, 2,np.nan],[3, 2,np.nan],[7, np.nan,np.nan], [np.nan, 8,np.nan]], columns=['Old', 'New', 'Current'])
In[]: df
Out[]: 

   Old  New  Current
0  1.0  2.0      NaN
1  3.0  2.0      NaN
2  7.0  NaN      NaN
3  NaN  8.0      NaN
I try to put in the values from New and then replace the NaN from Old.

code:
In[]df.Current=df.New
In[]df.Current=df.Current.loc[(df.Current.isnull() & (df.Old.notnull()))] = df.Old
In[]df
Out[]: 

   Old  New  Current
0  1.0  2.0      1.0
1  3.0  2.0      3.0
2  7.0  NaN      7.0
3  NaN  8.0      NaN
Welp, this just replaces all the values in Current with Old.

Please help, I'm bad and new at this.

What you're describing is just forward filling, but along the row axis, so you can use ffill:

code:
df['Current'] = df.ffill(axis=1)['Current']

QuarkJets
Sep 8, 2008

Jose Cuervo posted:

I guess I didn't explain myself clearly.

I have a for loop that iterates over different policies. For each policy in the for loop I want 25 simulation replications, where the code for a single replication looks up values stored in cfg.policy_history. Since each replication is independent, I am using multiprocessing (through joblib) to run the replications in parallel.

So once I finish simulating the first policy I calculate some metrics and store them in cfg.policy_history. Then on the second policy (the second iteration of the for loop), the code for a single replication should be able to look up the values stored in cfg.policy_history that were stored from the first policy.

The problem seems to be that, when I start the multiprocessing the second time, cfg.policy_history still seems to be the empty list that it was initialized to. So my question is, why is cfg.policy_history not updated when I start the second round of multiprocessing?

EDIT: Here is the code in case that makes it clearer. The function single_simulation_replication is the one that contain the line that looks up values in cfg.policy_history, and both single_simulation_replication and analyse_policy_results are located in the same module as the for loop.

Python code:
import cfg
import joblib
import multiprocessing
import policy

cfg.policy_history = []

for policy_idx in range(10):
    pol = policy.Policy(policy_idx)
    seed_range = range(25)

    cores_to_use = max(1, (multiprocessing.cpu_count() - 1))
    parallel_results = joblib.Parallel(n_jobs=cores_to_use)(
        joblib.delayed(single_simulation_replication)(pol, the_seed) for the_seed in seed_range)
    rep_dict = dict(parallel_results)

    cfg.policy_history.append(analyse_policy_results(rep_dict))

Pass in cfg.policy_history as an input to single_simulation_replication

Jose Cuervo
Aug 25, 2004

QuarkJets posted:

Pass in cfg.policy_history as an input to single_simulation_replication

Why does this work, and what I had before doesn't work?

baka kaba
Jul 19, 2003

PLEASE ASK ME, THE SELF-PROFESSED NO #1 PAUL CATTERMOLE FAN IN THE SOMETHING AWFUL S-CLUB 7 MEGATHREAD, TO NAME A SINGLE SONG BY HIS EXCELLENT NU-METAL SIDE PROJECT, SKUA, AND IF I CAN'T PLEASE TELL ME TO
EAT SHIT

Multiprocessing (which joblib looks like it uses) creates separate processes for each worker, and they use separate memory from the main process that's creating them. So your worker processes can't see that list, it's created and manipulated in the main process

You have a few options for giving them access - passing it as an argument to the process function creates a copy for them, you can set up message-passing so processes can talk to each other, and you can mess around with shared memory too. Depending on how much data you're working with the 'pass it as an argument' approach might be fine

Parallel processing is awkward, basically. There's a lot to trip you up, and tools you need to get around that

(that's my understanding of it anyway, I don't python on this level)

baka kaba fucked around with this message at 07:53 on Nov 21, 2016

axolotl farmer
May 17, 2007

Now I'm going to sing the Perry Mason theme




Thanks a lot for your help :tipshat:

Jose Cuervo
Aug 25, 2004

baka kaba posted:

Multiprocessing (which joblib looks like it uses) creates separate processes for each worker, and they use separate memory from the main process that's creating them. So your worker processes can't see that list, it's created and manipulated in the main process

You have a few options for giving them access - passing it as an argument to the process function creates a copy for them, you can set up message-passing so processes can talk to each other, and you can mess around with shared memory too. Depending on how much data you're working with the 'pass it as an argument' approach might be fine

Parallel processing is awkward, basically. There's a lot to trip you up, and tools you need to get around that

(that's my understanding of it anyway, I don't python on this level)

OK, makes sense. Thanks.

Next question:

I am looking at the code found in the accepted answer to this question

http://stackoverflow.com/questions/32791911/fast-calculation-of-pareto-front-in-python

Python code:
def dominates(row, rowCandidate):
    return all(r >= rc for r, rc in zip(row, rowCandidate))

def cull(pts, dominates):
    dominated = []
    cleared = []
    remaining = pts
    while remaining:
        candidate = remaining[0]
        new_remaining = []
        for other in remaining[1:]:
            [new_remaining, dominated][dominates(candidate, other)].append(other)
        if not any(dominates(other, candidate) for other in new_remaining):
            cleared.append(candidate)
        else:
            dominated.append(candidate)
        remaining = new_remaining
    return cleared, dominated
I notice that the answer uses the word 'dominates' as both a function name AND the parameter passed in to the function cull. In PyCharm, this brings up the 'Shadows name 'dominates' from outer scope' warning. Is it wrong to use 'dominates' here in this way? I think the author is trying to emphasize that the parameter passed in to culls() is actually a function itself, but why is he passing in the function? Why not just use the function without passing it in since it is in the same module?

Eela6
May 25, 2007
Shredded Hen

Jose Cuervo posted:

OK, makes sense. Thanks.

Next question:

I am looking at the code found in the accepted answer to this question

http://stackoverflow.com/questions/32791911/fast-calculation-of-pareto-front-in-python

...

Why not just use the function without passing it in since it is in the same module?
This is a matter of style and there's nothing wrong with doing it the way you have suggested. Some programmers prefer to make as many parameters used by a function explicit as possible. I prefer having functions with as few arguments as possible. The important part is coherency and, where possible, consistency.

Implementation note: According to Luciano Ramalho in Fluent Python, within cpython, explicitly passing the function is ever-so-slightly faster because then the passed functon is in the internal namespace - this speeds up lookup on the interpreter's end. However, the speed gains are negligible in almost every case.

As a matter of style, if I wanted to make it clear that dominates is within cull, I would make it a subfunction of cull. This inherits the speed gains of explicit passing but, again, these are not important.

i.e,

Python code:
def cull(pts):

    def dominates(row, rowCandidate):
        return all(r >= rc for r, rc in zip(row, rowCandidate))
        
    dominated = []
    cleared = []
    remaining = pts
    while remaining:
        candidate = remaining[0]
        new_remaining = []
        for other in remaining[1:]:
            [new_remaining, dominated][dominates(candidate, other)].append(other)
        if not any(dominates(other, candidate) for other in new_remaining):
            cleared.append(candidate)
        else:
            dominated.append(candidate)
        remaining = new_remaining
    return cleared, dominated
RE: your multiprocessing question. Are you sure you need to use multiprocessing? Have you profiled your code? For a single desktop or laptop, parallelization is generally the 'last gasp' of optimization. You get, at best, eight times the performance. For example, If you can eliminate cache misses in the 'hot' part of your code you can gain 100-400x speed without needing to muck around with multiprocessing.

Cingulate
Oct 23, 2012

by Fluffdaddy

Eela6 posted:

If you can eliminate cache misses in the 'hot' part of your code
... how would one go about that?

Jose Cuervo
Aug 25, 2004

Eela6 posted:

As a matter of style, if I wanted to make it clear that dominates is within cull, I would make it a subfunction of cull. This inherits the speed gains of explicit passing but, again, these are not important.

i.e,

Python code:
def cull(pts):

    def dominates(row, rowCandidate):
        return all(r >= rc for r, rc in zip(row, rowCandidate))
        
    dominated = []
    cleared = []
    remaining = pts
    while remaining:
        candidate = remaining[0]
        new_remaining = []
        for other in remaining[1:]:
            [new_remaining, dominated][dominates(candidate, other)].append(other)
        if not any(dominates(other, candidate) for other in new_remaining):
            cleared.append(candidate)
        else:
            dominated.append(candidate)
        remaining = new_remaining
    return cleared, dominated
I have never thought about structuring code this way, but this make a lot of sense (especially given how small the function dominates is.

quote:

RE: your multiprocessing question. Are you sure you need to use multiprocessing? Have you profiled your code? For a single desktop or laptop, parallelization is generally the 'last gasp' of optimization. You get, at best, eight times the performance. For example, If you can eliminate cache misses in the 'hot' part of your code you can gain 100-400x speed without needing to muck around with multiprocessing.
Each simulation replication takes 3-5 seconds to run. The desktop I use at work has the equivalent of 24 cores so running 25 replications in parallel takes 10-11 seconds. Running 25 replications in series takes about 2 minutes. This is why I chose to use multiprocessing.

However, I too am interested in a) what a cache miss is, and b) how I would find and eliminate them in my code.

Jose Cuervo fucked around with this message at 21:30 on Nov 21, 2016

Kit Walker
Jul 10, 2010
"The Man Who Cannot Deadlift"

So I'm pretty new to coding in general but I started learning Python a couple weeks ago. I've gone through the Codecademy course and I'm nearly done with Learn Python the Hard Way. So far it's been pretty easy and understandable. I just don't really know where to go from here, either in terms of developing my skills with Python or as a programmer in general. Any advice?

Tigren
Oct 3, 2003

Kit Walker posted:

So I'm pretty new to coding in general but I started learning Python a couple weeks ago. I've gone through the Codecademy course and I'm nearly done with Learn Python the Hard Way. So far it's been pretty easy and understandable. I just don't really know where to go from here, either in terms of developing my skills with Python or as a programmer in general. Any advice?

Why are you learning to code? After Codecademy and LPtHW, you should have the skills to start taking a crack at building whatever it was you hoped to build. Start on that project.

champagne posting
Apr 5, 2006

YOU ARE A BRAIN
IN A BUNKER

Kit Walker posted:

So I'm pretty new to coding in general but I started learning Python a couple weeks ago. I've gone through the Codecademy course and I'm nearly done with Learn Python the Hard Way. So far it's been pretty easy and understandable. I just don't really know where to go from here, either in terms of developing my skills with Python or as a programmer in general. Any advice?

Make a twitter bot which skews garbage at celebrities. Extra points for having people respond without knowing it's a not.

Kit Walker
Jul 10, 2010
"The Man Who Cannot Deadlift"

Tigren posted:

Why are you learning to code? After Codecademy and LPtHW, you should have the skills to start taking a crack at building whatever it was you hoped to build. Start on that project.

I actually had no real vision in mind. I just kinda wanted to learn to code and learn more about the inner workings of computers and the internet and see where it takes me. If I can develop my skills to the point that I can actually do it for a living that would be a nice bonus

Boiled Water posted:

Make a twitter bot which skews garbage at celebrities. Extra points for having people respond without knowing it's a not.

lol, why not? That's something I can work towards

Eela6
May 25, 2007
Shredded Hen

Jose Cuervo posted:

I have never thought about structuring code this way, but this make a lot of sense (especially given how small the function dominates is.

Each simulation replication takes 3-5 seconds to run. The desktop I use at work has the equivalent of 24 cores so running 25 replications in parallel takes 10-11 seconds. Running 25 replications in series takes about 2 minutes. This is why I chose to use multiprocessing.

However, I too am interested in a) what a cache miss is, and b) how I would find and eliminate them in my code.
Congratulations, you have a real use case for parallelism! Carry on :). In response to your questions:

A: I do not have a formal computer science or engineering background (I did math), so I would appreciate if someone with a stronger understanding of hardware could give a better explanation. I will try, though:
In an extremely general sense, a cache miss is when your processor tries to access data in the (extremely fast) L1 cache but can't find it. It then has to look in a higher-level cache. If it's not in that cache, it has to look in a higher-level cache... and if it's not in any of them, it then has to access RAM (which is very slow in comparison). Just like reading data from a hard drive is very slow compared to RAM, reading data from RAM is slow compared to the cache.


B: This is the subject of a small talk I am going to give at my local python developers' meeting. Once I've finished my research and slides, I will present it here, too! But to give an idea of the basics, you can avoid cache misses by structuring your code to use memory more efficiently. Basically, you want to be able to do your work without constantly loading things into and out of the cache. This means avoiding unnecesssary data structures & copying. As an extremely general rule of thumb, every place where you are using return when you could be using yield is a great way to more efficiently use memory.

Generators, coroutines, and functional-style programming are your friends, and they are often appropriate for simulations. (Not every function should be replaced by a generator, and not every list comprehension should be a generator comprehension. But you would be surprised how many can and should.)



Even more importantly, you have to know what the slow part of your code is before you bother spending time optimizing. This is what profiling is for. First get your code to work, then find out if it's slow enough. If it's too slow, find out why and where. Often times a small subsection of your code takes 95%+ of execution time. If you can optimize THAT part of your code, you are done. It's easy to spend a dozen man-hours 'optimizing' something that takes <1% of runtime. Don't do that.

I am not an expert. There are many great PyCon talks on code profiling & optimization by experts in the field that can give you better instruction than I can. The talk I'm planning to give is basically just cherry-picking bits and pieces from these pieces of excellent instruction.

As a starting place, this talk is a little long but an excellent overview of the topic of speed in Python.

https://www.youtube.com/watch?v=JDSGVvMwNM8

Eela6 fucked around with this message at 00:45 on Nov 22, 2016

Ghost of Reagan Past
Oct 7, 2003

rock and roll fun

Kit Walker posted:

I actually had no real vision in mind. I just kinda wanted to learn to code and learn more about the inner workings of computers and the internet and see where it takes me. If I can develop my skills to the point that I can actually do it for a living that would be a nice bonus


lol, why not? That's something I can work towards
When I started down this road I ended up in a similar boat. Just find something that you'd be happy to build. Doesn't need to be fancy, or great, or even interesting to anyone but you. Just make things and you'll get better and learn a lot.

Now of course it's easier said than done to come up with things when you've just started. But here are some ideas

1. Build a website that lets you list junk you want to give away and allows people to claim it.
2. Make an app that suggests drinks based on your current Spotify artist, ala http://drinkify.org/ but real time. Maybe hook it up to http://www.thecocktaildb.com/?
3. Build a text analyzer that reads text in and spits out some sort of analysis--maybe you want to see how positive and negative Pitchfork reviews are, or whatever.

Ghost of Reagan Past
Oct 7, 2003

rock and roll fun
Here's a question about what people prefer

Suppose you were looking at a database interface that claimed to be Pythonic, and you wanted to query a table for a value. Which syntax would be preferable, in your mind, for a simple SQL "SELECT * IN TABLE WHERE COLUMN = VALUE"?

code:
database['table']['column']['value']
code:
database.table.column['value']
code:
database['table']['column'].get('value')
If you think some other option is better, let me know, I know I haven't exhausted all the reasonable options.

Just assume that it also supports DB-API.

SurgicalOntologist
Jun 17, 2004

Python code:
database['table'].where(column='value')
Edit: you need to separate selecting on a column vs selecting that column. E.g.

Python code:
database['table'].where(column='value')['column']
to select on a column and also select it.

VikingofRock
Aug 24, 2008




Eela6 posted:

Python code:
[new_remaining, dominated][dominates(candidate, other)].append(other)

Is this line common python style? I've never seen that before and it took me a minute or two to puzzle through what it does.

Eela6
May 25, 2007
Shredded Hen
It is bad style. Whether it's a common style is something you'll have to ask the Coding Horrors thread. :)

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe
What if it were

(new_remaining if dominates(candidate, other) else dominated).append(other)

Eela6
May 25, 2007
Shredded Hen

Ghost of Reagan Past posted:

Here's a question about what people prefer

Suppose you were looking at a database interface that claimed to be Pythonic, and you wanted to query a table for a value. Which syntax would be preferable, in your mind, for a simple SQL "SELECT * IN TABLE WHERE COLUMN = VALUE"?


If you think some other option is better, let me know, I know I haven't exhausted all the reasonable options.

Just assume that it also supports DB-API.


I prefer numpy style. I.e,
Python code:
someValue = database['table', 'column', 'value']
someColumn = database['table', 'column']
someTable = database['table'] 

everyValueInAColumn = database['table', 'column', :]
I really dislike chained brackets. I come from a weird programming background, though, and I've, uh, never actually used SQL, so I might not even understand what that query means :doh:

Eela6 fucked around with this message at 17:25 on Nov 22, 2016

Dominoes
Sep 20, 2007

Eela6 posted:

I really dislike chained brackets. I come from a weird programming background, though, and I've, uh, never actually used SQL, so I might not even understand what that query means :doh:
Why can't we chain brackets for normal Python indexing? It's always bothered me!

Ghost of Reagan Past
Oct 7, 2003

rock and roll fun

SurgicalOntologist posted:

Python code:
database['table'].where(column='value')
Edit: you need to separate selecting on a column vs selecting that column. E.g.

Python code:
database['table'].where(column='value')['column']
to select on a column and also select it.
I like this and it facilitates a lot of Django-style tricks with the column name being passed as keyword arguments AND it tells me exactly how I should return collections of rows.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe

Hammerite posted:

What if it were

(new_remaining if dominates(candidate, other) else dominated).append(other)

You got it backwards.

Python code:
if dominates(candidate, other):
    dominated.append(other)
else:
    new_remaining.append(other)

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

Suspicious Dish posted:

You got it backwards.

Python code:
if dominates(candidate, other):
    dominated.append(other)
else:
    new_remaining.append(other)

Well I guess the fact that I hosed it up is evidence it shouldn't be used. Or at least evidence it shouldn't be used by me.

huhu
Feb 24, 2006
What Python Debugger do you guys use for Linux?

Thermopyle
Jul 1, 2003

...the stupid are cocksure while the intelligent are full of doubt. —Bertrand Russell

huhu posted:

What Python Debugger do you guys use for Linux?

PyCharm or ipdb.

huhu
Feb 24, 2006

Thermopyle posted:

PyCharm or ipdb.

Awesome Thanks.

New Question, is there a cleaner way to write this?

[code]
columns_int = self.columns_int[:]
columns_int.remove(self.column_with_id_int)
self.columns_int_no_id = columns_int
[code]

Hammerite
Mar 9, 2007

And you don't remember what I said here, either, but it was pompous and stupid.
Jade Ear Joe

huhu posted:

Awesome Thanks.

New Question, is there a cleaner way to write this?

[code]
columns_int = self.columns_int[:]
columns_int.remove(self.column_with_id_int)
self.columns_int_no_id = columns_int
[code]

Assuming these are lists,

self.columns_int_no_id = [col for col in self.columns_int if col != self.column_with_id_int]

(you might need to express the if-condition in some other way, depending on how exactly columns are represented and whether comparisons work on them)

Dominoes
Sep 20, 2007

Hey dudes. If you've used pandas, you may have noticed that working with DFs and series' is very slow compared to working with arrays. Do y'all know why Pandas doesn't have speedier ways to work with data? I've found I can get big performance boosts, while maintaining DF's robust features by creating arrays, then implementing conversion functions to do what I need. For example, indexing:

Python code:
row_indexes = {row: i for i, row in enumerate(df.index)}
column_indexes = {col: i for i, col in enumerate(df.columns)}

df_array = df.values

%timeit df[col1][row1]  # 126 µs per loop
%timeit df_array[row_indexes[row1], column_indexes[col1]]   # 175 ns per loop


That generic example has an 800x speed increase; if you index a lot, that adds up. I'd bet you could make similar funcs for most of Pandas' features. Is there a reason these aren't included in Pandas?


Here's a class to make the syntax better, and optionally get rid of PD's timestamps. Caveat; while this and a native DF's loc methods show the same performance boost as above, they're both slower than the direct indexing approaches above.

Python code:
class FastDf:
    def __init__(self, df: pd.DataFrame, convert_to_date: bool=False):
        """Used to improve DataFrame index speed by converting to an array. """
        if convert_to_date:
            # Convert from PD timestamps to datetime.date.
            self.row_indexes = {row.date(): i for i, row in enumerate(df.index)}
        else:
            self.row_indexes = {row: i for i, row in enumerate(df.index)}
        self.column_indexes = {col: i for i, col in enumerate(df.columns)}
        self.array = df.values

    def loc(self, row, col):
        return self.array[self.row_indexes[row], self.column_indexes[col]]

fast_df = FastDf(df)

%timeit df.loc([row1, col1]) # 473 µs per loop
%timeit fast_df.loc(row1, col1)  # 336 ns per loop

Dominoes fucked around with this message at 15:53 on Nov 24, 2016

Dominoes
Sep 20, 2007

Looking for help understanding sklearn's svc's confience info: decision_function, and predict_proba. docsThe results aren't matching up, and I can't find a pattern.

For example, the first prediction is 2. decision_func shows the highest value in its third col(what?), and prdict_proba in its second(correct?).
The fourth from bottom predicts 3. decision_func shows the highest in the third col (correct?) and predict_probab in its second(what)?

What's going on, and how can I assess prediction confidence?

code:
# decision_function
[[-0.4187  0.7601  1.0747]
 [-0.3199  0.4868  0.6394]
 [-0.6452  0.5399  0.8732]
 [-1.178   0.7251  1.1044]
 [-0.6136  0.7484  0.7548]
 [-0.297  -0.2525 -0.3202]
 [-0.7869 -0.1434 -0.0221]
 [-0.1882 -0.445  -0.5064]
 [-0.4259 -0.6125 -0.6949]
 [ 1.0001  1.2336 -0.8605]]

# predict_proba
[[ 0.3483  0.367   0.2848]
 [ 0.3488  0.3503  0.3009]
 [ 0.3322  0.3735  0.2943]
 [ 0.3035  0.4102  0.2864]
 [ 0.3374  0.3688  0.2938]
 [ 0.3365  0.3229  0.3406]
 [ 0.3142  0.3539  0.3319]
 [ 0.338   0.3128  0.3493]
 [ 0.3238  0.3177  0.3585]
 [ 0.4303  0.2511  0.3186]]

# predict
[2 2 2 2 2 3 3 3 3 1]

QuarkJets
Sep 8, 2008

Dominoes posted:

Hey dudes. If you've used pandas, you may have noticed that working with DFs and series' is very slow compared to working with arrays. Do y'all know why Pandas doesn't have speedier ways to work with data? I've found I can get big performance boosts, while maintaining DF's robust features by creating arrays, then implementing conversion functions to do what I need. For example, indexing:

The Pandas developers would be the right ones to ask. Maybe try posting to or searching their github issue tracker; they likely have a tag for Performance issues

Nippashish
Nov 2, 2005

Let me see you dance!

Dominoes posted:

Looking for help understanding sklearn's svc's confience info: decision_function, and predict_proba. docsThe results aren't matching up, and I can't find a pattern.

The relevant part of the documentation for decision_function is here: http://scikit-learn.org/stable/modules/svm.html#multi-class-classification.

Assuming you have not explicitly changed decision_function_shape, then the SVC class is actually training 3 different binary SVMs to distinguish 1v2, 1v3 and 2v3. decision_function is showing you the decision function for each of these models, and predict is showing the result of majority voting across the three different binary decisions. You can relate decision_function and predict like this:

code:
    1v2      1v3     2v3
[[-0.4187  0.7601  1.0747] -> [2 1 2] -> 2
 [-0.3199  0.4868  0.6394] -> [2 1 2] -> 2
 [-0.6452  0.5399  0.8732] -> [2 1 2] -> 2
 [-1.178   0.7251  1.1044] -> [2 1 2] -> 2
 [-0.6136  0.7484  0.7548] -> [2 1 2] -> 2
 [-0.297  -0.2525 -0.3202] -> [2 3 3] -> 3
 [-0.7869 -0.1434 -0.0221] -> [2 3 3] -> 3
 [-0.1882 -0.445  -0.5064] -> [2 3 3] -> 3
 [-0.4259 -0.6125 -0.6949] -> [2 3 3] -> 3
 [ 1.0001  1.2336 -0.8605]]-> [1 1 3] -> 1
(I took some guesses about signs here but everything matches so I assume I'm right).

Understanding predict_proba is much weirder. The relevant part of the documentation http://scikit-learn.org/stable/modules/svm.html#scores-and-probabilities says they use the method from this paper: http://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/svmprob.pdf. If you trace through the code a bit you can figure out that section 8 of this document http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf describes the precise formulation that sklearn eventually calls. The short answer is there is probably not an intuitive way to map these back to predict and decision_function.

quote:

What's going on, and how can I assess prediction confidence?

If you really care about confidence then I would suggest not using SVMs. The formulation of SVMs explicitly doesn't care about encoding confidences in its decision surface, so all methods to get confidences out of them are necessarily post-hoc. Use something like random forests or kernelized logistic regression which can provide confidence scores without an auxiliary calibration model.

Nippashish fucked around with this message at 22:40 on Nov 24, 2016

Dominoes
Sep 20, 2007

Great reply; thanks!

Adbot
ADBOT LOVES YOU

Xeno54
Apr 25, 2004
bread

Dominoes posted:

Hey dudes. If you've used pandas, you may have noticed that working with DFs and series' is very slow compared to working with arrays. Do y'all know why Pandas doesn't have speedier ways to work with data? I've found I can get big performance boosts, while maintaining DF's robust features by creating arrays, then implementing conversion functions to do what I need. For example, indexing:

In general, loc and iloc should only be used if you want to multiple row/column subsets of your DataFrame. If you just want to get a single value from a DataFrame you should use get_value or at:
code:
%timeit df.get_value(row1, col1)  # 2.72 µs per loop
%timeit df.at[row1, col1]  # 5.89 µs per loop
%timeit fast_df.loc(row1, col1)  # 560 ns per loop
Operating on the underlying numpy array will always be slightly faster than the pandas equivalent, as there's less overhead from pandas specific safeguards. However, if you "index a lot" by repeatedly getting single values from a DataFrame (or even a numpy array), you're almost certainly using a suboptimal pattern. If that's the case, you should probably be querying for all of the values you want at once using loc/iloc, via Boolean indexing or some other manner. It will still be slightly slower than the equivalent numpy method (not repeated calls for single values), but you won't be magnifying the pandas overhead, and your code will be more readable.

Also, note that there is some overhead to build your DataFrame implementation, e.g. for a (10**5, 10) sized DataFrame:
code:
%timeit FastDf(df)  # 20.4 ms per loop
So you could essentially do ~7500/3500 get_value/at lookups in the time it takes to create fast_df, and unless you have some really unique use case, you probably shouldn't need to query for single values that many times.

As a side note, pandas indexing should be faster whenever the new version of pandas with reimplemented internals (written in c++, decoupled from numpy) is released, although I doubt it will be any time soon.

Xeno54 fucked around with this message at 22:50 on Nov 25, 2016

  • Locked thread