|
Eela6 posted:You want to use numpy / MATLAB style logical indexing. BTW, touching on your post before the edit, I believe that this is something the pandas devs do on purpose to make it better align with other data analysis platforms.
|
# ? Nov 19, 2016 00:13 |
|
|
# ? May 9, 2024 01:40 |
That makes sense! It seems like a reasonable overload of those operators (they mean the first thing I would guess, which is generally good sign for overloaded operators). I tried to do exactly that when I first switched from MATLAB to numpy. (Whenever I get heavy into numerics, I occasionally still find myself using container(key) instead of container[key]. At least I've beat zero-based indexing into my skull. )
|
|
# ? Nov 19, 2016 01:35 |
|
I am using the joblib library to implement running simulations in parallel. My code is structured as follows:Python code:
The code runs just fine when I set the number of cores to use to 1, however when it is set to anything more than 1 the code does not work. The reason for this seems to be that the simulation runs for the second policy still believe that cfg.policy_history is an empty list, even though the results were appended to cfg.policy_history after the first policy was done being simulated. Thoughts on how to overcome this issue?
|
# ? Nov 19, 2016 01:57 |
|
I know nothing about joblib, but if it works anything like multiprocessing then it's creating forked jobs that don't have access to the memory of the other jobs. You can get around this (in multiprocessing anyway) in a few ways, an easy one being to create a shared memory array with some primitive type and some fixed size. If subsequent jobs need output from previous jobs then it doesn't make much sense to me that you're trying to run them in parallel. If you just want a running history of what each job did then that's something you can do with multiprocessing (by just having each job be a function that returns a thing, then you gather all of the things) QuarkJets fucked around with this message at 13:14 on Nov 19, 2016 |
# ? Nov 19, 2016 05:04 |
|
QuarkJets posted:I know nothing about joblib, but if it works anything like multiprocessing then it's creating forked jobs that don't have access to the memory of the other jobs. You can get around this (in multiprocessing anyway) in a few ways, an easy one being to create a shared memory array with some primitive type and some fixed size. I guess I didn't explain myself clearly. I have a for loop that iterates over different policies. For each policy in the for loop I want 25 simulation replications, where the code for a single replication looks up values stored in cfg.policy_history. Since each replication is independent, I am using multiprocessing (through joblib) to run the replications in parallel. So once I finish simulating the first policy I calculate some metrics and store them in cfg.policy_history. Then on the second policy (the second iteration of the for loop), the code for a single replication should be able to look up the values stored in cfg.policy_history that were stored from the first policy. The problem seems to be that, when I start the multiprocessing the second time, cfg.policy_history still seems to be the empty list that it was initialized to. So my question is, why is cfg.policy_history not updated when I start the second round of multiprocessing? EDIT: Here is the code in case that makes it clearer. The function single_simulation_replication is the one that contain the line that looks up values in cfg.policy_history, and both single_simulation_replication and analyse_policy_results are located in the same module as the for loop. Python code:
Jose Cuervo fucked around with this message at 16:39 on Nov 19, 2016 |
# ? Nov 19, 2016 16:31 |
|
axolotl farmer posted:I'm using pandas to update a column in a dataframe. What you're describing is just forward filling, but along the row axis, so you can use ffill: code:
|
# ? Nov 20, 2016 00:15 |
|
Jose Cuervo posted:I guess I didn't explain myself clearly. Pass in cfg.policy_history as an input to single_simulation_replication
|
# ? Nov 20, 2016 00:50 |
|
QuarkJets posted:Pass in cfg.policy_history as an input to single_simulation_replication Why does this work, and what I had before doesn't work?
|
# ? Nov 21, 2016 04:30 |
|
Multiprocessing (which joblib looks like it uses) creates separate processes for each worker, and they use separate memory from the main process that's creating them. So your worker processes can't see that list, it's created and manipulated in the main process You have a few options for giving them access - passing it as an argument to the process function creates a copy for them, you can set up message-passing so processes can talk to each other, and you can mess around with shared memory too. Depending on how much data you're working with the 'pass it as an argument' approach might be fine Parallel processing is awkward, basically. There's a lot to trip you up, and tools you need to get around that (that's my understanding of it anyway, I don't python on this level) baka kaba fucked around with this message at 07:53 on Nov 21, 2016 |
# ? Nov 21, 2016 07:51 |
|
Thanks a lot for your help
|
# ? Nov 21, 2016 13:33 |
|
baka kaba posted:Multiprocessing (which joblib looks like it uses) creates separate processes for each worker, and they use separate memory from the main process that's creating them. So your worker processes can't see that list, it's created and manipulated in the main process OK, makes sense. Thanks. Next question: I am looking at the code found in the accepted answer to this question http://stackoverflow.com/questions/32791911/fast-calculation-of-pareto-front-in-python Python code:
|
# ? Nov 21, 2016 17:54 |
Jose Cuervo posted:OK, makes sense. Thanks. Implementation note: According to Luciano Ramalho in Fluent Python, within cpython, explicitly passing the function is ever-so-slightly faster because then the passed functon is in the internal namespace - this speeds up lookup on the interpreter's end. However, the speed gains are negligible in almost every case. As a matter of style, if I wanted to make it clear that dominates is within cull, I would make it a subfunction of cull. This inherits the speed gains of explicit passing but, again, these are not important. i.e, Python code:
|
|
# ? Nov 21, 2016 19:52 |
|
Eela6 posted:If you can eliminate cache misses in the 'hot' part of your code
|
# ? Nov 21, 2016 21:14 |
|
Eela6 posted:As a matter of style, if I wanted to make it clear that dominates is within cull, I would make it a subfunction of cull. This inherits the speed gains of explicit passing but, again, these are not important. quote:RE: your multiprocessing question. Are you sure you need to use multiprocessing? Have you profiled your code? For a single desktop or laptop, parallelization is generally the 'last gasp' of optimization. You get, at best, eight times the performance. For example, If you can eliminate cache misses in the 'hot' part of your code you can gain 100-400x speed without needing to muck around with multiprocessing. However, I too am interested in a) what a cache miss is, and b) how I would find and eliminate them in my code. Jose Cuervo fucked around with this message at 21:30 on Nov 21, 2016 |
# ? Nov 21, 2016 21:25 |
|
So I'm pretty new to coding in general but I started learning Python a couple weeks ago. I've gone through the Codecademy course and I'm nearly done with Learn Python the Hard Way. So far it's been pretty easy and understandable. I just don't really know where to go from here, either in terms of developing my skills with Python or as a programmer in general. Any advice?
|
# ? Nov 21, 2016 23:30 |
|
Kit Walker posted:So I'm pretty new to coding in general but I started learning Python a couple weeks ago. I've gone through the Codecademy course and I'm nearly done with Learn Python the Hard Way. So far it's been pretty easy and understandable. I just don't really know where to go from here, either in terms of developing my skills with Python or as a programmer in general. Any advice? Why are you learning to code? After Codecademy and LPtHW, you should have the skills to start taking a crack at building whatever it was you hoped to build. Start on that project.
|
# ? Nov 21, 2016 23:57 |
|
Kit Walker posted:So I'm pretty new to coding in general but I started learning Python a couple weeks ago. I've gone through the Codecademy course and I'm nearly done with Learn Python the Hard Way. So far it's been pretty easy and understandable. I just don't really know where to go from here, either in terms of developing my skills with Python or as a programmer in general. Any advice? Make a twitter bot which skews garbage at celebrities. Extra points for having people respond without knowing it's a not.
|
# ? Nov 22, 2016 00:10 |
|
Tigren posted:Why are you learning to code? After Codecademy and LPtHW, you should have the skills to start taking a crack at building whatever it was you hoped to build. Start on that project. I actually had no real vision in mind. I just kinda wanted to learn to code and learn more about the inner workings of computers and the internet and see where it takes me. If I can develop my skills to the point that I can actually do it for a living that would be a nice bonus Boiled Water posted:Make a twitter bot which skews garbage at celebrities. Extra points for having people respond without knowing it's a not. lol, why not? That's something I can work towards
|
# ? Nov 22, 2016 00:34 |
Jose Cuervo posted:I have never thought about structuring code this way, but this make a lot of sense (especially given how small the function dominates is. A: I do not have a formal computer science or engineering background (I did math), so I would appreciate if someone with a stronger understanding of hardware could give a better explanation. I will try, though: In an extremely general sense, a cache miss is when your processor tries to access data in the (extremely fast) L1 cache but can't find it. It then has to look in a higher-level cache. If it's not in that cache, it has to look in a higher-level cache... and if it's not in any of them, it then has to access RAM (which is very slow in comparison). Just like reading data from a hard drive is very slow compared to RAM, reading data from RAM is slow compared to the cache. B: This is the subject of a small talk I am going to give at my local python developers' meeting. Once I've finished my research and slides, I will present it here, too! But to give an idea of the basics, you can avoid cache misses by structuring your code to use memory more efficiently. Basically, you want to be able to do your work without constantly loading things into and out of the cache. This means avoiding unnecesssary data structures & copying. As an extremely general rule of thumb, every place where you are using return when you could be using yield is a great way to more efficiently use memory. Generators, coroutines, and functional-style programming are your friends, and they are often appropriate for simulations. (Not every function should be replaced by a generator, and not every list comprehension should be a generator comprehension. But you would be surprised how many can and should.) Even more importantly, you have to know what the slow part of your code is before you bother spending time optimizing. This is what profiling is for. First get your code to work, then find out if it's slow enough. If it's too slow, find out why and where. Often times a small subsection of your code takes 95%+ of execution time. If you can optimize THAT part of your code, you are done. It's easy to spend a dozen man-hours 'optimizing' something that takes <1% of runtime. Don't do that. I am not an expert. There are many great PyCon talks on code profiling & optimization by experts in the field that can give you better instruction than I can. The talk I'm planning to give is basically just cherry-picking bits and pieces from these pieces of excellent instruction. As a starting place, this talk is a little long but an excellent overview of the topic of speed in Python. https://www.youtube.com/watch?v=JDSGVvMwNM8 Eela6 fucked around with this message at 00:45 on Nov 22, 2016 |
|
# ? Nov 22, 2016 00:36 |
|
Kit Walker posted:I actually had no real vision in mind. I just kinda wanted to learn to code and learn more about the inner workings of computers and the internet and see where it takes me. If I can develop my skills to the point that I can actually do it for a living that would be a nice bonus Now of course it's easier said than done to come up with things when you've just started. But here are some ideas 1. Build a website that lets you list junk you want to give away and allows people to claim it. 2. Make an app that suggests drinks based on your current Spotify artist, ala http://drinkify.org/ but real time. Maybe hook it up to http://www.thecocktaildb.com/? 3. Build a text analyzer that reads text in and spits out some sort of analysis--maybe you want to see how positive and negative Pitchfork reviews are, or whatever.
|
# ? Nov 22, 2016 01:10 |
|
Here's a question about what people prefer Suppose you were looking at a database interface that claimed to be Pythonic, and you wanted to query a table for a value. Which syntax would be preferable, in your mind, for a simple SQL "SELECT * IN TABLE WHERE COLUMN = VALUE"? code:
code:
code:
Just assume that it also supports DB-API.
|
# ? Nov 22, 2016 02:03 |
|
Python code:
Python code:
|
# ? Nov 22, 2016 02:27 |
Eela6 posted:
Is this line common python style? I've never seen that before and it took me a minute or two to puzzle through what it does.
|
|
# ? Nov 22, 2016 09:21 |
It is bad style. Whether it's a common style is something you'll have to ask the Coding Horrors thread.
|
|
# ? Nov 22, 2016 16:12 |
|
What if it were (new_remaining if dominates(candidate, other) else dominated).append(other)
|
# ? Nov 22, 2016 16:19 |
Ghost of Reagan Past posted:Here's a question about what people prefer I prefer numpy style. I.e, Python code:
Eela6 fucked around with this message at 17:25 on Nov 22, 2016 |
|
# ? Nov 22, 2016 17:17 |
|
Eela6 posted:I really dislike chained brackets. I come from a weird programming background, though, and I've, uh, never actually used SQL, so I might not even understand what that query means
|
# ? Nov 22, 2016 18:52 |
|
SurgicalOntologist posted:
|
# ? Nov 22, 2016 21:11 |
|
Hammerite posted:What if it were You got it backwards. Python code:
|
# ? Nov 23, 2016 08:21 |
|
Suspicious Dish posted:You got it backwards. Well I guess the fact that I hosed it up is evidence it shouldn't be used. Or at least evidence it shouldn't be used by me.
|
# ? Nov 23, 2016 12:11 |
|
What Python Debugger do you guys use for Linux?
|
# ? Nov 23, 2016 16:16 |
|
huhu posted:What Python Debugger do you guys use for Linux? PyCharm or ipdb.
|
# ? Nov 23, 2016 16:59 |
|
Thermopyle posted:PyCharm or ipdb. Awesome Thanks. New Question, is there a cleaner way to write this? [code] columns_int = self.columns_int[:] columns_int.remove(self.column_with_id_int) self.columns_int_no_id = columns_int [code]
|
# ? Nov 23, 2016 20:25 |
|
huhu posted:Awesome Thanks. Assuming these are lists, self.columns_int_no_id = [col for col in self.columns_int if col != self.column_with_id_int] (you might need to express the if-condition in some other way, depending on how exactly columns are represented and whether comparisons work on them)
|
# ? Nov 23, 2016 21:59 |
|
Hey dudes. If you've used pandas, you may have noticed that working with DFs and series' is very slow compared to working with arrays. Do y'all know why Pandas doesn't have speedier ways to work with data? I've found I can get big performance boosts, while maintaining DF's robust features by creating arrays, then implementing conversion functions to do what I need. For example, indexing:Python code:
Here's a class to make the syntax better, and optionally get rid of PD's timestamps. Caveat; while this and a native DF's loc methods show the same performance boost as above, they're both slower than the direct indexing approaches above. Python code:
Dominoes fucked around with this message at 15:53 on Nov 24, 2016 |
# ? Nov 24, 2016 15:24 |
|
Looking for help understanding sklearn's svc's confience info: decision_function, and predict_proba. docsThe results aren't matching up, and I can't find a pattern. For example, the first prediction is 2. decision_func shows the highest value in its third col(what?), and prdict_proba in its second(correct?). The fourth from bottom predicts 3. decision_func shows the highest in the third col (correct?) and predict_probab in its second(what)? What's going on, and how can I assess prediction confidence? code:
|
# ? Nov 24, 2016 17:28 |
|
Dominoes posted:Hey dudes. If you've used pandas, you may have noticed that working with DFs and series' is very slow compared to working with arrays. Do y'all know why Pandas doesn't have speedier ways to work with data? I've found I can get big performance boosts, while maintaining DF's robust features by creating arrays, then implementing conversion functions to do what I need. For example, indexing: The Pandas developers would be the right ones to ask. Maybe try posting to or searching their github issue tracker; they likely have a tag for Performance issues
|
# ? Nov 24, 2016 20:53 |
|
Dominoes posted:Looking for help understanding sklearn's svc's confience info: decision_function, and predict_proba. docsThe results aren't matching up, and I can't find a pattern. The relevant part of the documentation for decision_function is here: http://scikit-learn.org/stable/modules/svm.html#multi-class-classification. Assuming you have not explicitly changed decision_function_shape, then the SVC class is actually training 3 different binary SVMs to distinguish 1v2, 1v3 and 2v3. decision_function is showing you the decision function for each of these models, and predict is showing the result of majority voting across the three different binary decisions. You can relate decision_function and predict like this: code:
Understanding predict_proba is much weirder. The relevant part of the documentation http://scikit-learn.org/stable/modules/svm.html#scores-and-probabilities says they use the method from this paper: http://www.csie.ntu.edu.tw/~cjlin/papers/svmprob/svmprob.pdf. If you trace through the code a bit you can figure out that section 8 of this document http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf describes the precise formulation that sklearn eventually calls. The short answer is there is probably not an intuitive way to map these back to predict and decision_function. quote:What's going on, and how can I assess prediction confidence? If you really care about confidence then I would suggest not using SVMs. The formulation of SVMs explicitly doesn't care about encoding confidences in its decision surface, so all methods to get confidences out of them are necessarily post-hoc. Use something like random forests or kernelized logistic regression which can provide confidence scores without an auxiliary calibration model. Nippashish fucked around with this message at 22:40 on Nov 24, 2016 |
# ? Nov 24, 2016 22:38 |
|
Great reply; thanks!
|
# ? Nov 25, 2016 00:23 |
|
|
# ? May 9, 2024 01:40 |
|
Dominoes posted:Hey dudes. If you've used pandas, you may have noticed that working with DFs and series' is very slow compared to working with arrays. Do y'all know why Pandas doesn't have speedier ways to work with data? I've found I can get big performance boosts, while maintaining DF's robust features by creating arrays, then implementing conversion functions to do what I need. For example, indexing: In general, loc and iloc should only be used if you want to multiple row/column subsets of your DataFrame. If you just want to get a single value from a DataFrame you should use get_value or at: code:
Also, note that there is some overhead to build your DataFrame implementation, e.g. for a (10**5, 10) sized DataFrame: code:
As a side note, pandas indexing should be faster whenever the new version of pandas with reimplemented internals (written in c++, decoupled from numpy) is released, although I doubt it will be any time soon. Xeno54 fucked around with this message at 22:50 on Nov 25, 2016 |
# ? Nov 25, 2016 22:41 |