|
Seventh Arrow posted:I've been learning about scopes and namespaces (and maybe not paying enough attention!), so I'm trying to understand how this works: the 3rd statement passes the function called "add_tip" to the other function. the other function receives it and executes it, binding the result to a variable called "total" this called "first class functions" or "higher order functions" or maybe some other things e: adding quote for new page, and i have attempted to shittily mspaint this for you: the key insight here is that a function is just an object like anything else in python, which means you can pass it around to other functions so it can be called later 12 rats tied together fucked around with this message at 05:20 on Aug 12, 2022 |
# ? Aug 12, 2022 05:05 |
|
|
# ? May 16, 2024 04:25 |
|
Also, the fact that there is a variable with the same name in both functions doesn't matter, they are independant of each other. e.g. "total" from "total_bill" could've been named "bill" or something and nothing would've changed. What's important is that the the "total" in "total_bill" gets bound to whatever "add_tip(100)" returns, which is then returned and printed.
|
# ? Aug 12, 2022 05:29 |
|
Ok thanks, I think I see - so it's kind of a cross-pollination that's going on. I will definitely look up higher order functions, since my search terms didn't yield great results.
|
# ? Aug 12, 2022 06:54 |
|
Biffmotron posted:I'm not entirely following what the code does. I get that you're choosing one of option_X from probability distribution prob_Y, but you define draw_options() with arguments and then don't pass anything to them. Foxfire_ posted:Can you set up some smaller dummy thing that duplicates the problem? The previous examples werent the best. Heres an example that actually runs and duplicates the problem. Method 1 takes more than twice as long to run as method 2. Python code:
Python code:
|
# ? Aug 12, 2022 07:58 |
|
Mursupitsku posted:The previous examples werent the best. Heres an example that actually runs and duplicates the problem. Method 1 takes more than twice as long to run as method 2. Biffmotron posted:I'm not entirely following what the code does. I get that you're choosing one of option_X from probability distribution prob_Y, but you define draw_options() with arguments and then don't pass anything to them. It's definitely just coming down to the choice of choice function
|
# ? Aug 12, 2022 08:41 |
|
My first thought when I see an iterative problem like yours that needs to go faster is "Is there some way we can turn this into a matrix manipulation problem?" imagine you have an options matrix pre:1, 2, 3, 4 5, 6, 7, 8 ... 1001,1002,1003,1004 pre:False, False, True, False False, False, False, True True, False, False, False Python code:
|
# ? Aug 12, 2022 18:02 |
|
A less dramatic rearranging. This is trading memory for time by getting rid of python objects and python computation. 1x python saying "Numpy, please generate me 400 choices using these probabilities' is much faster than 400 x python saying "Numpy/random.choices, please generate me 1 choice using these probabilities" Python code:
Python is very, very slow. If you're doing numerical computing, you want to make sure as little implemented-in-python code as possible runs. e: the matrix thing Biffmotron is doing is essentially the same idea. Doing it like that is theoretically worse/slower (needs more scratch RAM, worse cache locality), but it moves the bulk of the executing code from python to numpy-implemented-in-C and that time savings more than makes up for doing the calculation suboptimally. The best-in-abstract way to do it would be to do it like you originally had it with loops that generate, use, and discard state as soon as possible so that the state is most likely to fit in cache, but python-slowness and diffuseness (PyObjects are individually allocated on the heap and aren't necessarily close to each other for caching) outweighs that Foxfire_ fucked around with this message at 22:50 on Aug 12, 2022 |
# ? Aug 12, 2022 20:41 |
|
Thanks to both Foxfire_ and Biffmotron for the help. I have been playing around with the code snippets given and the make a lot of sense to me. I was vaguely aware that going full numpy would be a lot faster but it is still a bit out of my realm of knowledge. However I havent been able to implement the solutions to my actual simulation yet. Ive also started to wonder if the whole simulation could be turned into a matrix manipulation problem and made faster as a whole? Below is a running and simplified example of the simulation im running. Its a simulation of a game that is played until either player reaches 11 rounds won. If bothplayers reach 10 an overtime is played repeatedly until either player reaches 4 overtime rounds won. At the start of each round both players choose an option that heavily alters the winprobabilites of the round. The probability of which option is chosen depends on the players previous rounds option and the outcome of the 2 previous rounds. Maybe the code also explains why im struggling to implement the solutions presented. Python code:
Mursupitsku fucked around with this message at 20:45 on Aug 14, 2022 |
# ? Aug 14, 2022 20:42 |
|
If the probabilities, choices, and round end depend on previous rounds, that gets a lot harder to solve as a matrix manipulation. One thing to consider is balancing developer time versus computer time. Your time is expensive (even if this is hobby programing , you could be doing something else) and computer time is cheap, up to a point. Getting a task that scales in O(n^3) to behave efficiently is worthwhile, eking out performance gains on an O(n) problem probably isn't, intrinsic value of learning something aside. But it's more interesting than cleaning my kitchen, which is what I'd be otherwise doing today. One simple improvement is to use multiprocessing Pools to calculate on all CPU cores at once, instead of just one core. To make this work, I rewrote the interior part of the loop as a function, simulate_match(), which takes an unused parameter x, because Pool.map expects a function and list of inputs to be parallelized. There are probably more elegant ways to do this, but this is one I know. I ran this with a CPU monitoring program open and I could actually see activity spike across all cores with the Pool enabled, as compared to just one core running in single threaded And as an aside, Pool and jupyter notebooks don't play nicely together, but this will run in a terminal or VSCode. Python code:
|
# ? Aug 14, 2022 23:23 |
|
Mursupitsku posted:Thanks to both Foxfire_ and Biffmotron for the help. I have been playing around with the code snippets given and the make a lot of sense to me. I was vaguely aware that going full numpy would be a lot faster but it is still a bit out of my realm of knowledge. Foxfire_ is recommending that you perform several random draws at a time. A common performance trick is to try to eliminate Python for and while loops as much as possible, deferring to numpy vectorization, numba loops, and comprehensions for much faster iteration. An easy optimization here would be to pre-generate your draws and then just iterate over them. Let's start by looking at your overtime scenario. Here's a copy-paste of just that block, but I added 2 lines at the start to set the scores to 10 so that every game immediately goes to overtime: Python code:
1. Set both player overtime scores to 0 2. Set both player options to "option4" 3. Generate 6 outcomes. If player1 wins 3 times, repeat this step. 4. Compare the player1 and player2 scores to determine a winner. The probability is static; both options are always set to "option4". Therefore, you could pregenerate some large number of overtime outcomes, and then just draw from that pool of results as-needed. Let's put all of this logic into its own block of code. Then, let's pregenerate a ton of results and just draw from those as-needed. Consider this generator: Python code:
1. Fetch our static win probability 2. Generate a 200k x 6 array of random numbers (why 200k? For safety - we need at least 20k results because right now I'm forcing every game into overtime, and anytime overtime is 3 we need to skip that result, so 200k seems plenty safe) 3. Compare that array with the win probability, converting the array to a 20k x 6 array of win/loss (True or False) states 4. Sum along the length-6 axis (axis 1) to get a 20k-length array of the number of wins. This will be between 0 and 6. 5. Iterate over this array, skipping any results where the outcome is exactly 3 (meaning we'd just repeat the overtime run) 6. In each successful iteration, yield whether the number of wins exceeds half of the number of tries (e.g. greater than 3). If the result is greater than 3, then player 1 wins, so we are returning True; otherwise we are returning False. Since this is a generator, all of the logic before the loop only occurs for the first call, and each call thereafter the next element in the array is provided. We can create an instance pointing to the generator and then use next() to fetch elements from it. Using this generator: Python code:
You might say "well I probably won't go into overtime every time." That's true. So instead of 20k outcomes, maybe you just generate 1k at a time, and do so forever. Here's an example of that implementation: Python code:
That's just optimizing the overtime portion, by pregenerating overtime results. Now consider the rest of your code. Without actually doing the work for you, can you repeat what was done here to pregenerate results for the rest of your code? You have a set of player1WinProbability values that are a function of player1 options, player2 options, and outcome_of_last_2_rounds. Since a generator can accept arguments, you could define 1 generator that yields outcomes for each of those scenarios and then create however many instances of this generator that you'll need. QuarkJets fucked around with this message at 00:59 on Aug 15, 2022 |
# ? Aug 15, 2022 00:56 |
|
Following the example given by QuarkJets I came up with the code below. Its about twice as fast as the original example! Is there still something obvious that could be improved? Also I'm getting a slightly different result from the code below compared to the original (~0.43 vs ~0.4) and I cant figure out why.Python code:
|
# ? Aug 15, 2022 12:28 |
|
Is there a reason why I'd get "IndexError: single positional indexer is out-of-bounds" when I'm referencing df.iloc[0][0] in a small table? I have a bunch of multi-day jobs running on parallel Jupyter notebooks on a server and this randomly happens to some of the jobs. It's not an issue with the input data because if I restart the job it doesn't happen again (at least not for a little while). Here the relevant bit of code Python code:
Josh Lyman fucked around with this message at 21:35 on Aug 17, 2022 |
# ? Aug 17, 2022 19:33 |
|
You probably want .iloc[0, 0] instead of [0][0] - since you rename the columns away from 0 and 1? If this only happens randomly, have you checked if len(mydf1) and len(mydf2) are > 0?
dorkanoid fucked around with this message at 20:11 on Aug 17, 2022 |
# ? Aug 17, 2022 20:08 |
|
dorkanoid posted:You probably want .iloc[0, 0] instead of [0][0] - since you rename the columns away from 0 and 1? If this only happens randomly, have you checked if len(mydf1) and len(mydf2) are > 0? It just happened with one of the notebooks. It appears inputdata is empty, which is a nested list that comes from an API call. I'm not sure there's a good solution. edit: I wonder if this is the result of me running 10 concurrent notebooks so that if too many of them make the API call at the same time and the database doesn't respond properly. Josh Lyman fucked around with this message at 21:59 on Aug 17, 2022 |
# ? Aug 17, 2022 21:28 |
|
It sounds like there's a problem with the input data, you can just check that it's not empty prior to doing something with it
|
# ? Aug 17, 2022 21:47 |
|
I got this off of LinkedIn, thought it was neat:quote:You can literally convert number from system one to another using f-strings.
|
# ? Aug 18, 2022 00:58 |
|
Pandas question: If I have a dataframe, what's a good way to get the column name and it's associated datatype in a list of lists and in a printable format? code:
quote:[('Zipcode', Int), ('FirstName', String), ('LastName', String)]
|
# ? Aug 18, 2022 01:52 |
|
Hughmoris posted:Pandas question: You can have columns that all have the same data type or you can have mixed data types in a column in which case the column will show as a object. df.dtypes https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html will give you a Series: code:
code:
list(zip(df.columns, df.dtypes.apply(str).values)) CarForumPoster fucked around with this message at 02:48 on Aug 18, 2022 |
# ? Aug 18, 2022 02:43 |
|
QuarkJets posted:It sounds like there's a problem with the input data, you can just check that it's not empty prior to doing something with it
|
# ? Aug 18, 2022 02:47 |
|
Status report: I found out that map and filter can be combined! There was an exercise for the filter function using lambda where the goal was to take a list of names and filter out the ones starting with the letter 'm'. Their solution was as follows: code:
Looking at it, I started wondering if there was a way to instead convert everything to lowercase and then filter out the 'm' names and after a bit of googling, this is what I came up with: code:
I'm a bit confused as to why it processes the map first and the filter second, though. Also, I wonder if it's a bit pointless. I was just doing it to experiment, but my 'solution' seems to use as much code as the original.
|
# ? Aug 18, 2022 04:53 |
|
Seventh Arrow posted:Status report: I found out that map and filter can be combined! That's an order of operations thing; filter is the outermost function, so it completes last I've never developed any appreciation for the filter function. I like list comprehensions: Python code:
QuarkJets fucked around with this message at 05:20 on Aug 18, 2022 |
# ? Aug 18, 2022 05:15 |
|
That's because the filter function takes as its first argument a function that resolves to true or false and an iterable as the second argument whose elements are run through the first argument function. Since the second argument in your code is the map function, that has to be run first to get elements to feed into the first argument function. https://docs.python.org/3/library/functions.html#filter e:f,b. Yeah and that too.
|
# ? Aug 18, 2022 05:19 |
|
CarForumPoster posted:
That is exactly what I needed. I was missing the apply(str) part. Thanks!
|
# ? Aug 18, 2022 14:02 |
|
QuarkJets posted:That's an order of operations thing; filter is the outermost function, so it completes last QuarkJets posted:I like list comprehensions: I was going to say: usually when you combine map and filter you perform the filter operation first to cull membership of your result set, then call map on it to transform it. If you call map first, then you're potentially performing the transform unnecessarily on members that will be filtered out anyways, unless the behavior of the filter depends on the transform which in this case it does. To make it equivalent you have to do something like this: Python code:
ExcessBLarg! fucked around with this message at 14:15 on Aug 18, 2022 |
# ? Aug 18, 2022 14:11 |
|
that's what the walrus operator is forcode:
|
# ? Aug 18, 2022 15:23 |
|
Interesting. I tried using the walrus operator but it didn't work. You have to follow the order in which the list comprehension clauses are evaluated then, which isn't left-to-right.
|
# ? Aug 18, 2022 17:18 |
|
Nobody likes startswith()?Python code:
|
# ? Aug 18, 2022 19:09 |
|
|
# ? Aug 18, 2022 21:07 |
|
KICK BAMA KICK posted:
lower() is O(n), so yes, for large numbers of very long strings you may see a large benefit with this approach. I just learned that startswith can accept a tuple of strings, so you can skip the lower() and just Python code:
|
# ? Aug 18, 2022 23:09 |
|
I have a 2 column dataframe where I'm trying to find the column value that's >= a threshold and extract the index. This is done multiple times for different thresholds. I've tried it using a for loop as well as .idxmin and it's not clear that .idxmin is faster. In fact, it might even be slower. Is there a better way to do this? Maybe using numpy? Josh Lyman fucked around with this message at 00:50 on Aug 19, 2022 |
# ? Aug 19, 2022 00:28 |
|
Josh Lyman posted:I have a 2 column dataframe where I'm trying to find the column value that's >= a threshold and extract the index. This is done multiple times for different thresholds. I've tried it using a for loop as well as .idxmin and it's not clear that .idxmin is faster. In fact, it might even be slower. Are there ever multiple column values such that you’re extracting the indices? Can you post some example code?
|
# ? Aug 19, 2022 01:16 |
|
CarForumPoster posted:Are there ever multiple column values such that you’re extracting the indices? To follow up with an issue I asked a couple days ago, I have 10 concurrent Jupyter notebooks running and they each make API calls about once every 1.5-2 seconds. The server will randomly return an empty list. I thought this was a rate limit issue and put in error handling so that if the server returns an empty list, I sleep and retry the call. I've tried sleep intervals from 3 seconds to 2 minutes but the server will keep returning empty lists. Yet when I stop the code and immediately restart it, everything works again. This makes me wonder if it's not actually a rate limit issue but something to do with JupyterLab. The issue has also occurred when I was only running 5 notebooks. I suppose I could try running the jobs from the terminal. Thoughts? Josh Lyman fucked around with this message at 08:04 on Aug 19, 2022 |
# ? Aug 19, 2022 08:02 |
|
I can’t imagine how I’d diagnose what a server is returning to you without seeing code…and even then if you mean the server response is literally an empty list Eg if using requests: r=requests.get(args) And r.content() is [] Then nah don’t know how to help. Though my first thought was “is the server returning tokens they should use for the next request and they’re not being used correctly?”
|
# ? Aug 19, 2022 12:21 |
|
Looking at job postings, it seems like every data analyst with a butthole should have experience with machine learning. I feel like I'm being left behind. In the ocean of resources of available on the topic, can anyone recommend some that can help build up practical experience using Python?
|
# ? Aug 19, 2022 15:28 |
|
Hughmoris posted:Looking at job postings, it seems like every data analyst with a butthole should have experience with machine learning. I feel like I'm being left behind. scikit-learn is my favorite Python machine learning library because they cover a ton of common algorithms and use cases for machine learning, and also have great docs and tutorials https://scikit-learn.org/stable/tutorial/index.html I put together an active-ish thread about some generic machine learning stuff, might get some better suited folks to help you there too: https://forums.somethingawful.com/showthread.php?threadid=3993118 But I'm also happy to discuss here, this thread is definitely more active. Fair warning that I'm definitely a bit behind in what's cool with Tensorflow/PyTorch/Jax/whatever these days.
|
# ? Aug 19, 2022 19:22 |
|
CarForumPoster posted:I can’t imagine how I’d diagnose what a server is returning to you without seeing code…and even then if you mean the server response is literally an empty list Eg if using requests: [[weights for month] , [calories for month]] If I request a username that doesn't exist, the API returns [] which I can handle fine. The problem I'm trying to understand is when it returns [[] , []]. This is where stopping the notebook and rerunning the cell in Jupyter "fixes" the issue and the nested list will be correctly populated. If it were strictly a rate limit issue, you'd think a 5 minute sleep before the code block tries the request again should work but it doesn't.
|
# ? Aug 19, 2022 19:27 |
|
Protocol7 posted:scikit-learn is my favorite Python machine learning library because they cover a ton of common algorithms and use cases for machine learning, and also have great docs and tutorials Thanks for these!
|
# ? Aug 19, 2022 21:43 |
|
Josh Lyman posted:Yeah it's tricky. Not sure if this will help but I'll try to describe it better: Imagine a dataset that has the daily weight and calories of every goon and I can request them by username and month. The server normally returns this as a nest list To be clear, restarting the client code is what solves it? I mean, you are not restarting the server or anything I presume. Are you doing anything in the client code such as using sessions? Or any other state that gets cleared when you restart the kernel? It must be something like that. There's no reason the same request should get a different result unless soemthing changes about the request itself. The server doesn't know the environment that the request was made unless its passed into the request. Which could happen for example with sessions. You might be able to inspect these kinds of things in the request object, or even check the traffic with wireshark and see what's different in the requests that don't work vs. the next one after restarting the client. Edit: reread the post, if you're not even restarting the kernel but just re-running the cell... that's potentially even weirder although depending on what's in the cell I suppose it could be equivalent. In the end it comes down to what's in the cell. I'd put my money that you're building up state somehow. Edit2: just to elaborate on what I would do to debug, specifically inspecting the requests object. Detect when the situation happens but break the loop instead of sleeping. Take a look at r, headers, cookies, and especially r.request (which encapsulates what you're sending to the server). In the next cell make another request but don't overwrite r, call it r2 or whatever. Assuming that request works, compare them side by side. Find the difference in what you are sending to the server. SurgicalOntologist fucked around with this message at 23:44 on Aug 19, 2022 |
# ? Aug 19, 2022 23:31 |
|
SurgicalOntologist posted:To be clear, restarting the client code is what solves it? I mean, you are not restarting the server or anything I presume. Are you doing anything in the client code such as using sessions? Or any other state that gets cleared when you restart the kernel? It must be something like that. And yes, I’m just re-running the cell. It’s just a script that loops through the list of goons and pulls their weight and calories. There’s a bunch of calculation that happens with that information and then I write it to a csv, but all those arrays are reset to 0 at the beginning of each loop. Edit: Regarding your edit2 , that was one of the first things I tried and figured out the data exists. So when I see the script is inside the sleep loop, I’ll stop the notebook and manually run the API call in another cell using the “current” parameter values and it works. Josh Lyman fucked around with this message at 00:28 on Aug 20, 2022 |
# ? Aug 20, 2022 00:02 |
|
|
# ? May 16, 2024 04:25 |
|
Josh Lyman posted:What do you mean by building up state? Josh Lyman posted:Edit: Regarding your edit2 , that was one of the first things I tried and figured out the data exists. So when I see the script is inside the sleep loop, I’ll stop the notebook and manually run the API call in another cell using the “current” parameter values and it works. Yeah, but what I'm suggesting is taking that "r" object you get, specifically r.requests which is an object of type requests.PreparedRequest that encapsulates all the information you're sending to the server, and take a close look. See if r.request and r2.request are different in any way, assuming r is a failed response and r2 is a successful one. Are the request headers exactly the same? Is the request body exactly the same? To put it another way, if you send the same exact set of bytes to the server, it shouldn't matter whether you send it from one cell or another (given that you seem to have ruled out rate-limiting on the server side with your sleep experiment). Much more likely than anything else I can think of, you are not actually sending the same bytes, but somehow the context of the cell (i.e. the loop) is changing what bytes you send to the server. Inspecting your PreparedRequest will help you figure out if that's the case. Also, other fields of the response (r) itself might be interesting too, perhaps there is a clue in the status code or the headers returned from the server.
|
# ? Aug 20, 2022 01:18 |