Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
12 rats tied together
Sep 7, 2006

Seventh Arrow posted:

I've been learning about scopes and namespaces (and maybe not paying enough attention!), so I'm trying to understand how this works:



:siren::siren::siren: without getting into an internet slapfight about the practice of tipping :siren::siren::siren:, I'm not sure how the add_tip function is able to access the 'total' variable from total_bill. Isn't 'total' local to total_bill and thus inaccessible to add_tip? Shouldn't the parameter for add_tip be 'def add_tip(total_bill)'?

the 3rd statement passes the function called "add_tip" to the other function. the other function receives it and executes it, binding the result to a variable called "total"

this called "first class functions" or "higher order functions" or maybe some other things

e: adding quote for new page, and i have attempted to shittily mspaint this for you:


the key insight here is that a function is just an object like anything else in python, which means you can pass it around to other functions so it can be called later

12 rats tied together fucked around with this message at 05:20 on Aug 12, 2022

Adbot
ADBOT LOVES YOU

Deffon
Mar 28, 2010

Also, the fact that there is a variable with the same name in both functions doesn't matter, they are independant of each other.
e.g. "total" from "total_bill" could've been named "bill" or something and nothing would've changed.
What's important is that the the "total" in "total_bill" gets bound to whatever "add_tip(100)" returns, which is then returned and printed.

Seventh Arrow
Jan 26, 2005

Ok thanks, I think I see - so it's kind of a cross-pollination that's going on. I will definitely look up higher order functions, since my search terms didn't yield great results.

Mursupitsku
Sep 12, 2011

Biffmotron posted:

I'm not entirely following what the code does. I get that you're choosing one of option_X from probability distribution prob_Y, but you define draw_options() with arguments and then don't pass anything to them.

I ran some profiles with %timeit. random.choice is reasonably performative, but random.choices scales poorly.

random.choice
  • 10 items, 189 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
  • 10**6 items, 362 ns ± 4.58 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
  • 10**8 items, 421 ns ± 24.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

random.choices
  • 10 items, 889 ns ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
  • 10**6 items, 32.9 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • 10**8 items, 3.8 s ± 82.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Foxfire_ posted:

Can you set up some smaller dummy thing that duplicates the problem?

When I run this:
Python code:
# Make a bigish dict indexed by a tuple of strings
probs = {}
for i in range(1000):
    probs[(f"thing1_{i}", f"thing2_{i}", f"thing3_{i}")] = i

def profile():
    """ The code we're interested in timing """
    fake_work = 0
    for _ in range(200_000):
        # Dict lookup, pretty sure CPython doesn't cache anything so every call does the full lookup
        fake_work += probs[("thing1_300", "thing2_300", "thing3_300")]
    return fake_work
%timeit profile() is telling me it's only tens of ms to run profile(). Doesn't seem like the dict lookup should matter much for runtime, unless your probs is a lot bigger than 1000 things

The previous examples werent the best. Heres an example that actually runs and duplicates the problem. Method 1 takes more than twice as long to run as method 2.

Python code:
import random
import time

##Method 1

probs = {}
keys = []

for i in range(5000):
    keys.append((f"condition1_{i}", f"condition2_{i}", f"condition3_{i}"))
    probs[(f"condition1_{i}", f"condition2_{i}", f"condition3_{i}")] = [0.25, 0.25, 0.25, 0.25]

def draw_options(condition1, condition2, condition3):
    probabilities = probs[(condition1, condition2, condition3)]
    options = ['option 1', 'option 2', 'option 3', 'option 4']
    result = random.choices(options, probabilities)[0]
    return result
    
start = time.time()

for _ in range(400):
    for key in keys:
        option = draw_options(key[0], key[1], key[2])

end = time.time()
print(end - start)

##Method 2

probs = {}
keys = []

for i in range(5000):
    keys.append((f"condition1_ {i}", f"condition2_ {i}", f"condition3_ {i}"))
    probs[(f"condition1_ {i}", f"condition2_ {i}", f"condition3_ {i}")] = ['option 1', 'option 2', 'option 3', 'option 4']


def draw_options(condition1, condition2, condition3):
    result = random.choice(probs[(condition1, condition2, condition3)])
    return result
    
start = time.time()

for _ in range(400):
    for key in keys:
        option = draw_options(key[0], key[1], key[2])
    
end = time.time()
print(end - start)
Also I noticed that in my actual code I hade the distributions as numpy arrays and it makes the first method to take more than 4 times as long as method 2.

Python code:
import random
import time
import numpy as np

##Method 1

probs = {}
keys = []

for i in range(5000):
    keys.append((f"condition1_{i}", f"condition2_{i}", f"condition3_{i}"))
    probs[(f"condition1_{i}", f"condition2_{i}", f"condition3_{i}")] = np.array([0.25, 0.25, 0.25, 0.25])

def draw_options(condition1, condition2, condition3):
    probabilities = probs[(condition1, condition2, condition3)]
    options = ['option 1', 'option 2', 'option 3', 'option 4']
    result = random.choices(options, probabilities)[0]
    return result
    
start = time.time()

for _ in range(400):
    for key in keys:
        option = draw_options(key[0], key[1], key[2])

end = time.time()
print(end - start)

##Method 2

probs = {}
keys = []

for i in range(5000):
    keys.append((f"condition1_ {i}", f"condition2_ {i}", f"condition3_ {i}"))
    probs[(f"condition1_ {i}", f"condition2_ {i}", f"condition3_ {i}")] = ['option 1', 'option 2', 'option 3', 'option 4']


def draw_options(condition1, condition2, condition3):
    result = random.choice(probs[(condition1, condition2, condition3)])
    return result
    
start = time.time()

for _ in range(400):
    for key in keys:
        option = draw_options(key[0], key[1], key[2])
    
end = time.time()
print(end - start)
Any ideas on how to make the first method faster?

QuarkJets
Sep 8, 2008

Mursupitsku posted:

The previous examples werent the best. Heres an example that actually runs and duplicates the problem. Method 1 takes more than twice as long to run as method 2.

Biffmotron posted:

I'm not entirely following what the code does. I get that you're choosing one of option_X from probability distribution prob_Y, but you define draw_options() with arguments and then don't pass anything to them.

I ran some profiles with %timeit. random.choice is reasonably performative, but random.choices scales poorly.

random.choice
  • 10 items, 189 ns ± 1.33 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)
  • 10**6 items, 362 ns ± 4.58 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
  • 10**8 items, 421 ns ± 24.6 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)

random.choices
  • 10 items, 889 ns ± 12.2 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
  • 10**6 items, 32.9 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
  • 10**8 items, 3.8 s ± 82.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

It's definitely just coming down to the choice of choice function

Biffmotron
Jan 12, 2007

My first thought when I see an iterative problem like yours that needs to go faster is "Is there some way we can turn this into a matrix manipulation problem?"

imagine you have an options matrix
pre:
   1,   2,   3,   4
   5,   6,   7,   8
...
1001,1002,1003,1004
and you want to select one option from each row, returning an output like 3,8... 1001. In Python you can do that with numpy and boolean array masks by doing U[V], where U is an array of options and V is a boolean array of the same dimensions. The boolean array which would get us 3,8... 1001 is

pre:
False, False, True, False
False, False, False, True
True, False, False, False
With that as the goal, the code is pretty simple.

Python code:
import numpy as np
from collections import Counter

n_samples = 200000
#dict of probability distributions, with keys as options and values as weights
prob_dists = {(1,2,3,4):3, #more of this dist
                        (1,1,2,2):2, 
                        (3,3,3,4):1} #fewer of this one


start = time.time()
# error check that all distributions have the same dimensions
dim_classes = set([len(x) for x in prob_dists.keys()])
assert len(dim_classes) == 1 

n_classes = list(dim_classes)[0]

# build options matrix U
U = []
for _ in range(n_samples):
    k = list(prob_dists.keys())
    v = list(prob_dists.values())
    U.append(*random.choices(k,v))
U = np.array(U)


# create choices matrix V
# I found this code on stack overflow, there are other ways to generate V
midpoint = time.time()
V = np.zeros((n_samples, n_classes))
J = np.random.choice(n_classes, n_samples)
V[np.arange(n_samples), J] = 1
# set to binary mask
V = np.array(V, dtype=bool)

# use binary mask to get output
output = Counter(U[V])
end = time.time()

print('total time', end - start)
print('probability time', end- midpoint)
The whole thing runs in 270 ms for me, and just 16 ms for everything from the midpoint on, compared to 760 ms for Method2. Since the options matrix U can be reused and Counters are compact objects that are easy to combine, this'll scale to any reasonable sample size.

Foxfire_
Nov 8, 2010

A less dramatic rearranging. This is trading memory for time by getting rid of python objects and python computation.


1x python saying "Numpy, please generate me 400 choices using these probabilities' is much faster than 400 x python saying "Numpy/random.choices, please generate me 1 choice using these probabilities"

Python code:
In [86]: def rearrange(num_trials_per_key, include_py_lookup):
    ...:     probs = {}
    ...:     keys = []
    ...:
    ...:     for i in range(5000):
    ...:         keys.append((f"condition1_{i}", f"condition2_{i}", f"condition3_{i}"))
    ...:         probs[(f"condition1_{i}", f"condition2_{i}", f"condition3_{i}")] = np.array([0.25, 0.25, 0.25, 0.25])
    ...:
    ...:     possible_options = ['option 1', 'option 2', 'option 3', 'option 4']
    ...:
    ...:     for key in keys:
    ...:         probabilities = probs[key]
    ...:         chosen_option_indices = np.random.choice(
    ...:             a=np.arange(len(possible_options)),
    ...:             size=num_trials_per_key,
    ...:             p=probabilities,
    ...:         )
    ...:
    ...:
    ...:         if include_py_lookup:
    ...:             for option_index in chosen_option_indices:
    ...:                 # Do something with the choice
    ...:                 _ = possible_options[option_index]
    ...:

In [87]: %timeit rearrange(400, True)
218 ms ± 4.79 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [88]: %timeit rearrange(400, False)
117 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [89]: %timeit rearrange(10_000, True)
3.4 s ± 66.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [90]: %timeit rearrange(10_000, False)
1.06 s ± 27.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
It's still dominated by python lookup, not the random choice numpy part. Doing 5000 x 10_000 python list index lookups takes longer than doing 5000 x "numpy generate 10_000 random choices".
Python is very, very slow. If you're doing numerical computing, you want to make sure as little implemented-in-python code as possible runs.


e:
the matrix thing Biffmotron is doing is essentially the same idea. Doing it like that is theoretically worse/slower (needs more scratch RAM, worse cache locality), but it moves the bulk of the executing code from python to numpy-implemented-in-C and that time savings more than makes up for doing the calculation suboptimally.

The best-in-abstract way to do it would be to do it like you originally had it with loops that generate, use, and discard state as soon as possible so that the state is most likely to fit in cache, but python-slowness and diffuseness (PyObjects are individually allocated on the heap and aren't necessarily close to each other for caching) outweighs that

Foxfire_ fucked around with this message at 22:50 on Aug 12, 2022

Mursupitsku
Sep 12, 2011
Thanks to both Foxfire_ and Biffmotron for the help. I have been playing around with the code snippets given and the make a lot of sense to me. I was vaguely aware that going full numpy would be a lot faster but it is still a bit out of my realm of knowledge.

However I havent been able to implement the solutions to my actual simulation yet. Ive also started to wonder if the whole simulation could be turned into a matrix manipulation problem and made faster as a whole?

Below is a running and simplified example of the simulation im running. Its a simulation of a game that is played until either player reaches 11 rounds won. If bothplayers reach 10 an overtime is played repeatedly until either player reaches 4 overtime rounds won. At the start of each round both players choose an option that heavily alters the winprobabilites of the round. The probability of which option is chosen depends on the players previous rounds option and the outcome of the 2 previous rounds.

Maybe the code also explains why im struggling to implement the solutions presented.

Python code:
import random

#Win probabilities of player1 based on chosen option and opponent option
player1_win_probabilities = {"option1": {"option1": 0.5,
                                       "option2": 0.2,
                                       "option3": 0.06,
                                       "option4": 0.05},
                           "option2": {"option1": 0.7,
                                       "option2": 0.5,
                                       "option3": 0.15,
                                       "option4": 0.1},
                           "option3": {"option1": 0.8,
                                       "option2": 0.7,
                                       "option3": 0.5,
                                       "option4": 0.45},
                           "option4": {"option1": 0.96,
                                       "option2": 0.85,
                                       "option3": 0.65,
                                       "option4": 0.5}}
                                     
options = ["option1", "option2", "option3", "option4"]

outcome_of_last_2_rounds = ["_", "_W", "_L", "WW", "LL", "WL", "LW"]

option_probabilities = {}

for option in options:
    for outcome in outcome_of_last_2_rounds:
        option_probabilities[(option, outcome)] = [0.25, 0.25, 0.25, 0.25]

def draw_option(previous_option, outcome_of_rounds):
    
    
    probabilities = option_probabilities[(previous_option, outcome_of_rounds)]
    option = random.choices(options, probabilities)[0]
    
    return option

player1MatchWins = []

for _ in range(20000):
    player1Score = 0
    player2Score = 0
    player1PreviousOption = "option1"
    player2PreviousOption = "option1"
    player1RoundOutcomes = "_"
    player2RoundOutcomes = "_"


    while player1Score != 11 and player2Score != 11:
        #A match is won if a player reaches a score of 11. If both players reach a score of 10 an overtime is played.
        
        player1Option = draw_option(player1PreviousOption, player1RoundOutcomes[-2:])
        player2Option = draw_option(player2PreviousOption, player2RoundOutcomes[-2:])
        
        player1PreviousOption = player1Option
        player2PreviousOption = player2Option
        
        player1WinProbability = player1_win_probabilities[player1Option][player2Option]
        
        p = random.random()
        
        if p < player1WinProbability:
            player1Score += 1
            player1RoundOutcomes += "W"
            player2RoundOutcomes += "L"
        else:
            player2Score += 1
            player2RoundOutcomes += "W"
            player1RoundOutcomes += "L"
            
        if player1Score == 10 and player2Score == 10:
            #"Overtime", best of 7
            player1OvertimeScore = 0
            player2OvertimeScore = 0
            player1Option = "option4"
            player2Option = "option4"
            
            while player1OvertimeScore != 4 and player2OvertimeScore != 4:
            
                player1WinProbability = player1_win_probabilities[player1Option][player2Option]
                
                p = random.random()
                
                if p < player1WinProbability:
                    player1OvertimeScore += 1

                else:
                    player2OvertimeScore += 1
                    
                if player1OvertimeScore == 3 and player2OvertimeScore == 3:
                    #New overtime
                    player1OvertimeScore = 0
                    player2OvertimeScore = 0
            
            if player1OvertimeScore > player2OvertimeScore:
                player1Score += 1
            else:
                player2Score += 1
               
    if player1Score == 11:
        player1MatchWins.append(1)
    else:
        player1MatchWins.append(0)
        
print("player1 wins with a probability of: ", sum(player1MatchWins)/len(player1MatchWins))

Mursupitsku fucked around with this message at 20:45 on Aug 14, 2022

Biffmotron
Jan 12, 2007

If the probabilities, choices, and round end depend on previous rounds, that gets a lot harder to solve as a matrix manipulation. One thing to consider is balancing developer time versus computer time. Your time is expensive (even if this is hobby programing , you could be doing something else) and computer time is cheap, up to a point. Getting a task that scales in O(n^3) to behave efficiently is worthwhile, eking out performance gains on an O(n) problem probably isn't, intrinsic value of learning something aside.

But it's more interesting than cleaning my kitchen, which is what I'd be otherwise doing today. One simple improvement is to use multiprocessing Pools to calculate on all CPU cores at once, instead of just one core. To make this work, I rewrote the interior part of the loop as a function, simulate_match(), which takes an unused parameter x, because Pool.map expects a function and list of inputs to be parallelized. There are probably more elegant ways to do this, but this is one I know. I ran this with a CPU monitoring program open and I could actually see activity spike across all cores with the Pool enabled, as compared to just one core running in single threaded

And as an aside, Pool and jupyter notebooks don't play nicely together, but this will run in a terminal or VSCode.

Python code:
import random
from multiprocessing import Pool
import time

#Win probabilities of player1 based on chosen option and opponent option
player1_win_probabilities = {"option1": {"option1": 0.5,
                                       "option2": 0.2,
                                       "option3": 0.06,
                                       "option4": 0.05},
                           "option2": {"option1": 0.7,
                                       "option2": 0.5,
                                       "option3": 0.15,
                                       "option4": 0.1},
                           "option3": {"option1": 0.8,
                                       "option2": 0.7,
                                       "option3": 0.5,
                                       "option4": 0.45},
                           "option4": {"option1": 0.96,
                                       "option2": 0.85,
                                       "option3": 0.65,
                                       "option4": 0.5}}
                                     
options = ["option1", "option2", "option3", "option4"]

outcome_of_last_2_rounds = ["_", "_W", "_L", "WW", "LL", "WL", "LW"]

option_probabilities = {}

for option in options:
    for outcome in outcome_of_last_2_rounds:
        option_probabilities[(option, outcome)] = [0.25, 0.25, 0.25, 0.25]

def draw_option(previous_option, outcome_of_rounds):
    
    
    probabilities = option_probabilities[(previous_option, outcome_of_rounds)]
    option = random.choices(options, probabilities)[0]
    
    return option

player1MatchWins = []

def simulate_match(x):
    player1Score = 0
    player2Score = 0
    player1PreviousOption = "option1"
    player2PreviousOption = "option1"
    player1RoundOutcomes = "_"
    player2RoundOutcomes = "_"


    while player1Score != 11 and player2Score != 11:
        #A match is won if a player reaches a score of 11. If both players reach a score of 10 an overtime is played.
        
        player1Option = draw_option(player1PreviousOption, player1RoundOutcomes[-2:])
        player2Option = draw_option(player2PreviousOption, player2RoundOutcomes[-2:])
        
        player1PreviousOption = player1Option
        player2PreviousOption = player2Option
        
        player1WinProbability = player1_win_probabilities[player1Option][player2Option]
        
        p = random.random()
        
        if p < player1WinProbability:
            player1Score += 1
            player1RoundOutcomes += "W"
            player2RoundOutcomes += "L"
        else:
            player2Score += 1
            player2RoundOutcomes += "W"
            player1RoundOutcomes += "L"
            
        if player1Score == 10 and player2Score == 10:
            #"Overtime", best of 7
            player1OvertimeScore = 0
            player2OvertimeScore = 0
            player1Option = "option4"
            player2Option = "option4"
            
            while player1OvertimeScore != 4 and player2OvertimeScore != 4:
            
                player1WinProbability = player1_win_probabilities[player1Option][player2Option]
                
                p = random.random()
                
                if p < player1WinProbability:
                    player1OvertimeScore += 1

                else:
                    player2OvertimeScore += 1
                    
                if player1OvertimeScore == 3 and player2OvertimeScore == 3:
                    #New overtime
                    player1OvertimeScore = 0
                    player2OvertimeScore = 0
            
            if player1OvertimeScore > player2OvertimeScore:
                player1Score += 1
            else:
                player2Score += 1
               
    if player1Score == 11:
        return True
    else:
        return False

if __name__ == '__main__':
    start = time.time()

    n_sim= 100000

    with Pool(8) as p:
        wins = sum(p.map(simulate_match, range(n_sim)))

    end = time.time()
    print('run time:', end-start)
    print('win percentage:', wins/n_sim)

QuarkJets
Sep 8, 2008

Mursupitsku posted:

Thanks to both Foxfire_ and Biffmotron for the help. I have been playing around with the code snippets given and the make a lot of sense to me. I was vaguely aware that going full numpy would be a lot faster but it is still a bit out of my realm of knowledge.

However I havent been able to implement the solutions to my actual simulation yet. Ive also started to wonder if the whole simulation could be turned into a matrix manipulation problem and made faster as a whole?

Below is a running and simplified example of the simulation im running. Its a simulation of a game that is played until either player reaches 11 rounds won. If bothplayers reach 10 an overtime is played repeatedly until either player reaches 4 overtime rounds won. At the start of each round both players choose an option that heavily alters the winprobabilites of the round. The probability of which option is chosen depends on the players previous rounds option and the outcome of the 2 previous rounds.

Maybe the code also explains why im struggling to implement the solutions presented.

Python code:
import random

#Win probabilities of player1 based on chosen option and opponent option
player1_win_probabilities = {"option1": {"option1": 0.5,
                                       "option2": 0.2,
                                       "option3": 0.06,
                                       "option4": 0.05},
                           "option2": {"option1": 0.7,
                                       "option2": 0.5,
                                       "option3": 0.15,
                                       "option4": 0.1},
                           "option3": {"option1": 0.8,
                                       "option2": 0.7,
                                       "option3": 0.5,
                                       "option4": 0.45},
                           "option4": {"option1": 0.96,
                                       "option2": 0.85,
                                       "option3": 0.65,
                                       "option4": 0.5}}
                                     
options = ["option1", "option2", "option3", "option4"]

outcome_of_last_2_rounds = ["_", "_W", "_L", "WW", "LL", "WL", "LW"]

option_probabilities = {}

for option in options:
    for outcome in outcome_of_last_2_rounds:
        option_probabilities[(option, outcome)] = [0.25, 0.25, 0.25, 0.25]

def draw_option(previous_option, outcome_of_rounds):
    
    
    probabilities = option_probabilities[(previous_option, outcome_of_rounds)]
    option = random.choices(options, probabilities)[0]
    
    return option

player1MatchWins = []

for _ in range(20000):
    player1Score = 0
    player2Score = 0
    player1PreviousOption = "option1"
    player2PreviousOption = "option1"
    player1RoundOutcomes = "_"
    player2RoundOutcomes = "_"


    while player1Score != 11 and player2Score != 11:
        #A match is won if a player reaches a score of 11. If both players reach a score of 10 an overtime is played.
        
        player1Option = draw_option(player1PreviousOption, player1RoundOutcomes[-2:])
        player2Option = draw_option(player2PreviousOption, player2RoundOutcomes[-2:])
        
        player1PreviousOption = player1Option
        player2PreviousOption = player2Option
        
        player1WinProbability = player1_win_probabilities[player1Option][player2Option]
        
        p = random.random()
        
        if p < player1WinProbability:
            player1Score += 1
            player1RoundOutcomes += "W"
            player2RoundOutcomes += "L"
        else:
            player2Score += 1
            player2RoundOutcomes += "W"
            player1RoundOutcomes += "L"
            
        if player1Score == 10 and player2Score == 10:
            #"Overtime", best of 7
            player1OvertimeScore = 0
            player2OvertimeScore = 0
            player1Option = "option4"
            player2Option = "option4"
            
            while player1OvertimeScore != 4 and player2OvertimeScore != 4:
            
                player1WinProbability = player1_win_probabilities[player1Option][player2Option]
                
                p = random.random()
                
                if p < player1WinProbability:
                    player1OvertimeScore += 1

                else:
                    player2OvertimeScore += 1
                    
                if player1OvertimeScore == 3 and player2OvertimeScore == 3:
                    #New overtime
                    player1OvertimeScore = 0
                    player2OvertimeScore = 0
            
            if player1OvertimeScore > player2OvertimeScore:
                player1Score += 1
            else:
                player2Score += 1
               
    if player1Score == 11:
        player1MatchWins.append(1)
    else:
        player1MatchWins.append(0)
        
print("player1 wins with a probability of: ", sum(player1MatchWins)/len(player1MatchWins))

Foxfire_ is recommending that you perform several random draws at a time. A common performance trick is to try to eliminate Python for and while loops as much as possible, deferring to numpy vectorization, numba loops, and comprehensions for much faster iteration.

An easy optimization here would be to pre-generate your draws and then just iterate over them. Let's start by looking at your overtime scenario. Here's a copy-paste of just that block, but I added 2 lines at the start to set the scores to 10 so that every game immediately goes to overtime:
Python code:
        # TEMP: Force us into overtime
        player1Score = 10
        player2Score = 10
        if player1Score == 10 and player2Score == 10:
            #"Overtime", best of 7
            player1OvertimeScore = 0
            player2OvertimeScore = 0
            player1Option = "option4"
            player2Option = "option4"
            
            while player1OvertimeScore != 4 and player2OvertimeScore != 4:
            
                player1WinProbability = player1_win_probabilities[player1Option][player2Option]
                
                p = random.random()
                
                if p < player1WinProbability:
                    player1OvertimeScore += 1

                else:
                    player2OvertimeScore += 1
                    
                if player1OvertimeScore == 3 and player2OvertimeScore == 3:
                    #New overtime
                    player1OvertimeScore = 0
                    player2OvertimeScore = 0
            
            if player1OvertimeScore > player2OvertimeScore:
                player1Score += 1
            else:
                player2Score += 1

Let's break down the logic
1. Set both player overtime scores to 0
2. Set both player options to "option4"
3. Generate 6 outcomes. If player1 wins 3 times, repeat this step.
4. Compare the player1 and player2 scores to determine a winner.

The probability is static; both options are always set to "option4". Therefore, you could pregenerate some large number of overtime outcomes, and then just draw from that pool of results as-needed.

Let's put all of this logic into its own block of code. Then, let's pregenerate a ton of results and just draw from those as-needed. Consider this generator:
Python code:
def overtime(num_tries=6, max_draws=200000):
    player1Option = "option4"
    player2Option = "option4"
    win_probability = player1_win_probabilities[player1Option][player2Option]
    chances = np.random.random((max_draws, num_tries))
    win_outcomes = chances < win_probability
    num_wins = np.sum(win_outcomes, axis=1)
    for overtime_outcome in num_wins:
        if overtime_outcome != 3:
            yield overtime_outcome > num_tries // 2

overtime_results = overtime()
Given some static win probability, this performs the same 4 steps as above but is a little different

1. Fetch our static win probability
2. Generate a 200k x 6 array of random numbers (why 200k? For safety - we need at least 20k results because right now I'm forcing every game into overtime, and anytime overtime is 3 we need to skip that result, so 200k seems plenty safe)
3. Compare that array with the win probability, converting the array to a 20k x 6 array of win/loss (True or False) states
4. Sum along the length-6 axis (axis 1) to get a 20k-length array of the number of wins. This will be between 0 and 6.
5. Iterate over this array, skipping any results where the outcome is exactly 3 (meaning we'd just repeat the overtime run)
6. In each successful iteration, yield whether the number of wins exceeds half of the number of tries (e.g. greater than 3). If the result is greater than 3, then player 1 wins, so we are returning True; otherwise we are returning False.

Since this is a generator, all of the logic before the loop only occurs for the first call, and each call thereafter the next element in the array is provided. We can create an instance pointing to the generator and then use next() to fetch elements from it. Using this generator:
Python code:
        # TEMP: Force us into overtime
        player1Score = 10
        player2Score = 10
        if player1Score == 10 and player2Score == 10:
            if next(overtime_results):
                player1Score += 1
            else:
                player2Score += 1
When I run this through timeit, with 20k cases all forced into overtime and 200k overtime scenarios pregenerated, I get a 20% performance improvement when using the generator than the original code. This is also a lot cleaner to read; we've pulled the overtime logic into its a generator, separate from the main workflow

You might say "well I probably won't go into overtime every time." That's true. So instead of 20k outcomes, maybe you just generate 1k at a time, and do so forever. Here's an example of that implementation:

Python code:
def overtime(num_tries=6, max_draws=1000):
    player1Option = "option4"
    player2Option = "option4"
    win_probability = player1_win_probabilities[player1Option][player2Option]
    while True:
        chances = np.random.random((max_draws, num_tries))
        win_outcomes = chances < win_probability
        num_wins = np.sum(win_outcomes, axis=1)
        for overtime_outcome in num_wins:
            yield overtime_outcome > num_tries // 2

overtime_results = overtime()
This is 25% faster than the original code, a nice little additional performance optimization that uses less memory and simultaneously can never run out of results (because it's generating new ones as-needed, 1000 at a time, in an infinite loop).

That's just optimizing the overtime portion, by pregenerating overtime results. Now consider the rest of your code. Without actually doing the work for you, can you repeat what was done here to pregenerate results for the rest of your code? You have a set of player1WinProbability values that are a function of player1 options, player2 options, and outcome_of_last_2_rounds. Since a generator can accept arguments, you could define 1 generator that yields outcomes for each of those scenarios and then create however many instances of this generator that you'll need.

QuarkJets fucked around with this message at 00:59 on Aug 15, 2022

Mursupitsku
Sep 12, 2011
Following the example given by QuarkJets I came up with the code below. Its about twice as fast as the original example! Is there still something obvious that could be improved? Also I'm getting a slightly different result from the code below compared to the original (~0.43 vs ~0.4) and I cant figure out why.

Python code:
import random
import numpy as np
import time

#Win probabilities of player1 based on chosen option and opponent option
player1_win_probabilities = {"option1": {"option1": 0.5,
                                       "option2": 0.2,
                                       "option3": 0.06,
                                       "option4": 0.05},
                           "option2": {"option1": 0.7,
                                       "option2": 0.5,
                                       "option3": 0.15,
                                       "option4": 0.1},
                           "option3": {"option1": 0.8,
                                       "option2": 0.7,
                                       "option3": 0.5,
                                       "option4": 0.45},
                           "option4": {"option1": 0.96,
                                       "option2": 0.85,
                                       "option3": 0.65,
                                       "option4": 0.5}}
                                     
options = ["option1", "option2", "option3", "option4"]

outcome_of_last_2_rounds = ["_", "_W", "_L", "WW", "LL", "WL", "LW"]

option_probabilities = {}

for option in options:
    for outcome in outcome_of_last_2_rounds:
        option_probabilities[(option, outcome)] = [0.25, 0.25, 0.25, 0.25]


def simulate_round(previousOptionPlayer1, previousOptionPlayer2, Player1Outcome, Player2Outcome, max_draws=1000):
    player1OptionProbs = option_probabilities[(previousOptionPlayer1, Player1Outcome)]
    player2OptionProbs = option_probabilities[(previousOptionPlayer2, Player2Outcome)]
    possible_options = ['option1', 'option2', 'option3', 'option4']
    while True:
        player1ChosenOptionIndices = np.random.choice(a=np.arange(len(possible_options)),size=max_draws,p=player1OptionProbs)
        player2ChosenOptionIndices = np.random.choice(a=np.arange(len(possible_options)),size=max_draws,p=player2OptionProbs)
        win_probabilities = [player1_win_probabilities[possible_options[player1Index]][possible_options[player2Index]] for player1Index, player2Index in zip(player1ChosenOptionIndices, player2ChosenOptionIndices)]
        chances = np.random.random(max_draws)
        win_outcomes = chances < win_probabilities
        for round_outcome, player1Option, player2Option in zip(win_outcomes, player1ChosenOptionIndices, player2ChosenOptionIndices):
            yield round_outcome, possible_options[player1Option], possible_options[player2Option]

def overtime(num_tries=6, max_draws=1000):
    player1Option = "option4"
    player2Option = "option4"
    win_probability = player1_win_probabilities[player1Option][player2Option]
    while True:
        chances = np.random.random((max_draws, num_tries))
        win_outcomes = chances < win_probability
        num_wins = np.sum(win_outcomes, axis=1)
        for overtime_outcome in num_wins:
            yield overtime_outcome > num_tries // 2

overtime_results = overtime()

round_results = {}

player1MatchWins = []

start = time.time()

for _ in range(20000):
    player1Score = 0
    player2Score = 0
    player1PreviousOption = "option1"
    player2PreviousOption = "option1"
    player1RoundOutcomes = "_"
    player2RoundOutcomes = "_"


    while player1Score != 11 and player2Score != 11:
        #A match is won if a player reaches a score of 11. If both players reach a score of 10 an overtime is played.
        
        if ((player1PreviousOption, player2PreviousOption, player1RoundOutcomes[-2:], player2RoundOutcomes[-2:])) not in round_results:
            round_results[(player1PreviousOption, player2PreviousOption, player1RoundOutcomes[-2:], player2RoundOutcomes[-2:])] = simulate_round(player1PreviousOption, player2PreviousOption, player1RoundOutcomes[-2:], player2RoundOutcomes[-2:])
        
        team1Win, player1PreviousOption, player2PreviousOption = next(round_results[(player1PreviousOption, player2PreviousOption, player1RoundOutcomes[-2:], player2RoundOutcomes[-2:])])
        
        if team1Win:
            player1Score += 1
            player1RoundOutcomes += "W"
            player2RoundOutcomes += "L"
        else:
            player2Score += 1
            player2RoundOutcomes += "W"
            player1RoundOutcomes += "L"
            
        if player1Score == 10 and player2Score == 10:
            if next(overtime_results):
                player1Score += 1
            else:
                player2Score += 1
               
    if player1Score == 11:
        player1MatchWins.append(1)
    else:
        player1MatchWins.append(0)
        
end = time.time()
print("run took ", end-start, " seconds")
print("player1 wins with a probability of: ", sum(player1MatchWins)/len(player1MatchWins))

Josh Lyman
May 24, 2009


Is there a reason why I'd get "IndexError: single positional indexer is out-of-bounds" when I'm referencing df.iloc[0][0] in a small table? I have a bunch of multi-day jobs running on parallel Jupyter notebooks on a server and this randomly happens to some of the jobs. It's not an issue with the input data because if I restart the job it doesn't happen again (at least not for a little while).

Here the relevant bit of code
Python code:
myarray1=np.array([[ level[0], level[1]] for level in inputdata[0] ]).T
mydf1 = pd.DataFrame.from_records(myArray).T
mydf1= mydf.rename(index=str, columns={0: "var1", 1: "var2"})

myarray2=np.array([[ level[0], level[1]] for level in inputdata[1] ]).T
mydf2 = pd.DataFrame.from_records(myArray).T
mydf2= mydf.rename(index=str, columns={0: "var1", 1: "var2"})

myvar = (mydf1.iloc[0][0] + mydf2.iloc[0][0])/2

Josh Lyman fucked around with this message at 21:35 on Aug 17, 2022

dorkanoid
Dec 21, 2004

You probably want .iloc[0, 0] instead of [0][0] - since you rename the columns away from 0 and 1? If this only happens randomly, have you checked if len(mydf1) and len(mydf2) are > 0?

dorkanoid fucked around with this message at 20:11 on Aug 17, 2022

Josh Lyman
May 24, 2009


dorkanoid posted:

You probably want .iloc[0, 0] instead of [0][0] - since you rename the columns away from 0 and 1? If this only happens randomly, have you checked if len(mydf1) and len(mydf2) are > 0?
.iloc[0,0] gives me the same output as [0][0], though for readability I could just do mydf1.var1[0]

It just happened with one of the notebooks. It appears inputdata is empty, which is a nested list that comes from an API call. I'm not sure there's a good solution.

edit: I wonder if this is the result of me running 10 concurrent notebooks so that if too many of them make the API call at the same time and the database doesn't respond properly.

Josh Lyman fucked around with this message at 21:59 on Aug 17, 2022

QuarkJets
Sep 8, 2008

It sounds like there's a problem with the input data, you can just check that it's not empty prior to doing something with it

Seventh Arrow
Jan 26, 2005

I got this off of LinkedIn, thought it was neat:

quote:

You can literally convert number from system one to another using f-strings.

`b` for Binary
`o` for Octal
`x` for Hexadecimal


Hughmoris
Apr 21, 2007
Let's go to the abyss!
Pandas question:

If I have a dataframe, what's a good way to get the column name and it's associated datatype in a list of lists and in a printable format?

code:
ZipCode | FirstName | LastName |
55631     |  Sally          | Ralphael    |
Ideally I'd get something like this:

quote:

[('Zipcode', Int), ('FirstName', String), ('LastName', String)]

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Hughmoris posted:

Pandas question:

If I have a dataframe, what's a good way to get the column name and it's associated datatype in a list of lists and in a printable format?

code:
ZipCode | FirstName | LastName |
55631     |  Sally          | Ralphael    |
Ideally I'd get something like this:

You can have columns that all have the same data type or you can have mixed data types in a column in which case the column will show as a object.

df.dtypes https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html will give you a Series:

code:
df = pd.DataFrame({'float': [1.0],
                   'int': [1],
                   'datetime': [pd.Timestamp('20180310')],
                   'string': ['foo']})

df.dtypes

float              float64
int                  int64
datetime    datetime64[ns]
string              object
dtype: object
Then just get them however you want for example:
code:
list(zip(df.columns, df.dtypes.values))

[('float', dtype('float64')),
 ('int', dtype('int64')),
 ('datetime', dtype('<M8[ns]')),
 ('string', dtype('O'))]
EDIT: Of if you want the types as strings

list(zip(df.columns, df.dtypes.apply(str).values))

CarForumPoster fucked around with this message at 02:48 on Aug 18, 2022

Josh Lyman
May 24, 2009


QuarkJets posted:

It sounds like there's a problem with the input data, you can just check that it's not empty prior to doing something with it
Yeah I ended up putting in try/except error handling. It ended up being more difficult than expected since there's a different type of error outside the try/except and the handling was different.

Seventh Arrow
Jan 26, 2005

Status report: I found out that map and filter can be combined!

There was an exercise for the filter function using lambda where the goal was to take a list of names and filter out the ones starting with the letter 'm'. Their solution was as follows:

code:
names = ["margarita", "Linda", "Masako", "Maki", "Angela"]
 
M_names = filter(lambda name: name[0] == "M" or name[0] == "m", names) 
 
print(list(M_names))
Output: ['margarita', 'masako', 'maki']

Looking at it, I started wondering if there was a way to instead convert everything to lowercase and then filter out the 'm' names and after a bit of googling, this is what I came up with:

code:
names = ["margarita", "Linda", "Masako", "Maki", "Angela"]
 
m_names = filter(lambda name: name[0] == "m", map(lambda name: name.lower(), names))

print(list(m_names))
Output: ['margarita', 'masako', 'maki']

I'm a bit confused as to why it processes the map first and the filter second, though. Also, I wonder if it's a bit pointless. I was just doing it to experiment, but my 'solution' seems to use as much code as the original.

QuarkJets
Sep 8, 2008

Seventh Arrow posted:

Status report: I found out that map and filter can be combined!

There was an exercise for the filter function using lambda where the goal was to take a list of names and filter out the ones starting with the letter 'm'. Their solution was as follows:

code:
names = ["margarita", "Linda", "Masako", "Maki", "Angela"]
 
M_names = filter(lambda name: name[0] == "M" or name[0] == "m", names) 
 
print(list(M_names))
Output: ['margarita', 'masako', 'maki']

Looking at it, I started wondering if there was a way to instead convert everything to lowercase and then filter out the 'm' names and after a bit of googling, this is what I came up with:

code:
names = ["margarita", "Linda", "Masako", "Maki", "Angela"]
 
m_names = filter(lambda name: name[0] == "m", map(lambda name: name.lower(), names))

print(list(m_names))
Output: ['margarita', 'masako', 'maki']

I'm a bit confused as to why it processes the map first and the filter second, though. Also, I wonder if it's a bit pointless. I was just doing it to experiment, but my 'solution' seems to use as much code as the original.

That's an order of operations thing; filter is the outermost function, so it completes last

I've never developed any appreciation for the filter function. I like list comprehensions:

Python code:
names = ["margarita", "Linda", "Masako", "Maki", "Angela"]
 
m_names = [name for name in names if name[0].lower() == "m"]

print(m_names)

QuarkJets fucked around with this message at 05:20 on Aug 18, 2022

Edward IV
Jan 15, 2006

That's because the filter function takes as its first argument a function that resolves to true or false and an iterable as the second argument whose elements are run through the first argument function. Since the second argument in your code is the map function, that has to be run first to get elements to feed into the first argument function.
https://docs.python.org/3/library/functions.html#filter

e:f,b. Yeah and that too.

Hughmoris
Apr 21, 2007
Let's go to the abyss!

CarForumPoster posted:


EDIT: Of if you want the types as strings

list(zip(df.columns, df.dtypes.apply(str).values))

That is exactly what I needed. I was missing the apply(str) part. Thanks!

ExcessBLarg!
Sep 1, 2001

QuarkJets posted:

That's an order of operations thing; filter is the outermost function, so it completes last
I've said it before, but I think that map/filter being built-ins (instead of methods on all iterables) and the order of their arguments limits their usefulness particularly as you can't method chain them. But as you say, Python has never really embraced them, much preferring:

QuarkJets posted:

I like list comprehensions:
But these are not equivalent!

I was going to say: usually when you combine map and filter you perform the filter operation first to cull membership of your result set, then call map on it to transform it. If you call map first, then you're potentially performing the transform unnecessarily on members that will be filtered out anyways, unless the behavior of the filter depends on the transform which in this case it does. To make it equivalent you have to do something like this:
Python code:
names = ["margarita", "Linda", "Masako", "Maki", "Angela"]
 
m_names = [n for n in [name.lower() for name in names] if n[0] == "m"]

print(m_names)

ExcessBLarg! fucked around with this message at 14:15 on Aug 18, 2022

Zoracle Zed
Jul 10, 2001
that's what the walrus operator is for

code:
[n for name in names if (n := name.lower())[0] == "m"]

ExcessBLarg!
Sep 1, 2001
Interesting. I tried using the walrus operator but it didn't work. You have to follow the order in which the list comprehension clauses are evaluated then, which isn't left-to-right.

nullfunction
Jan 24, 2005

Nap Ghost
Nobody likes startswith()?

Python code:
[name for name in names if name.lower().startswith("m")]

KICK BAMA KICK
Mar 2, 2009

Too lazy and I forget how to use timeit up correctly Phoneposting or I'd test but: sure it's nbd so just hypothetically, is s.lower() O(n)? and worth avoiding if the strings could be long in favor of like s[0] in "Mm"?

nullfunction
Jan 24, 2005

Nap Ghost

KICK BAMA KICK posted:

Too lazy and I forget how to use timeit up correctly Phoneposting or I'd test but: sure it's nbd so just hypothetically, is s.lower() O(n)? and worth avoiding if the strings could be long in favor of like s[0] in "Mm"?

lower() is O(n), so yes, for large numbers of very long strings you may see a large benefit with this approach.

I just learned that startswith can accept a tuple of strings, so you can skip the lower() and just

Python code:
[name for name in names if name.startswith(("M", "m"))]
and it's nearly as performant as grabbing the 0th index, while not throwing exceptions if it hits an empty string. Pretty readable too.

Josh Lyman
May 24, 2009


I have a 2 column dataframe where I'm trying to find the column value that's >= a threshold and extract the index. This is done multiple times for different thresholds. I've tried it using a for loop as well as .idxmin and it's not clear that .idxmin is faster. In fact, it might even be slower.

Is there a better way to do this? Maybe using numpy?

Josh Lyman fucked around with this message at 00:50 on Aug 19, 2022

CarForumPoster
Jun 26, 2013

⚡POWER⚡

Josh Lyman posted:

I have a 2 column dataframe where I'm trying to find the column value that's >= a threshold and extract the index. This is done multiple times for different thresholds. I've tried it using a for loop as well as .idxmin and it's not clear that .idxmin is faster. In fact, it might even be slower.

Is there a better way to do this? Maybe using numpy?

Are there ever multiple column values such that you’re extracting the indices?

Can you post some example code?

Josh Lyman
May 24, 2009


CarForumPoster posted:

Are there ever multiple column values such that you’re extracting the indices?

Can you post some example code?
I ended up rewriting my entire code away from lists and dataframes into numpy arrays so I'm good to go there.

To follow up with an issue I asked a couple days ago, I have 10 concurrent Jupyter notebooks running and they each make API calls about once every 1.5-2 seconds. The server will randomly return an empty list.

I thought this was a rate limit issue and put in error handling so that if the server returns an empty list, I sleep and retry the call. I've tried sleep intervals from 3 seconds to 2 minutes but the server will keep returning empty lists. Yet when I stop the code and immediately restart it, everything works again.

This makes me wonder if it's not actually a rate limit issue but something to do with JupyterLab. The issue has also occurred when I was only running 5 notebooks. I suppose I could try running the jobs from the terminal. Thoughts?

Josh Lyman fucked around with this message at 08:04 on Aug 19, 2022

CarForumPoster
Jun 26, 2013

⚡POWER⚡
I can’t imagine how I’d diagnose what a server is returning to you without seeing code…and even then if you mean the server response is literally an empty list Eg if using requests:

r=requests.get(args)

And r.content() is []

Then nah don’t know how to help.

Though my first thought was “is the server returning tokens they should use for the next request and they’re not being used correctly?”

Hughmoris
Apr 21, 2007
Let's go to the abyss!
Looking at job postings, it seems like every data analyst with a butthole should have experience with machine learning. I feel like I'm being left behind.

In the ocean of resources of available on the topic, can anyone recommend some that can help build up practical experience using Python?

Macichne Leainig
Jul 26, 2012

by VG

Hughmoris posted:

Looking at job postings, it seems like every data analyst with a butthole should have experience with machine learning. I feel like I'm being left behind.

In the ocean of resources of available on the topic, can anyone recommend some that can help build up practical experience using Python?

scikit-learn is my favorite Python machine learning library because they cover a ton of common algorithms and use cases for machine learning, and also have great docs and tutorials

https://scikit-learn.org/stable/tutorial/index.html

I put together an active-ish thread about some generic machine learning stuff, might get some better suited folks to help you there too:

https://forums.somethingawful.com/showthread.php?threadid=3993118

But I'm also happy to discuss here, this thread is definitely more active. Fair warning that I'm definitely a bit behind in what's cool with Tensorflow/PyTorch/Jax/whatever these days.

Josh Lyman
May 24, 2009


CarForumPoster posted:

I can’t imagine how I’d diagnose what a server is returning to you without seeing code…and even then if you mean the server response is literally an empty list Eg if using requests:

r=requests.get(args)

And r.content() is []

Then nah don’t know how to help.

Though my first thought was “is the server returning tokens they should use for the next request and they’re not being used correctly?”
Yeah it's tricky. Not sure if this will help but I'll try to describe it better: Imagine a dataset that has the daily weight and calories of every goon and I can request them by username and month. The server normally returns this as a nest list

[[weights for month] , [calories for month]]

If I request a username that doesn't exist, the API returns [] which I can handle fine.

The problem I'm trying to understand is when it returns [[] , []]. This is where stopping the notebook and rerunning the cell in Jupyter "fixes" the issue and the nested list will be correctly populated. If it were strictly a rate limit issue, you'd think a 5 minute sleep before the code block tries the request again should work but it doesn't.

Hughmoris
Apr 21, 2007
Let's go to the abyss!

Protocol7 posted:

scikit-learn is my favorite Python machine learning library because they cover a ton of common algorithms and use cases for machine learning, and also have great docs and tutorials

https://scikit-learn.org/stable/tutorial/index.html

I put together an active-ish thread about some generic machine learning stuff, might get some better suited folks to help you there too:

https://forums.somethingawful.com/showthread.php?threadid=3993118

But I'm also happy to discuss here, this thread is definitely more active. Fair warning that I'm definitely a bit behind in what's cool with Tensorflow/PyTorch/Jax/whatever these days.

Thanks for these!

SurgicalOntologist
Jun 17, 2004

Josh Lyman posted:

Yeah it's tricky. Not sure if this will help but I'll try to describe it better: Imagine a dataset that has the daily weight and calories of every goon and I can request them by username and month. The server normally returns this as a nest list

[[weights for month] , [calories for month]]

If I request a username that doesn't exist, the API returns [] which I can handle fine.

The problem I'm trying to understand is when it returns [[] , []]. This is where stopping the notebook and rerunning the cell in Jupyter "fixes" the issue and the nested list will be correctly populated. If it were strictly a rate limit issue, you'd think a 5 minute sleep before the code block tries the request again should work but it doesn't.

To be clear, restarting the client code is what solves it? I mean, you are not restarting the server or anything I presume. Are you doing anything in the client code such as using sessions? Or any other state that gets cleared when you restart the kernel? It must be something like that.

There's no reason the same request should get a different result unless soemthing changes about the request itself. The server doesn't know the environment that the request was made unless its passed into the request. Which could happen for example with sessions. You might be able to inspect these kinds of things in the request object, or even check the traffic with wireshark and see what's different in the requests that don't work vs. the next one after restarting the client.

Edit: reread the post, if you're not even restarting the kernel but just re-running the cell... that's potentially even weirder although depending on what's in the cell I suppose it could be equivalent. In the end it comes down to what's in the cell. I'd put my money that you're building up state somehow.

Edit2: just to elaborate on what I would do to debug, specifically inspecting the requests object. Detect when the situation happens but break the loop instead of sleeping. Take a look at r, headers, cookies, and especially r.request (which encapsulates what you're sending to the server). In the next cell make another request but don't overwrite r, call it r2 or whatever. Assuming that request works, compare them side by side. Find the difference in what you are sending to the server.

SurgicalOntologist fucked around with this message at 23:44 on Aug 19, 2022

Josh Lyman
May 24, 2009


SurgicalOntologist posted:

To be clear, restarting the client code is what solves it? I mean, you are not restarting the server or anything I presume. Are you doing anything in the client code such as using sessions? Or any other state that gets cleared when you restart the kernel? It must be something like that.

There's no reason the same request should get a different result unless soemthing changes about the request itself. The server doesn't know the environment that the request was made unless its passed into the request. Which could happen for example with sessions. You might be able to inspect these kinds of things in the request object, or even check the traffic with wireshark and see what's different in the requests that don't work vs. the next one after restarting the client.

Edit: reread the post, if you're not even restarting the kernel but just re-running the cell... that's potentially even weirder although depending on what's in the cell I suppose it could be equivalent. In the end it comes down to what's in the cell. I'd put my money that you're building up state somehow.

Edit2: just to elaborate on what I would do to debug, specifically inspecting the requests object. Detect when the situation happens but break the loop instead of sleeping. Take a look at r, headers, cookies, and especially r.request (which encapsulates what you're sending to the server). In the next cell make another request but don't overwrite r, call it r2 or whatever. Assuming that request works, compare them side by side. Find the difference in what you are sending to the server.
What do you mean by building up state?

And yes, I’m just re-running the cell. It’s just a script that loops through the list of goons and pulls their weight and calories. There’s a bunch of calculation that happens with that information and then I write it to a csv, but all those arrays are reset to 0 at the beginning of each loop.

Edit: Regarding your edit2 , that was one of the first things I tried and figured out the data exists. So when I see the script is inside the sleep loop, I’ll stop the notebook and manually run the API call in another cell using the “current” parameter values and it works.

Josh Lyman fucked around with this message at 00:28 on Aug 20, 2022

Adbot
ADBOT LOVES YOU

SurgicalOntologist
Jun 17, 2004

Josh Lyman posted:

What do you mean by building up state?
I guess I mean, whatever you're doing in the cell, its causing one request and another to not actually be the same despite appearing to have the same parameters. The actual request you're sending might be different, in some way, depending on the state of all that code you're running in that cell. It's the only way I can make sense of it.

Josh Lyman posted:

Edit: Regarding your edit2 , that was one of the first things I tried and figured out the data exists. So when I see the script is inside the sleep loop, I’ll stop the notebook and manually run the API call in another cell using the “current” parameter values and it works.

Yeah, but what I'm suggesting is taking that "r" object you get, specifically r.requests which is an object of type requests.PreparedRequest that encapsulates all the information you're sending to the server, and take a close look. See if r.request and r2.request are different in any way, assuming r is a failed response and r2 is a successful one. Are the request headers exactly the same? Is the request body exactly the same?

To put it another way, if you send the same exact set of bytes to the server, it shouldn't matter whether you send it from one cell or another (given that you seem to have ruled out rate-limiting on the server side with your sleep experiment). Much more likely than anything else I can think of, you are not actually sending the same bytes, but somehow the context of the cell (i.e. the loop) is changing what bytes you send to the server. Inspecting your PreparedRequest will help you figure out if that's the case. Also, other fields of the response (r) itself might be interesting too, perhaps there is a clue in the status code or the headers returned from the server.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply