Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Foxfire_
Nov 8, 2010

Best guess is reimporting like other people have alluded to.

A statement like `import params` tells the interpreter:
- Check in your table of already-loaded modules to see if `params` is present
- If so, do nothing
- If not, go load it

This is generally desirable, otherwise every module that's doing stuff like `import collections` is reloading stuff, then the reloaded stuff reloads more stuff, ... until every import is reloading large parts of the standard library. A better design for the language would involve hashing the disk file and giving a warning/error if it has changed, but CPython is lazy

If you executed `import params` in your notebook kernel while the file-on-disk didn't have those values yet, the module that gets loaded won't have them. Executing `import params` again will do nothing because `params` is already loaded, even if the file-on-disk has changed. Either use importlib.reload() to explicitly reload it, or change the structure of what you're doing.

Foxfire_ fucked around with this message at 23:47 on Sep 13, 2022

Adbot
ADBOT LOVES YOU

Jose Cuervo
Aug 25, 2004
Based on the replies what I wrote was confusing.

I have a module named params.py which I was using to store the names of variables which are used by an algorithm so that in the algorithm (defined in another module named algorithm.py which imports params.py) I can simply write params.lag whenever I wish to use the lag variable. In my Jupyter notebook I import both params.py and algorithm.py at the very top, then in the following cell set the value of each parameter, and then finally call the algorithm (e.g., algorithm.run_alg()).

I think the explanation by Foxfire_ is what is happening and is why this is not working. I also agree with QuarkJets that using a dataclass is what I should do and just pass the parameters as an argument.

Macichne Leainig
Jul 26, 2012

by VG
Yesterday I discovered Fixes Text For You (ftfy) and it has been a godsend. I often deal with importing text data from various sources so being able to unfuck encoding errors such as the ones that shows up in my title is wonderful

punished milkman
Dec 5, 2018

would have won
let’s say you’re working with a legacy codebase (python 3, untyped) that relies heavily on shuffling around various huge nested dictionaries across your functions. these dictionaries usually follow an internally consistent structure, but knowing what their contents are is a huge pain in the rear end. e.g i regularly encounter situations where i’m like okay i’ve got the “config” dictionary… how the hell do i dig out the s3 path for item x? these things are very opaque and you need to either a) refer to old code that gets the item you want or b) log out the contents of the whole dict and figure out how to dig through it. not ergonomic at all.

so the question is, how would you refactor this codebase to make the structure of the dictionaries more well known to the developers? typed dicts? dataclasses? just wondering what i can do to make developing less of a headache in the future (ideally without having to rewrite thousands of lines of legacy code)

QuarkJets
Sep 8, 2008

punished milkman posted:

let’s say you’re working with a legacy codebase (python 3, untyped) that relies heavily on shuffling around various huge nested dictionaries across your functions. these dictionaries usually follow an internally consistent structure, but knowing what their contents are is a huge pain in the rear end. e.g i regularly encounter situations where i’m like okay i’ve got the “config” dictionary… how the hell do i dig out the s3 path for item x? these things are very opaque and you need to either a) refer to old code that gets the item you want or b) log out the contents of the whole dict and figure out how to dig through it. not ergonomic at all.

so the question is, how would you refactor this codebase to make the structure of the dictionaries more well known to the developers? typed dicts? dataclasses? just wondering what i can do to make developing less of a headache in the future (ideally without having to rewrite thousands of lines of legacy code)

Shouldn't you have one (or more) of these dictionaries as an input to your unit tests, hanging around in your test directory as maybe a .json? I'd use that as a reference for navigating around.

The problem you're encountering probably won't be solved by using typed dicts or dataclasses, because at the end of the day you'll still have a nested data structure where it's not obvious how you'd access some specific element 4 levels down without referring to an example. What you need is to documentation for the data structure. If this doesn't exist then you need to write it. For common but tricky access patterns you write functions to help retain that pattern for future development

punished milkman
Dec 5, 2018

would have won

QuarkJets posted:

Shouldn't you have one (or more) of these dictionaries as an input to your unit tests, hanging around in your test directory as maybe a .json? I'd use that as a reference for navigating around.

The problem you're encountering probably won't be solved by using typed dicts or dataclasses, because at the end of the day you'll still have a nested data structure where it's not obvious how you'd access some specific element 4 levels down without referring to an example. What you need is to documentation for the data structure. If this doesn't exist then you need to write it. For common but tricky access patterns you write functions to help retain that pattern for future development

lol, I wish this codebase had unit tests. Instead we have a few end-to-end tests that take an obscene amount of time to execute.

thanks for the advice. makes sense, I should just spearhead writing some tests and document the shape of these things.

Foxfire_
Nov 8, 2010

punished milkman posted:

so the question is, how would you refactor this codebase to make the structure of the dictionaries more well known to the developers? typed dicts? dataclasses? just wondering what i can do to make developing less of a headache in the future (ideally without having to rewrite thousands of lines of legacy code)
Dataclasses or something else discoverable by your IDE is the best I can think of without changing lots of stuff. Finding config.thing1.aws.path is a lot easier if each level is an autocomplete from a list of choices instead of a hand typed string from nothing. And you want typos to be statically findable by a linter instead of being runtime KeyError

Seventh Arrow
Jan 26, 2005

This is a neat coincidence because I had an interview with Reddit the other day where I had to sum certain values from a bunch of nested dictionaries.

So as not to create a wall of text, I'll post the dataset elsewhere:

https://pastebin.com/z8L0eUdC

As you can see, at the bottom I was going to iterate through every dictionary level to get to the pageviews but this struck me as being inefficient (also I don't think I got to the right "level" of dictionary).

I thought of maybe popping the sub-sub-dictionary and putting the results in a new, blank dictionary to make it easier to parse. I ran out of time, however, so I wasn't able to pursue this - so I'm wondering what's the most efficient, non-memory-eating, non-bandwidth-consuming way to do this?

The thought of using a regex just occurred to me, but that might be even worse.

QuarkJets
Sep 8, 2008

Foxfire_ posted:

Dataclasses or something else discoverable by your IDE is the best I can think of without changing lots of stuff. Finding config.thing1.aws.path is a lot easier if each level is an autocomplete from a list of choices instead of a hand typed string from nothing. And you want typos to be statically findable by a linter instead of being runtime KeyError

These are examples of the benefits of dataclasses, which are generally just way superior to dictionaries. But I think the OP's problem isn't solved by just converting nested dicts to nested dataclasses; what they have is a data structure design issue, where it's difficult to find specific fields relative to other fields because there are both too many structure layers and the connections between layers are difficult to understand.

QuarkJets
Sep 8, 2008

Seventh Arrow posted:

This is a neat coincidence because I had an interview with Reddit the other day where I had to sum certain values from a bunch of nested dictionaries.

So as not to create a wall of text, I'll post the dataset elsewhere:

https://pastebin.com/z8L0eUdC

As you can see, at the bottom I was going to iterate through every dictionary level to get to the pageviews but this struck me as being inefficient (also I don't think I got to the right "level" of dictionary).

I thought of maybe popping the sub-sub-dictionary and putting the results in a new, blank dictionary to make it easier to parse. I ran out of time, however, so I wasn't able to pursue this - so I'm wondering what's the most efficient, non-memory-eating, non-bandwidth-consuming way to do this?

The thought of using a regex just occurred to me, but that might be even worse.

Popping the dictionaries should be more expensive than just iterating over the nested input, but that idea boils down to just creating a generator expression. Have you used generator expressions before? Off the top of my head that's probably the most efficient way to solve this problem, maybe someone else will bring up a better alternative

Python code:
data = {
  "2022-09-03": {
    "android": {
      "pageviews": 420993,
      "uniques": 272844,
    },
    "ios": {
      "pageviews": 265939,
      "uniques": 174860,
    },
  },
  "2022-09-04": {
    "android": {
      "pageviews": 420993,
      "uniques": 272844,
    },
    "ios": {
      "pageviews": 265939,
      "uniques": 174860,
    },
    # etc.
  }
}

def sum_page_views(input_data, min_date):
    pageviews = (data[date][platform]["pageviews"]
				for platform in data[date]
				for date in data
				if date >= min_date)
    return sum(pageviews)

result = sum_page_views(data, "2022-09-09")
print(result)
I'm certain that a regex would be a bad choice

QuarkJets fucked around with this message at 00:23 on Sep 18, 2022

Seventh Arrow
Jan 26, 2005

I haven't used generator expressions before but thank you, I will look them up.

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

punished milkman posted:

let’s say you’re working with a legacy codebase (python 3, untyped) that relies heavily on shuffling around various huge nested dictionaries across your functions. these dictionaries usually follow an internally consistent structure, but knowing what their contents are is a huge pain in the rear end. e.g i regularly encounter situations where i’m like okay i’ve got the “config” dictionary… how the hell do i dig out the s3 path for item x? these things are very opaque and you need to either a) refer to old code that gets the item you want or b) log out the contents of the whole dict and figure out how to dig through it. not ergonomic at all.

so the question is, how would you refactor this codebase to make the structure of the dictionaries more well known to the developers? typed dicts? dataclasses? just wondering what i can do to make developing less of a headache in the future (ideally without having to rewrite thousands of lines of legacy code)

This is my life right now and I absolutely hate it and am trying to fix it up. (I also don't have a handy test dictionary to refer to either, since this part of the codebase isn't tested).

I agree with QuarkJets that dataclasses aren't a silver bullet for this situation, but I do think they help a ton, just like FoxFire_ mentioned.

A thing to keep in mind through: refactoring away from dictionaries is pretty difficult because it's very easy to miss adding or removing elements somewhere along the chain if they're not well handled, and if you don't have tests that's gonna be a problem.

IMO here's what my plan is for doing this stuff, and what I've done for the little portions I've worked out so far: (Do one dictionary at a time, btw)

- First, the easiest answer to solve this in the short term is to (as QuarkJets mentioned), just go dig back until you find where the dictionary is instantiated and then open your favorite notetaking application and start jotting down what's in there. You can even probably throw a breakpoint in and print them in the debug console wherever you're working at.

- Now that you have a convenient schema for it, work through and make sure the dictionary isn't getting arbitrarily updated anywhere. IDE searches help a lot for this, but you can also just walk along the code manually if you need to. If the schema ever gets mutated, note where and how and what.

- Create a dataclass to replace your dictionary. As a general rule, make sure you're declaring all your attributes in the dataclass (so for example, if you don't instantiate with 'USER_ID' but add it later, still declare it in your dataclass class definition, just set it to None or an appropriate default value on creation.

Python code:
@dataclass
class Butts:
	""" This is a Butt. It does ButtStuff """
    butt_length: int
    butt_depth: int
    butt_color: ButtColor  # This is an Enum!
    butt_id: int = 0  # This is not known during creation, so we default to 0

new_butt = Butts(10,10,ButtColor.BLUE)
Now, start swapping things out. If you don't have test cases, I sure hope you're good at manually rerunning your code to cover your bases, because this is tricky since you're moving stuff around a lot.

Once you're done with one dataclass, make sure to commit your changes before moving onto the next. I'm by no means an expert, but this process works out alright for me so far.

QuarkJets
Sep 8, 2008

Also if at some point you find yourself needing a dictionary form of your dataclasses, they have a built-in method that does that. IIRC it's "asdict()"

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

QuarkJets posted:

Also if at some point you find yourself needing a dictionary form of your dataclasses, they have a built-in method that does that. IIRC it's "asdict()"

Yep - and if you need special handling to serialize your data (for example, if you have classes that don't handily serialize), there's a dict_factory argument for it, that lets you pass in a function to handle the processing.

duck monster
Dec 15, 2004

Foxfire_ posted:

This is generally desirable, otherwise every module that's doing stuff like `import collections` is reloading stuff, then the reloaded stuff reloads more stuff, ... until every import is reloading large parts of the standard library. A better design for the language would involve hashing the disk file and giving a warning/error if it has changed, but CPython is lazy

Its not just laziness. Under normal circumstances NOT reloading is absolutely best practice, especially where object oriented (or really anything with structured data) is concerned.

If you instantiate a class from that imported module , then change that module , does your instantiated class reflect the new definition or remain an instance of the old definition? If its updated, does __init__ get called again to update the members? Is that wise? Where do the missing values come from, where do the now undeclared values go?

Tracking these changes is a huge and deeply problematic task, so the answer would be to leave old instances pointing to a (presumably in-memory) copy of the old definition. But now we have new problems. Is the old class equal to the new class? What happens with inheritance? can static members of the new class comprehend the old class. On it goes.

Yeah, django and many frameworks do in fact do hot-reloading ,but I'd argue that while this is fine for in-dev code, crashes happen these are transactional http workloads so the chance for failure is significantly reduced, but dear god you would not want this for live code, and since jupyter isnt transactional (in the http sense) I'd argue the chance for genuinely baffling conflicts arising out of it are dramatically higher.

If you *really* must there are libraries that implement hot-reload and I dare say these probably work in jupyter kernels too. But remember, you may well be opening up a hell portal. (Theres also built-ins/std-library ways of forcing a reload, but buyer beware)

duck monster fucked around with this message at 01:18 on Sep 19, 2022

Foxfire_
Nov 8, 2010

The laziness I was talking about was CPython not detecting and warning/erroring when it encounters an import statement for a module whose disk content doesn't match the already-loaded content. I agree reloading is untenable. I think a good language/implementation should be yelling at you because 99% of the time, a difference between the on-disk and the in-use module is a sign that your program is about to do something wrong and confusing, and it's an easy situation for the interpreter to detect.

darkforce898
Sep 11, 2007

Falcon2001 posted:

This is my life right now and I absolutely hate it and am trying to fix it up. (I also don't have a handy test dictionary to refer to either, since this part of the codebase isn't tested).

I agree with QuarkJets that dataclasses aren't a silver bullet for this situation, but I do think they help a ton, just like FoxFire_ mentioned.

There is also this package that I have used before that could be useful

"marshmallow: simplified object serialization — marshmallow 3.18.0 documentation" https://marshmallow.readthedocs.io/en/stable/

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

darkforce898 posted:

There is also this package that I have used before that could be useful

"marshmallow: simplified object serialization — marshmallow 3.18.0 documentation" https://marshmallow.readthedocs.io/en/stable/

I considered using Pydantic, which seems similar-ish, but I was waiting on them to fix some sort of vulnerability I saw on our internal repo mirror. I'll go back and look again, as I'm about to go do another round of this stuff.

Edit:

On the topic of documenting data structures, is that something that should live outside of your code base or just properly documented/annotated within it? This is assuming that we're talking about something that isn't exposed to customers or somewhere else you'd need to make sure it's available to outside folks.

Falcon2001 fucked around with this message at 07:57 on Sep 20, 2022

duck monster
Dec 15, 2004

Falcon2001 posted:


On the topic of documenting data structures, is that something that should live outside of your code base or just properly documented/annotated within it? This is assuming that we're talking about something that isn't exposed to customers or somewhere else you'd need to make sure it's available to outside folks.

Confluence writeups for *everything* is my philosophy. But ultimately how much is enough is a pretty subjective matter. Pydantic is pretty self documenting, but documentation ought be more than what the field names and types are, but also what those structs are used for and how the usual business logic of the app creates, consumes, mutates, stores and destroys them.

Well in my view anyway. But I'm old-school about this stuff. Because I'm dumb and old, I really like documentation when I start new jobs because I'm too stupid to learn fast like the kids, so I like to document for when the next guy is hired. YMMV.

duck monster
Dec 15, 2004

Has anyone got any experience with django-ninja and django-ninja-extras?

I really like the idea of a dataclasses/pydantic structured OpenAPI using our existing django backend , But I seem to be unable to get both pagination and sane exception handling at the same time. It just seems to either poo poo the bed or create undefined behaviors (Iike feeding the pk into every field or other weird poo poo) when exceptions are raised.

The documentation seems to have examples of exception handling, or examples of pagination, but as soon as you combine them the whole thing catches on fire.

I'm sure I'm doing something wrong, but what? Its just not documented! Its like if pagination is added the only schema it can output is pagination, and surely thats an oversight that an entire-rear end open source project would would have anticipated because literally every major api imaginable has both exception handling [unless written by incompetents] AND pagination.

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.
I'm generating an SQLite database as a way of not having to store a very large dictionary in memory simultaneously. If I use isolation_level = None, it runs very slowly because of all the disk access. If I use commit() every five seconds or so, it runs much faster. But I'm wondering if there's a way to do it better. Is there some way to say something like:
code:
if size_of_transaction(con) > maximum_size_of_transaction:
    con.commit()

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

FredMSloniker posted:

I'm generating an SQLite database as a way of not having to store a very large dictionary in memory simultaneously. If I use isolation_level = None, it runs very slowly because of all the disk access. If I use commit() every five seconds or so, it runs much faster. But I'm wondering if there's a way to do it better. Is there some way to say something like:
code:
if size_of_transaction(con) > maximum_size_of_transaction:
    con.commit()

What package/etc are you using for this?

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.

Falcon2001 posted:

What package/etc are you using for this?

I'm not sure what you mean. I'm using 3.10.1 for Windows, downloaded straight from their site, and the sqlite3 module. Is that what you wanted, or...?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug

FredMSloniker posted:

I'm not sure what you mean. I'm using 3.10.1 for Windows, downloaded straight from their site, and the sqlite3 module. Is that what you wanted, or...?

I forgot there was a core library sqllite option, my bad.

Foxfire_
Nov 8, 2010

FredMSloniker posted:

I'm generating an SQLite database as a way of not having to store a very large dictionary in memory simultaneously. If I use isolation_level = None, it runs very slowly because of all the disk access. If I use commit() every five seconds or so, it runs much faster. But I'm wondering if there's a way to do it better. Is there some way to say something like:
code:
if size_of_transaction(con) > maximum_size_of_transaction:
    con.commit()
The purpose of transactions is data integrity, not performance. A feature "automatically commit the previous transaction and start a new one at unpredictable size based times" wouldn't make sense.

It sounds like you don't care about this data across crashes, so either:
pragma synchronous off to disable all waiting for disk flushing
or
Write-ahead log mode so that there is only one physical write for each logical write
might help you

e:

also, if you use an empty string as the database name, you will get a temporary database that will mostly live in memory cache except for the parts that don't fit, then get deleted when the db connection is closed

Foxfire_ fucked around with this message at 21:00 on Sep 28, 2022

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.

Foxfire_ posted:

The purpose of transactions is data integrity, not performance. A feature "automatically commit the previous transaction and start a new one at unpredictable size based times" wouldn't make sense.

Fair enough. I was just wondering if there was some way to check how big the pending transaction was.

Foxfire_ posted:

It sounds like you don't care about this data across crashes, so either:
pragma synchronous off to disable all waiting for disk flushing
or
Write-ahead log mode so that there is only one physical write for each logical write
might help you

e:

also, if you use an empty string as the database name, you will get a temporary database that will mostly live in memory cache except for the parts that don't fit, then get deleted when the db connection is closed

All of those look like potential good options for my use case (namely, being able to create a very large dictionary). Thanks for the tip! I'm still a novice with Python, and I'm literally looking up SQLite stuff as I go.

Foxfire_
Nov 8, 2010

You may want to rethink what exactly you're doing if it needs to be fast. Python code is very slow to execute and python objects are very bloated in terms of size. The usual way you operate with large datasets in python is by using python code to glue together pieces written in something else.

For example, if you had a time series of a million numbers to manipulate, you'd make a million-long numpy array. From python's point of view, the array is a single PyObject that was created by the numpy C library and has some opaque internal structure. When you do something like adding two arrays, that's a single python call operating on two PyObject (that gets implemented as a loop across the internal structure in the C library). You want to avoid keeping data in python objects or ever causing them to be created (i.e. a python loop that iterates across a numpy array is very bad because each iteration is "Make a function call asking numpy to create a brand new PyObject that has the same value as that array slot")

A million float64 numpy array is about 8GB of RAM. A million python float objects in a list is about 32GB. Telling numpy to find the max value in that array takes 220us on my laptop. Having python iterate through the million python floats in a list takes 28ms, about two orders of magnitude slower.

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.
Okay, here's what I'm doing.

I'm working on a program to solve a game book, one that involves randomness so you can't just pick the optimal choice and win every time. With that in mind, I want to score every position you can be in, or 'game state', giving them expected values based on the randomness yet to be faced. I have a function that takes a game state and returns the possible game states you can go to from there, along with their weights if the choices in that game state are made randomly.

I'm using sqlite3 to create a table containing each possible game state as well as its expected score. Win states and lose states have 1 and 0 as their score, respectively. If a choice can be made freely, that state's score is the best score of its result states. If a choice is made randomly, that state's score is the weighted average of its result states' scores. Scores are worked back from the win and lose states to the states that go to them, and the states that go to them, and so on and so forth until I know the score of the beginning state, i.e. how likely you are to win the game book with perfect play.

I need to save that information so I can show the thread, for instance, that there's a 70% chance of victory if we take one path, but only a 40% chance of victory if we take the other, so that I can show them what the optimal path looks like.

I'm using Python because I already (sort of) know it and because it has the built-in sqlite3 module. If there's a language you think would be better suited for this task, let me know.

timbro
Jul 18, 2004
I haven’t managed to figure out exactly what you’re doing but maybe something in this helps, or not.

I was doing a thing recently where I had to calculate a lot of states for a dice game.

Two things that sped it up massively (like a 5 min runtime to a few seconds).

Realizing I could shove multiple calculations into a vector and work through numpy to do about 100 at once. (Numpy is optimised for this) rather than looping through each one.

Caching a load of values so I didn’t have to keep looking up- in my case a dict which I calculated ONCE rather than each time, which you did say you didn’t want but I wonder if you can pull sections of your database in depending on the part of the game so that you make minimum transactions?

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.
It's like this.

At some point in the book, I have 7/10 STAMINA, 7/7 SKILL, and 11/11 LUCK. I have a bow and quiver, a fine sword, an iron amulet, a lucky charm, and nose filters, as well as four Gold Pieces, two Silver Pieces, a torch, and two arrows. Also, I foiled a robbery at the warehouse of Venquis Glint, and it's Seaday. I'm reading Section 41d.

Calling get_state_complex(state_string), where state_string is a serialized representation of all of that information, returns the following information:
  • The decision we're making here is not a random one; we can choose from the options freely.
  • The text of the section is:

    Section 41d posted:

    Later, you hurry through the mud-slick streets into town to the Bitterford Council chambers, to see what labours need completing on this rather inauspicious day. Although Gilfas the Scribe peers disapprovingly at you over the top of his glasses when he sees how much mud you have tracked in with you, he still runs a finger down the page of his book to see what tasks lie therein.

    ‘Aha!’ he barks. ‘Two things today! Mistress Jhaleph has discovered a ruin of sorts – she thinks it is a crypt or mausoleum – lurking in the depths of the swamp on her lands, and she requires someone to help clear it of potential danger. She will pay 3 Gold Pieces for a successful job, but you will have to provide your own torch.’

    ‘And the other...’ you ask hopefully, for wading through a partially submerged tomb for three Gold Pieces and having to supply your own torch sounds like a bad idea. Unless of course there is burial treasure present...

    ‘Hmm...’ says Gilfas, peering at the scratchy writing on the page of his book. ‘My understudy Par-Quan wrote this, and his handwriting borders on the barbarian. Apparently, the farmer Norbet has found a long-abandoned well – I think that is what it says – on his land.’

    Gilfas pauses, attempting to decipher the rest of the message.

    ‘If what Par-Quan writes is true, Norbet believes there is, hmm, something, at the bottom of the well and will pay 3 Gold Pieces for you to dispose of it for him. Again, though, you will have to supply your own torch.’

    Alternatively, you could always go fishing, but only if you have fishing gear written down in the Equipment section of your Adventure Sheet.

    Finally, if you have neither torches nor fishing gear (or if you do not care for any of the other options), you may choose to spend the entire day back at your hut.
  • The section offers the following choices:
    • "Explore the sunken crypt: turn to 16." Choosing this changes the game state's section to 16.
    • "Investigate the abandoned well: turn to 26." Choosing this changes the game state's section to 26.
    • "Spend the day fishing in the Red River. You can't do that because you don't have fishing gear." This can't be chosen and is marked as such.
    • "Spend the entire Seaday at your hut. Turn to 45." Choosing this changes the game state's section to 45.
Note that this is a pretty simple example; choices can change stats, add or remove gear, set or clear special flags, and even kill the character! And if the current choice is a random one, I need the relative weights of each choice.

Now. How likely am I to beat the game book for each of those options?

My current approach is that, if I can make a free choice in a given game state, the score of that state is the best score of the available choices. If the choice is random, it's the weighted average of the available choices. However, some game states are winning or losing game states. Instead of that information, get_state_complex(state_string) returns the score of the state and the text of the winning/losing section. I can work backward from the winning and losing sections to determine the scores of the sections that visit them, then back to determine the scores of the sections that visit them, and so on and so forth back to the initial game state; if I save all of this information, I now know, given any possible game state, what my expected score is for each of the choices on offer. I also know what my expected score for the entire game is, i.e. how likely it is to beat the book with perfect play and fair rolling.

I'm using Python because it lets me have a big ball of game state stuff and not have to go through and define the shape of state data bit by bit. It also gives me sqlite3, which I can use to store all the possible game states and their scores without my computer's RAM spontaneously combusting. However, I would like to have the data come out this decade, so if I have to learn some other language... well, point me in the direction of the best tool for the job, and I'll dig out its instruction manual.

timbro
Jul 18, 2004
Ok so I’m not that good at this. But it sounds like if we simplify this then you are in a state x and there are possible states to go to from here.

Are you using value/ policy iteration?

Falcon2001
Oct 10, 2004

Eat your hamburgers, Apollo.
Pillbug
I wonder if looking into how people build chess engines might be useful insight, since it sounds at least vaguely analogous.

Jabor
Jul 16, 2010

#1 Loser at SpaceChem
Python seems fine for this. If I remember my Fighting Fantasy correctly, you're looking at maybe a couple of hundred sections per book? This is on the level where getting "good enough" performance means putting your algorithm together in an effective way, rather than writing it in the most high-performance language.

The big concern I have is stat checks - for example, fights could leave you with a whole bunch of different stamina values at the end of them, and if you go and construct an entirely separate graph of followup states and duplicate all the work going forwards for every single possible fight outcome, then your solution is likely to be too slow no matter what language you write it in.

But the thing about stat checks is that they don't change the shape of the graph at all, they just change the odds of each outcome. You could instead construct and simplify the graph first, reducing each possible path to the series of fights and stat checks it will require, and then go through and evaluate the actual probabilities based on the simplified graph.

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.

timbro posted:

Ok so I’m not that good at this. But it sounds like if we simplify this then you are in a state x and there are possible states to go to from here.

Are you using value/ policy iteration?

I don't even know what that means. :downs:

Falcon2001 posted:

I wonder if looking into how people build chess engines might be useful insight, since it sounds at least vaguely analogous.

I know chess engines can't enumerate the entire state space like I'm trying to, because it's just too big. I don't think the game book's state space is that big, but I've been wrong before. And I do want to get exact probability of success if possible, not just 'this is probably the best move'.

Jabor posted:

The big concern I have is stat checks - for example, fights could leave you with a whole bunch of different stamina values at the end of them, and if you go and construct an entirely separate graph of followup states and duplicate all the work going forwards for every single possible fight outcome, then your solution is likely to be too slow no matter what language you write it in.

Yeah, I'm trying to shrink the state space as much as possible. For instance, if I only have one choice at the current state, either because it's a simple 'turn to 21' or because the other choices aren't available to me for whatever reason, I'm doing what I'm calling eliding that state into the choice's state, essentially assigning the second state's choices to the first state and returning the text from both. I'm also being careful to delete information from the state when it's no longer relevant, because the only difference between us having a dagger and selling it and us never having a dagger is the amount of money we have.

And stat checks only have two outcomes, the success outcome and the failure outcome; they're weighted appropriately for the probability of success. There's still room for improvement in combat, but it's something I'll have to put my big brain hat on to do right, and I want to see first if there's anything that's going to cause me to want to dump a bunch of code and start over.

Foxfire_
Nov 8, 2010

I would approach the problem like this:

For explanation, consider games of increasing complexity:

Game Version I: No items or combat, just free or random choices
The game can be modeled as a directed graph. We want to label each node with the probability of winning from there. The winning node can be trivially labeled with p=1 and all death nodes with p=0.

Then repeat for each upstream node:
- If this node's exits are free choices, label this node with whichever of the exits has higher probability (If section 100 has choices to section 110 and section 120, and we already know that 110 has p=0.75 and 120 has p=0.8, label 100 with p=0.8 because we can always go the more successful direction).
- If this node's exits are random choices, label this node with the exit probabilities combined with the choice probabilities. (If section 100 has a 1/4 chance to go to section 110, 2/4 to go to section 120, and 1/4 to just die, 110 gets p=(0.25*0.75) + (0.50*0.8)+(0.25*0)=0.5875
- If the upstream node is already labeled (there is another way we've seen to win from there), pick whichever label is the higher probability

Repeat for all upstream nodes until you get to the start node. It's label is the winning probability

Game Version II: Add combat
Each node no longer has the same winning probability every time we hypothetically enter it. Instead there's a MAXSKILL x MAXSTAMINA matrix of probabilities where each entry is "If you entered this node with theses skill/stamina values, this is your winning probability". Using 4 byte floats and letting each number go from 0-31, that's 4KiB per node, so not big.

Make a combat oracle subprogram that answers the question "If I start this combat with X skill and Y stamina, what are my chances of (1) dying, (2) living with Y stamina, (3) Y-1 stamina, (4) Y-2 stamina, ...". Either monte carlo, or I think you could also build up a table for it with precise answers (each answer is either the odds of winning, or the odds of winning with player/enemy having smaller health numbers. So you can start from all the health=1 entires which are easy to do directly, then table upward from there)

The bigger game node table filling in part works similarly to before:
- All matrix entries are 1 for the winning node and 0 for death nodes (you win/lose for any skill/stamina values)
- To fill in a combat node, you go through each (skill, stamina) possibility:
* Ask the combat oracle for the odds of each outcome
* For each of those outcomes, multiply the odds of it happening by the win probability for that cell in the exit node's table

Like if 5 skill, 4 stamina on page 200 gives: 10% death, 10% 1 stamina left, 20% 2 stamina left, 30% 3 stamina left, 30% 4 stamina left and survival takes you to page 300, you'd label the P200[5, 4] matrix entry with (10%*0) + (10% * P300[5,1]) + (20% * P300[5,2]) + (30% * P300[5,3]) + (30% * P300[5,4])

Adding luck can be done by adding another dimension, a 0-31 luck value is still only: 4 bytes * 32 * 32 *32 = 0.125 MiB

Game Version III: Items
The direct way to mix in items would be to extend the win matrix by a dimension per item you have/don't have, but that explodes quickly. Adding 20 possible items to our 32x32x32 per node win matrix to make it 32x32x32x2^20 balloons it to 128 GiB.

We can work around this because most gamebook items are just plot locks. If Lone Wolf doesn't go through a page giving him the Iron Key, he has a blocked choice on the page with the Square Lock. At the start, unfold the graph for all of that. At the node where the key is offered, have edges to both a "Picked up the key" and "Didn't pick up the key" node of the following pages. There aren't usually that many items that actually do stuff (erase all the red herring ones upfront), so this shouldn't be that many doublings. And crucially, you're not doubling the combat matrix size you have to deal with at once

For the remaining items that do touch combat, add them to the win matrices as additional dimensions. Hope that there aren't that many, 10 binary options brings us up to 128MiB per node and more rapidly become untenable, but I think that's an overestimate for most books. You're also only handling a couple of those matrices at once since most nodes only lead to a handful of others and you can compute all of this nodes win probabilities from its direct neighbors.

Herbs can be done with non-binary dimensions. Like if you want to contemplate up to 5 laumspur, have a laumspur dimension with the win probabilities for if you start the fight with 0, 1, 2, 3, 4, or 5 (skill X stamina X luck X #laumspurs you have X have mindblast or not X have sommersord or not X ...).

Skills you can kind of cheat at because they're 'items' that you have for the entire game and can't drop. So you could eliminate them from the problem so they're not adding combinatorial explosion to the internal table sizes, then run the whole thing for ever combination of initial skill choices. 19 skills choose 5 is still only 12000 or so repetitions (and you could probably skip the stupider weaponskill choices upfront)

Gold and Having a Bow and Wanting To Use it are problematic, but could probably be adapted in with more thought.


How are you organizing your stuff now? It sounds like you're trying for a heuristic scoring instead of a precise probability?

Foxfire_ fucked around with this message at 01:54 on Sep 29, 2022

QuarkJets
Sep 8, 2008

What's the scale here? Do you need a 10 GB table written to disk? 100 GB? 1000 GB?

I'm questioning the need to write to disk at all

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.

QuarkJets posted:

What's the scale here? Do you need a 10 GB table written to disk? 100 GB? 1000 GB?

I'm questioning the need to write to disk at all

I need the scorer to write the table to disk because I'm using a separate program to display the data from it.

I realized something last night. My laptop has a fan that spins up when the computer is working particularly hard, but I hadn't heard it during my run of the enumerator. A quick check told me that Python was only using one core, and a quick Google led me to the multiprocessing module. My current code looks a bit like this:
code:
# code used to import, define functions, et cetera, and load things into the to_enumerate list omitted
while len(to_enumerate) > 0:
    # code used to show a progress report omitted
    state_string = to_enumerate.pop(0)
    if state_string: # if not, this isn't a valid state string. TODO: don't put invalid state strings in to_enumerate in the first place.
        already_there = con.execute("SELECT * FROM state_scores WHERE state_string = ?", (state_string,)).fetchall()
        if not already_there:
            state_complex = get_state_complex(state_string)
            if state_complex: # if not, this game state isn't defined, probably because I haven't put every page in the gamebook in yet.
                is_random, text, result_keys = state_complex
                if type(is_random) == bool: # if not, is_random is actually the score of this position, which is either winning or losing.
                    # the keys of result_keys include information not relevant to the enumerator.
                    for result_key in result_keys:
                        to_enumerate.append(loads(result_key)[0])
                    thing_to_store = (state_string, minimum_score, maximum_score)
                else:
                    thing_to_store = (state_string, is_random, is_random)
                con.execute("INSERT INTO state_scores VALUES(?, ?, ?)", thing_to_store)
    # code used to show a progress report and to commit roughly every five seconds omitted.
con.commit() # must commit before closing!
How can I speed this up with multiprocessing? My first thought is to do this:
code:
# code used to import, define functions, et cetera, and load things into the to_enumerate DICTIONARY omitted. Using dictionary instead of list to avoid putting duplicate entries into to_enumerate.
while len(to_enumerate) > 0:
    # code used to show a progress report omitted
    new_to_enumerate = {} # to keep each pass through the states to enumerate separate, so two processes aren't fighting over the same state
    with multiprocessing.Pool(processes = max_cpus) as pool: # max_cpus will be defined as multiprocessing.cpu_count() multiplied by some constant <= 1
        for state_string in to_enumerate:
            # The code that was previously in the block under 'if state_string' is now in the function process_state_string, with the bug of adding invalid state strings removed.
            pool.apply_async(process_state_string, state_string, new_to_enumerate) # passing new_to_enumerate so process_state_string can update it
            # code used to show a progress report omitted.
        pool.close()
        pool.join()
    con.commit()
    to_enumerate = new_to_enumerate
Is that correct? Are there other things I should be doing?

Foxfire_
Nov 8, 2010

You shouldn't be trying to reuse the same database connection in multiple processes, that is likely to either crash, deadlock, or corrupt the database file.

The function that you are telling multiprocessing to run in many processes should be a pure function; it should never touch any global state and all of its parameters should be copyable without any reference to anything else in the program. You can get away with having it read global state if you have to, but there are more footguns there.

What conceptually happens inside multiprocessing is:
- The calling interpreter serializes all the arguments for the call
- A new interpreter is started
- The new interpreter imports the same modules as the old one had imported
- The new interpreter deserializes the arguments
- The new interpreter runs the code
- The new interpreter serializes the return
- The calling interpreter deserializes the return

1) Global stuff besides what is done by import won't exist in the new interpreter and any changes it makes to global state won't be copied back.
2) All the arguments to the parallel function need to be serializable and any modification won't be written back. Like if you passed in a list and the function changed it, the calling process won't see those changes because they were made on a copy.


Also, because of a historical bug and lots of programs that rely on the illegal behavior, multiprocessing's default implementation on linux is broken. Python on windows and mac default to the correct behavior.

To fix it on linux, you must call multiprocessing.set_start_method('spawn') exactly once before doing anything that could cause multiprocessing to start a new process. Otherwise, the child processes may randomly crash/deadlock. (multiprocessing tries to use a fork() call as a parallelism construct in a way that has always been forbidden by POSIX, but will usually not break dramatically if nothing anywhere in the calling process happened to be holding a lock at the instant of the call)

FredMSloniker
Jan 2, 2008

Why, yes, I do like Kirby games.
Good advice should I take this to Linux, but I'm running Windows. As for the database connection... would this work?
code:
def make_table_entry(state_string):
    new_enumerate_entries = {}
    cur = con.cursor()
        already_there = cur.execute("SELECT * FROM state_scores WHERE state_string = ?", (state_string,)).fetchall() # is this safe?
        if not already_there:
            state_complex = get_state_complex(state_string)
            if state_complex: # if not, this game state isn't defined, probably because I haven't put every page in the gamebook in yet.
                is_random, text, result_keys = state_complex
                if type(is_random) == bool: # if not, is_random is actually the score of this position, which is either winning or losing.
                    # the keys of result_keys include information not relevant to the enumerator.
                    for result_key in result_keys:
                        new_enumerate_entries[result_key[0]] = True
                    thing_to_store = (state_string, minimum_score, maximum_score)
                else:
                    thing_to_store = (state_string, is_random, is_random)
                cur.execute("INSERT INTO state_scores VALUES(?, ?, ?)", thing_to_store) # is this safe?
    return new_enumerate_entries
Then, in the main loop, so far I have:
code:
while len(to_enumerate) > 0:
    # code used to show a progress report omitted
    new_to_enumerate = {} # to keep each pass through the states to enumerate separate, so two processes aren't fighting over the same state
    with multiprocessing.Pool(processes = max_cpus) as pool: # max_cpus will be defined as multiprocessing.cpu_count() multiplied by some constant <= 1
    
        # code goes here
        
        # code used to show a progress report omitted.
        pool.close()
        pool.join()
    con.commit()
    to_enumerate = new_to_enumerate
I'm not sure what to put in code goes here, though. If I'm reading the documentation correctly, I could use:
code:
    with multiprocessing.Pool(processes = max_cpus) as pool:
        for new_enumerate_entries in pool.map(make_table_entry, to_enumerate, int(len(to_enumerate) / max_cpus)): # that's the right chunk size for this use, right?
            for new_enumerate_entry in new_enumerate_entries:
                new_to_enumerate[new_enumerate_entry] = True
But that sits on the line with pool.map until the whole thing's processed, right? I'd like to have some sort of progress report as it goes. I thought about doing:
code:
    with multiprocessing.Pool(processes = max_cpus) as pool:
        pool.map_async(make_table_entry, to_enumerate, len(to_enumerate) / max_cpus, callback_function)
        while condition: # condition being 'we haven't collected every result from map_async yet'
            if time_up(): # a function I wrote that returns True when it's been about a second since the last time it returned True
                # code goes here to show my progress so far
But while callback_function can take new_enumerate_entries as an input, it can't access new_to_enumerate for writing, right? And I don't know what to use for condition.

As you can see, I've done as much as I can with my current knowledge - indeed, I've probably done some dumb stuff without knowing better. Suggestions?

Adbot
ADBOT LOVES YOU

Wallet
Jun 19, 2006

FredMSloniker posted:

As you can see, I've done as much as I can with my current knowledge - indeed, I've probably done some dumb stuff without knowing better. Suggestions?

The simple way to do multithreaded processing here would be to basically pass a list of jobs that will create those table entries (along with whatever data they need to do that) and then have your main thread deal with writing what they calculate when they return it. If your bottleneck is literally processor time threading it will help performance, but it's also going to introduce a lot of complexity. Unless you already have a working example of what you want to do that isn't performant enough when single threaded I wouldn't bother at this point.

Trying to have multiple threads acting on your database at the same time on their own connections is going to create weird poo poo you have to deal with, particularly if you are trying to calculate states that are dependent upon the calculation of other states as it sounds like you are.

Are there too many states for you to just hold the data you're operating on in RAM? If there aren't I'd just do whatever you need to do and dump it all at the end.

Wallet fucked around with this message at 21:36 on Sep 29, 2022

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply