Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
NovemberMike
Dec 28, 2008

PilotEvan posted:

Okay I have a pretty specific question. Say you had a huge amount of data with a tree-like relationship, maybe for example a list of all species names (and their respective genera, families, etc.). And say you wanted to make queries like, "Output all children (and grandchildren and so on) of the class Mammalia" (or phylum Arthropoda or whatever the hell). What kind of data structure would you want for that? I'm completely inexperienced with hash tables but I'm pretty sure that's the right answer. A straight-up tree data structure seems like it might make traversing and outputting the data simpler/faster but after doing some research it looks like it might be better to do some kind of hash list or even a hash tree to have pointers(?) to a child's parents or adjacent siblings (assuming there is ordering between siblings) with an added bonus of finding the starting node I want to begin with in a reasonable amount time. If there aren't any changes made to the data set whatsoever it makes me think that some augmented hash table is probably the way to go but I'm still unsure. If hash tables do the trick I'm open for any book recommendations or other references that might put me on the right track. :)

Well, one of the big things you want is the path to the origin. With something like a list of species it might make sense to force the user to specify the path to the origin, but if you can't do that then it might make sense to make a hash set that you can use to find a reference to the origin node. Once you have the origin node you just recurse through the children.

The big question is what your constraints are. If you're storage limited that's a different problem than if you need it to be as fast as possible.

Adbot
ADBOT LOVES YOU

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

If there aren't any changes after initialization, and if the dataset is small enough that you can keep a couple of pointers for each node in memory, my (completely uneducated) approach would be
  • Build a tree (with parent pointers as well as child)
  • Store the tree as an array. Just do an in-order traversal of the tree, throwing things into the array as you go.
  • Sort the array by key. Easiest method would be a selection sort, updating the pointers of each node (and those of it's parent/children) as it's selected. You could use cycle sort if you don't want to minimize the amount of messing with pointers, or any in-place O(n lg n) sort (merge? heap?) if you want speed.
A node lookup costs O(lg(n)) rather than a hash table's O(1), but lg(n) grows so slowly and indexing into an array is so cheap that hidden constants are likely to make the array faster. As an added bonus, you don't have to gently caress around with hashes, woo.

Warning: if there's even the slightest chance that somewhere down the road, users will want to add stuff to the structure, go with a hash table/tree/whatever.

coffeetable fucked around with this message at 05:08 on Sep 17, 2013

double riveting
Jul 5, 2013

look at them go

PilotEvan posted:

"Output all children (and grandchildren and so on) of the class Mammalia" (or phylum Arthropoda or whatever the hell). What kind of data structure would you want for that?
When you say this, are you thinking of having the node whose subtree you want as a string (i.e. user typed it or something), or are you getting the node directly, say by the user clicking through the tree?

For the tree itself, the data structure you want is called a rose tree, i.e. a tree with an arbitrary number of children from each node. Each node contains an element and a list (singly linked list will do fine) of subtrees. (Leaf nodes simply have an empty list of children.) Once you have a node in the tree you can recurse over its children easily.

To get to a tree node, unless you are navigating the tree from the top, is a separate problem for which you want an index. I.e. a table or something that lets you find a tree node from whatever kind of query you give it. If you want actual Google-like search, you will have to learn how search indexes work. If you are content with exact string matching on the name of the node, a hash table is indeed what you want. (A hash table is always good if you want to map arbitrary strings to arbitrary values.) So you would map string names to tree nodes.

Done.

Vanadium
Jan 8, 2005

But he's using %d anyway, on what platforms does that print more than, like, 2^31? There's plenty of space, clearly.

shrughes
Oct 11, 2008

(call/cc call/cc)

Suspicious Dish posted:

The smallest 64-bit signed integer takes up 20 characters, including the negative sign. So unless you have some magical 65-bit signed integers, I'm not seeing the overflow.

Oh poo poo, I was typing in 2^64. Well never mind. I guess %d was the real problem.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Yeah, it's clearly badly written code and the '2^64+1' part makes no sense, but at least that isn't exploitable bad code.

Vanadium
Jan 8, 2005

What's he supposed to do in ANSI C tho :(

raminasi
Jan 25, 2005

a last drink with no ice
I feel really dumb for not being able to figure this out, but how the hell do I elegantly compare two periodic numbers (specifically angles) for equality? The inputs can be any real numbers, I don't get to restrict to positive numbers or numbers between 0 and 2 pi or anything.

JawnV6
Jul 4, 2004

So hot ...
mod them down to be between 0 and 2pi, subtract, divide by one of the inputs to get the relative error and compare the result to an epsilon. Check both against 0.0 to make sure you won't get NaN at some point.

Real math won't ever be 'elegant'.

Vanadium
Jan 8, 2005

You could modulo them to both be between 0 and 2 pi before comparing them, I think. Look up how your language's modulo operation works on negative numbers or w/e but you should be able to get angles normalized like that.

Seashell Salesman
Aug 4, 2005

Holy wow! That "Literally A Person" sure is a cool and good poster. He's smart and witty and he smells like a pure mountain stream. I posted in his thread and I got a FANCY NEW AVATAR!!!!

Vanadium posted:

You could modulo them to both be between 0 and 2 pi before comparing them, I think. Look up how your language's modulo operation works on negative numbers or w/e but you should be able to get angles normalized like that.

Rather, lots of languages have a remainder operator and no modulo operator. ((X % m) + m) % m should get you where you wanna go.

e: actually maybe I'm dumb and drawing a distinction that doesn't exist. Either way, if you want the positive modulo number that your lecturer would make you turn in on an exam, then you can use the above formula.

Seashell Salesman fucked around with this message at 18:21 on Sep 17, 2013

raminasi
Jan 25, 2005

a last drink with no ice
e: whoops nm

the littlest prince
Sep 23, 2006


Is anyone else going through the exercises on hackermeter.com? A friend posted the link a while back on facebook (he doesn't appear to be the one who wrote/manages it) and I've found it kind of interesting. My job hasn't had a lot of challenging problems in the last few years and it's refreshing to work on some. I'm currently working through the dynamic programming problems.

It's probably useless in its intended purpose, since it isn't widely known, but it is kind of fun.

raminasi
Jan 25, 2005

a last drink with no ice

JawnV6 posted:

mod them down to be between 0 and 2pi, subtract, divide by one of the inputs to get the relative error and compare the result to an epsilon. Check both against 0.0 to make sure you won't get NaN at some point.

Real math won't ever be 'elegant'.

Won't this not work when comparing 2pi - (small number) and 2pi + (small number)? Their difference is almost 2pi.

JawnV6
Jul 4, 2004

So hot ...
Test again in the range of pi to 3pi and OR the results?

PilotEvan
Jun 27, 2008

WHAT?

Sinestro posted:

You wouldn't want to grab all the data and then futz with it. Use a search algorithm to find the item you are starting from, then recurse over the children of the item.

Well say I had all the data in a .txt file using the format "Root(Child1(GC1,GC2)Child2(...". Would straight-up searching through that be efficient enough or is there some other way I should hold the data?

NovemberMike posted:

Well, one of the big things you want is the path to the origin. With something like a list of species it might make sense to force the user to specify the path to the origin, but if you can't do that then it might make sense to make a hash set that you can use to find a reference to the origin node. Once you have the origin node you just recurse through the children.

The big question is what your constraints are. If you're storage limited that's a different problem than if you need it to be as fast as possible.

I would assume storage is limited and that time isn't a huge deal. Assuming I hold the data using the method above, having the user specify the path seems reasonable... Navigating from the top you would essentially just have to count parentheses right? I'm not sure how efficient that is but it would work. There's probably a better way to do it that I'm not seeing though.

coffeetable posted:

If there aren't any changes after initialization, and if the dataset is small enough that you can keep a couple of pointers for each node in memory, my (completely uneducated) approach would be
  • Build a tree (with parent pointers as well as child)
  • Store the tree as an array. Just do an in-order traversal of the tree, throwing things into the array as you go.
  • Sort the array by key. Easiest method would be a selection sort, updating the pointers of each node (and those of it's parent/children) as it's selected. You could use cycle sort if you don't want to minimize the amount of messing with pointers, or any in-place O(n lg n) sort (merge? heap?) if you want speed.
A node lookup costs O(lg(n)) rather than a hash table's O(1), but lg(n) grows so slowly and indexing into an array is so cheap that hidden constants are likely to make the array faster. As an added bonus, you don't have to gently caress around with hashes, woo.

Warning: if there's even the slightest chance that somewhere down the road, users will want to add stuff to the structure, go with a hash table/tree/whatever.

There won't be any changes whatsoever, but I'm thinking storage constraints would make having the entire tree in memory too difficult unfortunately.


double riveting posted:

When you say this, are you thinking of having the node whose subtree you want as a string (i.e. user typed it or something), or are you getting the node directly, say by the user clicking through the tree?

For the tree itself, the data structure you want is called a rose tree, i.e. a tree with an arbitrary number of children from each node. Each node contains an element and a list (singly linked list will do fine) of subtrees. (Leaf nodes simply have an empty list of children.) Once you have a node in the tree you can recurse over its children easily.

To get to a tree node, unless you are navigating the tree from the top, is a separate problem for which you want an index. I.e. a table or something that lets you find a tree node from whatever kind of query you give it. If you want actual Google-like search, you will have to learn how search indexes work. If you are content with exact string matching on the name of the node, a hash table is indeed what you want. (A hash table is always good if you want to map arbitrary strings to arbitrary values.) So you would map string names to tree nodes.

Done.

Both ways of finding the node seems like something I might want to implement. Given space constraints clicking through the tree from the top and building the tree layer by layer seems like a good idea, but again I'm not sure how exactly fast finding adjacent children will be using the storage method I described above even though time isn't a huge constraint... There's probably a better way to hold the data but I'm not aware of one...

Exact string matching seems like something I would be happy with. Since there won't ever be any changes would a perfect hash function be something I could do? I mean I could simply list every string in the code to be mapped but there's better ways to do that right?

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

PilotEvan posted:

There won't be any changes whatsoever, but I'm thinking storage constraints would make having the entire tree in memory too difficult unfortunately.

:stare:

Let's take the species name example. If we limit each token in the name to say 64 bytes, and we assume you use 8 byte addresses to other nodes, you can fit ~10 million nodes in a gigabyte. You're seriously playing with orders of magnitude more data than that? Or are the tokens/keys much bigger than 64 bytes?

How many levels from the root would you have to descend before each individual subtree could fit in memory? How many levels could you descend before the tree above that level wouldn't fit in memory?

e: poo poo gets very complicated when you try to optimize data structures that involve calls to the hard drive, because they're insanely expensive and random access assumptions go out the window. We really need to know more about the structure of the data you're working with (like the distribution of the node degrees), the kinds of queries, the frequency of each kind of query, and the hardware that you're dealing with.

coffeetable fucked around with this message at 23:39 on Sep 17, 2013

PilotEvan
Jun 27, 2008

WHAT?

coffeetable posted:

:stare:

Let's take the species name example. If we limit each token in the name to say 64 bytes, and we assume you use 8 byte addresses to other nodes, you can fit ~10 million nodes in a gigabyte. You're seriously playing with orders of magnitude more data than that? Or are the tokens/keys much bigger than 64 bytes?

How many levels from the root would you have to descend before each individual subtree could fit in memory? How many levels could you descend before the tree above that level wouldn't fit in memory?

e: poo poo gets very complicated when you try to optimize data structures that involve calls to the hard drive, because they're insanely expensive and random access assumptions go out the window. We really need to know more about the structure of the data you're working with (like the distribution of the node degrees), the kinds of queries, the frequency of each kind of query, and the hardware that you're dealing with.

The space constraint I'm worried about comes from: 1. I'm aiming to develop for tablets. 2. The search terms won't be long, but the data might be larger 64 bytes, since it will include several web addresses. There will be trimming here and there depending on what I decide to include. 3. I'm planning on using Unity3D, and there will be 3D rendering going on meanwhile (though I don't think it will be intense; this is the constraint I'm mostly unsure about since I haven't done 3D design before, not to mention on a tablet).

The species example is quite exaggerated but there's 10's of thousands of leaves (the intermediate node count is unknown to me at the moment). Queries will be slow thankfully - minutes between - but I don't want the user to wait more than a few seconds for results. Again, unfortunately the 3D stuff meanwhile makes this a huge gray area and I'm not sure how it will turn out exactly.

nielsm
Jun 1, 2009



Okay so say you have about 50k nodes, on a 32 bit system, requiring 3 pointers per node, but actually round it up to 4. (One in the parent pointing to it, one in the node pointing to the parent, and some auxiliary.) That's 800 KB for the base tree structure. You could then spend two extra pointers per node to point into the raw source data where the text for each node can be found. That adds the storage required for the raw data.

All in all, a few megabytes at most for the live structure. That should be okay also for mobile device use.

If you have the option to store the data however and it won't be user generated, you can instead develop a file format that let's you do direct lookup in it. Augment the data in whatever convenient way to make lookups faster.

double riveting
Jul 5, 2013

look at them go
I'd just loving slurp the file on start and forget about the disk. As for the format, how about this:

code:
mammals
 apes
  gorillas
  humans
   clowns
fishies
 salmon
 tuna
(one line per entry, number of leading spaces = depth in tree)
Might be easiest to hand-roll a parser for.

Hughlander
May 11, 2005

PilotEvan posted:

The space constraint I'm worried about comes from: 1. I'm aiming to develop for tablets.

Normalize the data, throw it into a SQLite database and write relational queries on the data in a non rendering thread ?

Seashell Salesman
Aug 4, 2005

Holy wow! That "Literally A Person" sure is a cool and good poster. He's smart and witty and he smells like a pure mountain stream. I posted in his thread and I got a FANCY NEW AVATAR!!!!

Hughlander posted:

Normalize the data, throw it into a SQLite database and write relational queries on the data in a non rendering thread ?

This is a much better idea than inventing your own plain-text file format and parser. It also separates your business logic from the data backend in case you ever want to switch the two out independently of one another.

1337JiveTurkey
Feb 17, 2005

PostgreSQL supports recursive common table expressions which can be used to perform hierarchical searches. Oracle also supports it with CONNECT BY syntax.

edit: Nevermind, thought this would be running server-side.

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

PilotEvan posted:

The species example is quite exaggerated but there's 10's of thousands of leaves (the intermediate node count is unknown to me at the moment). Queries will be slow thankfully - minutes between - but I don't want the user to wait more than a few seconds for results. Again, unfortunately the 3D stuff meanwhile makes this a huge gray area and I'm not sure how it will turn out exactly.

<100k nodes with the output being fed straight to the user? Christ, you can structure it however you want and it'll return instantly from the user's perspective. Hand-crafting a data structure for it would be serious premature optimization. Like Hughlander said, just use a SQLite database.

Also, the megabyte or so it'd take to hold it in memory is tiny even by phone standards, so go nuts. And don't worry about graphics performance - all that stuff will be carted off to the GPU core of the chip. It won't even notice a handful of database queries.

double riveting
Jul 5, 2013

look at them go

Hughlander posted:

Normalize the data, throw it into a SQLite database and write relational queries on the data in a non rendering thread ?
Ah, I missed the part about tablets. In that environment, SQLite is indeed a much better idea. Especially because the "disk" will actually be some fast flash.

I was prematurely guessing this was probably for some academic stuff where every extra dependency would just annoy others and RAM is free.

PilotEvan
Jun 27, 2008

WHAT?

Everyone posted:

SQLite

That sounds pretty much perfect. Thanks for the help guys! Yall are awesome. :cheers:

WarpedLichen
Aug 14, 2008


Not sure if this is a "programming" question or a math question, but since this is algorithmic complexity related, I figured this place is closer.

Everybody knows the flash game, cookie clicker right?

What is the algorithmic complexity of optimizing the total cash earned at a given time x in that game? If only I could remember what I learned about P and NP problems back in college.

ultrafilter
Aug 23, 2007

It's okay if you have any questions.


I don't know the game, and I'm probably not the only one. Please explain it for us.

In particular, how do you measure the size of a given instance of the puzzle?

Jewel
May 2, 2009

ultrafilter posted:

I don't know the game, and I'm probably not the only one. Please explain it for us.

In particular, how do you measure the size of a given instance of the puzzle?

It's not a puzzle. Also it's a free web game that you can play and generally was a massive massive hit for reasons nobody really can understand yet.

I'm wondering if you could make a formula for potential money at any given time then differentiate to get the maximum?

WarpedLichen
Aug 14, 2008


Cookie clicker is a basic optimization game, where the objective is to generate cookies. The most basic form of generating cookies is to by clicking on the giant cookie, where each click gives you one cookie. With cookies, you can buy contraptions that generate more cookies over time. Every object generates infinite cookies, but the pricing and cookie generation rate varies.

Here's a link to the games thread

The theoretical naive approach would be an algorithm that takes a set of options and constructs a tree where each node on the search graph is an action during a 1 second payout rate, and simply do the max cost graph search. You can see that the size of this graph would be dependent on the number of options P and the maximum time to search.

Given a constant time T and a setup of arbitrary complexity P, I suppose that approach would yield a complexity of P^T, which is pretty awful.

The reason I ask is because this reminds me of the classic knapsack problem, so my naive solution might be wrong. Is there a canonical form of this class of problem and what is its computational complexity?

double riveting
Jul 5, 2013

look at them go
i suppose some contraptions require other contraptions to be built first? and possibly also consume other contraptions to build?

nielsm
Jun 1, 2009



double riveting posted:

i suppose some contraptions require other contraptions to be built first? and possibly also consume other contraptions to build?

There are certain upgrades that only become available once you have a certain amount of something else (e.g. a second "factories double production" upgrade is available when you have 50 factories), but the only resource used to purchase things are the cookies. I.e. there is a tech tree and one resource.
Writing some sort of simulation taking "steps" (one step being one purchase) might work, simulating multiple branches at once, and using a heuristic to give up on a branch at some point.

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

WarpedLichen posted:

What is the algorithmic complexity of optimizing the total cash earned at a given time x in that game? If only I could remember what I learned about P and NP problems back in college.

It takes constant time (O(1)). Numbers in Cookie Clicker are all 64 bits at most, so the number of cookies is bounded. All we need to do is enumerate every possible configuration of purchases (of which there are a constant number 'cause of the 64 bit thing), and then simulate each until it's accumulated 2^64 - 1 cookies (which takes a constant amount of time). We now have an answer for the maximal number of cookies that can be generated at each unit time, and we just print whichever value is asked for.

double riveting
Jul 5, 2013

look at them go

coffeetable posted:

It takes constant time (O(1)). Numbers in Cookie Clicker are all 64 bits at most, so the number of cookies is bounded. All we need to do is enumerate every possible configuration of purchases (of which there are a constant number 'cause of the 64 bit thing), and then simulate each until it's accumulated 2^64 - 1 cookies (which takes a constant amount of time). We now have an answer for the maximal number of cookies that can be generated at each unit time, and we just print whichever value is asked for.

Sounds reasonable. Get to it!

coffeetable
Feb 5, 2006

TELL ME AGAIN HOW GREAT BRITAIN WOULD BE IF IT WAS RULED BY THE MERCILESS JACKBOOT OF PRINCE CHARLES

YES I DO TALK TO PLANTS ACTUALLY

double riveting posted:

Sounds reasonable. Get to it!

You didn't ask for a reasonable answer :smug:

Slightly less smart-arse: it sounds like a prime suspect for dynamic programming, because an optimal solution for time T should contain the optimal solutions for many earlier times too.

I'm too busy think about it in-depth, but you should start by throwing away status effects and upgrades and everything and just considering cursors alone. How would you calculate the optimal number of cookies if your only choice at each timestep was whether or not to buy a cursor?

e: What distinguishes this from Knapsack is that in knapsack items are arbitrarily sized and there are arbitrarily many types of them. Here the items are of fixed size and there are a fixed number of types.

coffeetable fucked around with this message at 13:20 on Sep 19, 2013

unixbeard
Dec 29, 2004

At every step you have to decide do I spend some cookies to increase my output or do i keep harvesting cookies to buy more output.

Each output has a reward (increase in cps), and a cost (# of cookies req to buy, which is really a function of current cps rate).

It's kinda markov in that your available cookies goes to zero once you buy output. Also the costs change each iteration (ie after a purchase), so what was optimal at one stage will change. Whats optimal at time t is independent of the answer at time t-1 in some sense.

Not having thought about it too much, you would basically weigh up your options, eg. i can buy a cursor for 15 now which will increase my cps by y, or wait x seconds based on my current cps and buy a grandma for whatever, which will increase my cps by some other value. So if you're just getting started you're better off buying cursors, but at some point you will be better off waiting a bit longer to buy a grandma. You would just evaluate all your options after each purchase of output.

Jewel
May 2, 2009

You'd have to weigh up what's faster, waiting to buy a grandma, or buying n cursors then buying a grandma. If you can calculate the time of those (fairly easily) then the lowest time should in theory be the most optimal approach. You could repeat that down the chain I think, and rule out the lower ones when you've gotten enough cp/s to afford the next step up in < 1s or something.

Suspicious Dish
Sep 24, 2011

2020 is the year of linux on the desktop, bro
Fun Shoe
Let us know when your Cookie Clicker TAS comes out, WarpedLichen!

unixbeard
Dec 29, 2004

You can look at it starting from 0 cookies each time. Your current cps is x. The present value of every possible purchase is the increase in cps it brings and some function of the time in seconds it would take to get enough cookies to buy it.

So if you're starting the game fresh, a factory would have a huge increase in cps but take friggin ages = high cost, while a cursor has a relatively low cps increase but a low time cost, so would be optimal. At some point you get enough cursors so the time taken to get a grandma and the cps increase that brings outweighs the value of a cursor. The nominal amount of cookies isnt the real cost, its the time taken to generate the cookies to make the next purchase.

You could also evaluate all the bonus items you could buy with a similar pov.

Adbot
ADBOT LOVES YOU

unixbeard
Dec 29, 2004

The optimal decision is to get back to work on whatever you are procrastinating about.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply