Ask General Programming Questions Not Worth Their Own Thread

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Ask General Programming Questions Not Worth Their Own Thread

«‹›671 »

Hanpan: Dec 5, 2004

I have a database question I hope someone can help me with. I have a python service which connects to Twitter and stores tweet data to a database. The Twitter consumer updates a particular tracks 'count' column every time a new tweet comes in, then stores the overall tweet data in a separate database on an interval to keep write speed down. Considering a track like 'bieber' or 'obama' can have around 10 tweets per second, I need a database platform that allows me to update that specific tracks 'count' column at a decent enough speed. I also need to monitor the count in realtime, so updating the count on an interval is not an option.

A web app will then be monitoring this database for changes... whenever the count is updated for a specific track, I am going to use node.js to notify any client who is subscribed to that particular track in realtime using socket.io.

At the moment I am using MySQL, it seems to be handling the consumer's UPDATE query ok, but I don't want to add even more stress by using node to poll it every second for any changes (for each track.) Depending on how many clients are connected, I'd need to poll the database every second * the number of clients connected. Nasty.

Doing some research, it seems I have a few options:

1. Using a NonSQL database like Mongo or Couch to keep track of tracks count data. If I used Couch, I could also subscribe to the _change API whenever so I won't have to poll it for changes. I'm not sure how fast this is though, and the client needs updates to be close to realtime. Apparently Redis would be ideal for this too, but I'm struggling to make sense of how that works.

2. Use a in memory MySQL table and copy the data to a persistant table on a set interval via a Cron or something to prevent data loss.

3. Use "pub/sub mq". I have no idea what this is but someone suggested it to me.

Does anyone has any suggestions as to which method would be best to use? I'm not asking for code examples, I'm just unsure as to which method is best suited to my needs. I'm most comfortable with SQL so ideally I'd stick to using that but it does seem as though NoSQL is a better fit.

# ? May 27, 2011 09:12

Adbot: ADBOT LOVES YOU

# ? May 30, 2024 20:49

Contra Duck: Nov 4, 2004; #1 DAD

Don't try to use your database for things that aren't related to storing or retrieving data. Your friend was right, pub/sub aka publish subscribe messaging is pretty much perfect for your purposes. In a pub/sub system you've got two types of users, publishers who provide data and subscribers who receive all data that matches some criteria. In your case you'd have your feed from twitter acting as a publisher and pushing messages into the pub/sub system, and your clients would choose to receive all messages that match a specific pattern, i.e. on a given track.

A quick google finds this python pubsub thing that might be worth looking at, I have no experience with it myself though. Assuming it's OK, it should dramatically simplify and improve your system.

Contra Duck fucked around with this message at 10:05 on May 27, 2011

# ? May 27, 2011 10:02

Zombywuf: Mar 29, 2008

Hanpan posted:

Does anyone has any suggestions as to which method would be best to use? I'm not asking for code examples, I'm just unsure as to which method is best suited to my needs. I'm most comfortable with SQL so ideally I'd stick to using that but it does seem as though NoSQL is a better fit.

Here is the schema I would recommend for this:

code:

Tracks (
  track_id int not null primary key,
  track_name VarChar(max) not null
)

Tweets (
  tweet_id int not null primary key,
  tweet_text VarChar(max) not null
)

TrackTweets (
  tweet_id int not null references Tweets (tweet_id),
  track_id int not null references Tracks (track_id)
)

This is only an outline but hopefully it should be clear that you can then obtain the count with a query like:

code:

select
  Count(*)
from
  Tracks t
  inner join TrackTweets tt
    t.track_id = tt.track_id
where
  t.track_name = 'obama'

You can set the lowest transaction isolation level for this and it will still be atomic. As the TrackTweets table is only 8bytes wide gathering the counts for anything less than a million hits should be very quick.

EDIT: also yes, use something outside the database for change notifications. You could add datetimes to the tweets to gather things that have arrived since a certain date though (which is the one true way to do polling).

EDIT2: If you were to use CouchDB you'd do it the same way.

Zombywuf fucked around with this message at 10:13 on May 27, 2011

# ? May 27, 2011 10:11

tef: May 30, 2004; -> some l-system crap ->

Contra Duck posted:

Don't try to use your database for things that aren't related to storing or retrieving data. Your friend was right, pub/sub aka publish subscribe messaging is pretty much perfect for your purposes.

Except, there is really only one producer and one consumer. The producer is the stream and the consumer is the database. If you are doing trend analysis/reporting you don't want to throw away historical data. Also - message queues have a knack for either running full or empty; their only function in life often seems to be creating a backlog* :v:

The clients (on the web) have to access the counts by polling. But he wants to track this data if an end client isn't connected. So you'd have to have a twitter bot pushing into a number of queues and then a bunch of processes subscribing to them and writing them to disk for persistance, and then another system to aggregate these statistics and serving them over http.

(Also, people who recommend pubsub normally recommend a message broker like rabbitmq, or messaging protocol like 0mq, rather than the first thing that appeared on google)

* I kid, mostly.

# ? May 27, 2011 10:37

shrughes: Oct 11, 2008; (call/cc call/cc)

The free version of RethinkDB will let you track your counts nicely. It will store them on-disk and serve them out to lots of connections efficiently. You could also use it to store the tweets themselves and build an efficient product out of that, but I wouldn't quite recommend that because the first version doesn't support replication. (To be fair, a lot of key-value stores that "support" replication actually suck at it and lose consistency in the event of errors. Or so I'm told.) Disclaimer: I work at RethinkDB

# ? May 27, 2011 10:57

fankey: Aug 31, 2001

Although this is more of a graphics question it deals more with asset generation so I'm hoping someone here has more experience with this sort of thing.

I'm trying to create 3d rendered image assets with objects that have 0 alpha colors. By this I don't mean that they show what's behind them in the 3d scene - I want their color to be rendered as a color with 0 alpha. Here's an example of what I'm going for

To get the above image I rendered the chair in SketchUp with a 100% red color and then wrote a little program to change any color that was a pure shade of red, change it to black and set the alpha on it. I can then easily change the color of the image by putting a colored rectangle behind it. Notice that the chair is transparent but still has shading applied to it. The anti-aliasing also is correctly 'alphaed' so there isn't any fringing.

That works for simple images but I need to be careful that the image doesn't use red and also the transparent area can't interact with colored areas.

Does any 3d software have a concept of a alpha color like this? I emailed SketchUp ( back before they were bought by Google ) and I think I ended up confusing the developer I was talking to.

# ? May 27, 2011 16:58

nielsm: Jun 1, 2009

Is there any reason you can't instead render the object on a transparent background but make the object entirely white? When the object is pure white (and a shade of gray for shadows) you can use multiplication blending mode to mix a colour or texture onto it while keeping a transparency mask around the object. Basically the palette swapping of 256 colour VGA taken to the modern age.

# ? May 27, 2011 17:32

fankey: Aug 31, 2001

nielsm posted:

Is there any reason you can't instead render the object on a transparent background but make the object entirely white? When the object is pure white (and a shade of gray for shadows) you can use multiplication blending mode to mix a colour or texture onto it while keeping a transparency mask around the object. Basically the palette swapping of 256 colour VGA taken to the modern age.

It's a little more complicated than I originally wrote. Our application allows users to import graphics ( png, jpg, etc ) as well as draw polygonal regions that trigger events when clicked on. One way for our users to create sophisticated user interfaces is to create pngs with transparency and place them at a higher z-order than the polygon buttons. Currently our software doesn't support anything beyond simple transparency. I'm trying to come up with a few different techniques our users can use to create these images without opening them up in Photoshop and doing a bunch of channel math.

# ? May 27, 2011 20:51

FSMC: Apr 27, 2003; I love to live this lie

I have a question about FAT or how I can have more control over how files are written.

Background: I have a micro-controller that need to read lots of data fast from a uSD card. I've spent a bit of time trying to port a FAT filesystem to the microcontroller. I managed to port DOSFS to it but it was very slow. So I worked out why it was slow and "fixed it". The files would be ~150kb of size. I'm using FAT16 but could use FAT32.

I've implemented reading such that I only look at the position of the first sector and then just read the next x sectors.

So for my implementation to work the file needs to start on a sector boundary and be stored in order. I'm not too sure if this will always be the case and it feels like a bad idea but I don't really know much at all about file systems. Would something such as a simple defrag make sure my uSD card is readable by my implementation? Or is there another way to ensure files are stored contiguously on a FAT file system. Do I even need to worry?

# ? May 30, 2011 00:12

mcw: Jul 28, 2005

edit: nm

mcw fucked around with this message at 02:11 on May 30, 2011

# ? May 30, 2011 00:17

nielsm: Jun 1, 2009

Unparagoned posted:

I've implemented reading such that I only look at the position of the first sector and then just read the next x sectors.

This is a really bad idea and you explained why yourself.

You haven't told in much what you're doing right now or what your constraints are, so I'll just make some guesses and assumptions.

If you can afford the memory for it, you're probably best off reading the entire file allocation table into memory for constant reference. If you want to read an entire file, what I'd suggest doing would then be to build a list of cluster numbers you need to read, possibly coalescing it into runs of contiguous clusters if it's faster to make requests for contiguous regions.

Having the entire FAT in memory should also allow you to more efficiently search it for free clusters when you need to write, although if you need to write often, it might be smarter to build a free-list when initialising the FS driver. (Again, that obviously needs memory you might not have.) If you know you will be writing fixed-size files you could even construct your free-list based on that knowledge, so you start with contiguous runs of the appropriate number of clusters, pre-calculated, and place all free space that would require you to fragment at the end of the free-list.

# ? May 30, 2011 01:30

FSMC: Apr 27, 2003; I love to live this lie

nielsm posted:

This is a really bad idea and you explained why yourself.

You haven't told in much what you're doing right now or what your constraints are, so I'll just make some guesses and assumptions.

If you can afford the memory for it, you're probably best off reading the entire file allocation table into memory for constant reference. If you want to read an entire file, what I'd suggest doing would then be to build a list of cluster numbers you need to read, possibly coalescing it into runs of contiguous clusters if it's faster to make requests for contiguous regions.

Thanks nielsm. The application is to play raw video. The chip isn't powerful enough to uncompress video, but obviously with raw video I need enough bandwidth. I will just be reading on the device so there will be no writing. The main factor in speed is the number of requests to make for the information, it's about 3000% faster to request for 300 contiguous sectors at once than request 300 sector one at a time.

I think the file allocation table is in memory, or the part I need for the file is. So if I'm reading you right, I should do everything "correctly", while working out where the file is, and sort that into contiguous areas, then do the read(s) at the end. That sounds like it would work.

It may be better to front load the work. So how would I make a file system like FAT except that the files would be stored contiguously? Would I be able to make a program that would do that? Could I just find a way to have a cluster size of 192kb?

# ? May 30, 2011 10:21

rolleyes: Nov 16, 2006; Sometimes you have to roll the hard... two?

If you're just reading on the device, does this mean you've got your own custom file system running on another machine somewhere to write the files in the first place? To me it sounds like you're talking about building a file system which automatically defragments itself, which I suppose is possible in theory and would be simple so long as you never needed to delete anything as you can just write everything in sequence. Once you throw in deletions you have to decide whether you're going to defragment on delete or on write. Either way, the performance of the chosen operation would be terrible but maybe you're willing to live with that for this application.

# ? May 30, 2011 11:45

nielsm: Jun 1, 2009

There's two (three) ways to keep a FAT system unfragmented when you don't know what will be writing to it.
All of them depend on Win95 long file names not being written since that can bloat the directory entries a whole lot more, causing those to fragment.

One would be to make sure you only ever have one file open for writing at a time, never write to files that have been closed once, and never delete files. Then all your files should be contiguous.

Another would be to have all files be exactly the same size, and pre-allocate the files when you create them. Then you can create and delete files as long as they keep their size.

Last, which doesn't really count, is to only ever have one single, huge file.

On the other hand, if your device will be the one writing to the file system, then you're free to implement your file system driver as you wish, to avoid fragmentation in whatever way that might be most convenient for you.
The most important element in avoiding fragmentation is to plan ahead, using knowledge of the access pattern. When you need to write, if you know ahead of time what amounts of data you will be writing to a file, you can find a sufficiently large contiguous free area in the FAT and write your data there. A well-designed free-list can help in that operation.

Again, when reading, if you have enough memory you might be able to get away with throwing that at your problem: Implement a large-block cache below the FS driver. E.g. read aligned 4 MB blocks and keep them around for a while, then just read into those when you need. That can solve problems of files being scattered around, for example two files that were written in an interleaved fashion.

"I have a slow storage" -> throw faster memory at the problem

# ? May 30, 2011 13:26

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Can anyone recommend any particularly well-designed asynchronous object-relational mappers, written in any language? I don't particularly want to use one, but I would really like to see how other people have designed their APIs in this space. I'm writing an event-driven Twisted application in Python and I'll definitely need to write something to simplify my interactions with MongoDB.

(Okay, since MongoDB isn't relational, I'm really looking for object persistence. Sue me.)

# ? May 31, 2011 03:50

Haystack: Jan 23, 2005

Misogynist posted:

Can anyone recommend any particularly well-designed asynchronous object-relational mappers, written in any language? I don't particularly want to use one, but I would really like to see how other people have designed their APIs in this space. I'm writing an event-driven Twisted application in Python and I'll definitely need to write something to simplify my interactions with MongoDB.

(Okay, since MongoDB isn't relational, I'm really looking for object persistence. Sue me.)

Well, if persisting python objects is what you're after, you could cut out the middleman and switch to an object database, namely ZODB.

# ? May 31, 2011 04:37

Vulture Culture: Jul 14, 2003; I was never enjoying it. I only eat it for the nutrients.

Haystack posted:

Well, if persisting python objects is what you're after, you could cut out the middleman and switch to an object database, namely ZODB.

ZODB isn't asynchronous (if it can do it, its API certainly isn't built for it) and it fulfills almost none of my requirements. I'm writing a high-throughput application that's going to be processing tons of performance data, and:

I need this to be fully asynchronous, as the majority of this stuff is happening inside of a Twisted reactor loop.
Lots of the data I'm interacting with needs to be fire-and-forget (i.e. I don't care if individual values even make it into persistent storage on the other end). My framework needs to ensure that, if data is flagged for this, it doesn't waste any time waiting on return values for inserts or other things I don't actually care about. I should be able to scale up to hundreds of writes per second from a single thread without any issues.
If this can't get me the performance I want, I need to be able to break out of it and just start making GBS threads data into Mongo as fast as Python can manage.

Since I'm already working with a document-oriented database, this thing basically just needs to make my data interaction code more readable (especially when I'm dealing with versioned schemas or references to other documents), since most of the serialization/deserialization is already happening in BSON on the way to/from MongoDB.

Edit: gently caress it, I'm rewriting MongoKit to use txmongo

Vulture Culture fucked around with this message at 23:03 on May 31, 2011

# ? May 31, 2011 12:28

Meliv: Nov 1, 2008

I just graduated from Uni and am looking around job sites for, well, jobs. I'm seeing that a lot of companies require VB6 programmers to convert their systems to VB.Net. I have a fair amount of experience in VB.Net so how quickly is it to pick up VB6? Is it pretty much identical? It seems that way

# ? May 31, 2011 13:39

rolleyes: Nov 16, 2006; Sometimes you have to roll the hard... two?

Well, most obviously, VB.NET uses the .NET runtime and libraries while VB6 doesn't so there's going to be a hell of a lot of rewriting calls to libraries which don't exist in VB.NET although I believe the upgrade tool may do some of that, where possible. There are also some significant syntactical differences to enable all the new features which weren't present in VB6. See here for a brief rundown.

# ? May 31, 2011 14:01

Scaramouche: Mar 26, 2001; SPACE FACE! SPACE FACE!

Meliv posted:

I just graduated from Uni and am looking around job sites for, well, jobs. I'm seeing that a lot of companies require VB6 programmers to convert their systems to VB.Net. I have a fair amount of experience in VB.Net so how quickly is it to pick up VB6? Is it pretty much identical? It seems that way

To do this you'll basically be doing two things:
1. Figuring out VS project/binary/compilation problems
2. Finding VB6 specific syntax and converting it to VB.NET

First part is a nightmare, especially if 'must have' third party components were never upgraded.

# ? May 31, 2011 20:45

loinburger: Jul 10, 2004; Sweet Sauce Jones

I'm converting one XML file to another using XSLT 1.0 in Visual Studio 2010. If need be I may be able to use the Saxon XSLT 2.0 API instead, but the XSLT needs to be imported into Dynamics AX so I'm not able to embed C# code in it.

The input looks like:

code:

{level-1-element}

  {level-2-element "1"}
      A1
      A2
  {/level-2-element}

  {level-2-element "2"}
      B1
      B2
  {/level-2-element}

  {level-2-element "3"}
      C1
      C2
  {/level-2-element}

{/level-1-element}

The output looks like

code:

{level-1-element}

  {level-2-element "1"}
      A1
      B1
      C1
  {/level-2-element}

  {level-2-element "2"}
      A2
      B2
      C2
  {/level-2-element}

{/level-1-element}

I can build this element-by-element with

code:

{ns0:level-1-element}
  {ns0:level-2-element}
    {value-of "./level-2-element[1]/[1]"}
    {value-of "./level-2-element[2]/[1]"}
    {value-of "./level-2-element[3]/[1]"}
  {/ns0:level-2-element}
  {ns0:level-2-element}
    {value-of "./level-2-element[1]/[2]"}
    {value-of "./level-2-element[2]/[2]"}
    {value-of "./level-2-element[3]/[2]"}
  {/ns0:level-2-element}
{/ns0:level-1-element}

The problem is that I don't know how many triplets are going to be in the input XML - could be 2, could be a dozen. I can copy-paste my XSLT 40 times (it is extremely unlikely that there will be more than 40 triplets in the input) and increment the final index on each copy-paste, but if possible I'd prefer to be able to loop through the triplets with a {for-each} or something along those lines. However, if I'm using a {for-each} on the first level-2-element (with A(1) through A(N)) I don't know how to write XSLT that will access the appropriate element in the second and third level-2-elements.

# ? Jun 1, 2011 00:05

Zombywuf: Mar 29, 2008

loinburger posted:

The problem is that I don't know how many triplets are going to be in the input XML - could be 2, could be a dozen. I can copy-paste my XSLT 40 times (it is extremely unlikely that there will be more than 40 triplets in the input) and increment the final index on each copy-paste, but if possible I'd prefer to be able to loop through the triplets with a {for-each} or something along those lines. However, if I'm using a {for-each} on the first level-2-element (with A(1) through A(N)) I don't know how to write XSLT that will access the appropriate element in the second and third level-2-elements.

Assuming all the elements are at the same level, try something like for-each select="./level-2-element[position() mod 3 = 0]" and using the following-sibling axes to get the 2 level-2-elements after each one.

# ? Jun 1, 2011 08:54

loinburger: Jul 10, 2004; Sweet Sauce Jones

Thanks, that did the trick

# ? Jun 1, 2011 12:34

gonadic io: Feb 16, 2011; >>=

Does anybody who knows the language feel like giving my Haskell implementation of hangman a quick look over? It's not quite finished (mostly just more feedback to do) but I'm more looking for advice on style and implementation.

http://pastebin.com/pyv2au4F

e: cleaned up the code a bit

gonadic io fucked around with this message at 19:53 on Jun 3, 2011

# ? Jun 3, 2011 17:28

ShoulderDaemon: Oct 9, 2003; support goon fund; Taco Defender

AlsoD posted:

Does anybody who knows the language feel like giving my Haskell implementation of hangman a quick look over? It's not quite finished (mostly just more feedback to do) but I'm more looking for advice on style and implementation.

http://pastebin.com/pyv2au4F

e: cleaned up the code a bit

It's not bad, but I would probably say it lacks elegance. It reads like you are doing a lot of extra work, and you are jumping around and being somewhat too verbose in your coding.

Case statements inside do notation, especially if nested, should probably be taken as opportunities to reword into different functions. Remember that you can use pattern guards to split a function definition into separate cases where appropriate. The library has a great many useful functions and datatypes which you should work to familiarize yourself with.

Here's how I would start something like this:

code:

import Control.Monad
import Data.Char
import qualified Data.Set as S
import System.IO
import System.Random

type Word = String
type Guesses = S.Set Char

showWord :: Word -> Guesses -> String
showWord w g = map guessedLetter w
  where
    guessedLetter c | c `S.member` g = c 
                    | otherwise      = '.' 

mistakenGuesses :: Word -> Guesses -> Int 
mistakenGuesses w g = S.size $ g `S.difference` S.fromList w

complete :: Word -> Guesses -> Bool
complete w g = all (`S.member` g) w

playRound :: Word -> Guesses -> IO ()
playRound w g | complete w g            = putStrLn $ w ++ "\nYou won!"
              | mistakenGuesses w g > 6 = putStrLn $ "You lost!\nWord was " ++ show w ++ "." 
              | otherwise               = do
  putStrLn $ showWord w g 
  putStrLn $ "You have guessed: " ++ S.toList g ++ " (" ++ show (mistakenGuesses w g) ++ " mistaken guesses)"
  putStr "? "
  l <- getLine
  case l of
    "" -> do
      putStrLn $ "Giving up. Word was " ++ show w ++ "." 
      putStrLn ""
    [c] | c `S.member` g -> do
      putStrLn $ "You have already guessed " ++ show c ++ "." 
      putStrLn ""
      playRound w g 
        | otherwise -> do
      putStrLn ""
      playRound w $ S.insert c g 
    _ -> do
      putStrLn "You can only guess a single letter!"
      putStrLn ""
      playRound w g 

game :: Word -> IO ()
game w = do
  putStrLn ""
  putStrLn "Let's play hangman!"
  playRound w S.empty

main :: IO ()
main = do
  hSetBuffering stdout NoBuffering
  dictionary <- fmap (filter (all isLower) . filter ((> 4) . length) . lines) $ readFile "/usr/share/dict/words"
  putStrLn $ "Loaded " ++ show (length dictionary) ++ " words from dictionary."
  g <- newStdGen
  let selections = map (dictionary !!) $ randomRs (0, length dictionary - 1) g
  mapM_ game selections

Here I'm using existing Set support to manage the guessed letters, and computing everything I need on-the-fly from the word and the set of guessed letters. With the selection of helper functions I wrote, most of the game logic can be kept pure and in pattern guards, leaving only the input handling stuck in the IO monad. The main function remains messy, but acceptably so, as it is all one-time initialization code.

# ? Jun 4, 2011 01:37

Orzo: Sep 3, 2004; IT! IT is confusing! Say your goddamn pronouns!

Here's an algorithm question I can't find the answer to because I don't know what it's traditionally called.

You are placed on an infinitely long line and told there's a treasure somewhere on it. You don't know which direction to go in. What is the most efficient way to find the treasure?

Example solution: go 1 meter left, then 2 meters right, then 4 meters left, then 8 meters right, etc, zigzagging. Eventually you'll find the treasure.

What is the name of this problem and what is the correct solution?

# ? Jun 4, 2011 07:02

shrughes: Oct 11, 2008; (call/cc call/cc)

Orzo posted:

Here's an algorithm question I can't find the answer to because I don't know what it's traditionally called.

You are placed on an infinitely long line and told there's a treasure somewhere on it. You don't know which direction to go in. What is the most efficient way to find the treasure?

Example solution: go 1 meter left, then 2 meters right, then 4 meters left, then 8 meters right, etc, zigzagging. Eventually you'll find the treasure.

What is the name of this problem and what is the correct solution?

Define "correct".

Define "most efficient".

# ? Jun 4, 2011 07:58

Mr.Hotkeys: Dec 27, 2008; you're just thinking too much

shrughes posted:

Define "correct".

Define "most efficient".

Since the question is "what way is most efficient", the correct answer is what's most efficient. "Most efficient" is fairly obvious.

# ? Jun 4, 2011 08:14

pokeyman: Nov 26, 2006; That elephant ate my entire platoon.

Orzo posted:

You are placed on an infinitely long line and told there's a treasure somewhere on it. You don't know which direction to go in. What is the most efficient way to find the treasure?

I've heard of this problem generally called the 'lost cow' or 'cow-path' problem; you're describing a specific case of it.

# ? Jun 4, 2011 08:15

qntm: Jun 17, 2009

Orzo posted:

Here's an algorithm question I can't find the answer to because I don't know what it's traditionally called.

You are placed on an infinitely long line and told there's a treasure somewhere on it. You don't know which direction to go in. What is the most efficient way to find the treasure?

Example solution: go 1 meter left, then 2 meters right, then 4 meters left, then 8 meters right, etc, zigzagging. Eventually you'll find the treasure.

What is the name of this problem and what is the correct solution?

To answer this question we need to know the probability distribution which was used to drop the treasure on the line. We can't assume a "uniform" distribution because there's no such thing on the real line, or indeed on the integers.

# ? Jun 4, 2011 09:28

Orzo: Sep 3, 2004; IT! IT is confusing! Say your goddamn pronouns!

qntm posted:

To answer this question we need to know the probability distribution which was used to drop the treasure on the line. We can't assume a "uniform" distribution because there's no such thing on the real line, or indeed on the integers.

Yeah, this is actually the conclusion I came to after thinking about it a bit, if you assume it's 'anywhere' then by definition you'll never find it anyway. If it's anywhere between where you are and 10,000 miles away it's probably best to just guess a direction and exhaust it, then turn around. If there's a table of distributions you could probably write a corresponding search.

shrughes posted:

Define "correct".

Define "most efficient".

'Correct' = guarantees finding the treasure
'Efficient' = The algorithm has the lowest average expected time; if you were to play this game millions of times against another program, you'd take less time.

Although I suspect you knew that since, like Mr.Hotkeys pointed out, it's fairly self evident. Thanks for the useful post qntm.

# ? Jun 4, 2011 17:10

ultrafilter: Aug 23, 2007; It's okay if you have any questions.

qntm posted:

To answer this question we need to know the probability distribution which was used to drop the treasure on the line. We can't assume a "uniform" distribution because there's no such thing on the real line, or indeed on the integers.

Who says the treasure is deposited at random?

Edit: The scenario I have in mind is one where the treasure is placed deterministically on the line by an entity who knows your algorithm and wants to see you perform badly. There are a lot of problems like this where you can do pretty well by using a randomized algorithm, and this could be one of them.

# ? Jun 4, 2011 17:44

shrughes: Oct 11, 2008; (call/cc call/cc)

qntm posted:

To answer this question we need to know the probability distribution which was used to drop the treasure on the line. We can't assume a "uniform" distribution because there's no such thing on the real line, or indeed on the integers.

There kind of is. From the perspective of a self-similar algorithm like one where you double your distance walked each time you turn around, you could, for example, assume the distribution is uniform between two adjacent points at which one turns around.

It's not hard to make a sequence of nice-looking probability distributions that converge to this property.

# ? Jun 4, 2011 19:38

Heavy_D: Feb 16, 2002; "rararararara" contains the meaning of everything, kept in simple rectangular structures

shrughes posted:

It's not hard to make a sequence of nice-looking probability distributions that converge to this property.

But what it converges to isn't a probability distribution. The simplest way of creating a converging distribution X_n would be

P(X_n=x) = 1/(2n+1) for |x|<=n

The problem is that in the limit those probabilities all go to zero.

In general, the proof goes:

Suppose there exists a uniform probability distribution X on the integers.

P(X=i) = P(X=j) for all i,j by uniformity. Let p denote this probability.

Suppose p > 0. Then the sum of P(X=i) over all i is infinite, which contradicts X being a distribution since they should sum to 1.
Suppose p = 0. Then the sum of P(X=i) over all i is zero, which contradicts X being a distribution.

# ? Jun 5, 2011 21:14

kimbo305: Jun 9, 2007; actually, yeah, I am a little mad

Maybe the problem could be constrained to a finite range, with uniform distribution of the treasrue on that range. Then the question of efficiency is the expected cost (as a proportion of the range size) in traversed distance before hitting the treasure.
...though my gut says running all the way out to one end (and back if needed) might be the most efficient way.

# ? Jun 5, 2011 21:43

Nippashish: Nov 2, 2005; Let me see you dance!

ultrafilter posted:

Who says the treasure is deposited at random?

Edit: The scenario I have in mind is one where the treasure is placed deterministically on the line by an entity who knows your algorithm and wants to see you perform badly. There are a lot of problems like this where you can do pretty well by using a randomized algorithm, and this could be one of them.

I think this approach is the most reasonable, since it entirely sidesteps the problematic issue of defining a probability distribution over the possible goal locations.

This also seems to be how other people approach this problem. The first google result for "cow path problem" shows that the obvious strategy of alternating directions and doubling how far you walk each time is 9-competitive, and it cites Searching in an Unknown Environment: An Optimal Randomized Algorithm for the Cow-Path Problem for a more general discussion. This paper mentions that the optimal deterministic strategy is 9-competitive (so you can't do better than doubling without randomization), and also gives a randomized algorithm which has an optimal competitive ratio for any number of paths. Their algorithm is ~4.591-competitive for two paths, so ultrafilter's intuition that randomization can win big here is correct.

# ? Jun 6, 2011 00:25

shrughes: Oct 11, 2008; (call/cc call/cc)

Heavy_D posted:

The problem is that in the limit those probabilities all go to zero.

That's not a problem.

# ? Jun 6, 2011 02:49

Zombywuf: Mar 29, 2008

shrughes posted:

That's not a problem.

Pretty sure it is. You could make p range over the surreals, but that get's complicated.

# ? Jun 6, 2011 10:44

shrughes: Oct 11, 2008; (call/cc call/cc)

Uh no it's not a problem that we're considering a sequence of probability distributions that converge pointwise to zero.

The question in consideration is whether it could possibly make sense to assume the distribution is uniform between two adjacent points at which one turns around, when one doubles the distance walked each time.

Generally speaking it's trivially easy to have this property: you've got a bunch of such intervals, so make pdf uniform on each of these segments but of course it gets smaller with each segment. If the first segment walked is of length 1, then you get segment sizes of

1,1,2,4,8,16,32,...

So you just just pick a piecewise uniform pdf that is uniform on the respective segments.

The only thing is that you'd like our pdf to be resilient to the initial interval size. We want something equivalent to this for all scaling factors.

The question is, how flat can your distribution be? I was pointing out that you can get arbitrarily flat. For example, the distributions of the form f_p(x) = 1/s * { 1, if |x| < 1; 1/|x|^p, if |x| >= 1 }, where s is the appropriate scaling factor, get flatter and flatter and flatter. Any value of p greater than 1 will create a valid probability distribution that is flatter than that with a greater value of p. These distributions converge pointwise to zero. Obviously any succession of flatter and flatter distributions will do so. That doesn't matter.

The reason I'm wrong is because you can't actually get arbitrarily flat. The pdf at any two turnaround points must generally decrease by more than half. If it decreases by half each time or less, then you won't get a convergent integral.

So at least there are no nice families of continuous probability distributions that switch from increasing to decreasing exactly once.

I mean obviously it's wrong to look at individual probability distributions because for any individual probability distribution it's trivial to pick a step size large enough such that the first two segments cover 90% of the interval, and then twiddling the scaling factor will show something not resembling a uniform distribution. So we can either look at the limit as initial segment size approaches zero or look at sequences of ever-wider probability distributions.

shrughes fucked around with this message at 11:39 on Jun 6, 2011

# ? Jun 6, 2011 11:37

Adbot: ADBOT LOVES YOU

# ? May 30, 2024 20:49

Zombywuf: Mar 29, 2008

shrughes posted:

Uh no it's not a problem that we're considering a sequence of probability distributions that converge pointwise to zero.

I thought we were talking about the prior pdf for the position of the cow over the entire set of integers. There is no appropriate scaling factor for your function in that case.

# ? Jun 6, 2011 14:33

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Ask General Programming Questions Not Worth Their Own Thread

«‹›671 »