Dr. Erlang Love or: How I Learned to Stop Worrying and Love Crashing

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Dr. Erlang Love or: How I Learned to Stop Worrying and Love Crashing

THE PLATFORM MASTER: Jun 3, 2008

LYSE is awesome, thanks MononcQc! I first toyed with Erlang a couple years ago but never had a project to do anything serious with it. Now I've got a system with tons of crash-happy workers talking to clients and Erlang is making dealing with it and doing failover a breeze. Cannot imagine how painful it'd be to be doing this in C#.

Question:
In my program I have continuously running bots that all register themselves with a router, and then the router routes requests to bots that aren't busy. It's trivial to make the router handle a bot dying, but I'm unsure of what to do when a router dies.

I've had a couple of meh ideas:
When the router starts it could supervisor:which_children, query all the children and try to reconstruct its state.
Bots erlang:monitor the router, and once they're in a state where they're both idle and the router is dead commit suicide. Upon restart they'll reconnect with the router.
Bots erlang:monitor the router, and periodically check if the router has come back yet (not even sure how to do this). When it has re-register.

What do you recommend? The middle seems the most Erlangy, but all my bots killing themselves and attempting reconnect (both to the router and an external system) at the exact same time is concerning. A random back-off time might help, but it seems like there's a better solution somewhere.

# ¿ Jul 7, 2014 04:26

Adbot: ADBOT LOVES YOU

# ¿ May 11, 2024 11:59

THE PLATFORM MASTER: Jun 3, 2008

MononcQc posted:

A few counter-questions:

- What happens when a router goes down? I'm guessing input stops coming in, and so does output, connections are lost, etc. Is there any state that is assumed to be there that will leave the program in an inconsistent state if the routers come back up without the bots having noticed?

Input stops coming in, however if a request got sent to a bot then that will keep running. Basically the architecture is requests come into the router, get sent to the bot, and then the bot handles all further communication.

The only assumed state that would be a problem is that bots are connected to an outside service with a username and password. If I try to bring up a new bot with the same username and password while an old one is running they'll just disconnect each other and both be useless.

quote:

- How do the bots currently connect to the router? How does the router react when one of the bots crashes and comes back?

The router is registered with {local, ?MODULE} (is this something people do?) and the bots do router:register(self()). Within the router's gen_server the router does an erlang:monitor and knows to remove the bot from the pool of live bots when it dies.

quote:

- How often do you expect the router to go down, and for what reasons?

My main concern is I write lovely code, so I expect it to crash from being in untested states. This is a rare occurrence so far but puts my system in a really bad/silent state.

quote:

- How do the routers know which bots are or aren't busy?
- How many routers are there?

Right now only one router. When it assigns a request it assumes the bot has become busy, and then bots do a gen_server:cast to the router when they're done. It would be pretty trivial to shard requests and have multiple routers (each with their own bots), though I don't see that happening for a while. Seems like a solution here would be a solution there too.

# ¿ Jul 7, 2014 16:48

THE PLATFORM MASTER: Jun 3, 2008

MononcQc posted:

Okay so my understanding there at this point is that the router is required to get the request, but not to handle the response.

[good stuff]

Ah that's really nice, definitely makes more sense. So if I understand this right, every time you become available for a request you register yourself. This way, if the router is dead and you're busy then you don't notice until you're available and then you just loop until it comes back. Cool!

# ¿ Jul 7, 2014 19:25

The Something Awful Forums > Discussion > Serious Hardware/Software Crap > The Cavern of COBOL > Dr. Erlang Love or: How I Learned to Stop Worrying and Love Crashing