Register a SA Forums Account here!
JOINING THE SA FORUMS WILL REMOVE THIS BIG AD, THE ANNOYING UNDERLINED ADS, AND STUPID INTERSTITIAL ADS!!!

You can: log in, read the tech support FAQ, or request your lost password. This dumb message (and those ads) will appear on every screen until you register! Get rid of this crap by registering your own SA Forums Account and joining roughly 150,000 Goons, for the one-time price of $9.95! We charge money because it costs us money per month for bills, and since we don't believe in showing ads to our users, we try to make the money back through forum registrations.
 
  • Post
  • Reply
Shaggar
Apr 26, 2006
git is garbage for idiots.

Adbot
ADBOT LOVES YOU

PleasureKevin
Jan 2, 2011

Captain Foo
May 11, 2004

we vibin'
we slidin'
we breathin'
we dyin'


hi luke v

theadder
Dec 30, 2011



this name deceit gave me no pleasure luke

Bloody
Mar 3, 2013


"null" == null

true

Carthag Tuek
Oct 15, 2005

Tider skal komme,
tider skal henrulle,
slægt skal følge slægters gang



if ("true" == 0) { hosed up }

suffix
Jul 27, 2013

Wheeee!

MononcQc posted:

Of course if you have that mechanism, you can start thinking about sets! How do I model a set? Ah well, there are optimized ways, but I can use these primitive counters.

[{element1, {+: {...}, -: {...}}}, {element2, ...}, ...}]

Or if you want to make it a bit easier:

{[{element1+: {...}, ...], [{element1-: {...}, ...]}

is that safe without timestamps?

what do you do if two nodes have added an element, and another node wants to remove it?

JewKiller 3000
Nov 28, 2006

by Lowtax

Bloody posted:

"null" == null

compile-time type error

Snapchat A Titty posted:

if ("true" == 0) { compile-time type error }

ftfy

MononcQc
May 29, 2007

suffix posted:

is that safe without timestamps?

what do you do if two nodes have added an element, and another node wants to remove it?

Ah yes. So that's a fun distinction. Depending on the implementation of the set, you have the following options:
  • Grow-only set: you can only add elements, never remove them; you only need the + side of the set.
  • Two-phase set: when an element is added then removed, it can never be added again
  • Last-Write-Wins element set: This one uses a timestamp to figure out what the last write wins means. The timestamp in the original paper is of an unspecified type, and could be a physical timestamp as well as a logical one. You can add and delete elements multiple times that way, and there are very fancy ways to deal with timestamps if you want, such as Hybrid Logical Clocks, which mix lamport clocks and NTP time to make sure you get both a causal link and a time-based link with an interval of confidence on the timestamp accuracy.
  • PN Set: The counter I showed with {+: ..., -: ...} is called a PN counter. The set build from that (the example I gave) is a PN Set. In this case, an element is considered in if the + counter is greater than the - counter, and deleted otherwise.
  • Observed-Remove set: Variation of the PN-set, where causality is preserved in how updates are dispatched and allows to avoid the case where 1 node adds the same element 5 times and deletes it once and it still shows up as there. This set has two subtypes: add wins or remove wins when causality conflicts are detected.

So a good CRDT library (say Riak?) should implement many of these, and you could pick the set semantics that suit your problem. Other fancypants types exist (registers, graphs, sequences, etc.) and some interesting work has been done with L-Vars + CRDTs allowing you to give an 'at least N' semantic to any structure, which could let you do funky things like implement some quorum-like mechanism over CRDTs.

It goes deep deep deep.

Edit: I should say that while I kind of demoed a simple "let's assemble primitive types" approach to building sets, people who design that poo poo for a living end up doing so by bypassing the primitives and defining more efficient structures that still respect the requirements to be considered CRDTs.

Edit2: I have a hard time reading the mathy stuff and my understanding of some of the structures might be off.

MononcQc fucked around with this message at 06:05 on Nov 22, 2014

Don Mega
Nov 26, 2005

Shaggar posted:

git is garbage for idiots.
since ur a windows fan i assume any tool that doesn't have a gui is too difficult for you to comprehend?

Shaggar
Apr 26, 2006
there are guis for git. visual studio itself will do git. but git is bad because it is poorly designed and distributed version control is a bad idea.

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





tef posted:

anyone want to talk about crdts ?

i am all about crdts. let's do this

distributed systems (edit: good ones at least) rely on leslie lamport's state machine replication approach. you start each node in the same initial state and then ensure they make the same state transitions in the same order on all nodes and they will all have the same state progression. ignoring time deltas they will all be in the same state

ensuring they make the same transitions in the same order turns out to be really hard. you need linearizable transitions (strict ordering, basically). paxos is the only algorithm (ok, maaaaaybe raft and maybe zookeeper isn't actually paxos) widely considered to be sufficient to handle that requirement. in practice this reduces most distributed systems to be only kinda paxos (like spanner and zookeeper) or only consistent if you ignore all the ways they can fail to be consistent (which is a lot, see basically any discussion of the CAP theorem for a tiny subset of them)

crdts are a way to sidestep the ordering part of the problem. instead of state machines with strictly ordered state transitions you instead work with a set (in the mathematical sense) of operations (analogous to state transitions) and an algorithm to reduce that set to some output. that reduces the consistency problem to merging sets

you can't use just any old algorithm for the reduction of your set of course. you need an algorithm that's commutative, idempotent and associative. (the cmrdt's mononoqc mentions relax this further and eliminate the requirement the reduce algorithm be commutative but require updates are causually ordered (via vector clocks or whatever))

with a little bit of work you can define counters, registers (a type that contains a single other type that can be replaced or read), sets, graphs, sequences and anything you can assemble from those. if that seems familiar it's because those are the same basic types most programming languages give you

tef
May 30, 2004

-> some l-system crap ->

MononcQc posted:

So a good CRDT library (say Riak?) should implement many of these, and you could pick the set semantics that suit your problem. Other fancypants types exist (registers, graphs, sequences, etc.) and some interesting work has been done with L-Vars + CRDTs allowing you to give an 'at least N' semantic to any structure, which could let you do funky things like implement some quorum-like mechanism over CRDTs.

I'm actually leaning towards the opposite approach, that it is probably better to have a handful of high level data structures, rather than building everything from 2P-Sets. For example: roshi just has one data structure, a last write wins timeline, which has add/select/delete.

Riak goes for the full coverage, and as a result the api is less than 'get' 'set' 'timeline', but buckets, bucket config, and operations within the bucket. Dynamo, the amazon system which riak built on, has never been exposed directly to aws customers, but they've built many services atop of it.

I think applications writers would prefer a handful of well defined /opinionated/ structures, than a toolkit that requires expert knowledge to construct safely. I'm not sure what these structures are yet but i imagine a redis-alike with last-write-wins would be a lot more useful than redis' "lol what's a distributed system" or riak's "time to implement your own conflict management".

Something with the interface of redis: here are five data structures with distinct operations, but unlike redis, and more like riak, built around the assumption it will be running as part of a larger cluster. I've seen more systems built atop redis than riak_core and it makes me sad.



MononcQc posted:

So GC? Not that much for CRDTs, as far as I know. Either you can't afford it (because you'd lose information), or it's baked into the middleware. For vector-clocks and other mechanisms where you go the causal way and expect conflicts from time to time, GC can be done as a gamble.

The two problems of CRDTs (being unbounded in growth and synchronisation is needed for compacting without loss) can be avoided by abandoning persistence: i.e be a cache. If all data is ephemeral, you can avoid unbounded growth. You can also compact by expiring old data.

MononcQc
May 29, 2007

the talent deficit posted:

paxos is the only algorithm (ok, maaaaaybe raft and maybe zookeeper isn't actually paxos) widely considered to be sufficient to handle that requirement. in practice this reduces most distributed systems to be only kinda paxos (like spanner and zookeeper) or only consistent if you ignore all the ways they can fail to be consistent (which is a lot, see basically any discussion of the CAP theorem for a tiny subset of them)

Paxos is an example of a Quorum system; it's the big first canonical one, but Raft and ZAB (Zookeeper's protocol) both implement quorums in a good but different manner. Saying "paxos is the only algorithm to be linearizable" is a bit like saying "mergesort is the only true sorting algorithm, others are just equivalent".

I'm not fully sure you need linearizability for the state machine approach -- I'm a bit rusty. I'm thinking serializability (i.e. ACID) might be enough as long as you have a single master than ensures the only taken order is the one that is replicated elsewhere, but it gets to be interesting to go there. In any case, quorum systems are more tolerant of individual member falures so far.

the talent deficit posted:

you can't use just any old algorithm for the reduction of your set of course. you need an algorithm that's commutative, idempotent and associative. (the cmrdt's mononoqc mentions relax this further and eliminate the requirement the reduce algorithm be commutative but require updates are causually ordered (via vector clocks or whatever))
The updates to the datastructure in an op-based CRDT still require commutativity, idempotence (which is achieved per-message through the vclock), and so on, because you don't know who's gonna communicate a given update to you and from whom it's gonna be for. The difference is that op-based CRDTs don't need to be convergent iirc, but in practice they're just plain equivalent.

tef
May 30, 2004

-> some l-system crap ->

the talent deficit posted:

i am all about crdts. let's do this

do you know anything else like roshi? (like mononcqc mentions, I think it would be a nice idea to add hybrid logical clocks to it, or any timestamp based set) (i have finally got hybrid logical clocks and they are very very clever)

quote:

in practice this reduces most distributed systems to be only kinda paxos (like spanner and zookeeper) or only consistent if you ignore all the ways they can fail to be consistent (which is a lot, see basically any discussion of the CAP theorem for a tiny subset of them)

fwiw i think we don't do ourselves a favour by defining distributed systems canonically as distributed systems with shared state, even if the latter is the hardest bit. there a poo poo ton of massively distribtued systems which don't use paxos (any p2p network). and of the ones that share state, it's actually a small subset of these that require linerisability. i think the reason these shared state systems are in vogue is that they allow the application developer to pretend that they're not dealing with a distributed system, with all the mess of latency and availability/consistency tradeoffs.

i think we do ourselves a disservice by not pushing people towards statelessness and idempotency.

ps: raft is more like paxos, they are both network consensus algorithms. zab on the other hand is atomic broadcast. :3:

quote:

with a little bit of work you can define counters, registers (a type that contains a single other type that can be replaced or read), sets, graphs, sequences and anything you can assemble from those. if that seems familiar it's because those are the same basic types most programming languages give you

they're familiar because we have similar concepts: a usenet server is in some ways a crdt: you can send, remove postings, these operations are both idempotent and can occur on any order, linked together by a shared message-id. there is probably a back catalogue of ad-hoc crdts in the industry.

'm kinda hoping crdts means we get standard eventually consistent datastructures. i think they work better for ap and not cp systems though, for there you have to go into the world of paxos *whipcrack*

tef
May 30, 2004

-> some l-system crap ->

MononcQc posted:

The difference is that op-based CRDTs don't need to be convergent iirc, but in practice they're just plain equivalent.

For me the distinction is very akin to snapshots vs write ahead logs, and tbh the difference seems to be mostly around implementation: if you send out everything you have, it's state based, if you send out partial information, it's operation based.

I think any of op based ones can be used as a state based one, by sending out the entire op set rather than individual operators, and doing a merge of the sets. Similarly you should be able to convert a state based one into an operation log, preserving any causal semantics.

The thing seems to be that sometimes a state is smaller than a list of operations, and sometimes it's the other way around. If you only require a count of operations, and the operations carry no state, it's cheaper to store a number and sum them to merge. On the other hand, sometimes the ordering of the operations matter, you need to store the ops themselves, and so you might as well send out partial updates.

The distinction seems to mostly benefit academia as it gives them two papers to publish :3:

For programmers the questions should be more "Am I storing aggregates of operations, or am I storing the operations themselves" and "Do I send out the entierty of the data structure to merge, or do I send partial updates?"

MononcQc
May 29, 2007

tef posted:

I think applications writers would prefer a handful of well defined /opinionated/ structures, than a toolkit that requires expert knowledge to construct safely. I'm not sure what these structures are yet but i imagine a redis-alike with last-write-wins would be a lot more useful than redis' "lol what's a distributed system" or riak's "time to implement your own conflict management".

Something with the interface of redis: here are five data structures with distinct operations, but unlike redis, and more like riak, built around the assumption it will be running as part of a larger cluster. I've seen more systems built atop redis than riak_core and it makes me sad.

As you said though, Riak covers them all, or is on its way. Still in the original paper are:
  • Add-only DAG: good old graphs
  • Add/remove partial order: the graph is now a timeline with events
  • Replicated Growable Arrays: you're restricted in where you can append, and it's more of a linked list built on the graph
  • Trees (acyclic DAGs, heh)
  • Registers (shopping carts)

So the stuff Roshi does? Likely the add/remove partial order or a derivative.

As you said, though Riak's tied to its architecture a lot, but all of the CRDTs above could be used in a 'redis-like' data structure DB.

Redis has hashes, counters, lists, k/v, sorted sets, etc.

You could arguably represent the same interface with CRDTs. Riak doesns't, but could; You could probably provide the abstraction on top of it, too.

tef posted:

The two problems of CRDTs (being unbounded in growth and synchronisation is needed for compacting without loss) can be avoided by abandoning persistence: i.e be a cache. If all data is ephemeral, you can avoid unbounded growth. You can also compact by expiring old data.

You're probably in a fun spot when you need the strong eventual consistency of a CRDT but can be okay with eventually losing that data. Not sure what your use case would be to warrant the effort over more regular causality stuff.

tef
May 30, 2004

-> some l-system crap ->

MononcQc posted:

You're probably in a fun spot when you need the strong eventual consistency of a CRDT but can be okay with eventually losing that data. Not sure what your use case would be to warrant the effort over more regular causality stuff.

it's being able to delete data before expiration that makes it necessary.

essentially, you want to be able to cache some expensive data, however, you need to insert tombstones to be able to delete things, and then timestamps to be able to add them back again, etc, and now you have a lww crdt. now we have a cache which we can remove things from.

i am writing a messaging system, and i want to allow people to be a part of group messages, and leave them. i have some big database somewhere, but a lot of the time it's really expensive to work out who is in which group, as people are in lots of groups, and they're often short lived. with crtd semantics i can cache the group membership, but when someone leaves the group, they might get some messages, but they'll eventually get dropped.

in another case i am writing a messaging system, but each person can write on their own stream and subscribe to many other peoples — and i need to cache it, but i want to let people delete their messages if they made a mistake. with a cache, it's harder to force eviction without crdts.

MononcQc
May 29, 2007

tef posted:

it's being able to delete data before expiration that makes it necessary.

essentially, you want to be able to cache some expensive data, however, you need to insert tombstones to be able to delete things, and then timestamps to be able to add them back again, etc, and now you have a lww crdt. now we have a cache which we can remove things from.

Ok, yeah, I see.

tef posted:

i am writing a messaging system, and i want to allow people to be a part of group messages, and leave them. i have some big database somewhere, but a lot of the time it's really expensive to work out who is in which group, as people are in lots of groups, and they're often short lived. with crtd semantics i can cache the group membership, but when someone leaves the group, they might get some messages, but they'll eventually get dropped.

in another case i am writing a messaging system, but each person can write on their own stream and subscribe to many other peoples — and i need to cache it, but i want to let people delete their messages if they made a mistake. with a cache, it's harder to force eviction without crdts.

For your messaging cases, funnily enough, it's what the guys at League Of Legends ended up doing for their chat system: http://highscalability.com/blog/2014/10/13/how-league-of-legends-scaled-chat-to-70-million-players-it-t.html

I know the leader on their team (former co-worker), I guess I could ask if they reused the Riak data type library or spun their own one, I guess.

tef
May 30, 2004

-> some l-system crap ->
i think i've finally gotten my head around hybrid logical clocks and i'd like someone to check my working

the paper is a bit awful to skim and requires closer reading than you expect :v: http://www.cse.buffalo.edu/tech-reports/2014-04.pdf


# event handling

let's look at a cluster of machines sending and receiving timestamped messages to each other. unfortunately, the clocks on these machines drift apart, so there will be some timestamps which are misleading, and worse still, sometimes clocks go backwards.

we'd like to be able to tell which messages happened before other messages, across the cluster, so instead of timestamping with the machine's clock, we can timestamp messages with the highest timestamp we've seen so far:

code:
var max_timestamp = time.now()

def send(message) {
   max_timestamp = max(time.now(), max_timestamp)
   message.max_timestamp = max_timestamp
   io.send(message)
}

def receive(message) {
   if |message.max_timestamp - time.now()| > skew { error }
   max_timestamp = max(time.now(), max_timestamp, message.timestamp)
   m.max_timestamp = max_timestemp
   handler(message)
}
this means that for a given time we can reasonably tell which events happened after it, but it isn't without flaws:

if i receive a message with a timestamp that's ahead of me, all messages will be sent with the same timestamp until my clock catches up.if my clock is low granularity, i send a lot of messages with the same timestamp too.

we could abandon physical time and go for a counter instead, but now we have different problems: the counter gives us ordering information but we no longer can match it up to physical time.

code:
var counter = 0

def send(m) {
   counter = counter + 1 
   m.counter = counter
   io.send(message)
}

def receive(m) {
   # let's assume we do skew checking elsewhere
   counter = max(m.counter, counter) + 1 
   m.counter = counter
   handle(m)
}
(this is actually called a logical clock, as it keeps time but nothing in relation to physical time)

so let's keep a maximum timestamp, and a counter too, but we'll have to build up the logic in parts.

sending is easy: like before we keep a maximum timestamp seen, but we add one to the counter if
our clock is behind, and reset the counter so it doesn't grow forever.

code:
var counter = 0
var max_ts = time.now()

def send(m) {
   t = time.now()
   if (t > max_ts) { # if we're ahead, reset the counter
       max_ts, counter = t, 0
   } else {
       counter++; # otherwise, increment it
   }
        
   m.max_ts, m.counter = max_ts, counter
   io.send(m)
}
but receiving is much harder, a first attempt would be combining the two.

code:
def receive(m) {
   counter = max(m.counter, counter) + 1 
   max_ts = max(time.now(), m.max_ts, max_ts)
   m.counter, m.max_ts = counter, max_ts
   handle(m)
}
this will work, but we increment the counter far more than we need to. unless a node sends, it's internal counter will grow, and never be reset. instead of taking the maximum counter, we can break it down into a number cases - we really only care about counters when timestamps are the same.

we can reset the counter when our current time is greater than both max_ts and message.max_ts, otherwise we must increment the counter. we can split these into when message.max_ts is greater, less than, or equal to max_ts
code:
def receive(m) {
   current, old_max = time.now(), max_ts
 
   max_ts = max(current, old_max, m.max_ts)

   if (max_ts == old_max == m.max_ts) {
      counter = max(m.counter, counter) + 1 # new counter must be bigger than both
   } else if (max_ts == m.max_ts) {
      counter = m.counter + 1 # only has to be bigger than the message's
   } else if (max_ts == old_max) {
      counter = counter + 1 
   } else  { # current is largest timestamp seen
      counter = 0
   } 

   m.max_ts, m.counter = max_ts, counter
   handle(m)
}
we now have a logical timestamp, with an attached maximum physical time seen, which allows us to do wonderous things: timestamps that stay reasonably close to physical time, but can handle skew and drift, and even a machine's clock going backwards.

i think this sort of system would allow us to retain causality even with leap seconds :swoon:

Notorious b.s.d.
Jan 25, 2003

by Reene
so in this system, the physical timestamp is just for laffs?

markerstore
Dec 5, 2003
Canny!

Shaggar posted:

distributed version control is a bad idea.

how do you work on multiple things at once

Brain Candy
May 18, 2006

Notorious b.s.d. posted:

so in this system, the physical timestamp is just for laffs?

i can see it being useful for logs/history. nobody is going to tell you that the service went down at 0xAFD10A12

tef
May 30, 2004

-> some l-system crap ->

Notorious b.s.d. posted:

so in this system, the physical timestamp is just for laffs?


?

in this system, it augments physical time, well, replaces it with a maximum seen and a counter

the thing is, it lets you have consistent timestamps across the network, without having to compensate for skew.

if you have a system where they make lots of calls to each other, you can use these instead of wall clock, and lo and behold, every response will have a later timestamp than the request it was for

it's a timestamp for distributed systems that's monotonic, resilient to clock skew both forwards and backwards, and it's also possible to do range operator, i.e give me all of the events that happened before this one was sent and recieved, and you get a pretty good picture.

it'll let your clocks drift by a specified amount and still preserve a lot of causality

sorry, not casuality, "happened before", it's a much weaker guarantee

tef fucked around with this message at 19:14 on Nov 22, 2014

tef
May 30, 2004

-> some l-system crap ->
the physical time serves two purposes:

- actually knowing when things roughly happen, as rough as you get with clock skew
- bounding the counter so it doesn't explode

the counter meanwhile gives the monotonic incrementing timestamps when the physical clock is behind the currently seen maximum time.

tef
May 30, 2004

-> some l-system crap ->
it would be nice that when i merge my log files together from multiple machines, the logs have some ordering such that the log order of calls between machines is preserved, within some reasonable clock skew. tracing rpc interactions is nicer

it would also be nice that when i have a last-write-wins set, if i issue a delete operation after an add, it is guaranteed to have a later timestamp, unlike physical time. because it's monotonic, it makes it a little easier to take a snapshot at a given time window, and always get the same result back.

and there is some possibility of using it as an ad-hoc true time api, if you can guarantee your clock skew is very small, to give finer granularity

it's just a real nice replacement for physical timestamps

tef
May 30, 2004

-> some l-system crap ->
also, leap seconds :3:

suffix
Jul 27, 2013

Wheeee!
so if you have two messages with the same time and counter, they came from different servers and are unordered?

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





Notorious b.s.d. posted:

so in this system, the physical timestamp is just for laffs?

no it's like a best effort thing. event α probably happened at time t but definitely happened before event β. the casual part (α → β) is critical but the timestamp being a wall clock time is nice to have

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





tef posted:

the thing is, it lets you have consistent timestamps across the network, without having to compensate for skew.

it doesn't really do this i don't think. not without some kind of synchronization on top (paxos or some kind of write throttling that prevents writes that fall outside some tolerance bracket). i mean, they're consistent in the sense they all make some sort of sense but they don't necessarily reflect reality. different nodes can have wildly different skew. you only get happened before relationships on a single node

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





tef posted:

I'm actually leaning towards the opposite approach, that it is probably better to have a handful of high level data structures, rather than building everything from 2P-Sets. For example: roshi just has one data structure, a last write wins timeline, which has add/select/delete.

I think applications writers would prefer a handful of well defined /opinionated/ structures, than a toolkit that requires expert knowledge to construct safely. I'm not sure what these structures are yet but i imagine a redis-alike with last-write-wins would be a lot more useful than redis' "lol what's a distributed system" or riak's "time to implement your own conflict management".

i think the next step forward in distributed systems will be some sort of platform agnostic crdt api such that you can write things that look like declarative/imperative code but runs in a distributed context. less a database/storage layer and more a computation layer. hadoop but not limited to mapreduce

if you can make counters, registers, maps, graphs and sequences efficient (both space and time) you have enough basic types to assemble just about anything you could ask for. you end up with an enviroment something like MUMPS where you just perform operations on variables that exist in the storage layer without having to worry about synchronization or consistency

suffix
Jul 27, 2013

Wheeee!
crdts sounds cool but irl i think i'd prefer hbase distributed to riak distributed most of the time.

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





MononcQc posted:

Paxos is an example of a Quorum system; it's the big first canonical one, but Raft and ZAB (Zookeeper's protocol) both implement quorums in a good but different manner. Saying "paxos is the only algorithm to be linearizable" is a bit like saying "mergesort is the only true sorting algorithm, others are just equivalent".

paxos is the only linearizable algorithm in common use. i know etcd and consul (and maybe something else i'm forgetting) are using raft but they are both very new. spanner and zookeeper aren't strictly paxos (spanner uses crazy gps synchronized atomic clocks to skip some steps and zookeeper requires your transport make a bunch of guarantees to do the same) but they are based on paxos

MononcQc posted:

I'm not fully sure you need linearizability for the state machine approach -- I'm a bit rusty. I'm thinking serializability (i.e. ACID) might be enough as long as you have a single master than ensures the only taken order is the one that is replicated elsewhere, but it gets to be interesting to go there. In any case, quorum systems are more tolerant of individual member falures so far.

the sacrifice crdts make over paxos based systems is linearizability. they're serialized but not linear. if you apply the entirety of some set α you get result β but there's no guarantee that applying a partial set of α is consistent with any other partial set of α. in practice that means there's no read consistency except when you can apply some total ordering, ie, linearize. single nodes get read consistency but reading from two different nodes can give completely different results. two nodes may start in the same state and end in the same state in a serialized system but their intermediate states could be completely divergent

MononcQc posted:

The updates to the datastructure in an op-based CRDT still require commutativity, idempotence (which is achieved per-message through the vclock), and so on, because you don't know who's gonna communicate a given update to you and from whom it's gonna be for. The difference is that op-based CRDTs don't need to be convergent iirc, but in practice they're just plain equivalent.

in an op based crdt (cmrdt) you can establish a casual relationship (i mean 'happened before') on events from a single node. that lets you relax commutativity because you can control the order of ops. you still need associativity for the reason you state (other nodes can essentially insert operations between yours). cmrdt's are really just a specialization of crdts in which you can make the casual guarantee

gonadic io
Feb 16, 2011

>>=
i think i feel how tbc feels when i talk about monads

tef
May 30, 2004

-> some l-system crap ->
side note:

so you know consistent hashing? to match a message to a server, hash(message), and find the first hash(server) greater than it

and uh, rendevous hashing? to match a server, you do hash(message, server) for every server, and uh, pick the lowest one. this is a bit slower than consistent hashing, but uses less memory.

well, there is also kadmelia, which seems to have the properties of both: hash(message, server) is defined by hash(message) xor hash(server). so like rendevous hashing, you get good balancing without vnodes, like consistent hashing, you're matching up by a distance, but xor this time. the trick seems to be that if you store your hash(server)s in a critbit tree, you should be able to do the lowest-xor lookup very, very quickly.

Malcolm XML
Aug 8, 2009

I always knew it would end like this.

tef posted:

For me the distinction is very akin to snapshots vs write ahead logs, and tbh the difference seems to be mostly around implementation: if you send out everything you have, it's state based, if you send out partial information, it's operation based.

I think any of op based ones can be used as a state based one, by sending out the entire op set rather than individual operators, and doing a merge of the sets. Similarly you should be able to convert a state based one into an operation log, preserving any causal semantics.

The thing seems to be that sometimes a state is smaller than a list of operations, and sometimes it's the other way around. If you only require a count of operations, and the operations carry no state, it's cheaper to store a number and sum them to merge. On the other hand, sometimes the ordering of the operations matter, you need to store the ops themselves, and so you might as well send out partial updates.

The distinction seems to mostly benefit academia as it gives them two papers to publish :3:

For programmers the questions should be more "Am I storing aggregates of operations, or am I storing the operations themselves" and "Do I send out the entierty of the data structure to merge, or do I send partial updates?"

do I church encode or

tef posted:

side note:

so you know consistent hashing? to match a message to a server, hash(message), and find the first hash(server) greater than it

and uh, rendevous hashing? to match a server, you do hash(message, server) for every server, and uh, pick the lowest one. this is a bit slower than consistent hashing, but uses less memory.

well, there is also kadmelia, which seems to have the properties of both: hash(message, server) is defined by hash(message) xor hash(server). so like rendevous hashing, you get good balancing without vnodes, like consistent hashing, you're matching up by a distance, but xor this time. the trick seems to be that if you store your hash(server)s in a critbit tree, you should be able to do the lowest-xor lookup very, very quickly.

dhts solve a lot of common use cases

Orleans uses them internally

suffix
Jul 27, 2013

Wheeee!
so you randomly order the servers based on their hash, but the hash of the message decides the sorting order of each bit.

do you need a lot of hash bits for the kademlia thing to work well? i tried with a 64-bit hash, and the distribution is very uneven

suffix
Jul 27, 2013

Wheeee!
nope. tried with 128 bit keys and still getting ridiculous distributions like
kademlia:
Counter({b'endpoint7': 249270, b'endpoint0': 125624, b'endpoint3': 124987, b'endpoint6': 124666, b'endpoint9': 62954, b'endpoint4': 62699, b'endpoint5': 62601, b'endpoint1': 62481, b'endpoint8': 62426, b'endpoint2': 62292})


i think it tends to skewed distributions and isn't good for load balancing like consistent hashing is

the talent deficit
Dec 20, 2003

self-deprecation is a very british trait, and problems can arise when the british attempt to do so with a foreign culture





suffix posted:

nope. tried with 128 bit keys and still getting ridiculous distributions like
kademlia:
Counter({b'endpoint7': 249270, b'endpoint0': 125624, b'endpoint3': 124987, b'endpoint6': 124666, b'endpoint9': 62954, b'endpoint4': 62699, b'endpoint5': 62601, b'endpoint1': 62481, b'endpoint8': 62426, b'endpoint2': 62292})


i think it tends to skewed distributions and isn't good for load balancing like consistent hashing is

that's the distribution you'd expect. kademlia is used when the network is too large or the member nodes too ephemeral to keep track of them in an efficient manner. instead you try to maintain routes to nodes close to you in the network (where close is defined as your id xor their id). your (virtual) network topology looks like a tree with your node at the root. you know routes/ids of only a handful of nodes (k nodes in kademlia) in half the tree. the same goes for each subtree in the other half. you get a distribution where you know a lot about a tiny part of the global tree. each node has a unique view of this global tree

so half the messages you see are bound for the half of the tree you know nothing about but that's ok because you can send it to the few nodes you do know over there and they know a lot about their side and can efficiently deliver your message

Adbot
ADBOT LOVES YOU

MononcQc
May 29, 2007

tef posted:

(this is actually called a logical clock, as it keeps time but nothing in relation to physical time)

Specifically, a Lamport clock. It's the one that tracks causality the least well in that it may or may not tell you there was a conflict unless the timestamps are at the exact same level. Lamport clock have that kind of implicit LWW in them.

tef posted:

sending is easy: like before we keep a maximum timestamp seen, but we add one to the counter if
our clock is behind, and reset the counter so it doesn't grow forever.

The real reason you grow the clock is that hybrid clocks assume they run on NTP (or any other similar mechanism) where you have an interval of confidence between which you'll know there's enough difference your operation was definitely ahead.

So if you have a clock with an accuracy of ±5ms, you will have to increment the lamport clock whenever there's 0..10ms of difference between the previous timestamp and yours. Anything greater and the confidence you have in the accuracy of the clock can be used to make your order well-defined.

So what you're gonna do is, every time you're ahead of the last clock, you optimistically reset the lamport clock. When you're behind, you need the lamport clock to compensate for that interval.

If you receive messages where the physical clock is ahead by more than the interval of confidence, you ignore it and wait for the sync mechanism (NTP) to kick in and stabilize things.

tef posted:

i think this sort of system would allow us to retain causality even with leap seconds :swoon:

It's a weak form of causality though. For example, assume the following time line:

code:
a --x1-----x2-x3--x4--x5--x6--x7------x7---------
      \                              /
b -----x2----x4-------------------x5-------------
         \  /
c ----x2--x3-------------------------------------
The problem being that in cases such as b merging its state into a, a will not detect the issue because the lamport clock is strictly monotonic and doesn't carry that much causal information.

A vector clock or version vector, on the other hand, would let you know that this merge is wrong and should be detected as a conflict.

The causality for the hybrid clock works because the mechanism is strictly a Last-Write-Wins one where you expect that kind of overwrite to happen rather arbitrarily.

  • 1
  • 2
  • 3
  • 4
  • 5
  • Post
  • Reply