[erlang-questions] auto-syncing mnesia after a network split

153 views
Skip to first unread message

Joel Reymont

unread,
Dec 2, 2008, 12:04:34 PM12/2/08
to erlang-q...@erlang.org
What about keeping a version counter on each mnesia node in the
cluster and adding a version number to every record?

Alternatively, the version counter could be kept per table or a
timestamp could be used instead.

The version number would be bumped on every insert or update and the
node or table with the highest vnum would be chosen as the master
after nodes rejoin.

The goal is to pick the latest record as the true one.

What do you think?

Sent from my iPhone
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions

Alex

unread,
Dec 2, 2008, 12:18:40 PM12/2/08
to Joel Reymont, erlang-q...@erlang.org
what happens when you have multiple updates to both sides of the split?  if you just pick the highest vnum, you lose all the transactions from the other side of the split when it rejoins.

Joel Reymont

unread,
Dec 2, 2008, 3:03:51 PM12/2/08
to Alex, Erlang Questions
Alex,

On Dec 2, 2008, at 5:18 PM, Alex wrote:

> what happens when you have multiple updates to both sides of the
> split? if you just pick the highest vnum, you lose all the
> transactions from the other side of the split when it rejoins.


You can pick up new (inserted) records by doing a diff of primary keys
for each table.

You cannot do anything about deleted records, I think, so you'll just
have to delete those again somehow. You could assume that the table
replica with the latest timestamp is the right one and just delete the
extra records from the other table.

Imagine a bank account that's distributed across the split nodes,
where a customer deposits money a 2 times and the deposits are split
across the nodes. You'll pick up the latest deposit on one node and
miss the other deposit.

I think you can overcome this programmatically, with a timestamp _and_
a version number. You can have a version table per node with three
columns: table name, vnum and timestamp. The rest of the tables would
have just the vnum in their records.

When updating table T, you will first update the version table by
storing the current time and bumping the vnum for the key T. You will
then store the vnum in the record of table T that you are updating.

You will be able to find the split time by looking at the version
tables and figuring out when the vnums started to diverge. You can
then invoke a merge function that figures out, for example, how to
merge a bunch of bank deposit transactions into a single balance.

You will know the vnum at split time and will only need to consider
the transactions that happened after. Shouldn't be a lot of
transactions for a short split time.

What do you think?

--
http://twitter.com/wagerlabs

David Mercer

unread,
Dec 2, 2008, 3:24:18 PM12/2/08
to Joel Reymont, Alex, Erlang Questions
On Tuesday, December 02, 2008, Joel Reymont wrote:

> Imagine a bank account that's distributed across the split nodes,
> where a customer deposits money a 2 times and the deposits are split
> across the nodes. You'll pick up the latest deposit on one node and
> miss the other deposit.
>
> I think you can overcome this programmatically, with a timestamp _and_
> a version number. You can have a version table per node with three
> columns: table name, vnum and timestamp. The rest of the tables would
> have just the vnum in their records.
>
> When updating table T, you will first update the version table by
> storing the current time and bumping the vnum for the key T. You will
> then store the vnum in the record of table T that you are updating.
>
> You will be able to find the split time by looking at the version
> tables and figuring out when the vnums started to diverge. You can
> then invoke a merge function that figures out, for example, how to
> merge a bunch of bank deposit transactions into a single balance.
>
> You will know the vnum at split time and will only need to consider
> the transactions that happened after. Shouldn't be a lot of
> transactions for a short split time.
>

> What do you think?

How do you handle unsynched clocks on the two nodes?

How about two different transactions that occur at the same time on the two
nodes right after the split?

And if an account with balance x before the split on one node has a balance
of x + d1 after a deposit, while the other node has balance x + d2 after
another deposit, all you have during the merge is the balances x + d1 and x
+ d2, and no way to identify d1 and d2 without knowing x. Isn't this what
databases use transaction logs for, and if so, can we learn anything from
how they handle this situation? Anyone have that expertise?

Cheers,

David

Rick Pettit

unread,
Dec 2, 2008, 3:30:05 PM12/2/08
to Joel Reymont, Erlang Questions

The ideas are interesting, though I pray my bank never adopts such software.

I don't think bank software can continue to allow transactions like
deposit and withdrawal during a network partition--I just don't see how
that can be made to work while maintaining a consistent view of the
various accounts across disconnected nodes (e.g. how can a bank ATM allow
me to withdraw funds if it cannot reach its peer node(s) at my bank to
determine the availability of such funds?).

Most systems I work with implement a recovery procedure similar to what
Ulf has posted in the past on this list. This works in my _special case_
because I am tracking real-time telephony stats used to route calls (vs.
manage bank account information).

Because the systems I am referring to require high-availability over 100%
data consistency, this is perfectly ok (and works quite well). With issues
like telecom "glare" I couldn't be 100% accurate all the time anyway.

So, to recover from a partition it is enough to pick any functioning node
as the new "master" and have others restart and/or force load tables from
it. The entire time clients keep pushing new stats into the system, so
everything "converges on reality" in the end following a recovery attempt
anyway.

This system works extremely well, but again I wouldn't dream of using it
to implement ATM software for managing bank accounts.

-Rick

Joel Reymont

unread,
Dec 2, 2008, 3:34:02 PM12/2/08
to dme...@alum.mit.edu, Erlang Questions

On Dec 2, 2008, at 8:24 PM, David Mercer wrote:

> How do you handle unsynched clocks on the two nodes?

I don't know. Should they go out of sync during the split if the split
is short?

> How about two different transactions that occur at the same time on
> the two
> nodes right after the split?

That would be a transaction that

1) happens at the same time, down to a microsecond in Erlang,
2) updates the same table,
3) updates the same primary key of said table,
4) whilst no other transactions have happened so that they version
number is the same.

I think all of the above is highly unlikely.

> And if an account with balance x before the split on one node has a
> balance
> of x + d1 after a deposit, while the other node has balance x + d2
> after
> another deposit, all you have during the merge is the balances x +
> d1 and x
> + d2, and no way to identify d1 and d2 without knowing x.

You would code defensively, obviously, and would have a separate
ledger for deposits and withdrawals that you could always reconcile to
arrive at the proper balance.

--
http://twitter.com/wagerlabs

Joel Reymont

unread,
Dec 2, 2008, 3:40:13 PM12/2/08
to rpe...@vailsys.com, Erlang Questions
Rick,

On Dec 2, 2008, at 8:30 PM, Rick Pettit wrote:

> (e.g. how can a bank ATM allow
> me to withdraw funds if it cannot reach its peer node(s) at my bank to
> determine the availability of such funds?).

In my scenario a bank ATM would have an internal Mnesia table with the
balance :-). The ATM would clearly be part of a cluster of ATM,
replicating their transactions and balances to all other ATMS in the
cluster.

> Most systems I work with implement a recovery procedure similar to
> what
> Ulf has posted in the past on this list.

Would you kindly post a link to that procedure in this thread, for
easier reference?

> Because the systems I am referring to require high-availability over
> 100%
> data consistency, this is perfectly ok (and works quite well). With
> issues
> like telecom "glare" I couldn't be 100% accurate all the time anyway.

What's telecom glare?

> So, to recover from a partition it is enough to pick any functioning
> node
> as the new "master" and have others restart and/or force load tables
> from
> it. The entire time clients keep pushing new stats into the system, so
> everything "converges on reality" in the end following a recovery
> attempt
> anyway.


I understand that Mnesia was designed for telco ops but I want to run
my social network on top of it. I did a search before and all the
solutions were along the lines of "I'm dealing with telco stuff or I
can just throw that data out". I don't have such luxury and don't want
to throw Mnesia out in favor of PostgreSQL until I absolutely have to.

Thanks, Joel

--
http://twitter.com/wagerlabs

David Mercer

unread,
Dec 2, 2008, 3:45:18 PM12/2/08
to rpe...@vailsys.com, Joel Reymont, Erlang Questions
On Tuesday, December 02, 2008, Rick Pettit wrote:

> The ideas are interesting, though I pray my bank never adopts such
> software.
>
> I don't think bank software can continue to allow transactions like
> deposit and withdrawal during a network partition--I just don't see how
> that can be made to work while maintaining a consistent view of the
> various accounts across disconnected nodes (e.g. how can a bank ATM allow
> me to withdraw funds if it cannot reach its peer node(s) at my bank to
> determine the availability of such funds?).

I was thinking the opposite. I would presume banks would have to continue
to function through network problems, so these must be solved problems,
backed by good solid theory.

(Now, I doubt banks just record account balances, since they also need
transaction details, so the master balance of an account is not the
'account_balances' table but the sum of all transactions pertaining to that
account. (In practice, I imagine they reconcile the transactions into a
balance summary periodically.) Therefore, all transactions are inserts, and
balances are updated asynchronously.)

> With issues
> like telecom "glare" I couldn't be 100% accurate all the time anyway.

What is telecom "glare"?

Cheers,

David

Joel Reymont

unread,
Dec 2, 2008, 3:49:19 PM12/2/08
to dme...@alum.mit.edu, Erlang Questions

On Dec 2, 2008, at 8:24 PM, David Mercer wrote:

> Isn't this what
> databases use transaction logs for, and if so, can we learn anything
> from
> how they handle this situation? Anyone have that expertise?


Hell, yes! You could add timestamps and version numbers to Mnesia
logs. Then you could autosync for real!

--
http://twitter.com/wagerlabs

Rick Pettit

unread,
Dec 2, 2008, 4:09:27 PM12/2/08
to Joel Reymont, Erlang Questions
On Tue, December 2, 2008 2:40 pm, Joel Reymont wrote:
> Rick,
>
> On Dec 2, 2008, at 8:30 PM, Rick Pettit wrote:
>
>> (e.g. how can a bank ATM allow
>> me to withdraw funds if it cannot reach its peer node(s) at my bank to
>> determine the availability of such funds?).
>
> In my scenario a bank ATM would have an internal Mnesia table with the
> balance :-). The ATM would clearly be part of a cluster of ATM,
> replicating their transactions and balances to all other ATMS in the
> cluster.

Supposing there were 2 ATM machines in my neighborhood, each with a table
containing bank account balances, mine included.

Now suppose the first ATM became disconnected from its cluster, which
included the second ATM. Suppose I visit the first ATM and withdraw all
the funds from my account. The first ATM records this in a transaction log
but cannot replicate the transaction due to the network partition.

So I make my way down the street to the second ATM, which also has a
record of my balance _before_ I made the withdrawal at the first ATM. So I
again withdraw all my funds.

I am in this particular case quite happy, though I suspect my bank might
not be when they work to mend the network partition. And no, they can't
have their, er, my money back :-)

>> Most systems I work with implement a recovery procedure similar to
>> what
>> Ulf has posted in the past on this list.
>
> Would you kindly post a link to that procedure in this thread, for
> easier reference?

No problem:

http://www.erlang.org/pipermail/erlang-questions/2001-January/002484.html
http://www.erlang.org/pipermail/erlang-questions/2006-February/019092.html
http://www.erlang.org/pipermail/erlang-questions/2007-January/024716.html

>> Because the systems I am referring to require high-availability over
>> 100%
>> data consistency, this is perfectly ok (and works quite well). With
>> issues
>> like telecom "glare" I couldn't be 100% accurate all the time anyway.
>
> What's telecom glare?

A quick google search turned up the following:

glare telecom definition

The condition that arises when a telephone line or trunk is seized at both
ends for different reasons, perhaps causing the collision between an
incoming call and an outgoing call, for example. Glare is a phenomenon
associated with loop start signaling used to support single-line
telephones, multi-line telephones, and key telephone systems (KTSs).When
the handset of the telephone is lifted, the electrical loop is completed
and current flows across the circuit.The central office switch detects
that fact and returns dial tone for an outgoing call, or connects an
incoming call, as appropriate. If the user picks up the handset to place
an outgoing call at the same time that the central office switch is
attempting to connect an incoming call, a collision, or glare condition,
occurs. See also ground start, loop, and loop start.

===

This affects my software in that there is the chance of sending a call out
a trunk (during periods of peak call volume) which may actually be in use
by the time the call lands there. The system has a means of detecting and
handling this outside of any mnesia code (call retries, etc).

>> So, to recover from a partition it is enough to pick any functioning
>> node
>> as the new "master" and have others restart and/or force load tables
>> from
>> it. The entire time clients keep pushing new stats into the system, so
>> everything "converges on reality" in the end following a recovery
>> attempt
>> anyway.
>
>
> I understand that Mnesia was designed for telco ops but I want to run
> my social network on top of it. I did a search before and all the
> solutions were along the lines of "I'm dealing with telco stuff or I
> can just throw that data out". I don't have such luxury and don't want
> to throw Mnesia out in favor of PostgreSQL until I absolutely have to.

Agreed.

This is a really tough problem to solve for sure. If I didn't have the
luxury to potentially lose transactions my solution would not work.

I'll continue to follow the thread and add anything I think might be
useful if it comes to me.

-Rick

Joel Reymont

unread,
Dec 2, 2008, 4:18:20 PM12/2/08
to dme...@alum.mit.edu, Erlang Questions

On Dec 2, 2008, at 8:24 PM, David Mercer wrote:

> How do you handle unsynched clocks on the two nodes?

> ...


> Isn't this what
> databases use transaction logs for, and if so, can we learn anything
> from
> how they handle this situation? Anyone have that expertise?


I think the only problem to solve here is that of keeping clocks in
sync.

If the clocks are in sync _and_ the Mnesia transaction log is
timestamped, then you can auto-sync by applying transactions in order
once the nodes rejoin.

Right?

--
http://twitter.com/wagerlabs

David Mercer

unread,
Dec 2, 2008, 4:30:36 PM12/2/08
to Joel Reymont, dme...@alum.mit.edu, Erlang Questions
On Tuesday, December 02, 2008, Joel Reymont wrote:

> On Dec 2, 2008, at 8:24 PM, David Mercer wrote:
>
> > How do you handle unsynched clocks on the two nodes?
>
> I don't know. Should they go out of sync during the split if the split
> is short?

Aren't clocks always out of synch? I thought it was impossible to perfectly
synch clocks due to signal transmission times, at least over a standard
network. Even if the clocks were precisely in-synch, same problem if the
same transaction is received by the two nodes a microsecond apart: they'll
have different timestamps. I wonder if adding some sort of GUID would solve
the problem... Maybe a client-generated ID.

Another thought is that two transactions can be received by different nodes
in different order, resulting in different vnums for the same transactions
on different nodes.

Cheers,

David.

Florian Zumbiehl

unread,
Dec 2, 2008, 4:42:48 PM12/2/08
to erlang-q...@erlang.org
Hi,

> I think the only problem to solve here is that of keeping clocks in
> sync.

What does that mean in this context?

Florian

Felix Hamilton

unread,
Dec 2, 2008, 4:47:19 PM12/2/08
to dme...@alum.mit.edu, Erlang Questions
I have used NTP (the Network Time Protocol) quite effectively to sync
timestamps in widely distributed systems with a fairly high degree of
accuracy.

It may be that using an NTP server and clients to provide time syncing
for Mnesia may be a valuable addition - NTP resolution can get quite
high (read way more than you need) if you have redundant servers.

Is there an NTP project for Erlang as yet? If not, that might be a
project that would provide quite a few benefits to the community ...

/Felix

Joel Reymont

unread,
Dec 2, 2008, 4:52:27 PM12/2/08
to Felix Hamilton, Erlang Questions

On Dec 2, 2008, at 9:47 PM, Felix Hamilton wrote:

> Is there an NTP project for Erlang as yet? If not, that might be a
> project that would provide quite a few benefits to the community ...


I thought that so long as the system clock is in sync then the Erlang
clock would be in sync. Isn't it a matter of running NTP side by side
with Erlang?

--
http://twitter.com/wagerlabs

Joel Reymont

unread,
Dec 2, 2008, 4:53:39 PM12/2/08
to Florian Zumbiehl, erlang-q...@erlang.org

On Dec 2, 2008, at 9:42 PM, Florian Zumbiehl wrote:

> Hi,
>
>> I think the only problem to solve here is that of keeping clocks in
>> sync.
>
> What does that mean in this context?


I meant to say that if the clocks are synchronized and Mnesia
transaction logs are timestamped, then it's possible to apply
transactions in proper order on each node once the cluster rejoins.

Is this a better explanation?

--
http://twitter.com/wagerlabs

Felix Hamilton

unread,
Dec 2, 2008, 4:56:25 PM12/2/08
to Joel Reymont, Erlang Questions
Certainly that is what i have done previously. But in this context it
might make sense to implement a NTP server / client within Erlang so
as to minimize interaction with the host node operating system. As
soon as your deployed application starts requiring operating system
customization it becomes significantly more painful. It would nice to
be able to use NTP with an Erlang app without having to deal with
anything at the OS level.

And my suspicion is that Erlang may be quite suited for a robust
distributed NTP implementation in any case ...

/Felix

Joel Reymont

unread,
Dec 2, 2008, 4:59:20 PM12/2/08
to Felix Hamilton, Erlang Questions

On Dec 2, 2008, at 9:56 PM, Felix Hamilton wrote:

> It would nice to
> be able to use NTP with an Erlang app without having to deal with
> anything at the OS level.


I don't understand. My Mac runs the NTP demon just fine, keeping my
clock synchronized with europe.apple.com or something like that.

I don't see the benefit of implementing NTP in Erlang since the two
are completely orthogonal. Am I missing something?

Mihai Balea

unread,
Dec 2, 2008, 5:13:24 PM12/2/08
to Erlang Questions

On Dec 2, 2008, at 3:24 PM, David Mercer wrote:
> How do you handle unsynched clocks on the two nodes?
>
> How about two different transactions that occur at the same time on
> the two
> nodes right after the split?
>
> And if an account with balance x before the split on one node has a
> balance
> of x + d1 after a deposit, while the other node has balance x + d2
> after
> another deposit, all you have during the merge is the balances x +
> d1 and x
> + d2, and no way to identify d1 and d2 without knowing x. Isn't
> this what
> databases use transaction logs for, and if so, can we learn anything
> from
> how they handle this situation? Anyone have that expertise?


How about Vector Clocks (http://en.wikipedia.org/wiki/Vector_clocks)?

Mihai

Felix Hamilton

unread,
Dec 2, 2008, 5:22:12 PM12/2/08
to Joel Reymont, Erlang Questions
Well, it depends. If you have complete control over the OS used by
each node in your distributed system and can insure that they are all
using the same NTP hosts and have sync, sure, relying on an existing
NTP daemon and using the system time to generate your timestamps works
fine. Basically as accurate as you need it to be.

If not, and timestamp accuracy is important, you want to make sure
that you keep an 'internal' application clock that is separate from
(though of course related to) the OS clock on each of your nodes.

Otherwise, you rely on the folks that have OS level control of your
nodes to a) have an NTP daemon running and properly configured, b) not
make manual changes to their system time that override NTP sync, and
c) maintain their NTP config correctly. I have found that these tend
to be unwarranted assumptions. ;-)

An Erlang NTP implementation might be one way to maintain an internal
application clock on a widely distributed system of nodes that avoids
reliance on system time. This is the first time i have considered the
prospect of embedding an NTP implementation in an Erlang application,
however, so take it for what its worth.

I will say that maintaining NTP on large distributed systems can be
quite a pain even when you control things at the OS level.

Vector Clocks do look interesting tho, but it is not clear that
partial ordering would be enough in some cases.

/Felix

mats cronqvist

unread,
Dec 2, 2008, 5:33:26 PM12/2/08
to Felix Hamilton, Erlang Questions
"Felix Hamilton" <fham...@gmail.com> writes:

> I have used NTP (the Network Time Protocol) quite effectively to sync
> timestamps in widely distributed systems with a fairly high degree of
> accuracy.

i personally would not bet a dirty sock on ordering events by time
stamps (in a distributed system). and for once Leslie Lamport seems
to agreewith me.

http://en.wikipedia.org/wiki/Lamport_timestamps

Felix Hamilton

unread,
Dec 2, 2008, 5:49:25 PM12/2/08
to mats cronqvist, Erlang Questions
Actually in the distributed system to which I am referring we used a
combination of 'real' time based on NTP and a type of logical
operation ordering, then discarded and demanded repetition of those
operations which we could not reconcile based on our two ordering
systems. And this was a very large scale distributed archival system
with *no* margin for commit errors.

So in some sense it depends on what you are doing. But I suggest that
having some type of relatively consistent 'real' time is still useful.

/Felix

Joel Reymont

unread,
Dec 2, 2008, 5:53:50 PM12/2/08
to mats cronqvist, Erlang Questions

On Dec 2, 2008, at 10:33 PM, mats cronqvist wrote:

> i personally would not bet a dirty sock on ordering events by time
> stamps (in a distributed system). and for once Leslie Lamport seems
> to agreewith me.
>
> http://en.wikipedia.org/wiki/Lamport_timestamps

Would implementing vector clocks or lamport timestamps for Mnesia
solve the problem of ordering events?

Has anyone attempted this previously?

--
http://twitter.com/wagerlabs

Holger Hoffstätte

unread,
Dec 3, 2008, 12:53:17 AM12/3/08
to erlang-q...@erlang.org
Joel Reymont wrote:
> On Dec 2, 2008, at 10:33 PM, mats cronqvist wrote:
>
>> i personally would not bet a dirty sock on ordering events by time
>> stamps (in a distributed system). and for once Leslie Lamport seems
>> to agreewith me.
>>
>> http://en.wikipedia.org/wiki/Lamport_timestamps

+1

Reliance on physical clocks is long known to be at best unreliable. HW
clock skew and DST (daylight saving time) back/forward warp does the rest.

> Would implementing vector clocks or lamport timestamps for Mnesia
> solve the problem of ordering events?
>
> Has anyone attempted this previously?

http://code.google.com/p/distributerl/

-h

mats cronqvist

unread,
Dec 8, 2008, 7:47:32 AM12/8/08
to Claes Wikström, Erlang Questions
Claes Wikström <kla...@hyber.org> writes:

> mats cronqvist wrote:
>> "Felix Hamilton" <fham...@gmail.com> writes:
>>
>>> I have used NTP (the Network Time Protocol) quite effectively to sync
>>> timestamps in widely distributed systems with a fairly high degree of
>>> accuracy.
>>
>> i personally would not bet a dirty sock on ordering events by time
>> stamps (in a distributed system). and for once Leslie Lamport seems
>> to agreewith me.
>>
>

> I think I would. In retrospect I think the worst part of mnesia (apart
> from the sucky dets module I once wrote) is how we chose to deal with
> partitioned networks.
>
> We should have chosen to rely on the system clock. For example if
> all nodes make a persistent note of when they lost contact with other
> nodes. Once they reunite they could compare timestamps, and also compare
> timestamps on the last committed transaction.
> In a vast majority of cases it would be possible to automatically chose
> a winner.

and what would happen in the small minority of cases?


> Assuming the clocks on all involved hosts are sufficiently synchronized
> by NTP that is.
>
> Furthermore - since this only affects recovery after partitioning - this
> can still be added.
>
>
> /klacke

Chandru

unread,
Dec 8, 2008, 7:52:13 AM12/8/08
to mats cronqvist, Erlang Questions
2008/12/8 mats cronqvist <ma...@kreditor.se>:

> Claes Wikström <kla...@hyber.org> writes:
>
>> mats cronqvist wrote:
>>> "Felix Hamilton" <fham...@gmail.com> writes:
>>>
>>>> I have used NTP (the Network Time Protocol) quite effectively to sync
>>>> timestamps in widely distributed systems with a fairly high degree of
>>>> accuracy.
>>>
>>> i personally would not bet a dirty sock on ordering events by time
>>> stamps (in a distributed system). and for once Leslie Lamport seems
>>> to agreewith me.
>>>
>>
>> I think I would. In retrospect I think the worst part of mnesia (apart
>> from the sucky dets module I once wrote) is how we chose to deal with
>> partitioned networks.
>>
>> We should have chosen to rely on the system clock. For example if
>> all nodes make a persistent note of when they lost contact with other
>> nodes. Once they reunite they could compare timestamps, and also compare
>> timestamps on the last committed transaction.
>> In a vast majority of cases it would be possible to automatically chose
>> a winner.
>
> and what would happen in the small minority of cases?

I presume it would be whatever happens today -- nothing.

Chandru

Claes Wikström

unread,
Dec 7, 2008, 5:57:29 PM12/7/08
to mats cronqvist, Erlang Questions
mats cronqvist wrote:
> "Felix Hamilton" <fham...@gmail.com> writes:
>
>> I have used NTP (the Network Time Protocol) quite effectively to sync
>> timestamps in widely distributed systems with a fairly high degree of
>> accuracy.
>
> i personally would not bet a dirty sock on ordering events by time
> stamps (in a distributed system). and for once Leslie Lamport seems
> to agreewith me.
>

I think I would. In retrospect I think the worst part of mnesia (apart


from the sucky dets module I once wrote) is how we chose to deal with
partitioned networks.

We should have chosen to rely on the system clock. For example if
all nodes make a persistent note of when they lost contact with other
nodes. Once they reunite they could compare timestamps, and also compare
timestamps on the last committed transaction.
In a vast majority of cases it would be possible to automatically chose
a winner.

Assuming the clocks on all involved hosts are sufficiently synchronized
by NTP that is.

Furthermore - since this only affects recovery after partitioning - this
can still be added.


/klacke

mats cronqvist

unread,
Dec 8, 2008, 9:47:30 AM12/8/08
to Chandru, Erlang Questions
Chandru <chandrashekha...@gmail.com> writes:

> 2008/12/8 mats cronqvist <ma...@kreditor.se>:
>> Claes Wikström <kla...@hyber.org> writes:
>>
>>> mats cronqvist wrote:
>>>> "Felix Hamilton" <fham...@gmail.com> writes:
>>>>
>>>>> I have used NTP (the Network Time Protocol) quite effectively to sync
>>>>> timestamps in widely distributed systems with a fairly high degree of
>>>>> accuracy.
>>>>
>>>> i personally would not bet a dirty sock on ordering events by time
>>>> stamps (in a distributed system). and for once Leslie Lamport seems
>>>> to agreewith me.
>>>>
>>>
>>> I think I would. In retrospect I think the worst part of mnesia (apart
>>> from the sucky dets module I once wrote) is how we chose to deal with
>>> partitioned networks.
>>>
>>> We should have chosen to rely on the system clock. For example if
>>> all nodes make a persistent note of when they lost contact with other
>>> nodes. Once they reunite they could compare timestamps, and also compare
>>> timestamps on the last committed transaction.
>>> In a vast majority of cases it would be possible to automatically chose
>>> a winner.
>>
>> and what would happen in the small minority of cases?
>
> I presume it would be whatever happens today -- nothing.

that would be nice.

but i was kind of assuming that everyone would reach the same
conclusion as I; we pick the wrong winner.

Joel Reymont

unread,
Dec 8, 2008, 9:55:48 AM12/8/08
to Claes Wikström, Erlang Questions, mats cronqvist

On Dec 7, 2008, at 10:57 PM, Claes Wikström wrote:

> Assuming the clocks on all involved hosts are sufficiently
> synchronized
> by NTP that is.


Does this matter if all the nodes are aware of the clocks of all the
other nodes?

--
http://wagerlabs.com

Joel Reymont

unread,
Dec 8, 2008, 10:27:38 AM12/8/08
to mats cronqvist, Erlang Questions

On Dec 8, 2008, at 3:23 PM, mats cronqvist wrote:

>> Does this matter if all the nodes are aware of the clocks of all the
>> other nodes?
>

> isn't that the same thing as saying that network partitioning is
> not a problem as long as the network is not partitioned?


I mean to say that if I am node A and I know what the clock is on
nodes B, C and D then it doesn't matter if our clocks aren't perfectly
synchronized by NTP. I will know the skew and once node D splits off
and rejoins, I'll be able to compare the timestamp on my transactions
vs those on node D using that skew.

Claes Wikstrom

unread,
Dec 8, 2008, 10:30:20 AM12/8/08
to mats cronqvist, Erlang Questions
mats cronqvist wrote:

> and what would happen in the small minority of cases?
>

If the clocks are sufficiently close - the minority would be
so small so that it can be ignored.

What will your HA computer system do if suddenly both HA components
in your HA system fails. Or all three.

Claes Wikstrom

unread,
Dec 8, 2008, 10:31:57 AM12/8/08
to Joel Reymont, mats cronqvist, Erlang Questions
Joel Reymont wrote:
>
>
> I mean to say that if I am node A and I know what the clock is on nodes
> B, C and D then it doesn't matter if our clocks aren't perfectly
> synchronized by NTP. I will know the skew and once node D splits off and
> rejoins, I'll be able to compare the timestamp on my transactions vs
> those on node D using that skew.
>

No,

/klacke

Joel Reymont

unread,
Dec 8, 2008, 10:36:04 AM12/8/08
to Claes Wikstrom, mats cronqvist, Erlang Questions

On Dec 8, 2008, at 3:31 PM, Claes Wikstrom wrote:

> Joel Reymont wrote:
>>
>> I mean to say that if I am node A and I know what the clock is on
>> nodes B, C and D then it doesn't matter if our clocks aren't
>> perfectly synchronized by NTP. I will know the skew and once node D
>> splits off and rejoins, I'll be able to compare the timestamp on my
>> transactions vs those on node D using that skew.
>
> No,


Why?

--
http://wagerlabs.com

Jim McCoy

unread,
Dec 8, 2008, 12:19:42 PM12/8/08
to Joel Reymont, Erlang Questions
Vector clocks don't really work that way, you can only update the
vectors at the point where messages pass between nodes. Using your
example, if D splits off, reboots and has its clock reset slightly,
records a few local transactions using this new clock, and then
rejoins you will need to figure out how to order the local
transactions with the state of the rest of the nodes.

Consistence, availability, partition-tolerance: pick two.

Simiarly, Claes is incorrect in assuming that you could use the system
clock for anything other than printing a nice little display on the
screen to remind the user that they have a meeting scheduled for some
point in the near future. There is _a lot_ of literature out there on
distributed transaction fault-tolerance and it gets pretty complex in
a hurry.

To maintain the current properties of mnesia I believe the only option
would be to add some complexity on the mechanics of the join mechanism
(e.g. a quorum system like paxos to decide membership and agreement
among the nodes as to the minimum quorum below which a partitioned
subset is read-only) and a bit of work to change a minority subset
into read-only mode when a partition is discovered.

Joel Reymont

unread,
Dec 8, 2008, 12:37:44 PM12/8/08
to Jim McCoy, Erlang Questions

On Dec 8, 2008, at 5:19 PM, Jim McCoy wrote:

> To maintain the current properties of mnesia I believe the only option
> would be to add some complexity on the mechanics of the join mechanism
> (e.g. a quorum system like paxos to decide membership and agreement
> among the nodes as to the minimum quorum below which a partitioned
> subset is read-only) and a bit of work to change a minority subset
> into read-only mode when a partition is discovered.


I'm willing to take a shot at the work, with proper instruction and
guidance.

I really really want a fix for this issue. I don't see how Mnesia can
be used as the backend for an internet site otherwise. It's currently
impossible to run an ecommerce or other internet site on top of Mnesia
because of the split/rejoin issue. Please show me where I am mistaken.

My understanding is that this issue can be dealt with in the telecom
industry due to the transient nature of the data. I'm guessing that
really important data always goes into Oracle, PostgreSQL, etc.

Yes, I can use PostgreSQL or MySQL instead of Mnesia but then I have
to deal with replication issues on that end. Does anyone know how
other databases deal with network splits in a multi-master scenario?

Thanks, Joel

Andrew Stone

unread,
Dec 8, 2008, 2:00:14 PM12/8/08
to Joel Reymont, Jim McCoy, Erlang Questions
Another question to ask is whether you actually need a relational DB at all for your internet site. CouchDB seems to have solved the replication issue quite elegantly.

-Andrew

Dave Smith

unread,
Dec 8, 2008, 4:46:16 PM12/8/08
to Joel Reymont, Erlang Questions
On Mon, Dec 8, 2008 at 10:37 AM, Joel Reymont <joe...@gmail.com> wrote:
>
> On Dec 8, 2008, at 5:19 PM, Jim McCoy wrote:
>
>> To maintain the current properties of mnesia I believe the only option
>> would be to add some complexity on the mechanics of the join mechanism
>> (e.g. a quorum system like paxos to decide membership and agreement
>> among the nodes as to the minimum quorum below which a partitioned
>> subset is read-only) and a bit of work to change a minority subset
>> into read-only mode when a partition is discovered.
>
>
> I'm willing to take a shot at the work, with proper instruction and
> guidance.
>
> I really really want a fix for this issue. I don't see how Mnesia can
> be used as the backend for an internet site otherwise. It's currently
> impossible to run an ecommerce or other internet site on top of Mnesia
> because of the split/rejoin issue. Please show me where I am mistaken.
>
> My understanding is that this issue can be dealt with in the telecom
> industry due to the transient nature of the data. I'm guessing that
> really important data always goes into Oracle, PostgreSQL, etc.
>
> Yes, I can use PostgreSQL or MySQL instead of Mnesia but then I have
> to deal with replication issues on that end. Does anyone know how
> other databases deal with network splits in a multi-master scenario?

I _strongly_ suggest reading some of the research literature on this
subject, particularly the work done by Amazon with their Dynamo
system. That's one of the most practical papers I've read on these
topics. With a little elbow grease, you could probably even implement
such a system based on that paper... :)

D.

Claes Wikström

unread,
Dec 8, 2008, 4:07:52 PM12/8/08
to Jim McCoy, Erlang Questions
Jim McCoy wrote:

>
> Consistence, availability, partition-tolerance: pick two.
>
> Simiarly, Claes is incorrect in assuming that you could use the system
> clock for anything other than printing a nice little display on the
> screen to remind the user that they have a meeting scheduled for some
> point in the near future. There is _a lot_ of literature out there on
> distributed transaction fault-tolerance and it gets pretty complex in
> a hurry.


I'm not wrong if instead of choosing Consistence, availability
I choose availability, partition-tolerance

I.e sacrifice Consistence. NTP works sufficiently well IMHO.
If system clocks are synchronized - how big are the chances of
picking the wrong new master. Sufficiently slim I'd say.

>
> To maintain the current properties of mnesia I believe the only option


I never said that - I suggested the properties were wrongly chosen.
I/We should have sacrificed consistency - 10 years ago.


> would be to add some complexity on the mechanics of the join mechanism
> (e.g. a quorum system like paxos to decide membership and agreement


A lot of the mnesia clusters only have 2 nodes - thus making
quorums a no-goer. typical HA telcoms chassis systems have exactly
two management blades.


/klacke

Jim McCoy

unread,
Dec 8, 2008, 6:39:52 PM12/8/08
to Claes Wikström, Erlang Questions
On Mon, Dec 8, 2008 at 1:07 PM, Claes Wikström <kla...@hyber.org> wrote:
> Jim McCoy wrote:
>
>>
>> Consistence, availability, partition-tolerance: pick two.
>>[...]

>
> I'm not wrong if instead of choosing Consistence, availability
> I choose availability, partition-tolerance

My bad. I had probably missed a point and sort of assumed we were
talking about maintaining the current set of properties. I am a big
fan of eventual consistency, but it seems to me that this pushes a lot
of the responsibility for conflict resolution back to the DB client --
not necessarily bad, but something that probably would have probably
made mnesia even less accessible than it currently is to the casual
programmer.

>> would be to add some complexity on the mechanics of the join mechanism
>> (e.g. a quorum system like paxos to decide membership and agreement
>
>
> A lot of the mnesia clusters only have 2 nodes - thus making
> quorums a no-goer. typical HA telcoms chassis systems have exactly
> two management blades.

Two is just a special-case of N. If you wanted a practical solution
for 2 nodes or any even number of nodes you could elect a leader at
each membership change (systems like paxos require the existence of
such a leader even if they do nothing more than initiate the voting
protocol) and declare that a partition with half the nodes that
included the previous leader could continue writing.

jim

mats cronqvist

unread,
Dec 9, 2008, 3:59:07 PM12/9/08
to Claes Wikström, Erlang Questions
Claes Wikström <kla...@hyber.org> writes:

> Jim McCoy wrote:
>
>> would be to add some complexity on the mechanics of the join mechanism
>> (e.g. a quorum system like paxos to decide membership and agreement
>
>
> A lot of the mnesia clusters only have 2 nodes - thus making
> quorums a no-goer. typical HA telcoms chassis systems have exactly
> two management blades.

i find it quite humorous that the telecoms industry is paying
thousands of people to work on solving this problem (how to reliably
rejoin a partitioned network consisting of two nodes); a problem
that has been proven (over 30 years ago) to have no solution.

30,000 man-years... that a chunk of change(*). certainly a lot more
expensive than adding a third node.

it's less humourous (at least to me) that i personally spent a few
of those man-years, when all I had to do was google on "leslie
lamport."

mats

(*) yes, i made that figure up. but i do believe it's the right
magnitude.

Reply all
Reply to author
Forward
0 new messages