These notes are just a rough cut and if ever I get time will try to make them more legible, just thought of putting them to help others.
Cassandra Notes
---------------
Before reading this doc please please please go through
http://wiki.apache.org/cassandra/Operations
As I was told in the forums, read this page till your eyes bleed.
###########################
Selection of partitioner
###########################
From the wiki "When the RandomPartitioner is used, Tokens are integers from 0 to 2**127. Keys are converted to this range by MD5 hashing for comparison with Tokens. (Thus, keys are always convertible to Tokens, but the reverse is not always true.) "
For Order Preserving Partioner the keys themself serve as the tokens and therefore it is much difficult to manage the tokens for OPP.
There are some great links to read more on this:
http://spyced.blogspot.com/2009/05/consistent-hashing-vs-order-preserving.html
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
###########################
Selection of initial tokens
###########################
Initially generate tokens using the python function, and then add new tokens as per that function. But you will need to manually move the old tokens and then do a node repair on each of the machines.Basically follow this staregy :
Generate tokens.. Add node with the new token to the end.. Move other nodes by one to the new token position(i.e. using the python function generate new range of tokens, add the new token to the last machine, and then move the old machines to the new token value ) counter clock wise.. Then repair all the nodes
In case equal number of machines are being added as the original then all the new tokens can be half of the old tokens and can be interleaved easily.
From the wiki: Using a strong hash function means RandomPartitioner keys will, on average, be evenly spread across the Token space, but you can still have imbalances if your Tokens do not divide up the range evenly, so you should specify InitialToken to your first nodes as i * (2**127 / N) for i = 0 .. N-1.
With order preserving partitioners, your key distribution will be application-dependent. You should still take your best guess at specifying initial tokens (guided by sampling actual data, if possible), but you will be more dependent on active load balancing (see below) and/or adding new nodes to hot spots.
*******Once data is placed on the cluster, the partitioner may not be changed without wiping and starting over.
##################################
Selection of replication strategy
##################################
RackUnaware Strategy
----
In case you are using RackUnaware Strategy then your replicas are placed in the next N-1 machines in the ring starting from the node where the first write happens(the first write node is called first replica). The replication depends on Replication Factor and Consistency Level chosen.
In the begining I had a basic confusion regarding this, not sure if you have the same, but I will write about it anyways:
So I have a ring of says 8 nodes, with Node 1 up, 2 down and so on.. So I had basically 2 understanding problems
1)My RF is 3 that means my replication should happen in 3 and I was writing with Quorum CL so basically as soon as write is done in 2 nodes the call should return, but all my writes were failing. I was thinking that since I had 4 nodes up and 4 down and as 4 nodes up > RF 3, so my writes should pass. Later I talked about this at forums and I figured out that my writes were failing because of the following reason, say my first replica for a key is node 2, which is down so the write will try to happen on next 3 nodes and the 3rd write will be a hinted write. But since node 4 is also down the CL of Quorum will never be attained and hence the writes will fail. Please note that this will happen only for certain writes for which the primary replica node is even, if my primary replica would have been odd then my write should have passed.
Basically for each write there are nodes in which the given number of replicas will be written, till CL number of nodes out of the nodes where the write is supposed to happen are up we are good.
2) I forgot the 2nd question :( will update once I remember.
So now RackAware Strategy, I haven't actually tried it out much but here is waht I have learned about it so far. Also it sounds better to me compared to RackUnaware strategy if you are considering more reliable setup with your data spread across multiple DC's.
Here is what wiki says about RackAware Strategy:
RackAwareStrategy:
-----
replica 2 is placed in the first node along the ring the belongs in another data center than the first; the remaining N-2 replicas, if any, are placed on the first nodes along the ring in the same rack as the first
I was confused about the RackAware first and used to think that this will not be best if you want to tolerate your entire DC going down. The reason that this strategy will for sure will make a copy in another rack and if the entire DC is down that copy will fail resulting in failure of writing data. Anyways I discussed this at IRC and I was told that this won't be the case, say again my CL is Quorum and RF 3, and all the nodes in DC 1 are up and all nodes in DC 2 are down. If the primary replica of a write is supposed to be on DC1 then again since 2 nodes are up(in DC 1) and only 1 write is supposed to happen in DC 2 my writes will still pass and a hint will be written to one node in DC 1 and will be passed back to DC 2 once it comes back.
I am still confused how things will work with RackAware strategy when no of DC > 2, need to research more on this.
I think there is something called DataCenter strategy, the corresponding class in cassandra source code I think is DatacenterShardStrategy(will need to confirm, but it sounds s.th like this). This strategy is much better and you can actually specify the replication per DC. So this acts in combination with your RF, i.e. say you ave a RF =10 and you want 2 replicas in DC 1, 5 in DC 2 and 3in DC 3; then this is the startegy you want. But again I need to read more on this.
Also there is a file of interest called rack.properties for DCAwareStrategy and RackAware Strategy, using which you can tell cassandra which nodes are in which rack
http://svn.apache.org/repos/asf/cassandra/tags/cassandra-0.6.1/contrib/property_snitch/
This is what wiki has to say about it:
Besides datacenters, you can also tell Cassandra which nodes are in the same rack within a datacenter. Cassandra will use this to route both reads and data movement for Range changes to the nearest replicas. This is configured by a user-pluggable EndpointSnitch class in the configuration file.
EndpointSnitch is related to, but distinct from, replication strategy itself: RackAwareStrategy needs a properly configured Snitch to place replicas correctly, but even absent a Strategy that cares about datacenters, the rest of Cassandra will still be location-sensitive.
I also think that you can attain a reliable config in RackUnaware by interleaving nodes by selecting tokens carefully, but this is much painful, though you have more control. I need more info to decide on which is better and which one to use, so any help will be appreciated.
There is another thing in wiki, but I am not sure if I clearly understand it:
Note that with RackAwareStrategy, succeeding nodes along the ring should alternate data centers to avoid hot spots. For instance, if you have nodes A, B, C, and D in increasing Token order, and instead of alternating you place A and B in DC1, and C and D in DC2, then nodes C and A will have disproportionately more data on them because they will be the replica destination for every Token range in the other data center.
I think what is being said is that ABCD are in increasing token range i.e. token A < token B < token C < token D. If that is the case and A and B are in DC 1 and C and D are in DC2 then say we write to either A or B ie any node in DC 1, and a write is supposed to happen in DC 2 because our strategy was RackAware, then always node C will be picked. Similary if a write is supposed to happen in DC2 then always node A from DC 1 will get chosen.(I am hoping that I am correct)
So say now you choose to lay out the nodes as A(DC1), B(DC2), C(DC1), D(DC2) and again token A < token B < token C < token D. Then a write to A in DC1 will get B from node 2, a write to C from DC1 will get D from DC 2 and so on.
Again wiki talks about reducing RF and changing Replication startegy but I need to read more on it. Plus I need to read more on what various options in nodetool mean, like cleanup, repair etc. I know somewhat but I need to confident about thinngs before I write about them.
I think some work is going on about initial tokens, read s.th on the IRC but not sure.
#############
Bootstrapping
#############
Its the process of adding a new node to a ring. If the autobootstrap parameter is set as true then:
1) If initial token is specified then the node will take that initial token and will place itslef accordingly in the ring. If the initial token is not specified then this new node will find the most loaded node and will try to share the load the of this node by splitting the range into half and taking half of this node's range.
2)Once placed in the ring the process of copying the keys will start. I need to read more on this, as in how you can figure out from the logs that how long the process is going to take etc, steps in the process and figruing out where exactly are you in the process from the logs. Anyways the wiki provides useful info on it.
Also note that in case you want to manually control the bootstrapping, then keep the autobootstrap as false. Place the node and then call nodetool repair, that should fix things for you.
So again.. if you are manually adding nodes specifying tokens yourself, then add the node, move the other nodes, call repair one by one on each of the nodes.
Please do read the important things to note in the bootstrap section of the wiki:
1) You should wait long enough for all the nodes in your cluster to become aware of the bootstrapping node via gossip before starting another bootstrap. The new node will log "Bootstrapping" when this is safe, 2 minutes after starting. (90s to make sure it has accurate load information, and 30s waiting for other nodes to start sending it inserts happening in its to-be-assumed part of the token ring.)
So basically once the new nodes logs start showing bootstrapping then only you should bootstrap another node.
2)Relating to point 1, one can only boostrap N nodes at a time with automatic token picking, where N is the size of the existing cluster. If you need to more than double the size of your cluster, you have to wait for the first N nodes to finish until your cluster is size 2N before bootstrapping more nodes. So if your current cluster is 5 nodes and you want add 7 nodes, bootstrap 5 and let those finish before boostrapping the last two.
Not sure what exactly this point means.. as anyways point 1 says that wait for every node to finish before bootstrapping a new node.
3)As a safety measure, Cassandra does not automatically remove data from nodes that "lose" part of their Token Range to a newly added node. Run nodetool cleanup on the source node(s) (neighboring nodes that shared the same subrange) when you are satisfied the new node is up and working. If you do not do this the old data will still be counted against the load on that node and future bootstrap attempts at choosing a location will be thrown off.
This is interesting, so cassandra basically will not delete the load on the previous nodes that were having keys which are now taken up by the new node. Also explains the purpose of cleanup argument of nodetool.
4)When bootstrapping a new node, existing nodes have to divide the key space before beginning replication. This can take awhile, so be patient.
Again not sure patient with what, I think it means that the Bootstraping in log will take some time to appear. I need to read more on the steps that occur when a node is bootstrapped.
5)During bootstrap, a node will drop the Thrift port and will not be accessible from nodetool.
Think this is for the node that is being bootstrapped. So even though the logs will start ticking it still won't mean that you'll be able to use the nodetool for this node, i.e. until the bootstrapping thing appears on the logs.
6)Bootstrap can take many hours when a lot of data is involved. See Streaming for how to monitor progress.
This is the most imp. Need to read more on this.
Cassandra is smart enough to transfer data from the nearest source node(s), if your EndpointSnitch is configured correctly. So, the new node doesn't need to be in the same datacenter as the primary replica for the Range it is bootstrapping into, as long as another replica is in the datacenter with the new one.
Ok about this I think this is what it means. So say you are adding a new node in DC2 and your primary replica is in DC1. Then if you say have a node that is replica of the older node in DC2 then the new old will get data trasferred from node in DC2 rather than DC1, basically cassandra is smart enough to figure out which is the closest node to transfer the data. I have a question though that even though I saw
Another thing to note is that there is another snitch called PropertyFileEndPointSnitch, you can use the same to specify the rack and DC of your nodes in the property file and cassandra will read it from here. My question is that what happens when I change the property file and add a new node, I think cassandra should read the property file everytime it is supposed to bootstrap a node but i need to read the source code more to figure that out.
Bootstrap progress can be monitored using nodetool with the streams argument.
During bootstrap nodetool may report that the new node is not receiving nor sending any streams, this is because the sending node will copy out locally the data they will send to the receiving one, which can be seen in the sending node through the the "AntiCompacting... AntiCompacted" log messages.
I need to try out the first line and read more on the second one before I can say anything about it.
##############
Removing Nodes
##############
A live node can be removed using nodetool decommision and a dead node can be removed using nodetoll removetoken (need to read the syntax more carefully as in the host name that should be provided and the token that should be given. I guess hostname can be anything and the token is the token that needs to be removed, but need to cross check).
As the wiki says:
If decommission is used, the data will stream from the decommissioned node. If removetoken is used, the data will stream from the remaining replicas.
This means that on decomission the nodetool will take some time to finish (need to figure out more on what exactly does each line in the log will mean during this time.. will be nice if a progress bar kind of thing is displayed so that one is able to figure out that things are not stuck).
While the removetoken will finish quickly as the work will happen in the background(Just not sure if again you will have to do a nodetool repair once you remove nodes.. need to ask this in the forums)
*No data is removed automatically from the node being decommissioned, so if you want to put the node back into service at a different token on the ring, it should be removed manually. (How do you do this, by removing data and storage directory.. need to look into this)
This is the wiki link that talks about monitoring progress of such operations but will need to try it and read more of this
http://wiki.apache.org/cassandra/Streaming
I have already talked about moving nodes so won't talk more on that.
###############
Load Balancing
###############
Well every 1 at forums seem to say that this is crappy and doesn't work the way it is intended to, so I am skipping this one. Please let me know in case you differ with me on this.
The wiki says s.th like:
The status of move and balancing operations can be monitored using nodetool with the streams argument.
But I haven't done this and will need to read more on this.
######################################
Repairing missing or inconsistent data
######################################
As per the wiki the repair happens in 2 ways:
1)Read Repair: So every time a request is made for a read, the primary replica tries to read from all the machines that are storing the replica of the given key, but as soon as nodes with given CL return the response, the response is sent back to the client. But in the background the remaining replies are checked for consistency and are repaired.This is similar to state machine in dynamo.Read the dynamo paper to get a better idea.
2)Anti Entropy: The wiki is pretty self explanatory on this:
when nodetool repair is run, Cassandra computes a Merkle tree of the data on that node, and compares it with the versions on other replicas, to catch any out of sync data that hasn't been read recently. This is intended to be run infrequently (e.g., weekly) since computing the Merkle tree is relatively expensive in disk i/o and CPU, since it scans ALL the data on the machine (but it is is very network efficient).
Running nodetool repair: Like all nodetool operations, repair is non-blocking; it sends the command to the given node, but does not wait for the repair to actually finish. You can tell that repair is finished when (a) there are no active or pending tasks in the CompactionManager, and after that when (b) there are no active or pending tasks on o.a.c.concurrent.AE-SERVICE-STAGE, or o.a.c.service.StreamingService.
Repair should be run against one machine at a time. (This limitation will be fixed in 0.7.)
(Need to read more on this)
#################
Handling Failures
#################
-- for temporary node down
The wiki says s.th like :
If a node goes down and comes back up, the ordinary repair mechanisms will be adequate to deal with any inconsistent data.
What I think it means is that when your node comes up you should run a nodetool repair as this will ensure that any updates that have been missed by this nodes gets pushed back to it. Also as noted above please don't run nodetool repair too frequently.
-- for permanent node down
In additon the wiki says:
Remember though that if a node misses updates and is not repaired for longer than your configured GCGraceSeconds (default: 10 days), it could have missed remove operations permanently. Unless your application performs no removes, you should wipe its data directory, re-bootstrap it, and removetoken its old entry in the ring (see below).
What it means is if your node has been down for > 10 days (ie GCGraceSeconds) then it is better to remove the storage data directory, call remove token for this node and rebootstrap it. This step is identical to as if you are adding a new node.
The wiki suggests 2 approaches to deal with an ENTIRELY DEAD node(please note that this node is not coming back ever and this is not the same as node going down and then coming back up again)
1) Recommeneded approaach:
Get a replacement node with the new IP(not sure why can't this be the same IP as old) and autobootstrap as true. I think you should keep the initial token as blank so that the node gets place in the ring automatically(I am not sure if i can provide a token diff from the failed node's token, think even that should do).This will place the node in the ring appropriately and bootstrap will start. While bootstrap is running no reads will happen from this node(Not sure but what about writes, whether writes will also be kept on hold).Once this process is finished on the replacement node, run nodetool removetoken once, supplying the token of the dead node(this will remove the dead node from the ring, please note that as described earlier in this case you have to manually run repair, but since our autobootstrap was true the repair has already happened), and nodetool cleanup on each node(cleanup removes any old HintedHandoffs stored for the dead node).
2)Alternative approach:
Bring up a replacement node with the same IP (not sure why the same IP, what difference will that make)and token as the old(same token means that the node will take the position of the old nodes), and run nodetool repair(this will take time to finish). Until the repair process is complete, clients reading only from this node may get no data back. Using a higher ConsistencyLevel on reads will avoid this.
I am not sure why is the first recommended approach. I think both are equally good but will need to read more for this.
###############
Backing up data
###############
Wiki is pretty clear on how to backup data. I have just one doubt and I think its pretty stupid. I think the snapshot tool takes snapshot per node and not of the entire cluster, I think I am correct but I am not sure.
Import/Export
-------------
Need to read and experiment more with the json2sstable and sstable2json
####
FAQ (Need to update these and add like tonns more here)
####
Q) What about location of my data and storage directory, should they be in a separate disk and why?
Q)What exactly does CL and RF mean and how are they dependent on each other
Q)I am using RF of 3 and CL QUORUM, and have multiple nodes up, why are my writes still failing? If I have any 2 nodes up, shouldn't all my reads and writes should pass?
Q)What does autobootstrap mean and what should its value be under various conditions?s
Q)What do tokens actually mean and how are tokens and range related?
A)From the wiki : Each Cassandra server [node] is assigned a unique Token that determines what keys it is the first replica for. If you sort all nodes' Tokens, the Range of keys each is responsible for is (PreviousToken, MyToken], that is, from the previous token (exclusive) to the node's token (inclusive). The machine with the lowest Token gets both all keys less than that token, and all keys greater than the largest Token; this is called a "wrapping Range."
Q)What happens if I specify a different replication strategy, replication factor, other such params like type of partinoner per node in the conf/storage.xml?
A)I think all these are communicated to nodes using gossip, but I am not sure on this. So even though we might by accidently specify different values the one picked in the begining will be used.. Will need to read cassandra source code more and discuss in forums for this.
Q)I need a situation where my entire datacenter can go down and I should still be able to write data: What kind of RF should I use, kind of replication strategy, how should i lay down my nodes, minimum no. of nodes that I should use.
Q)Can I figure out on what all nodes my data got stored?
A)Not sure, though I didn't find any code in any of the clients to do the same, it should be pretty easy. In case of OPP, since your key maps to your token directly this should be pretty straighforward.
No comments:
Post a Comment