User Tools

Site Tools


scylladb

ScyllaDB

Return to ScyllaDB in Action by Bo Ingram

1 Introducing ScyllaDB This chapter covers ScyllaDB and what it is ScyllaDB versus other databases How ScyllaDB leverages being a distributed system ScyllaDB is a distributed NoSQL database designed to be a more-performant rewrite of Apache Cassandra. Although it rhymes with Godzilla and has an adorable creature as a mascot, it’s designed to not be monstrous to operate.

Compared with relational databases, ScyllaDB brings two big weapons to the Great Database Battle Royale — scalability and fault tolerance. ScyllaDB runs as a distributed system, running multiple nodes to store and serve data. This distribution simplifies scalability; to add additional capacity, operators only need to add more nodes. By providing users the capability to tune how many nodes respond to a query, it also provides fault tolerance, because the system can handle the loss of a configurable amount of nodes before being unable to serve requests, as seen in figure 1.1.

Figure 1.1 ScyllaDB is a distributed database that provides scalability and fault tolerance. CH01 F08 Ingram This distributed design impacts everything around it — it affects how you design applications, how you query data, how you monitor the database, and how you recover the system during an outage. We’ll explore all of these areas, showing how ScyllaDB can be the practical distributed database for any application. Let’s dive in!

livebook features: highlight, annotate, and bookmark

 
Select a piece of text and click the appropriate icon to annotate, bookmark, or highlight (you can also use keyboard shortcuts - h to highlight, b to bookmark, n to create a note).

You can automatically highlight by performing the text selection while keeping the alt/ key pressed. 1.1 ScyllaDB, a different database ScyllaDB is a database — it says it in its name! Users give it data; the database gives it back when asked. This very basic and oversimplified interface isn’t too dissimilar from popular relational databases like PostgreSQL and MySQL. ScyllaDB, however, is not a relational database, eschewing joins and relational data modeling to provide a different set of benefits. To illustrate these, let’s take a look at a fictitious example.

1.1.1 Hypothetical databases Let’s imagine you’ve just moved to a new town, and as you go to new restaurants, you want to remember what you ate so that you can order it or avoid it next time. You could write it down in a journal or save it in the notes app on your phone, but you hear about a new business model where people remember information you send them. Your friend Robert has just started a similar venture: Robert’s Rememberings.

Robert’s Rememberings Robert’s business (figure 1.2) is straightforward: you can text Robert’s phone number, and he will remember whatever information you send him. He’ll also retrieve information for you, so you won’t need to remember everything you’ve eaten in your new town. That’s Robert’s job.

Figure 1.2 Robert’s Rememberings has a seemingly simple plan. CH01 F02 Ingram The plan works swimmingly at first, but issues begin to appear. Once, you text him, and he doesn’t respond. He apologizes later and says he had a doctor’s appointment. Not unreasonable, you want your friend to be healthy. Another time, you text him about a new meal, and it takes him several minutes to reply instead of his usual instant response. He says that business is booming, and he’s been inundated with requests — response time has suffered. He reassures you and says not to worry, he has a plan.

Figure 1.3 Robert adds a friend to his system to solve problems, but it introduces complications CH01 F03 Ingram Robert has hired a friend to help him out. He sends you the new updated rules for his system. If you only want to ask a question, you can text his friend, Rosa. All updates are still sent to Robert; he will send everything you save to her, so she’ll have an up-to-date copy. At first, you slip up a few times and still ask Robert questions, but it seems to work well. No longer is Robert overwhelmed, and Rosa’s responses are prompt.

One day, you realize that when you asked Rosa a question, she texted back an old review that you had previously overwritten. You message Robert about this discrepancy, worried that your review of the much-improved tacos at Main Street Tacos is lost forever. Robert tells you there was an issue within the system where Rosa hadn’t been receiving messages from Robert but was still able to get requests from customers. Your request hasn’t been lost, and they’re reconciling to get back in sync.

You wanted to be able to answer one question: is the food here good or not? Now, you’re worrying about contacting multiple people depending on whether you’re reading a review or writing a review, whether data is in sync, and whether your friend’s system can scale to satisfy all of their users' requests. What happens if Robert can’t handle people only saving their information? When you begin brainstorming intravenous energy drink solutions, you realize that it’s time to consider other options.

ABC Data: A different approach Your research leads you to another business – ABC Data. They tell you that their system is a little different: they have three people – Alice, Bob, and Charlotte – and any of them can save information or answer questions. They communicate with each other to ensure each of them has the latest data, as shown in figure 1.4. You’re curious what happens if one of them is unavailable, and they say they provide a cool feature: because there are multiple of them, they coordinate within themselves to provide redundancy for your data and increased availability. If Charlotte is unavailable, Alice and Bob will receive the request and answer. If Charlotte returns later, Alice and Bob will get Charlotte back up to speed on the latest changes.

Figure 1.4 ABC Data’s approach is designed to meet the scaling challenges that Robert encountered. CH01 F04 Ingram This setup is impressive, but because each request can lead to additional requests, you’re worried this system might get overwhelmed even easier than Robert’s. This, they tell you, is the beauty of their system. They take the data set and create multiple copies of it. They then divide this redundant data amongst themselves. If they need to expand, they only need to add additional people, who take over some of the existing slices of data. When a hypothetical fourth person, Diego, joins, one customer’s data might be owned by Alice, Charlotte, and Diego, whereas Bob, Charlotte, and Diego might own other data.

Because they allow you to choose how many people should respond internally for a successful request, ABC Data gives you control over availability and correctness. If you want to always have the most up-to-date data, you can require all three holders to respond. If you want to prioritize getting an answer, even if it isn’t the most recent one, you can require only one holder to respond. You can balance these properties by requiring two holders to respond — you can tolerate the loss of one, but you can ensure that a majority of them have seen the most up-to-date data, so you should get the most recent information.

Figure 1.5 ABC Data’s approach gives us control over availability and correctness. CH01 F05 Ingram You’ve learned about two imaginary databases here — one that seems straightforward but introduces complexity as requests grow, and another with a more complex implementation that attempts to handle the drawbacks of the first system. Before beginning to contemplate the awkwardness of telling a friend you’re leaving his business for a competitor, let’s snap back to reality and translate these hypothetical databases to the real world.

1.1.2 Real-world databases Robert’s database is a metaphorical relational database, such as PostgreSQL or MySQL. They’re relatively straightforward to run, fit a multitude of use cases, and are quite performant, and their relational data model has been used in practice for more than 50 years. Very often, a relational database is a safe and strong option. Accordingly, developers tend to default toward these systems. But, as demonstrated, they also have their drawbacks. Availability is often all-or-nothing. Even if you run with a read replica, which in Robert’s database would be his friend, Rosa, you would potentially only be able to do reads if you had lost your primary instance. Scalability can also be tricky – a server has a maximum amount of compute resources and memory. Once you hit that, you’re out of room to grow. It is through these drawbacks that ScyllaDB differentiates itself.

The ABC Data system is ScyllaDB. Like ABC Data, ScyllaDB is a distributed database that replicates data across its nodes to provide both scalability and fault tolerance. Scaling is straightforward – you add more nodes. This elasticity in node count extends to queries. ScyllaDB lets you decide how many replicas are required to respond for a successful query, giving your application room to handle the loss of a server.

1.1.3 Unpacking the definition ScyllaDB is commonly described as a distributed wide-column NoSQL database and is a rewrite of the popular Cassandra database, which, as you might imagine shares similar properties. This definition demonstrates how Scylla differentiates itself from other databases. It aims to be both more scalable than a relational database and more performant than Cassandra. This positioning is typified by ScyllaDB’s description as a NoSQL database. PostgreSQL and MySQL, as their names suggest, are classified as SQL databases. They use SQL (Structured Query Language) to query a relational database schema. NoSQL has become a catch-all term to describe databases that do not conform to this model. A broad array of databases fall under this model – from our ScyllaDB to document stores like MongoDB to “not-only SQL” databases like CockroachDB.

WHAT’S A WIDE-COLUMN DATABASE? ScyllaDB and Cassandra are often described as wide-column databases. In this type of database, data can be thought of as a multidimensional map or a key-key value store, where tables have columns, but aren’t required to have values for every column. These tables, or column families, as they’re also called in Cassandra, are stored together on disk. This approach contrasts with a columnar database, where a given column is stored together. The way I try to remember it is that columns can be arbitrarily wide, therefore, it’s a wide-column store.

NoSQL databases tend to emphasize scalability and fault tolerance over total correctness and accuracy of the data within the database, a property called consistency. This might sound like a ridiculous tradeoff, but you’ll get to examine it closely throughout the book. In practice, Scylla works to be eventually consistent, converging towards correctness over time. To achieve its desired scalability and fault tolerance, ScyllaDB runs multiple instances of itself within a cluster.

Figure 1.6 ScyllaDB is a distributed database that provides scalability and fault tolerance. CH01 F01 Ingram There is no overarching all-powerful leader; each node is just as important as any other node. Not only are there multiple nodes in the system, but data is distributed across all these nodes. ScyllaDB isn’t a distributed database because distributed systems are cool; it’s distributed because it was designed to make a more reliable and scalable database. If you distribute data across all nodes in a cluster, what happens if you lose one node? ScyllaDB stores multiple copies of the data, and by letting you choose how many replicas are required to respond to a query, picking any number fewer than the maximum lets the database tolerate node failure. This distribution also helps with scalability. If one node is taking a large amount of traffic, the rest of the cluster won’t be impacted. Requests that don’t hit our one heavily-trafficked node won’t be affected by any overburdening of another node. This fault tolerance is critical to ScyllaDB’s design. Instead of putting all of your eggs in one basket, you can have many eggs in many baskets. If you lose a basket, you still have lots of eggs!

livebook features: discuss

 
Ask a question, share an example, or respond to another reader. Start a thread by selecting any piece of text and clicking the discussion icon. Get ScyllaDB in Action 1.2 ScyllaDB, a distributed database ScyllaDB runs multiple nodes, making it a distributed system. By spreading its data across its deployment, it uses that to achieve its desired availability and consistency, which, when combined, differentiates the database from other systems.

1.2.1 Distributing data All distributed systems have a bar to meet: they must deliver enough value to overcome the introduced complexity. ScyllaDB, designed to be a distributed system, achieves its scalability and fault tolerance through this design.

When users write data to ScyllaDB, they start by contacting any node. Many systems follow a leader-follower topology, where one node is designated as a leader, giving it special responsibilities within the system. If the leader dies, a new leader is elected, and the system continues operating.

ScyllaDB does not follow this model; each node is as special as any other. Without a centralized coordinator deciding who stores what, each node must know where any given piece of data should be stored. Internally, Scylla can map a given key to the node that owns it, forwarding requests to the appropriate nodes.

To provide fault tolerance, ScyllaDB not only distributes data but replicates it across multiple nodes. The database stores a row in multiple locations – the amount depends upon the configured replication factor. In a perfect world, each node acknowledges every request instantly every time, but what happens if it doesn’t? To help with unexpected trouble, the database provides tunable consistency.

How you query data is dependent on what degree of consistency you’re looking to get. ScyllaDB is an eventually consistent database, and you perhaps will see inconsistent data as the system converges toward consistency. Developers must keep this eventual consistency in mind when working with the database. To facilitate the various needs of consistency, ScyllaDB provides a variety of consistency levels for queries, including those listed in table 1.1.

Table 1.1 Sample of consistency level options, assuming a three-node cluster Consistency level Description Number required to succeed Failures tolerated ALL

Requires all nodes to succeed

3

0

QUORUM

Requires a majority of replicas to succeed

2

1

ONE

Requires a single replica to succeed

1

2

With a consistency level of ALL, you can require that all replicas for a key acknowledge a query, but this setting harms availability. You can no longer tolerate the loss of a node. With a consistency level of ONE, you require a single replica for a key to acknowledge a query, but this greatly increases our chances of inconsistent results.

Luckily, some options aren’t as extreme. ScyllaDB lets you tune consistency via the concept of quorums. A quorum is when a group has a majority of members. Legislative bodies, such as the US Senate, do not operate when the number of members present is below the quorum threshold. When applied to ScyllaDB, you can achieve intermediate forms of consistency.

With a QUORUM consistency level, the database requires a majority of replicas for a key to acknowledge a query. If you have three replicas, two of them must accept every read and every write. If you lose one node, you can still rely on the other two to keep serving traffic. You additionally guarantee that a majority of our nodes get every update, making it much more challenging to read inconsistent data if you follow the same quorum consistency.

Once you have picked our consistency level, you know how many replicas we need to execute a successful query. A client sends a request to a node, which serves as the coordinator for that query. Our coordinator node reaches out to the replicas for the given key, including itself if it is a replica. Those replicas return results to the coordinator, and the coordinator evaluates them according to our consistency. If it finds the result satisfies the consistency requirements, it returns the result to the caller.

The CAP theorem classifies distributed systems by saying that they cannot provide all three of these properties – consistency, availability, and network partition tolerance, as seen in figure 1.7. Consistency is a measure of correctness. For the CAP theorem’s purposes, we define consistency as every request reading the most recent write. Availability is whether the system can serve requests, and network partition tolerance is the ability to handle a disconnected node.

Figure 1.7 The CAP theorem says a database can only provide two of three properties — consistency, availability, and partition tolerance. ScyllaDB is classified as an AP system. CH01 F09 Ingram A distributed system must have partition tolerance, so it ultimately chooses between consistency and availability. If a system is consistent, it must be impossible to read inconsistent data. To achieve consistency, it must ensure that all nodes receive all necessary copies of data. This requirement means it cannot tolerate the loss of a node, therefore losing availability.

ScyllaDB is classified as an AP system. When encountered with a network partition, it chooses to sacrifice consistency and maintain availability. You can see this in its design – ScyllaDB repeatedly makes choices, via quorums and eventual consistency, to keep the system up and running in exchange for potentially weaker consistency. By emphasizing availability, you see one of ScyllaDB’s differentiators against other databases.

1.2.2 ScyllaDB versus relational databases I’ve introduced ScyllaDB by describing its features in comparison with relational databases, but we’ll examine in closer detail here the differences. Relational databases such as PostgreSQL and MySQL are the standard for data storage in software applications, and they’re almost always the default choice for a new developer looking to build an application. Relational databases are a very strong option for many use cases, but that doesn’t mean they’re a strong option for every use case.

ScyllaDB is a distributed NoSQL wide-column store. By distributing data across a cluster, ScyllaDB unlocks better availability when nodes go awry than a single-node all-or-nothing relational database. PostgreSQL and MySQL can run in a distributed mode, but that is either powered through extensions or newer storage engines and not the primary native mode of the database. This distribution is native to ScyllaDB and the bedrock of its design.

By running as a distributed system, ScyllaDB empowers horizontal scalability. Many relational databases are only vertically scalable – you can only add more resources by running it on a bigger server. With horizontal scalability, you can add additional nodes to a system to increase its capacity. ScyllaDB supports this expansion; administrators can add more nodes, and the cluster will rebalance itself, offloading data to the new cluster member.

ScyllaDB does not provide a relational database’s ACID (atomicity, consistency, isolation, and durability) guarantees, instead opting for a softer model called BASE (Basic Availability, Soft-state, and Eventual consistency), where the database has basic availability and is eventually consistent. This decision leads to faster writes than a relational database, which has to validate the consistency of the database after every write, whereas ScyllaDB only needs to save the write since it doesn’t promise that degree of correctness. The tradeoff, though, is that developers need to consider ScyllaDB’s weaker consistency.

ACID VS BASE ACID provides a set of guarantees for transactions, one or more statements applied to a database. When developers refer to a transaction in a database, they are almost always referring to ACID transactions. ACID provides:

Atomicity - All statements in the transaction succeed together or fail together. Consistency - The database is in a valid state after every transaction Isolation - A transaction cannot interfere with a concurrently executing transaction Durability - Any change in a transaction will be persisted. I like to think of ACID as how you would expect a database to run. You want a consistent database, and you’d like your writes to be durable.

ScyllaDB provides a softer set of guarantees called BASE. Softer isn’t bad though; it’s these guarantees that let ScyllaDB more easily provide scalability and fault tolerance. BASE provides:

Basic availability - The database is basically available. Some portions of the database might be down, but overall, the system is available. Soft state - Every node in the database doesn’t have to be consistent at every moment in time. Eventually consistent - The database will be consistent at some moment in time. While I remain convinced that the designer of BASE named the property “soft state” to make the acronym work, it does accurately describe ScyllaDB’s benefits. It can tolerate the loss of a node and remain available, but to do this, it has to weaken consistency. Nevertheless, it should strive and converge toward consistency. We’ll discuss in further chapters these properties, how they affect ScyllaDB and your usage of it, and how the system’s architecture provides them.

Ultimately, ScyllaDB versus relational databases is a foundational and philosophical decision. They operate so differently and provide such varying guarantees to their clients that picking one over the other has large effects on an application. If you’re looking for availability and scalability in your database, ScyllaDB is a strong option.

1.2.3 ScyllaDB versus Cassandra ScyllaDB is a rewrite of Apache Cassandra. It is frequently described as “a more performant Cassandra” or “Cassandra but in C++”. ScyllaDB is designed to be compatible with Cassandra: it uses a compatible API, query language, on-disk storage format, and hash ring architecture. Like Cassandra, but better, is ScyllaDB’s goal; it makes some improvements to accomplish this.

The choice of language in the rewrite immediately unlocks better performance. Cassandra is written in Java, which leverages a garbage collector to perform memory management. Because objects get loaded into memory, at some point, they will need to be removed. Java’s garbage collection algorithms handle this removal, but it comes at the cost of compute. Time spent garbage collecting is time Cassandra can’t spend executing queries. If garbage collection reaches a certain threshold, the Java Virtual Machine will pause all execution for a brief time while it cleans up memory, referred to as a “stop the world” pause. Even if it’s just for milliseconds, that pause can be painful to clients. Although Java exposes many configuration knobs and improves the garbage collector with each release, it’s a tax that all Java-based applications have to pay.

ScyllaDB avoids this tax because it is implemented in C++, which requires manual memory management. By having full control of memory allocation and cleanup, ScyllaDB doesn’t need to let a garbage collector perform this functionality on an application-wide scale. It avoids “stop the world” pauses and can dedicate its compute time to executing queries.

ScyllaDB’s biggest architectural difference is its shard-per-core architecture (figure 1.8). Both Cassandra and ScyllaDB shard a data set across its various nodes via placement in a hash ring, which you’ll learn more about in chapter 3. ScyllaDB takes this further by leveraging the Seastar framework (https://seastar.io/) to shard data within a node, splitting it up per CPU-core and giving each shard its own CPU, memory, and network bandwidth allocation.

Figure 1.8 ScyllaDB shards data not only within the cluster, but also within each instance. CH01 F07 Ingram This sharding further limits the blast radius due to hot traffic patterns – the damage is limited to just that shard on that node. Cassandra does not follow this paradigm, however, and limits the sharding to only per node. If a data partition gets a large amount of requests, it can overwhelm the node, leading to cluster-wide struggles.

Performance justifies the rewrite. Both in benchmarks (https://thenewstack.io/benchmarking-apache-cassandra-40-nodes-vs-scylladb-4-nodes/) and in the wild (https://discord.com/blog/how-discord-stores-trillions-of-messages), ScyllaDB is faster, more consistent, and requires fewer servers to operate than Cassandra.

1.2.4 ScyllaDB versus Amazon Aurora / Google Cloud Spanner / Google AlloyDB I’ve lumped a few similar systems together here – Amazon Aurora, Google Cloud Spanner, and Google AlloyDB. They can be generally described as scalable cloud-hosted databases. They aim to take a relational data model and provide greater scalability than out-of-the-box PostgreSQL or MySQL. This effort accentuates a need in the market for scalable databases, showing the value of ScyllaDB.

These systems have two related drawbacks – vendor lock-in and cost. As cloud providers provide these databases, they run in only that specific vendor’s cloud environment. We can’t run Google Cloud Spanner in Amazon Web Services. If your application is heavily dependent on one of these systems, there can be a high engineering cost if you decide to switch cloud providers. If you’re not using that provider (or any provider), these options aren’t even on the table for you. And by using a cloud provider, companies pay money for these services. Operating and maintaining a database is challenging (which is partly why you’re reading this book), and although these cloud vendors provide solutions to make it potentially simpler, that can get quite expensive for clients. Of course, operating a database yourself can also be costly.

ScyllaDB, however, can be run anywhere. Companies are running it on-premises or within various cloud providers. It provides a scalable and fault-tolerance database that you can take to any hosting solution.

1.2.5 ScyllaDB versus document stores I’m not talking about Google Drive here, but instead, databases that store unstructured documents by a given key, such as MongoDB. These systems support querying these documents, allowing users to access arbitrary document fields without defining a database schema.

ScyllaDB eschews this flexibility to provide (relatively) predictable performance. By requiring users to define their schema up front, it clarifies to both users and the system how data is distributed across the cluster. By forcing users to query data in patterns that match this distribution, ScyllaDB can limit the number of nodes involved in a query, preventing surprisingly expensive queries.

Document stores, on the other hand, tend to bias toward initial ease of use. In MongoDB, no schema definition is required, but users still need to consider the design of their data to query it effectively. MongoDB runs as a distributed system, but unlike ScyllaDB, it doesn’t out-of-the-box attempt to minimize inefficient queries that hit more than the expected number of nodes, leading to potential performance surprises.

In the CAP theorem, MongoDB is a CP (consistent and partition-tolerant) system. Writes require the presence of a primary node and are blocked until a new primary is elected in the event of a network partition. ScyllaDB, however, prioritizes availability, keeping the system up and relying on its tunable consistency.

1.2.6 ScyllaDB versus distributed relational databases One interesting development for databases over the past few years has been the growth of distributed transactional databases. These systems — such as CockroachDB, TiDB, and YugabyteDB — focus on improving the availability of a traditional relational database like PostgreSQL while still offering strong consistency. In the CAP theorem’s classifications, they’re CP systems; they prefer consistency over availability. By emphasizing correctness, they need a quorum of nodes to respond to successfully complete a query; if quorum is lost, the database loses availability. ScyllaDB, however, provides tunable consistency to dodge this problem. By allowing weaker consistency levels, such as ONE, Scylla can handle a greater loss of availability to preserve functionality.

In a relational database, writes are the computationally intensive operation. The database needs to validate its consistency on every write. Scylla, on the other hand, skips this verification, opting for speed and simplicity when writing data. The tradeoff, however, is that reads in Scylla will be slower than writes, as you need to gather data from multiple nodes, that have data stored in different places on disk. You’ll learn a lot more about this behavior in later chapters, but the big takeaway is that writes in Scylla will be faster than in these systems.

1.2.7 When to prefer other databases I’ve described ScyllaDB’s benefits relative to other databases, but sometimes, I admit, it’s not the best tool for the job. I can’t describe it as a unique database because of the Cassandra rewrite approach, but it does trade operational and design complexity for more graceful failure modes. Choosing Scylla requires you to design applications differently and adds more complexity than something like a cloud-hosted PostgreSQL server. If you don’t need ScyllaDB’s horizontal scalability and nuanced availability, the increased operational overhead might not be worth it. If your application is small, makes few requests, and isn’t expected to grow over time, ScyllaDB might be overkill. A database backing comments on your blog probably doesn’t need a ScyllaDB cluster, unless, like many of us, you wanting that as an excuse to try it out.

Operating and maintaining a ScyllaDB cluster isn’t a hands-off exercise. If you can’t dedicate time to operating and maintaining a cluster, that is another signal that a managed offering might be preferable for you. Teams must choose wisely about how they spend their time and their money on what they do; choosing a less hands-on is a valid decision.

One thing you’ll see about Scylla in upcoming chapters is that, with data modeling, it can be inflexible to change your database’s design. Adding new query patterns that don’t fit in with your initial design can be challenging. While there are ways to work around it, other databases can potentially give you more flexibility when you’re in the prototyping and learning stage of building features for an application.

Lastly, some use cases might prefer a stronger transactional model like ACID. If you’re working with financial data, you probably want to use a relational database so that you can have isolation in your operations. One popular example to demonstrate the importance of ACID transactions is concurrent access to bank accounts. Without isolation, you run the risk of concurrent operations causing a mismatch between how much money the database thinks you have and how much money you actually have. Accountants traditionally prefer accuracy in these areas, so you might prefer a relational database when working with something that needs stronger database transactions. While scaling a relational database has its challenges, they might be preferable to take on than surrendering ACID’s guarantees.

livebook features: settings

 
Update your profile, view your dashboard, tweak the text size, or turn on dark mode. 1.3 ScyllaDB, a practical database We’ve talked about what exactly ScyllaDB is, and its differences with other systems, but how does it run in practice? In this section, we’ll look at it as a real, deployed system and show that ScyllaDB isn’t just a distributed database, but a practical one too.

1.3.1 Fault tolerance If you’re on-call for a database, you want it to handle failures gracefully to avoid the dreaded 3 AM alert and to have a night of undisturbed rest. ScyllaDB is designed to be a fault-tolerant database to give you a good night’s sleep. Through its tunable consistency model, it is capable of surviving spontaneous downtime without any effect on queries. By leveraging quorum consistency, not every node needs to be up and running to serve traffic. A server can crash and if the underlying hardware self-recovers, the ScyllaDB process can start, rejoin the cluster, receive any data it missed, and return to serving traffic, with you asleep and none the wiser.

That recovery process simplifies complicated operations. When a ScyllaDB node perishes, other nodes in the cluster perform a process called hinted handoff. When nodes need to replicate data to the downed node, they locally save a “hint” instead. When the downed node recovers, nodes with hints resend them to the recovered node, allowing it to replay missed writes and become consistent with the rest of the cluster.

If for some unfortunate reason, a node is unable to recover, we don’t need to execute a complicated operation like restoring from backups. We can provision a new node and tell the cluster that our new node is a replacement for the old node. Because ScyllaDB replicates data across the cluster, our new node takes the place of the old node, and other replicas stream data to it until the node has caught up and joined the cluster, serving traffic.

1.3.2 Scalability Whether you’re hitting the limits of your existing data store, or handling growth is important to you, scalability is one of the prime reasons people choose ScyllaDB. Thankfully, in ScyllaDB, scalability is straightforward – you add more nodes! Even with terabytes of data, adding a node should take no more than a few hours.

Similar to when you replace a node in the cluster, adding a node involves joining the cluster, signing up for what slices of data the node will own, and then receiving data from other replicas until the node is caught up. While this bootstrapping process is limited to one node at a time, it is a simple operation to execute.

1.3.3 Used in production Software developers tend to be conservative in their choice of data store, and with good reason: a database is the base for all of your data. A database might meet all of the requirements, but it’s scary to be the only one running something. As a field, software development moves forward as more people use things, discovering bugs and finding pain points and solutions to them. A big question you’ll frequently hear when considering a less-ubiquitous solution: “Does anyone actually use this thing?”

Yes! ScyllaDB is a database used in real-life production systems and is growing in popularity. Several companies use it today:

Discord stores their trillions of messages inside ScyllaDB Epic Games uses ScyllaDB as a cache for binary assets Comcast stores DVR data for their X1 cable platform in ScyllaDB They’ve built systems that leverage Scylla because they want a scalable and fault-tolerant database. Each of these use cases involves highly distributed reads serving important functionality to their systems.

As a reader of this book, you might be considering building a similar system using ScyllaDB. My goal is to get to that point by the end of this book – by learning how to structure schemas, query the database, and operate it, you will gain the knowledge to go off and build your own system. I’ve spent a lot of time introducing ScyllaDB – let’s dive in and query the thing!

livebook features: highlight, annotate, and bookmark

 
Select a piece of text and click the appropriate icon to annotate, bookmark, or highlight (you can also use keyboard shortcuts - h to highlight, b to bookmark, n to create a note).

You can automatically highlight by performing the text selection while keeping the alt/ key pressed. 1.4 Summary ScyllaDB is a distributed NoSQL database compatible with Apache Cassandra’s API. ScyllaDB distributes data across multiple nodes to provide horizontal scalability and fault tolerance. ScyllaDB provides tunable consistency to allow users to adjust their desired data consistency and fault tolerance thresholds. sitemap add to cart for $47.99 $33.59 (pdf + ePub + kindle + liveBook )

Prev
Next Chapter cover

ScyllaDB in Action Welcome 1 Introducing ScyllaDB 1.1 ScyllaDB, a different database 1.1.1 Hypothetical databases 1.1.2 Real-world databases 1.1.3 Unpacking the definition 1.2 ScyllaDB, a distributed database 1.2.1 Distributing data 1.2.2 ScyllaDB versus relational databases 1.2.3 ScyllaDB versus Cassandra 1.2.4 ScyllaDB versus Amazon Aurora / Google Cloud Spanner / Google AlloyDB 1.2.5 ScyllaDB versus document stores 1.2.6 ScyllaDB versus distributed relational databases 1.2.7 When to prefer other databases 1.3 ScyllaDB, a practical database 1.3.1 Fault tolerance 1.3.2 Scalability 1.3.3 Used in production 1.4 Summary 2 Touring ScyllaDB 3 Data Modeling in ScyllaDB 4 Data Types in ScyllaDB 5 Tables in ScyllaDB Up next… 2 Touring ScyllaDB Running ScyllaDB locally with Docker Using nodetool to view operational details of the cluster at the command line Creating a table and reading and writing data Experimenting with failures and changing consistency levels

scylladb.txt · Last modified: 2024/02/16 19:15 by 127.0.0.1