database - What scalability problems have you encountered using a NoSQL data store?

ID : 20316

viewed : 33

Tags : databasenosqlkey-value-storegraph-databasesdistributed-databasedatabase

Top 5 Answer for database - What scalability problems have you encountered using a NoSQL data store?

vote vote

96

My current project actually.

Storing 18,000 objects in a normalised structure: 90,000 rows across 8 different tables. Took 1 minute to retrieve and map them to our Java object model, that's with everything correctly indexed etc.

Storing them as key/value pairs using a lightweight text representation: 1 table, 18,000 rows, 3 seconds to retrieve them all and reconstruct the Java objects.

In business terms: first option was not feasible. Second option means our app works.

Technology details: running on MySQL for both SQL and NoSQL! Sticking with MySQL for good transaction support, performance, and proven track record for not corrupting data, scaling fairly well, support for clustering etc.

Our data model in MySQL is now just key fields (integers) and the big "value" field: just a big TEXT field basically.

We did not go with any of the new players (CouchDB, Cassandra, MongoDB, etc) because although they each offer great features/performance in their own right, there were always drawbacks for our circumstances (e.g. missing/immature Java support).

Extra benefit of (ab)using MySQL - the bits of our model that do work relationally can be easily linked to our key/value store data.

Update: here's an example of how we represented text content, not our actual business domain (we don't work with "products") as my boss'd shoot me, but conveys the idea, including the recursive aspect (one entity, here a product, "containing" others). Hopefully it's clear how in a normalised structure this could be quite a few tables, e.g. joining a product to its range of flavours, which other products are contained, etc

Name=An Example Product Type=CategoryAProduct Colour=Blue Size=Large Flavours={nice,lovely,unpleasant,foul} Contains=[ Name=Product2 Type=CategoryBProduct Size=medium Flavours={yuck} ------ Name=Product3 Type=CategoryCProduct Size=Small Flavours={sublime} ] 
vote vote

89

I've switched a small subproject from MySQL to CouchDB, to be able to handle the load. The result was amazing.

About 2 years ago, we've released a self written software on http://www.ubuntuusers.de/ (which is probably the biggest German Linux community website). The site is written in Python and we've added a WSGI middleware which was able to catch all exceptions and send them to another small MySQL powered website. This small website used a hash to determine different bugs and stored the number of occurrences and the last occurrence as well.

Unfortunately, shortly after the release, the traceback-logger website wasn't responding anymore. We had some locking issues with the production db of our main site which was throwing exceptions nearly every request, as well as several other bugs, which we haven't explored during the testing stage. The server cluster of our main site, called the traceback-logger submit page several k times per second. And that was a way too much for the small server which hosted the traceback logger (it was already an old server, which was only used for development purposes).

At this time CouchDB was rather popular, and so I decided to try it out and write a small traceback-logger with it. The new logger only consisted of a single python file, which provided a bug list with sorting and filter options and a submit page. And in the background I've started a CouchDB process. The new software responded extremely quickly to all requests and we were able to view the massive amount of automatic bug reports.

One interesting thing is, that the solution before, was running on an old dedicated server, where the new CouchDB based site on the other hand was only running on a shared xen instance with very limited resources. And I haven't even used the strength of key-values stores to scale horizontally. The ability of CouchDB / Erlang OTP to handle concurrent requests without locking anything was already enough to serve the needs.

Now, the quickly written CouchDB-traceback logger is still running and is a helpful way to explore bugs on the main website. Anyway, about once a month the database becomes too big and the CouchDB process gets killed. But then, the compact-db command of CouchDB reduces the size from several GBs to some KBs again and the database is up and running again (maybe i should consider adding a cronjob there... 0o).

In a summary, CouchDB was surely the best choice (or at least a better choice than MySQL) for this subproject and it does its job well.

vote vote

79

Todd Hoff's highscalability.com has a lot of great coverage of NoSQL, including some case studies.

The commercial Vertica columnar DBMS might suit your purposes (even though it supports SQL): it's very fast compared with traditional relational DBMSs for analytics queries. See Stonebraker, et al.'s recent CACM paper contrasting Vertica with map-reduce.

Update: And Twitter's selected Cassandra over several others, including HBase, Voldemort, MongoDB, MemcacheDB, Redis, and HyperTable.

Update 2: Rick Cattell has just published a comparison of several NoSQL systems in High Performance Data Stores. And highscalability.com's take on Rick's paper is here.

vote vote

64

We moved part of our data from mysql to mongodb, not so much for scalability but more because it is a better fit for files and non-tabular data.

In production we currently store:

  • 25 thousand files (60GB)
  • 130 million other "documents" (350GB)

with a daily turnover of around 10GB.

The database is deployed in a "paired" configuration on two nodes (6x450GB sas raid10) with apache/wsgi/python clients using the mongodb python api (pymongo). The disk setup is probably overkill but thats what we use for mysql.

Apart from some issues with pymongo threadpools and the blocking nature of the mongodb server it has been a good experience.

vote vote

54

I apologize for going against your bold text, since I don't have any first-hand experience, but this set of blog posts is a good example of solving a problem with CouchDB.

CouchDB: A Case Study

Essentially, the textme application used CouchDB to deal with their exploding data problem. They found that SQL was too slow to deal with large amounts of archival data, and moved it over to CouchDB. It's an excellent read, and he discusses the entire process of figuring out what problems CouchDB could solve and how they ended up solving them.

Top 3 video Explaining database - What scalability problems have you encountered using a NoSQL data store?

Related QUESTION?