The problem is that "denormalization" is an overloaded term. Often, it's used to...

graffitici · on Feb 25, 2015

Can you recommend a resource (book, blog article, website) for learning more about NoSQL database design and denormalization? The issue not being of course that I have trouble finding such resources, but that there are way too many..

andrewstuart2 · on Feb 25, 2015

> Metaphors like "data junkyard" don't add anything productive to the discussion.

Yeah, I probably went a bit overboard. It's just very much different from the way I'm used to thinking about data, so it seems very disorganized to me, though I'm sure it's not when done well.

I was just imagining an RDBMS with a million columns. Even in DW scenarios, I'm pretty sure that's a bad plan, though I may be wrong there. I'd definitely cringe, though.

teraflop · on Feb 25, 2015

Yeah, it's just a totally different model, and maybe the problem is that we're using the word "column" to mean totally different things in different contexts. What HBase calls a "column" is really more like part of a composite key, and nobody gets upset by a composite key that has millions of distinct values.

In an RDBMS, a table with millions of columns would be unmanageable for a bunch of reasons:

- If only a few columns were set in any given row, you'd waste a ton of space storing all the NULL values.

- Modifying one column in a row would probably require reading and re-writing the entire row.

- There's no good way to retrieve a large-ish subset of columns that you're interested in; you'd have to either specify them all by name in your query, or fetch the entire row.

None of those downsides apply to the Bigtable data model (which includes HBase, Cassandra and a few other similar projects). Null columns are free (since they don't exist on disk in the first place), writes are cheap no matter the row size, and you can filter by columns in interesting ways.

You're not wrong about the potential for messiness, though. The biggest drawback (IMO) of the Bigtable model is that the database server doesn't know anything about the structure of your data. If you're used to having a SQL prompt where you can examine and manipulate data in interesting ways, HBase's "shell" is a huge step backward. If you want to have any kind of useful visibility into your data, you have to build those tools yourself.