Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The problem is that "denormalization" is an overloaded term.

Often, it's used to mean storing multiple copies of data; for instance, to work around a key-value store's lack of secondary indexing. In that case, you're paying a penalty in terms of space (and code complexity) to make certain operations faster.

In other cases, denormalization just means structuring your data non-relationally. For example, you might want to add a set-valued field to one of your tables. The relational way would be to split that field into a separate table and access it with a join operation, but it's not obvious that that's more maintainable or efficient than storing the set inline.

HBase's support for wide rows is just a mechanism that gives you that kind of flexibility in organizing your data. As norkakn alluded to, the distinction between "row" and "column" in HBase isn't nearly as fundamental as in an RDBMS. Data is indexed by a tuple of (row, column family, column), where the row determines atomicity and the column family controls storage locality.

There are solid technical reasons for wanting to go with an RDBMS or a distributed key-value store in different situations. Metaphors like "data junkyard" don't add anything productive to the discussion.



Can you recommend a resource (book, blog article, website) for learning more about NoSQL database design and denormalization? The issue not being of course that I have trouble finding such resources, but that there are way too many..


> Metaphors like "data junkyard" don't add anything productive to the discussion.

Yeah, I probably went a bit overboard. It's just very much different from the way I'm used to thinking about data, so it seems very disorganized to me, though I'm sure it's not when done well.

I was just imagining an RDBMS with a million columns. Even in DW scenarios, I'm pretty sure that's a bad plan, though I may be wrong there. I'd definitely cringe, though.


Yeah, it's just a totally different model, and maybe the problem is that we're using the word "column" to mean totally different things in different contexts. What HBase calls a "column" is really more like part of a composite key, and nobody gets upset by a composite key that has millions of distinct values.

In an RDBMS, a table with millions of columns would be unmanageable for a bunch of reasons:

- If only a few columns were set in any given row, you'd waste a ton of space storing all the NULL values.

- Modifying one column in a row would probably require reading and re-writing the entire row.

- There's no good way to retrieve a large-ish subset of columns that you're interested in; you'd have to either specify them all by name in your query, or fetch the entire row.

None of those downsides apply to the Bigtable data model (which includes HBase, Cassandra and a few other similar projects). Null columns are free (since they don't exist on disk in the first place), writes are cheap no matter the row size, and you can filter by columns in interesting ways.

You're not wrong about the potential for messiness, though. The biggest drawback (IMO) of the Bigtable model is that the database server doesn't know anything about the structure of your data. If you're used to having a SQL prompt where you can examine and manipulate data in interesting ways, HBase's "shell" is a huge step backward. If you want to have any kind of useful visibility into your data, you have to build those tools yourself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: