Why You Need a Data Catalog

Have you ever been in a situation where you’re looking into a dataset for the first time? It usually goes something like this:

  • You infer a data model from some inspection of a single datum, or a few data.
  • You start by assuming that data model holds for everything in the dataset. So you query the dataset using that model as an assumption.
  • This inevitably fails due to lack of handling for message types, extra fields, union fields, etc.
  • So you start the process of exhaustively searching for all the various possible data models a single datum can exhibit. This takes a while.
  • Finally, you have a complete data model. Now you query you’re dataset for your desired question. But, some of the data is bad and you find errors in your output.
  • Now you have to figure out how to account for data errors that show up seemingly randomly in the data.
  • After all this, you finally can run your query.
  • Then you find out that someone you know already did all this and wrote some code to handle it and didn’t write it down in documentation. ARGH!

Have you ever thought there might be a better way? So have a lot of people. Thats why Data Catalogs exist.