Skip to main content

15.2 How Databases Work

Learning objective

As a product manager, you'll interact with data in a variety of ways. Some companies expect product managers to have an intimate knowledge of data. In other companies, you will work alongside a data scientist or data analyst who will handle your data needs.

In either case, knowing the basics about databases and being familiar with data tools like SQL is going to be tremendously helpful to your career. It will allow you to understand critical aspects of the products you'll manage and have intelligent discussions with your collaborators. And not only that, but it will make you a much more attractive candidate to most companies in today's data-driven business environment.

By the end of this checkpoint, you should be able to do the following:

  • Explain what databases and relational databases are
  • Explain the difference between SQL and NoSQL databases
  • Explain why SQL is important to product management



What is SQL​

Structured Query Language, or SQL (pronounced See-Quill) as it is better known, is a programming language commonly used for retrieving data from relational databases such as SQL Server, MySQL, MariaDB, and PostgreSQL. What is a relational database? Simply put, it's a collection of related information contained in data tables. If all of this sounds foreign and confusing for now, don't worry. By the end of this checkpoint, it won't be.

What are data tables?​

A data table is like a self-contained spreadsheet. It is grouped by rows and columns. Different data tables will be different in size. Some will have hundreds of thousands of rows and only a few columns, and some will have thousands of columns and only a few rows. Data tables are used to store important data waiting to be retrieved by a computer command. For example, in this data table, the information stored is based on where your customers live:

Customer IDCustomer nameCustomer addressCustomer stateCustomer zip
1Fred Jones123 Main RoadArizona12232
2Monica James344 St. Charles Ave Apt 3Georgia45466
3Lania Helms45 New Haven St. Unit 224New Jersey30308
4Jorge Castillo555 1st StreetArizona12256
5Ronnell Jameson1212 Arlo LaneTexas23200
6James Smith65 Mansley Avenue Unit 3BNorth Dakota68234

So if you wanted to know which customers live in Arizona, it would be easy to retrieve that information to find out that customers 1 and 4 live in Arizona.

Relational databases​

A relational database is, simply put, a collection of data tables about information that is related. A company could manage hundreds of thousands of data tables, each with hundreds of rows and columns. You can probably imagine how managing that much data and using it well could be a daunting task. Without data management programs, everyone would be lost trying to find a single piece of useful information in the vastness of information collected.

Here's the tricky part: not all data points are going to reside in one data table. Data is spread out into various tables that make sense in their specific business context. It would be impractical for all data to reside in one massive table that displayed information about various areas of your business.

For instance, a table called customers would have data relevant to those customers, like their ID and maybe their region or customer type (enterprise, small business, etc.) while a table called customer_orders may be more focused on specific information (like when an order was placed, what the order was, and how many units were purchased).

For example, what if you want to know how many of your customers buy over 2,000 dollars worth of your product and are located in Texas? In order to find that information, you can't only look up the table from earlier; you need to also look up (and cross-reference) another table that has different information relating to order cost and order amounts. That means you are using related information from your relational database to retrieve data that answers a specific question.

Unfortunately, SQL programs don't understand questions posed the way we speak (yet). Therefore, data tables are organized in a way that is easy for a computer program to retrieve and for human beings to read. SQL is the tool that allows you to translate your question into a language the computer will understand.

How relational databases work​

A database consists of tables containing columns and rows with data in them. In SQL, these are commonly referred to as fields and records, and both follow certain rules and best practices.

Fields​

A field is the data-table column. You'll see a field being called a column and vice versa—this isn't meant to be confusing; it's just because data analysts and SQL gurus use the terms interchangeably. Every field includes two pieces of information:

  1. A name: A good column name is legible and logically related to the data stored in that column. A name must be a string of 63 characters or fewer. It must start with a letter or an underscore, and the remaining characters can be letters, digits, or underscores.
  2. A column type: In a SQL table, you must pick a column type so that all the rows in the column will be the same type of data. Even though there are plenty of exotic SQL data types (see PostgreSQL documentation for a complete list), here are the most common ones you will pick from when using SQL:
Data typeDescriptionExamples
IntegerIntegers can be positive or negative, but they must be whole numbers. (In other words, they can't have decimals.)1, -3, 454555
CharacterThis is text data, such as "Hello" or "Hello World!" A character data type can take strings of any length; this data type is often known as varchar or varying or variable character.123 Main Street, Debra, Enterprise
BooleanThis is the standard True–False flag. This is used often to delineate a binary set. For example, if you want to know if a customer has purchased a specific product, the column would be named Has product? and a True or False would be listed in the row. Sometimes a 0 will represent False and a 1 will represent True.True,
False,
1,
0
DecimalThis is your run-of-the-mill decimal number.1.1, 1.3, 3.0005555
DateThis type expresses a date in year-month-day calendar format.1999-01-08,
1/8/1999,
08-Jan-1999
TimeThis type expresses a time of day, with or without a time zone.04:05 AM,
04:05:06,
04:05:06 PST
TimestampThis type expresses both a date and a time, with or without a time zone.1997-12-17 07:37:16-08,
12/17/1997 07:37:16.00 PST,
Wed Dec 17 07:37:16 1997 PST

Requiring that a given field conforms to a certain data type ensures that the data is clean, regular, and predictable. A two-letter state abbreviation is a great example of this. Because it is a standardized, reproducible naming convention, anyone in the United States knows that NY stands for New York when it is placed in the context of an address field. Similarly, some numbers only make sense as whole numbers; for example, fractions don't make sense for the total number of customer orders in a month. So setting a field type to numeric adds some logic to that field's design and saves memory.

Records​

Records are the data in your table (the rows). Once the field structure is established, you can add as many records as you want. A database table is designed as a perfect grid; every record has a value for every column—in other words, nothing is left blank. While this is similar to the basic row-and-column setup of a spreadsheet, a database table has a much more precise structure.

Database normalization​

Before diving into SQL proper, it's important to cover one more topic: database normalization. As a product manager, normalizing data may not be a large part of your job. However, you will hear the term tossed around, and it's worth understanding why it's important. Additionally, engineers work hard to optimize and organize the database efficiently, so you need to understand normalization in order to effectively communicate with them about this work.

Database normalization is a process in which data is organized and cleaned up so that it becomes more efficient and easier to understand. In other words, the goal of database normalization is to reduce and even eliminate data redundancy so that programmers and data scientists can access the data in an efficient manner, knowing that identical data is not stored in multiple data tables across databases.

For example, if I want to know how many orders a particular customer placed in the last year, you want to be sure that the customer ID is only stored in one place. If the information is stored in various places, there could be data corruption and other difficulties in trying to extract accurate information.




How SQL works​

SQL stands for Structured Query Language. Consider each of these elements:

  • Structured: It requires queries to be written a certain way in order to retrieve data properly.
  • Query: SQL is used to generate requests for the system to retrieve information.
  • Language: Much like other programming languages, SQL requires knowledge of its syntax.

Imagine you are ordering a meal at a restaurant. If you ask the waiter (the database) for dessert, then soup, then water, then meatloaf in that order (the query), you're going to have a strange meal. However, if you ask for water alongside your meal and soup followed by meatloaf followed by dessert, you'll have a more pleasant experience. The same applies to data. Knowing how to query data is extremely important, and SQL is the most commonly used programming language to retrieve data from data tables.

Why learn SQL?​

Depending on the company you end up working for, you'll either need a deep understanding of data yourself, or you'll need to understand what data your programmers and/or data scientists have access to. Understanding data will help you solve problems and form useful questions, and that will, in turn, enhance product development.

As a bonus, if you can write your own SQL queries and speak the lingo, you'll gain your fellow data workers' trust and increase your credibility. You'll also be able to answer your own questions and pull information yourself without having to rely on a data scientist or programmer to help you.

Finally, SQL has become so ubiquitous in technology companies that learning its inner workings is sure to make you more employable. Even if you never have to write a query or look at a data table, familiarity with SQL shows employers that you are skilled at analytically approaching problems and using data to answer questions.

NoSQL​

Not all databases are relational. A NoSQL database is one that stores its records without using rigid tables and columns. Records in NoSQL databases are often in documents similar to relational database records—like user documents, order documents, and product documents.

NoSQL databases were developed because relational databases are often very difficult to scale or perform poorly when they get too large. That's why companies like Google and Facebook use NoSQL databases for their extremely large datasets; these would be too slow in relational databases. On the other hand, relational databases are better when you need strong consistency and error checking in your data.

NoSQL databases usually don't use SQL to access the records; instead, they rely on their own querying language. If you're analyzing a NoSQL database, you'll need to learn the specific query language for that database or use business intelligence tools (which you will learn about later in the program). For now, it's helpful just to know that NoSQL exists. You may hear about a few popular NoSQL tools, such as MongoDB, Cassandra, Redis, and Amazon DynamoDB; just being familiar with the names and the concept of NoSQL could set you apart in an interview.

In the next checkpoint, you will start putting this knowledge into context and writing your own SQL queries. Throughout this module, you can reach out to fellow PMcademy students, including those from other programs, to discuss the joys and struggles of learning SQL via the #general-discussion channel on Slack.