Amazon Redshift has proven to be a trustworthy, scalable data warehouse solution. According to Gartner’s release of the “2016 Magic Quadrant for Data Warehouse and Data Management Solutions for Analytics,” there is an urgency for IT market leaders, such as IBM, Microsoft, Oracle, Teradata, and SAP to fully understand the competition facing them as data warehousing continues to move into the cloud. This competition comes, in particular, from Amazon Redshift, who we see has gained great traction.
Redshift is a fully managed, highly scalable, columnar database built on top of the Postgres engine. It supports standard SQL so that your existing application is not impacted. In addition, it is fast, thanks to its columnar data storage technique and parallel execution using multiple nodes supported by high-performant SSDs and disks.
Fast data loading is the key to making quick decisions and Redshift can load terabytes of data from files on Amazon S3 in seconds. It allows you to load compressed data directly into Redshift tables, saving time on transport and load. When you need to handle an increased throughput requirement, you are able to scale with just a few clicks. Plus, Redshift supports running read queries on hundreds of gigabytes of datasets without pre-aggregation. Redshift allows you to start with one node cluster and scale up to 128 node clusters as the demand grows, saving on provisioning in advance.
In this article, we will outline a plan for setting up a fully functional data warehouse, including planning the underlying infrastructure, optimizing data loading and modeling, as well as maintenance and monitoring of the clusters.
Moving Your Data Warehouse to Redshift
Redshift is not the same relational database that you were using for your traditional DWH. Careful planning is required before you start loading your data to it. You will need to plan your infrastructure, first of all, and the transition might also require a tweak to your data model to make full use of the features Redshift offers. There are five key areas to consider before starting to build your data warehouse on Redshift. Let’s review them below.
Planning and Provisioning
Resource provisioning requires you to look back at your existing DWH infrastructure to set a baseline for the required nodes and storage. Understanding the architecture of Redshift is important to set up a very high-performant data warehouse. A Redshift cluster consists of nodes, basically a compute engine that has its own storage, CPU, and RAM of the configuration you select from DC and DS types.
The tables below show the multiple options AWS supports.
If you have a high storage requirement, select the dense storage node (i.e., ds2) cluster. If you have a high performance throughput requirement, then dense compute (i.e., dc1) is the most suitable type of node running on an SSD disk. Even if you select dense compute, keep in mind that Redshift provides columnar compression up to three times the normal data you may have.
A Redshift cluster may have one node (not a best practice for production deployment) or more. In addition, Redshift supports multi-node clusters so when your requirement grows, you can scale by just adding a node. You also get a leader node compute engine without any extra cost when you run your data warehouse in a multi-node cluster. Storage of each node is used for storing data in a distributed fashion to achieve high degree of parallel processing. The Leader Node is responsible for collating data from different nodes to serve user requests, as well as distributing data across nodes during data loading.
When you move to AWS Redshift, there are key data model changes that you need to perform so your data warehouse is able to make full use of Redshift features. This may not be a huge change to your functional data model and mostly impacts technical aspects related to distribution and compression.
Distribution of data is key to performance. While creating tables in Redshift, make sure that proper distribution style is set. DISTSTYLE key will decide how your data is distributed across nodes. There are three distribution styles: EVEN, KEY, and ALL. As a rule of thumb, you should use KEY for FACT tables on date or range key columns that are used for distribution of data logically by period or range, and ALL for dimensions. This distribution will avoid transport of data between nodes for joins and aggregations, allowing each node to utilize optimal and near equal resources and allowing the leader node to run a balanced compute operation.
SORTKEYS: To further enhance query performance, you should set Redshift SORTKEYS to take the place of indexes. If there was a table with proper indexing on the source database and the data was properly utilized, then use those columns as SORTKEYS when creating the table in Redshift. You can replace your composite indexes with compound SORTKEYS. Consider creating interleaved SORTKEYS in case some of the columns are not used in join because an interleaved sort gives equal weight to each column, or a subset of columns, in the sort key.
Compression: As with every database, performance is also impacted if your read query runs on a large amount of data. To reduce the amount of data your query needs to process, use column level compressionencoding to downsize the amount of data up to three times. This provides significant performance benefits because you’ll be reading 9 to 12 percent of the original amount of data that an RDBMS would generally read. Keep the key columns uncompressed to avoid overhead of the decompression operation during query execution.
Now that your cluster is ready, let’s look at how data is loaded and how to build your ETL processes on top of Redshift.
Redshift allows you to load 100s of GBs of data in minutes. Redshift uses Amazon S3 storage as a staging area. Data from multiple sources with multiple formats can be copied over to an S3 bucket and from there be loaded to Redshift. To load data from S3, Redshift supports a COPY command.
The COPY command copies data from the S3 bucket to a target Redshift table. COPY command supports loading CSV or JSON format, compressed and uncompressed files (e.g, gzip). You can load zipped files directly without extracting it. When loading GBs of data, it is most recommended to split the files in chunks of same prefix to utilize high degree of parallelism. DOP is decided based on number of slices your cluster has. It’s also important to note that Redshift does not support procedural language like PLSQL or PSQL, therefore, your existing ETL procedures will need to be transformed or changed to standard SQL.
There are various tools and services that allow for easy integration of multiple data sources. Let’s look at some of them.
One is Glue, an AWS-native, fully managed ETL service. It simplifies data discovery, conversion, mapping, and job scheduling tasks. You can also connect to external jdbc-compliant data sources, as well as to Amazon RDS and S3. Glue also may be used for scheduling ETL jobs. As Glue is currently “in preview,” you can access it by submitting a preview request.
Additionally, there are other third-party tools built especially for Redshift, such as Matillion. It can connect to multiple data sources and extract data with just a few clicks. Matillion provides a rich data orchestration environment to build and control the ETL/ELT and allows you to perform data model changes directly. It’s an enterprise-ready platform that is easily integrated with AWS tools, including CloudWatch, SNS, and others for monitoring, logging, and alerting.
Monitoring and Maintenance
Amazon Redshift provides performance metrics and data so that you can track the health and performance of your clusters and databases. Amazon CloudWatch generates metrics to help you monitor the underlying physical aspects of your cluster, such as CPU utilization, latency, and resource throughput. This data is aggregated in the Amazon Redshift console to help you easily correlate and track performance history of the underlying infrastructure with a specific database query and load events.
Database tables may get fragmented due to data refresh, reload, and delete activity so it is important to note that with Redshift, rows are not hard deleted. Subsequently, it is important that tables are defragmented and statistics are gathered periodically using VACUUM and ANALYZE respectively. Performing these operations will also re-sort the data, maintaining performance over time.
You should also use Redshift Workload Manager (WLM) to manage priorities of workloads. For instance, you may have users running queries that should be killed if the query takes more than five minutes , or you just want to timeout a user session at certain intervals. But a session that is running ETL should not get killed and should not eat up all the resources, so allow ETL sessions to use unlimited timeout, but limit the memory it can use.
In addition to optimizing storage running costs, you should leverage Redshift Spectrum, which allows you to move old, rarely accessed data to S3 and still query it using the same Redshift SQL queries.
Connecting with BI Tools
The beauty of working with Redshift is the ability to leverage the business intelligence (BI) or data discovery tools you’re currently using. Its main strength is the ability to “query petabytes of structured data” in a matter of seconds. AWS offers a cloud BI service called QuickSight that is native to AWS and very cost-effective. You can connect QuickSight to Redshift and start building dashboards and reports with a few clicks. You may also connect data from endless data sources and start visualizing quickly. According to Amazon, QuickSight costs 1/10 the price of standard, more traditional BI tools.
Final Note: The Cloud Data Engineer
The job of data engineer is certainly not the same today as it was when there were only traditional databases. The modern cloud data engineer requires a new set of skills to attend to the changed scope of responsibility. Here are the five main roles of the cloud data engineer:
- Planning and architecting to support the required demand (i.e., scale) and performance
- Choosing and integrating the right ETL tools
- Preparing the BI structures and integrating them with BI solutions
- Optimizing queries and table structures
- Managing the underlying clusters and mitigating underlying limitations
Cloud data engineers work with other members of the organization, not just IT professionals, and are required to deliver processed data from multiple sources to executives and analysts. They also need to get involved in different phases from understanding requirements and infrastructure, to development and deployment.
With all its great benefits, Redshift requires a dedicated engineer to manage it. In addition to the technical skills and proficiency required, it is very important that the data engineer is also capable of handling changing business requirements to apply the right DWH features and capabilities to serve the business’ goals.