Sunday, October 8, 2017

Using Amazon Redshift with Looker

This post is an attempt to share some lessons I learned while migrating a Looker instance to Amazon Redshift.

It turns out that Looker isn't just great for Business Users to avoid having to write SQL, it's also great for DevOps to create Admin Dashboards into Redshift that everyone can use to avoid manually writing many complex queries against Redshift. I'll start by setting up Looker/Redshift connections, then Admin dashboards, then end with optimizing it all.

Redshift currently makes sense to use with Looker because it's an OLAP (Online Analytical Processing) database, as opposed to OLTP DBs like MySQL. If you plan to use other AWS services, such as Kinesis Firehose to store event data in Redshift, using Redshift as your data warehouse enables you to perform JOINs on that event data. Redshift already supports Window Functions (great for more real-time/ELT operations with Looker Derived Tables), CTE's (for more readable queries), etc. MySQL 8 will support many of these features but Redshift still has the advantage of being designed as a columnar database while offering additional analytical advantages.

The cloud-database space moves rapidly so keep an eye on Google, Oracle and others as they compete with Amazon. If you have a cloud-native app that uses another provider, that's fine since you can extract all that data in ETL/ELT but the main thing is to have a single data warehouse because Looker should only connect to one warehouse.

Migrating

Fortunately, you don't need to recreate your existing Looker account just to switch connections -- as long as the table and field names will (at least mostly) be the same between environments. To switch connections go to Admin>Database>Connections and create a new connection (leaving your old connection there for now):

  • The connection name has to be different. (I appended -redshift to the name)
  • For Dialect put "Amazon Redshift"
  • For Host:Port just put the host part of your Redshift endpoint, e.g., foo.us-east-1.redshift.amazon.com
  • For Database enter the name of the production Redshift DB to connect to
  • Enter the Schema you will connect to. (schemas are just like in Postgres)
  • You can probably leave the other options as defauls for now

At this point click `test connection`. If it says "Can connect", great. It should be pretty instant. If it takes a while and then fails with: Cannot connect: connection refused: Java::OrgPostgresqlUtil::PSQLException: Connection refused..." then it means Looker doesn't have access to your cluster yet.

If your production Redshift cluster is locked down, such that Looker.com can't connect to it yet, then whitelist Looker's IPs inside your Redshift VPC Security Group (VPC Security Groups work just like EC2 Security Groups). Looker required 5 IPs to be whitelisted, which worked but made it harder to tell what the IPs were for and we already had another IP listed for Kinesis that looked like Type=Redshift, Protocol=TCP, Port Range=5439 Source=<ip>/27. So instead of adding more IPs to the Inbound section, I created a new VPC Security Group called "Looker" and whitelisted those 5 IPs in there:

Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32
Redshift TCP 5439 <ip>/32

(Get the list of IPs to whitelist from here.)

I attached that new security group to the Redshift Cluster. This is key because you can't attach the new Looker security group to the Redshift security group (even though you can add it as a tag or type its group ID or name in since that would only act like a tag).

In the Develop menu dropdown, you should have a menu option called Content Validator (If you don't, ask Looker support to add access). Go into it and make sure your current Looker connection is validating. You will use this to find any errors when you switch your connection over to your new Redshift connection and update your LookML.

BTW, I highly recommend considering hosting your own Looker Git repo, so that you can view all of the diffs your team makes to the LookML.

IMPORTANT: before you can switch connections, you'll have to be in development mode and changing connections will BREAK PRODUCTION Looker for now. So don't attempt this until you can bring Looker down for maintenance!

To switch DBs, go to Develop>Manage LookML Projects. Click 'Edit Configuration' for the project you want to change connections for, change to the new connection, and click Save. It will probably say: model '<your model>' is not allowed to use connection '<your schema>' because you still need to update the model file(s) to use your new connection. Go to the model file from Develop>Project. Where it says connection: "<your connection>", change it to your new Redshift connection and click Save.

Next you'll need to update any table or field names in the View files to reference your new name. For example, if your old instances of sql_table_name were referencing a MySQL DB name, they will now need to reference your Redshift schema. Use search to find any other old references to change.

Now you should just need to port any SQL syntax to Redshift. For example, if your old connection was MySQL, look for errors such as The Amazon Redshift database encountered an error while running this query ... and convert it to Redshift syntax. One example I encountered was a query that used SELECT AVG(some_table.paid) which no longer works because Redshift doesn't want you to do an AVG on a bool (whereas the 'paid' field in MySQL was a TINYINT(1). So I cast it by going into its View file where a Measure defined the 'paid' field and changed: sql: ${paid} ;; to sql: ${paid}::int::double precision ;;. Remember: Redshift is not the exact same as Postgres where you could have just cast it to an int. Redshift's AVG() gives you an int result if the input is int. If you're used to using TINYINT(1) bools in MySQL, you can use a Dimension of type number: sql: case when ${TABLE}.paid then 1 else 0 end ;; rather than a yesno so that a sum Measure can just reference ${paid} and an average Measure can just use ${paid}::double precision. YMMV, but keep this in mind.

If you have lot of old LookML, keep going through and fixing all the syntax issues switching to Redshift caused. If your old connection was Postgres then there's not as much to change than if it were MySQL. For example, MySQL date functions such as now() will need to be changed to getdate(), period_diff() will be datediff(), date_format() will be to_char(), etc.

Redshift Admin Dashboards

Looker puts out really cool admin "blocks" that allow both admins and regular users to get insights into things such as ETL cycles, recent load failures, table architectures, and more; all within Looker without having to grant everyone access to the AWS Console.

The recommended way to create a Redshift Admin dashboard and Performance Overview in Looker is to follow these instructions but I will clarify some points that were missing for me. Start by creating a separate DB connection for your Redshift cluster to be used by this block. This step isn't absolutely necessary but it's a good idea since it increases performance because the Admin Block uses PDTs, so having a separate connection will reduce extra query load on your main connection. This connection will mostly be a duplicate, except you'll make a new user just for this connection since this user will have permissions to the metadata tables that the regular Looker user shouldn't. Everything else can be duplicated: the database and even the scratch schema.

I created a user called "looker" as my main connection's user. Don't use an admin user for Looker or it will have access to everything. For instructions on how to do this see this article. I found their explanation of granting SELECT on all tables to Looker inadequate so I did GRANT SELECT ON ALL TABLES IN SCHEMA <my_schema> TO looker;

If you haven't already, create a scratch schema that your PDT's, including the Redshift Admin Block, will use:

psql> CREATE SCHEMA looker_scratch AUTHORIZATION looker;
psql> ALTER USER looker SET search_path TO '$user',looker_scratch,<my_schema>,public;

In Admin>Connections select your connection and click the checkbox to enable PDTs. Under Temp Database enter 'looker_scratch' and click 'Update Connection'. Then click 'test connection'. If you get an error like: Failed to create/write to pdt connection registration table looker_scratch.connection_reg_3 then run:

psql> select has_table_privilege('looker', 'looker_scratch.connection_reg_r3', 'insert');

If that says 'f', then run:

psql> \dt looker_scratch.*

If that says the owner is someone else, what likely happened was that you first made a connection in Looker that had a user of someone else and that user created the connection_reg_r3 table first, so it's still the owner. One way to fix that is:

psql> DROP SCHEMA looker_scratch CASCADE;
psql> CREATE SCHEMA looker_scratch AUTHORIZATION looker;
psql> ALTER USER looker SET search_path TO '$user',looker_scratch,<my_schema>,public;

Now when you click 'test connection' with the 'looker' user, the looker user will be the owner of the look_scratch tables and should now pass with: Can use persistent derived tables in "looker_scratch"

We'll now create a new Redshift user called 'looker_admin' for this admin block to use and give that user access to the redshift tables and scratch_schema it will use:

psql> CREATE USER looker_admin WITH PASSWORD 'some_password';

psql> GRANT SELECT ON TABLE STV_WLM_SERVICE_CLASS_CONFIG TO looker_admin;
psql> GRANT SELECT ON TABLE SVV_TABLE_INFO TO looker_admin;
psql> GRANT SELECT ON TABLE STV_TBL_PERM TO looker_admin;
psql> GRANT SELECT ON TABLE STV_BLOCKLIST TO looker_admin;
psql> GRANT SELECT ON TABLE STL_LOAD_COMMITS TO looker_admin;
psql> GRANT SELECT ON TABLE STL_LOAD_ERRORS TO looker_admin;
psql> GRANT SELECT ON TABLE pg_class TO looker_admin;
psql> GRANT SELECT ON TABLE pg_namespace TO looker_admin;
psql> GRANT SELECT ON TABLE pg_database TO looker_admin;

psql> GRANT ALL ON SCHEMA looker_scratch TO looker_admin;
psql> GRANT ALL ON ALL TABLES IN SCHEMA looker_scratch TO looker_admin;

Looker's SQL Runner will run as the 'looker' user, not 'looker_admin'. So don't be confused if use it with queries like SELECT COUNT(*) FROM stl_load_commits, which will show a count of zero because these are virtual tables that only users with superuser privileges can see. Redshift's convention is to have rows generated by another user show up as invisible, rather than showing an error.

You're now ready to start making the Redshift Admin block work. This requires adding 3 files: A Model called redshift_model, a Dashboard called redshift_admin, and a View called redshift_views.

Make sure the model's 'connection' string points to the separate connection we made above for the looker_admin user. Also make sure to go into Develop>Manage LookML Projects and click 'Edit Configuration' next to the redshift_model and select the same connection for use with it. You should now see the admin dashboard when you go to the dashboard section of your user. It should show you sections like Table Load Summary, Recent Files Loaded, Recent Load Errors, and Database Consumption.

But wait, there are more useful dashboards to add: redshift_performance and redshift_query_inspection. Edit both dashboard's LookML if you need to change query_timezone from their default America/Los_Angeles to something else.

If you end up seeing duplicate dashboards for all your models, it's probably because your other model still does:

include: "*.view.lkml"
include: "*.dashboard.lookml"

which makes it include every dashboard, including your new redshift ones. There currently isn't a way to exclude redshift_ from the includes, so come up with a new naming convention like how the redshift_ LookML does. I changed the include in my main model to use a prefix "somePrefix_", include: "somePrefix_*.dashboard.lookml". This meant I needed to create any LookML dashboards to have a prefix now and for consistency I did the same for the Views. It also meant going into my main Model and changing all references to the views to use the prefix. It's really annoying but not as annoying to me as having duplicate dashboards.

After doing that, some of your existing Looks and Dashboards in the 'Browse' section will likely be broken. Go into Develop>Content Validator and click 'Replace' next to any errors and rename the explores to your new prefix. Once you commit, if you get any LookML errors fix them before you push. For example, I had a couple of Views referencing ${foo} where I needed to change it to ${somePrefix_foo} now. Don't blindly do that for every ${foo}, just errors, because most ${foo}'s are probably for Dimensions which don't need prefixes but sometimes a ${foo} is actually a reference to a View from another file. Once you're resolved these, click 'Validate Again' until the errors are gone.

Finally, you might want to adjust the time of day these Redshift Admin PDTs rebuild. Go into 'redshift_views' and change the value of sql_trigger_value.

Optimizing

Let me start by saying that I am not a Redshift expert and that the following are from my own understanding of the docs and my own experiments. Nonetheless, it may help your understanding because these are the topics that tend to trip up developers the most.

Whether or not you're using a star-schema design with Redshift, you can still leverage Looker and optimize your tables in similar ways. For example, we can see which sortkeys and distkeys a table has with:

$ psql [cluster credentials]
# set search_path to '$user', public, <my_schema>
# select "column", type, encoding, distkey, sortkey from pg_table_def where tablename = 'foo';

However, it's useful to first know which queries are actually slow. To do this, log into the AWS Console and go to your Redshift cluster and then the Queries tab. When you find a slow query, try running it directly, e.g. in Looker's SQL Runner or via the psql command. However, keep in mind that first runs of your query will be slower because Redshift compiles and then caches them. Therefore, you'll want to do your benchmarking against the warm cache versions.

While you're still in the Queries tab, click on the actual query and then under Query Execution Details you'll see the "Explain plan". Look for the tab called Actual. Now you can click on the plan's steps to drill down further and see if there are any warnings or how many rows are being scanned - which is indicative of your query reading too many rows from disk, so you may need to change your sortkeys or run a VACUUM. You may also need to adjust your query to filter better...

Redshift is a columnar database, which means that each block holds data for only a single column. The "zone map" is held separately from the block, like an index. The zone map holds only two data points per block, the highest and lowest values in the block. Redshift consults the zone map when executing queries, and excludes the blocks that the zone map indicates won’t be returned by the WHERE clause filter.

You might notice that your ad-hoc queries run perfectly fast in warm cache, but perhaps your Look or Dashboard in Looker is much slower. This is probably because Looker's visualizations can be slow to render in the browser. Fortunately, once you (or anyone else) loads something in Looker, it will remain cached until your caching-policy resets the cache, e.g. once per day after ETL run.

Aside from incorrect queries, the general place to look into why queries are slow in redshift is usually with a distkey/diststyle and second most would be sortkeys. The distyle of EVEN is the default and means "distributed across all the nodes". The default is generally fine because you don't have bottlenecks or 90% disk space and CPU on one node while other nodes are empty. It makes all nodes participate to the best of their ability. However, it can take longer to do joins.

Unfortunately, it's not easy (or even realistic) to know ahead of time which distkeys and sortkeys to add because you have no way of knowing which queries people will plan to write against your tables a year from now. And since you have to add distkeys and sortkeys at table creation time, there are two main things you'll need to do: 1) Make it as easy as possible to change your schema, so you can edit the keys and repopulate the data 2) Use EXPLAIN on the queries you do have, to figure out which columns could benefit from different keys.

When you run JOINs, think in terms of distributed joins. Say you have a query like:

EXPLAIN SELECT groups.name COUNT(DISTINCT u_id)
FROM groups JOIN users ON groups.g_id = users.g_id
GROUP BY 1;

and EXPLAIN output shows Hash Full Join DS_DIST_BOTH. It means both outer and inner, which redistributes part of the data, allocating different ranges to different nodes. This is usually the default in Redshift if you don't optimize, and it's one of the least optimized! So you will want to add distkeys and sort keys. Also, BCAST (broadcast) means "duplicate the entire table to all the nodes" but isn't seen as often yet some queries need broadcast. Most use DIST (distributed) which is generally better to use diststyle: ALL to broadcast once, rather than every query run.

Why is this so important? Redshift is not designed to run on a single node. It's designed to run in a sharded cluster and it's expected to have very bad numbers within only one node. This is one reason you'll get slow queries. In MPP databases, data is partitioned across multiple servers & nodes with communication across a network rather than disk. The nice thing is that your SQL queries will look as though you're selecting from a single database, this is thanks to using a "distribution key".

Say you commonly JOIN or GROUP BY using the 'state' column of your table. You may choose to distribute on that column. However, if you set the DISTKEY to the state column the distribution won't be as even. In a large table, this amount of distribution "skew" could have an adverse impact on query processing. From the columns used in your queries, choose a column that causes the least amount of skew as the DISTKEY. A column with many distinct values, like a timestamp, would be a good first choice. Avoid columns with few distinct values such as credit card types, error messages or days of the week.

While the general rule of thumb is to add distkeys to the fields you'll be JOIN'ing on the most and to add sortkeys to fields you'll be filtering the most in WHERE clauses, you need to benchmark before/after and really look into the EXPLAIN output to be sure you're not making things worse. There is more to consider than just query speed, such as CPU usage and other things that can cause problems or drive up your AWS bill.

Pick a few important queries you want to optimize your databases for. You can’t optimize your table for all queries, unfortunately. Even though it will rarely be the best performance, a table with no DISTKEY/SORTKEY can be an ok all-round performer. It’s a good option not to define DISTKEY and SORTKEY until you really understand the nature of your data and queries.

That all said, sortkeys are less likely to backfire. If you don't use sortkeys, every query you'd do has to read all the files from disk (like a full table scan) before it can decide if it's something it can use. This sortkeys are comparable to indexes. If you use sortkeys you can more easily reason about and measure the min/max values of each of the 1mb blocks, so you can skip blocks. So when you WHERE to filter by some value, Redshift wouldn't have to read everything to do it. Take:

SELECT COUNT(1) FROM users WHERE state = 'California'

Most of the time of this query is spent in 'Seq Scan on users'.

The Redshift table STL_EXPLAIN gives you actual EXPLAIN output of all your queries. This is very useful because you can audit this list rather than figuring out all the queries and manually put EXPLAINs in front of them. It shows the time cost of query plans. Don't pay too much attention to the number, but you can use it to reason about what is probably the most time-consuming task. The number at the top contains the numbers at the bottom:

Plannode           | cost   | info
Aggregate          | 68718  | 
  Seq Scan on users | 62500  | Filters: state = 'California'

See also the STL_SCAN table. Helps you figure out which columns will contribute the most to optimizing. It helps you see what are the top columns that take a lot of time being scanned and have the greatest potential for improvement. Fortunately, we created some Admin Dashboards above in this post, so that you can peer into all of this more easily.

No comments:

Post a Comment

Followers