Top 5 SQL tips I didn’t get in DS bootcamp
Like many new data scientists, I went to a bootcamp to start my career. Mine was a great program run by Galvanize — I graduated well prepared to build models at my first job. The bootcamp spent about a week on SQL before moving on to the serious DS topics. My first job was 75% SQL 25% modeling — I needed to get a lot better than one week of training at SQL if I was going to do the majority of my job well.
I don’t think the bootcamp should have focused more on SQL; we didn’t need those awesome teachers to go through the minutiae of Redshift documentation. The only option to learn those details is to do tons of querying yourself — in real data.
This list is meant to give you tools to solve common problems that I found while working professionally in dozens of databases over the last 5 years. These tips should serve as the building blocks of bigger queries; try using one of these the next time you need to break a big problem down into steps. Happy querying!
ROW_NUMBER() to dedupe data or select only the rows you want. I had very limited exposure to window functions at my bootcamp, but they are enormously useful for selecting the right data. They are great with time stamps— think first or last — or picking out the largest or smallest in a group. They are also great for finding and choosing the best among duplicates; window functions can have very complex
2. Calculate percentage columns during EDA
COUNT(*) / SUM(COUNT(*)) OVER() AS pct_of_total
When you are doing quick data investigation, it can be really helpful to compare groups as a percent of total rather than raw numbers. It’s considerably faster than eyeballing the raw data, especially for things with 15 or 20 groups. Another great example of a window function, this one runs very quickly as well. It also leads into the next tip…
3. User ORDER and LIMIT
ORDER BY 1 ASC LIMIT 15
Most people before their first day will not have worked on a scale database. They might be surprised to find that
SELECT * takes 15 minutes to run. When exploring a DB for the first time, think about what you are looking for and make sure the results are going to be displayable. This means ordering by the appropriate column and using either
DESC. Limit rows to less than a silly number of results. Try not to hog the database with huge queries when you only need to run tests.
Bonus tip! In the
GROUP clauses you can refer to columns by their order. Very handy shorthand if you have something like
GROUP BY 1,2,3,4,5,6 ORDER BY 1,2,3,4,5,6
4. Make use of the runtime indicator on your SQL host
Target this number to make sure you are running efficient queries. If a particular query takes a long time to run, note that down. Often only one or two tables will be ones you need to be really careful with; query time will let you know that you are asking for a lot of data.
This can also help you catch mistakes like cross-joins if something takes unusually long. If your simple query takes a lot longer than expected, start scanning it for typos.
Sarcastic bonus tip! Use the company chat app to complain that coworkers are slowing down the DB and demand special time to run your hulking DB torpedos.
5 Fix the cause of a million unhelpful error messages
Did you know that “double quotes” are used to wrap table and column names in the case of a poorly formatted name, like a space or a special character? ‘Single quotes’ wrap strings. I see this error all the time but it is very difficult to research online. If you copy and paste the exact error you get no results since the error includes your specialized title.
Become a hero of your coworkers by fixing a lot of previously ungooglable issues just by using this double quote trick.
That is my top 5 list. Let me know if you have any other tips that you’ve only gotten after years mining DBs!