Provide a detailed summary of the following web content, including what type of content it is (e.g. news article, essay, technical report, blog post, product documentation, content marketing, etc). If the content looks like an error message, respond 'content unavailable'. If there is anything controversial please highlight the controversy. If there is something surprising, unique, or clever, please highlight that as well: Title: Replacing a SQL analyst with 26 recursive GPT prompts Site: When I was at Square and the team was smaller we had a dreaded “analytics on-call” rotation. It was strictly rotated on a weekly basis, and if it was your turn up you knew you would get very little “real” work done that week and spend most of your time fielding ad-hoc questions from the various product and operations teams at the company (SQL monkeying, we called it). There was cutthroat competition for manager roles on the analytics team and I think this was entirely the result of managers being exempted from this rotation -- no status prize could rival the carrot of not doing on-call work. So, the idea that something like analytics on-call could be entirely replaced by a next-token optimizer like GPT-3 has been an idea I’ve been excited to try for a long time now. I finally got a chance to sit down and try it out this week. We’re about to kick off our fundraising process, so I to test the idea I attempted to build a free-form question answering analytics bot on top of the Crunchbase data set of investors, companies, and fundraising rounds. I call it CrunchBot. Read on to get the details, but here is a quick example of it getting a simple thing exactly right: And an example of it getting something more complex exactly right: And an example of it getting something complex terrifically wrong (watch your join conditions davinci-003!): And something of medium complexity mostly right (it’s so close here, needs an ilike instead of like , which it messed up despite a specific prompt telling it to watch out for this mistake): Overall I was dumbfounded with the quality of the results. LLMs are a continual source of blown minds, but this is shockingly close to replacing an entire role at companies with only a couple hours of effort. How I built it in 2 hours ​ My plan was to build it in Patterns Studio . There were mostly four parts Build the prompt from user’s question schemas and sample data of the available tables clear directions Run it through various GPT models and get 5+ completions of raw SQL Execute the SQL against the relevant tables, pick the best result Hook it up to a Slack bot By the end I had a much more complex pipeline as I kept quickly finding prompt improvements. Specifically the SQL generation and execution part became a loop of: Generate a candidate query Quality check the SQL with GPT itself, asking it to spot common errors (NULLS LAST, for instance) and produce a correct version of the query Run the SQL against the tables If there was an error or no result, ask GPT to fix the query to produce a correct result and repeat the loop Otherwise, return the result This led to completion chains of over 20 calls to GPT for a single user question. There feels like some logarithmic improvement in each GPT call — you can continue to add more context and checks with every completion, exploring different versions and iterating on the result. This is the same process that a junior analyst would use to arrive at the answer, in this case it takes 15 seconds and costs $1 in credits vs $50 and 1 hour for the analyst. You have a LOT of leeway there to use even more crazy prompt pipelines before the ROI gets bad. Anyways, here’s the step-by-step of how I built it. You can follow along in this template app on Patterns that has all the code (no data it in, Crunchbase is a proprietary data set — reach out if you are interested in exploring this specific data set). 1. Get the Crunchbase data ​ The full Crunchbase dataset includes 2.4m organizations and 510k funding rounds. We ingest this via the full CSV dump every 24 hours into our postgres instance. For this analysis we restricted it to three tables: organizations, funding_rounds, and investments. 2. Build the initial prompt ​ The initial SQL generation prompt includes three basic elements: a summary of the tables and data available to query, the user’s question, and a prompt asking GPT to write a correct Postgres query. Here’s the exact template we used: prompt = f""" { tables_summary } As a senior analyst, given the above schemas and data, write a detailed and correct Postgres sql query to answer the analytical question: " { question } " Comment the query with your logic.""" Which results in this full prompt, for example: Schema for table: organizations uuid Text name Text roles Text country_code Text region Text city Text status Text short_description Text category_list Text num_funding_rounds Float total_funding_usd Float founded_on Date employee_count Text email Text primary_role Text Data for table: organizations: uuid name roles \ 0 ac323097-bdd0-4507-9cbc-6186e61c47a5 Bootstrap Enterprises company 1 717ce629-38b6-494d-9ebf-f0eeb51506f8 Campanizer company 2 c8cbaa69-c9db-44e2-9ffa-eb4722a62fe3 Cambr company 3 5ab1ae3d-c3a1-4268-a532-b500d3dd6182 CallMeHelp company 4 143f840b-551c-4dbd-a92b-0804d654b5cf California Cannabis Market company country_code region city status \ 0 operating 1 USA Colorado Boulder operating 2 USA New York New York operating 3 GBR Stockport Stockport operating 4 USA California San Francisco closed short_description \ 0 Bootstrap Enterprises is an organic waste mana... 1 Campanizer organizes schedule and coordinates ... 2 Cambr enables companies to build and scale fin... 3 CallMeHelp provides early warning and care ove... 4 California Cannabis Market is an information t... category_list num_funding_rounds \ 0 Consulting,Organic,Waste Management NaN 1 Information Technology,Scheduling NaN 2 Banking,Financial Services NaN 3 Fitness,Health Care,Wellness NaN 4 B2B,Information Services,Information Technology NaN total_funding_usd founded_on employee_count email \ 0 NaN NaT unknown 1 NaN 2017-01-01 1-10 2 NaN NaT unknown 3 NaN 2017-01-01 1-10 4 NaN 2018-01-01 1-10 primary_role 0 company 1 company 2 company 3 company 4 company Schema for table: investments uuid Text name Text funding_round_uuid Text funding_round_name Text investor_uuid Text investor_name Text investor_type Text is_lead_investor Boolean Data for table: investments: uuid \ 0 524986f0-3049-54a4-fa72-f60897a5e61d 1 6556ab92-6465-25aa-1ffc-7f8b4b09a476 2 0216e06a-61f8-9cf1-19ba-20811229c5