Provide a detailed summary of the following web content, including what type of content it is (e.g. news article, essay, technical report, blog post, product documentation, content marketing, etc). If the content looks like an error message, respond 'content unavailable'. If there is anything controversial please highlight the controversy. If there is something surprising, unique, or clever, please highlight that as well: Title: Regular Expression Matching with a Trigram Index Site: swtch.com Regular Expression Matching with a Trigram Index or How Google Code Search Worked Russ Cox rsc@swtch.com January 2012 Introduction In the summer of 2006, I was lucky enough to be an intern at Google. At the time, Google had an internal tool called gsearch that acted as if it ran grep over all the files in the Google source tree and printed the results. Of course, that implementation would be fairly slow, so what gsearch actually did was talk to a bunch of servers that kept different pieces of the source tree in memory: each machine did a grep through its memory and then gsearch merged the results and printed them. Jeff Dean, my intern host and one of the authors of gsearch, suggested that it would be cool to build a web interface that, in effect, let you run gsearch over the world's public source code. I thought that sounded fun, so that's what I did that summer. Due primarily to an excess of optimism in our original schedule, the launch slipped to October, but on October 5, 2006 we did launch (by then I was back at school but still a part-time intern). I built the earliest demos using Ken Thompson's Plan 9 grep, because I happened to have it lying around in library form. The plan had been to switch to a “real” regexp library, namely PCRE, probably behind a newly written, code reviewed parser, since PCRE's parser was a well-known source of security bugs. The only problem was my then-recent discovery that none of the popular regexp implementations - not Perl, not Python, not PCRE - used real automata . This was a surprise to me, and even to Rob Pike, the author of the Plan 9 regular expression library. (Ken was not yet at Google to be consulted.) I had learned about regular expressions and automata from the Dragon Book, from theory classes in college, and from reading Rob's and Ken's code. The idea that you wouldn't use the guaranteed linear time algorithm had never occurred to me. But it turned out that Rob's code in particular used an algorithm only a few people had ever known, and the others had forgotten about it years earlier . We launched with the Plan 9 grep code; a few years later I did replace it, with RE2 . Code Search was Google's first and only search engine to accept regular expression queries, which was geekily great but a very small niche. The sad fact is that many programmers can't write regular expressions, let alone correct ones. When we started Code Search, a Google search for “regular expression search engine” turned up sites where you typed “phone number” and got back “ \(\d{3}\) \d{3}-\d{4} ”. Google open sourced the regular expression engine I wrote for Code Search, RE2 , in March 2010. Code Search and RE2 have been a great vehicle for educating people about how to do regular expression search safely. In fact, Tom Christiansen recently told me that even people in the Perl community use it ( perl -Mre::engine::RE2 ), to run regexp search engines (the real kind) on the web without opening themselves up to trivial denial of service attacks. In October 2011, Google announced that it would shut down Code Search as part of its efforts to refocus on higher-impact products , and now Code Search is no longer online. To mark the occasion, I thought it would be appropriate to write a little about how Code Search worked. The actual Code Search was built on top of Google's world-class document indexing and retrieval tools; this article is accompanied by an implementation that works well enough to index and search large code bases on a single computer. Indexed Word Search Before we can get to regular expression search, it helps to know a little about how word-based full-text search is implemented. The key data structure is called a posting list or inverted index, which lists, for every possible search term, the documents that contain that term. For example, consider these three very short documents: (1) Google Code Search (2) Google Code Project Hosting (3) Google Web Search The inverted index for these three documents looks like: Code: {1, 2} Google: {1, 2, 3} Hosting: {2} Project: {2} Search: {1, 3} Web: {3} To find all the documents that contain both Code and Search, you load the index entry for Code {1, 2} and intersect it with the list for Search {1, 3}, producing the list {1}. To find documents that contain Code or Search (or both), you union the lists instead of intersecting them. Since the lists are sorted, these operations run in linear time. To support phrases, full-text search implementations usually record each occurrence of a word in the posting list, along with its position: Code: {(1, 2), (2, 2)} Google: {(1, 1), (2, 1), (3, 1)} Hosting: {(2, 4)} Project: {(2, 3)} Search: {(1, 3), (3, 4)} Web: {(3, 2)} To find the phrase “Code Search”, an implementation first loads the list for Code and then scans the list for Search to find entries that are one word past entries in the Code list. The (1, 2) entry in the Code list and the (1, 3) entry in the Search list are from the same document (1) and have consecutive word numbers (2 and 3), so document 1 contains the phrase “Code Search”. An alternate way to support phrases is to treat them as AND queries to identify a set of candidate documents and then filter out non-matching documents after loading the document bodies from disk. In practice, phrases built out of common words like “to be or not to be” make this approach unattractive. Storing the position information in the index entries makes the index bigger but avoids loading a document from disk unless it is guaranteed to be a match. Indexed Regular Expression Search There is too much source code in the world for Code Search to have kept it all in memory and run a regexp search over it all for every query, no matter how fast the regexp engine. Instead, Code Search used an inverted index to identify candidate documents to be searched, scored, ranked, and eventually shown as results. Regular expression matches do not always line up nicely on word boundaries, so the inverted index cannot be based on words like in the previous example. Instead, we can use an old information retrieval trick and build an index of n -grams, substrings of length n . This sounds more general than it is. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is. Continuing the example from the last section, the document set: (1) Google Code Search (2) Google Code Project Hosting (3) Google Web Search has this trigram index: _Co: {1, 2} Sea: {1, 3} e_W: {3} ogl: {1, 2, 3} _Ho: {2} Web: {3} ear: {1, 3} oje: {2} _Pr: {2} arc: {1, 3} eb_: {3} oog: {1, 2, 3} _Se: {1, 3} b_S: {3} ect: {2} ost: {2} _We: {3} ct_: {2} gle: {1, 2, 3} rch: {1, 3} Cod: {1, 2} de_: {1, 2} ing: {2} roj: {2} Goo: {1, 2, 3} e_C: {1, 2} jec: {2} sti: {2} Hos: {2} e_P: {2} le_: {1, 2, 3} t_H: {2} Pro: {2} e_S: {1} ode: {1, 1} tin: {2} (The _ character serves here as a visible representation of a space.) Given a regular expression such as /Google.*Search/ , we can build a query of AND s and OR s that gives the trigrams that must be present in any text matching the regular expression. In this case, the query is Goo AND oog AND ogl AND gle AND Sea AND ear AND arc AND rch We can run this query against the trigram index to identify a set of candidate documents and then run the full regular expression search against only those documents. The conversion to a query is not simply a matter of pulling out the text strings and turning them into AND expressions, although that is part of it. A regular expression using the | operator will lead to a query with OR terms, and parenthesized subexpressions complicate the process of turning strings into AND expressions. The full rules compute five results from each regular expression r : whether the empty string is a matching string, the exact set of matchin