Provide a detailed summary of the following web content, including what type of content it is (e.g. news article, essay, technical report, blog post, product documentation, content marketing, etc). If the content looks like an error message, respond 'content unavailable'. If there is anything controversial please highlight the controversy. If there is something surprising, unique, or clever, please highlight that as well: Title: Can sanitizers find the two bugs I wrote in C++? Site: A few days ago I published a short post about two bugs I wrote while developing the C++ external scanner for my TLA⁺ tree-sitter grammar . Reactions were mixed! Many people were supportive, but there were of course the usual drive-by claims by developers that the bugs were trivial, they would’ve found & fixed them inside of 20 minutes, and I was laughably incompetent for having written them in the first place. Maybe so! I’m a fan of formal methods primarily so I don’t have to be a genius to write correct code. In that same vein of building tools to save us from ourselves, one user suggested building the tree-sitter grammar with the LLVM address & undefined behavior sanitizers enabled. I’d used valgrind a long time ago but had never played around with sanitizers. I was also doing some closely-associated work to build the grammar for fuzzing with LLVM’s libFuzzer, so it seemed a fun detour to check whether those sanitizers would have saved me days of debugging pain! The build process Unless you want to enable this on your own tree-sitter grammar you can skip this section. Usually tree-sitter grammars are built automatically when running tree-sitter test or tree-sitter parse . However, the compilation flags are hardcoded in the CLI so I needed to write a script to build it with the flags I wanted. Thankfully tree-sitter grammars consist of only three relevant files: a gigantic generated src/parser.c file, a hand-written src/ file, and a header file src/tree_sitter/parser.h . These are all compiled into a shared library which is loaded by the consuming program. It was easy enough to compile with clang, along with the -fsanitize=address,undefined -fno-omit-frame-pointer flags. Supposedly the environment variables TREE_SITTER_DIR or TREE_SITTER_LIBDIR can be used to set where tree-sitter CLI looks for the file, but I couldn’t get those to work so just copied it to the default location ~/.cache/tree-sitter/lib/ (as a funny aside, while writing this up in the build script I managed to instead create a directory named ~ so had to figure out how to delete that without nuking home). Time to let it rip! I ran node_modules/.bin/tree-sitter test , and saw: Error opening dynamic library "/home/ahelwer/.cache/tree-sitter/lib/" Caused by: /home/ahelwer/.cache/tree-sitter/lib/ undefined symbol: __asan_report_store4 Alas! The tree-sitter CLI is a rust program that dynamically loads the file, and it wasn’t loading ASAN. I added the tree-sitter repo itself as a submodule of my grammar and set to figuring this out. Unfortunately after bashing my head against it for over an hour I couldn’t figure out how to load ASAN from rust; if you know, please answer this StackOverflow question ! Instead, I whipped up a quick C++ program that links against the tree-sitter C library directly and is itself compiled with sanitizers. This is unfortunate; the tree-sitter CLI contains some finicky logic for parsing & running the test corpus, which I am not going to replicate in C++, so my program only reads a TLA⁺ file, parses it, and prints out the parse tree. You can see the build script and C++ front end here . Initial cleanup Before I could get started replicating my bugs I had to fix all the other bugs the sanitizers found! One class of issue came from this line in the std::vector::data() docs : Notes: If size () is 0, data () may or may not return a null pointer. Indeed my empty vector was returning a null pointer, which I then passed into memcpy as variously the source or the target. This only happened when the number of bytes to copy was also zero, so no harm done I thought. On the memcpy docs page though, we see: If either dest or src is an invalid or null pointer, the behavior is undefined, even if count is zero. Well thank goodness it didn’t decide the vector should contain the entire contents of my address space again. Wrapping memcpy in an if-statement fixed the problems. Surprisingly, that was the end of it! I ran my program against the entire tlaplus/examples test corpus without any additional issues popping up. Time to reproduce those bugs. Bug #1: the perils of null-terminated strings Recall this one happened because I passed a pointer into atoi() , and std::vector doesn’t null-terminate its data (why would it?). My hypothesis was this would be easily detected even when adjacent memory was unused, because the null byte atoi() reads to terminate the “string” is itself uninitialized memory. To fix this bug I had written my own version of atoi() to parse the raw_level char vector: level = 0 ; int32_t multiplier = 1 ; for (size_t i = 0 ; i < raw_level.size(); i ++ ) { const size_t index = raw_level.size() - i - 1 ; const int8_t digit_value = - 48 ; level += digit_value * multiplier; multiplier *= 10 ; } At this point in the code the contents of raw_level were all guaranteed to be ASCII numbers from 0-9, so not too bad. I replaced it with our cursed line of code: level = atoi(; It’s worth looking at how this bug manifests; we can parse the following TLA⁺ proof: ---- MODULE Proof ---- THEOREM P ⇒ Q PROOF <1> P <1> Q PROOF BY P <1> QED ====================== Our buggy code is concerned with parsing the number in the <1> proof step IDs, which is the proof level. Proofs in TLA⁺ are hierarchical with various steps themselves having sub-proofs to prove their correctness. Ordinarily, this should give the following parse tree: ( source_file ( module ( header_line ) ( identifier ) ( header_line ) ( theorem ( bound_infix_op ( identifier_ref ) ( implies ) ( identifier_ref )) ( non_terminal_proof ( proof_step ( proof_step_id ( level ) ( name )) ( suffices_proof_step ( identifier_ref ))) ( proof_step ( proof_step_id ( level ) ( name )) ( suffices_proof_step ( identifier_ref ) ( terminal_proof ( use_body ( use_body_expr ( identifier_ref )))) )) ( qed_step ( proof_step_id ( level ) ( name ))) ) ) ( double_line ))) However, about one in twenty times it generates a parse error as one of the proof steps is a completely different level. At the time my initial thoughts were along two tracks: one, the logic in the newly-written proof step ID handler was quite complicated and an obvious source of possible bugs; two, the external scanner had previously exhibited nondeterministic behavior when incorrectly initializing its state during deserialization. Unfortunately both of these hypotheses were way off the mark which led to a lot of frustration & wasted time. Could sanitizers have come to the rescue? It should point us to line 275 in src/ . Let’s check! $ test/sanitize/out/parse_tlaplus Test.tla == 929719 == ERROR: AddressSanitizer: heap-buffer-overflow on address 0x602000000031 at pc 0x55ba5836b6f4 bp 0x7ffdd637d2b0 sp 0x7ffdd637ca70 READ of size 2 at 0x602000000031 thread T0 #0 0x55ba5836b6f3 in StrtolFixAndCheck(void*, char const*, char**, char*, int) asan_interceptors.cpp.o #1 0x55ba5836bbd4 in __interceptor_strtol (/home/ahelwer/src/tlaplus/tree-sitter-tlaplus/test/sanitize/out/parse_tlaplus+0xffbd4) (BuildId: 1061028f7f02d346004ffa4692ca74e9d92b5cad) #2 0x55ba5845e1f2 in atoi /usr/include/stdlib.h:364:16 #3 0x55ba5845e1f2 in (anonymous namespace)::ProofStepId::ProofStepId(std::vector> const&) /home/ahelwer/src/tlaplus/tree-sitter-tlaplus/src/ ... Easy-peasy. Sanitizers would have led me right to it! Bug #2: undefined behavior creating a black hole Recall this one resulted from calling std::vector::pop_back() on an empty vector, which is undefined behavior. The undefined behavior was given a somewhat hilarious definition by declaring the vector to then include the entire computer address space. The actual crash happened later on in a memcpy when I tried to serialize this impossibly-large vector, now part of the external scanner state. The circumstances leading to calling pop_back() on an empty vector resulted from insufficiently-hardening my external scanner against invalid syntax. Tree-sitter grammars are designed to be error-tolerant so they can keep functioning even while the user is in the middle of typing some code. This extends to your external scanner code, so you have to be careful about making assumptions about parser state when you encounter a given keyword. In this case the bug came from this function