getJsonToken assumes underlying string is null-terminated but requires end pointer

Brotcrunsher commented at 7:32 pm on August 11, 2023: contributor

Is there an existing issue for this?

I have searched the existing issues

Current behaviour

When a function requires a const char *end, one might assume that the string doesn’t need to be null terminated. getJsonToken requires this parameter:

0enum jtokentype getJsonToken(std::string& tokenVal, unsigned int& consumed,
1                            const char *raw, const char *end)

However, the function does assume that the underlying memory block is null-terminated as is visible here:

0        if (!strncmp(raw, "null", 4)) {
1            //...
2        } else if (!strncmp(raw, "true", 4)) {
3            //...
4        } else if (!strncmp(raw, "false", 5)) {
5            //...

and here:

0if ((*firstDigit == '0') && json_isdigit(firstDigit[1]))

and possibly others.

If the string isn’t null terminated, then we run risk of reading uninitialized memory, which could lead to a crash.

In practice this currently does not seem to lead to any issues, as all usages that I could find used underlying null-terminated blocks. It is however an uncomfortable pitfall for future development.

Expected behaviour

The function should either make sure to never access anything beyond or at the end pointer, or should document this requirement properly.

Steps to reproduce

0char innocentChar = 'n';
1getJsonToken(/*...*/, &innocentChar, (&innocentChar) + 1);

Relevant log output

No response

How did you obtain Bitcoin Core

Other

What version of Bitcoin Core are you using?

master@3654d84

Operating system and version

Common code issue, os irrelevant

Machine specifications

No response

Brotcrunsher renamed this:
~~getJsonToken assumes underlying string is null terminated~~
getJsonToken assumes underlying string is null-terminated but requires end pointer
on Aug 11, 2023

willcl-ark added the label Utils/log/libs on Jan 24, 2024

daniel-btc-nym commented at 2:38 am on February 2, 2024: none

I’m working on it. Plan is to harden the function to not read at end and after. I have some initial small draft changes, and want to add testing. Maybe, we can even highlight the issue via additional fuzz+msan tests.

maflcko commented at 12:09 pm on February 5, 2024: member

I guess this can be fixed by using std::string_view::starts_with

daniel-btc-nym commented at 3:18 am on February 8, 2024: none

The bug here isn’t actually that bad:

It uses strncmp(…, 4) and strncmp(…, 5) respectively: so, in worst case it will read 3 or 5 bytes after end.
Similar for the number parsing.

In any case, I have a change that fixes both (adding guards for reading past end, and making the code pretty / easy to read again). I have unit tests for the keywords (true, false, null) but want to add some for the numbers.

Haven’t made any progress on MSAN. But if we would poison data around begin and end, we could write some exhaustive tests for small inputs. If we restrict to (printable characters, or these mentioned in the code), we could then be pretty certain the new code is correct.

Touching code is always a risk, so I am trying to minimize it.

FWIW, my code so far: https://github.com/bitcoin/bitcoin/commit/b77afc50b14ea1af61ea3b32e75d2aa974b07844

maflcko commented at 8:37 am on February 8, 2024: member

Haven’t made any progress on MSAN.

You can copy the bytes into a vector first and call shrink_to_fit, which should be enough to tell msan where the data really ends?

daniel-btc-nym commented at 1:22 pm on February 8, 2024: none

I still need to look into how MSAN is integrated into the current unit test suite. Questions like: how can I run the unit tests I added under MSAN? Are all unit tests run under MSAN with a certain make target … the basics. I am very new to this repository here :)

maflcko commented at 1:25 pm on February 8, 2024: member

It is probably not worth it to try to setup msan, as it requires building llvm from scratch.

It is easier to just use valgrind. For example: valgrind ./src/test/test_bitcoin -t vector_tests

daniel-btc-nym commented at 3:40 am on February 9, 2024: none

Yeah, I ran some valgrind tests before and after my changes and it clearly showed the “uninitialized” accesses when we copy the data first into a tight vector as you suggested.

I added more test coverage and think this should be enough for this issue.

What do you think about the changes?

If they look OK, I can merge them into a single commit – (I suppose by forking again?) – and then making a pull-request?

maflcko commented at 8:45 am on February 9, 2024: member

I like bugfixes the most, when they are touching only a single line of code. But if more changes are required, then that is fine, too.

I checked that this is only a theoretical issue in Bitcoin Core, and can not be hit in production code, but it may still be good to fix it.

It may be good to update the fuzz target to catch the issue here: src/test/fuzz/parse_univalue.cpp

daniel-btc-nym commented at 4:50 pm on February 17, 2024: none

I like bugfixes the most, when they are touching only a single line of code. But if more changes are required, then that is fine, too.

My changes have a few orthogonal, but related fixes and test additions. We could make different pull-requests for them. Let me know what you prefer.

For example:

Check strncmp will not go out of bound. This is lines 99-115 in univalue_read.cpp ; unit tests that call no_reading_beyond_end_test
Change parsing of numbers, where there were two places that might read past the end. This is lines 126-143 in univalue_read.cpp ; unit test that call: parsing_numbers_beyond_end_test.
Added exhaustive_short_string_test unit test, which was previously causing errors when run in valgrind; now runs cleanly.
Added a couple more unit tests such as parsing the empty buffer.
Fixed: fail18.json which is a test where the nesting is too deep.
Added pass5.json which is a test where the nesting is just shallow enough to pass.

It may be good to update the fuzz target to catch the issue here: src/test/fuzz/parse_univalue.cpp

I added it in, and ran the fuzzer using HONGGFUZZ. Since this isn’t a “runaway” overflow, we don’t get any crashes. When running the entire fuzzer under valgrind, we seem to get all sorts of (seemingly unrelated) problems.

I thus added a mini-exhaustive test that runs quickly and cleanly under valgrind now, but didn’t before.

A few more questions:

Are we adhering to any line lengths conventions?
Are we adhering to conventions for variable names and function names?
Do we have an auto-formatter that takes care of basic formatting such as where open-braces go, how many spaces to indent, and maybe even the line length?
My changes do change the semantics for some corner-cases where we were reading past the end of the buffer before: For example, previously: char buf[] = "true"; getJsonToken(tokenVal, consumed, buf, buf + 1); would parse as JTOK_KW_TRUE and now returns JTOK_ERR. There are also the opposite: previously, char buf[] = "01"; getJsonToken(tokenVal, consumed, buf, buf + 1); would error since we do not allow leading 0s, but now, this would parse as “0”.
Do we have some place where we could dump outputs from valgrind that I could link to in the commit messages or pull-request explanation? I would like to show before / after? Or basically, anywhere that creates durable data that can easily be linked to?
What do you suggest my next steps would be? I am fairly happy with the code.

maflcko commented at 9:30 am on February 19, 2024: member