Sunday, February 19, 2017

Safer S...

I want, of course, to talk about "Safer Strings" today.


TL;DR: Add /DwxNO_UNSAFE_WXSTRING_CONV=1 to your compiler options today.


wxWidgets has had implicit conversion of wxString to const char* since the dawn of time (or about 1992, at any rate). This was always dangerous, as it allowed someone to accidentally write:

void show_and_free(const char* p) { ...; free(p); }
wxString s("...");
show_and_free(s);
with catastrophic consequences, but such situations were relatively rare and it was thought that the convenience of having this implicit conversion overweighted the dangers. This is also why when we added Unicode support later, we also added implicit conversion to const wchar_t* and, when we added "STL" build mode, in which interoperability with the standard library is increased further even at the price of backwards compatibility, we added implicit conversions to std::string and std::wstring as well.

Unfortunately, with the merge of ANSI and Unicode build modes in wxWidgets 3, another, much more dangerous, problem has appeared because in the new combined mode we can now have a string containing Unicode characters not representable in the current locale encoding. And converting such strings to either char* or std::string inevitably results in a loss of data in this case, e.g.

double convert_temperature_to_celsius(const char* p) {
    const char* end;
    double t = strtod(p, &end);
    return 5.*(t - 32)/9.;
}

wxString s = wxGetTextFromUser("Enter temperature");
convert_temperature_to_celsius(s);
could, confusingly, result in always returning -17.77777, corresponding to 0°F, if the user decided to terminate the temperature entry with "°F" to explicitly indicate the scale used and the current encoding couldn't represent the degree symbol (which is the case of e.g. CP1250 under Microsoft Windows). In this case, conversion of wxString to char* would fail and p would be just empty.

Of course, this wouldn't happen if the code just used wxString::ToDouble() directly, or used wxChar and wxStrtod(), or used UTF-8, capable of representing any Unicode character, as encoding (which is practically always the case under Unix systems nowadays). So there are a lot of ways to write this code correctly, but, unfortunately, it was still too simple to write it wrongly accidentally lose the data entered by the user in this case. Clearly, implicit conversions potentially losing data are a bad idea, but we couldn't just turn them off in wxWidgets 3, as it would have broken almost all the existing programs which, empirically, all used these conversions in many places.

For the same reason, we still won't be able to turn this conversion off by default, even in wxWidgets 3.2. However now we at least provide a way to opt-in into safer behaviour. The arguably less interesting part of the changes is that you can now change the value of the compile-time wxUSE_UNSAFE_WXSTRING_CONV option when building the library. It is set to 1 by default, for compatibility, but if you build wxWidgets for the use in your own project, you are strongly advised to set it to 0 to permanently disable the unsafe, in the sense described above, implicit conversions.

Many people, however, don't build their own library, but use the one provided by their package manager under Unix/macOS or download our MSW binaries. These official binaries will continue to provide the unsafe conversions for compatibility, but you can define wxNO_UNSAFE_WXSTRING_CONV when building your own project to disable their use in your code without rebuilding the library. This symbol can be just #define'd before including any wxWidgets headers, but it is better to define it globally, in the compiler options in your make- or project file: just add /DwxNO_UNSAFE_WXSTRING_CONV=1 to it. And the main point of this long post is to convince you that you NEED TO DO just that: please define wxNO_UNSAFE_WXSTRING_CONV for your code and fix the resulting compilation errors to ensure that you don't lose any data entered by the user.

Fixing the compilation errors will, generally speaking, involve doing one of two things:

  • Either stop using char* (or std::string in the STL build) entirely and use wxString directly.
  • Or convert it to wchar_t* (or std::wstring) or convert wxString to UTF-8 encoding which will never lose data, using methods such as utf8_str(), which is a convenient synonym for mb_str(wxConvUTF8), or ToStdString(wxConvUTF8).
Of course, if you really need to use the current locale encoding, e.g. because you call a C standard library function using it, you will still need to perform the conversion to it, using just plain mb_str() and there will still be a possibility of the conversion to it failing, but at least now it won't happen implicitly.

Thanks for reading all this and, if you jumped to the end, hoping to quickly find the conclusion instead of reading this wall of text, please see the conclusion in the very beginning!