Tuesday, August 15, 2017

Story of One Exception or This is How We Have to Debug Other People's Code

by Paul Eremeev and Sergey Vasiliev 
Using third-party libraries allows you to get the functionality you want, without wasting time on the development of the corresponding logic. Take and use it! Of course, such an approach doesn't include only the merits, that's why it has another "dark" side. One of the problems inherent to using third-party libraries is the lack of control over things that are going on inside. It all started with a user who wrote about an unhandled exception, appearing when checking C# project...
Before moving to a debriefing session, one has to understand at least approximately about interacting between PVS-Studio, Roslyn and MSBuild. To keep it short - to open C++ and C# projects, PVS-Studio uses MSBuild libraries, and for C# projects Roslyn libraries are additionally used (which, in turn, uses MSBuild as well). If you would like to get more details about the interaction of these components, you can find them in the article: "Support of Visual Studio 2017 and Roslyn 2.0 in PVS-Studio: sometimes it's not that easy to use ready-made solutions as it may seem".
As I've already said it all started with a crash with this warning:
Unhandled Exception: System.IO.FileNotFoundException: Could not load 
file or assembly 'Microsoft.Build.Framework, Version=15.1.0.0, 
Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a' or one of its 
dependencies. The system cannot find the file specified. ---> 
System.IO.FileNotFoundException: Could not load file or assembly
'Microsoft.Build.Framework, Version=4.0.0.0, Culture=neutral,
PublicKeyToken=b03f5f7f11d50a3a' or one of its dependencies. The 
system cannot find the file specified.
The most interesting thing here is that PVS-Studio is not mentioned in the error message, that made us think that we are not directly related to the crash. That's good. But when verifying the project, PVS-Studio still crashes, which is bad. Moreover, the exception was thrown in a way, that it couldn't be caught and handled properly on our side. So, we had to investigate and dig deep, e.g. configure debug projects to explore source code of Roslyn and MSBuild libraries.
It was clear at once that the exception is coming from within the Roslyn, while trying to open a project, but the Antlr4.Build.Tasks mention gave the idea, that the problem is more specific.
Gradually, exploring the code and making a way in the wilds of third-party dependencies connected with PVS-Studio, this call chain was reconstructed:
PVS-Studio -> Roslyn -> MSBuild -> Antlr4
It should be explained how Antlr4 dependencies got here, if PVS-Studio does not use them. The project file, parsing of which had generated an exception, utilizes the Antlr4 code generatortask, which is a user-defined MSBuild task. Such a task can be imported into any standard MSBuild project and it can be called as a separate build step. Thus, to open the project, in addition to Roslyn and MSBuild, we find ourselves tied by such third party components necessary for a build. Since the Roslyn, in addition to its evaluation, runs various build steps to open the project, code from Antlr4 is also executed inside the process of our analyzer.
Task's source code, throwing an exception, is available as an open source at GitHub. This allowed us to rebuild a task's .dll with debug symbols and see what was going on inside.
The most interesting (and useful for us) from what was inside, was a way in which an external process is launched - it took place in another application domain. During the instantiation of a type containing a method for processing the stdout of the aforementioned external process, the task requested the Microsoft.Build.Framework.dll library. Here comes the most interesting part. This library is already loaded by the process of our analyzer - it is needed both by us (for MSBuild files parsing) and by Roslyn. However, as you may remember, we are now in another AppDomain and the library is not loaded into it yet. The crash occurs while attempting to load this very library.
A careful reader may immediately have a question - why there is a problem with loading this library now, if in the main AppDomain it was just fine? The answer is how this AppDomain was created. A directory, passed to its constructor as base one, was the one containing the task's dll, but not a directory with the executable PVS-Studio_Cmd.exe file. This, in turn, led to a change in the behavior of the Fusion subsystem, which is responsible for finding and loading dependent .NET assemblies. Fusion searches for the assembly in the directory of the executable file (which is usually an exe file) and in the Global Assembly Cache. Creation of an AppDomain with another base directory made Fusion search for dll file in another directory, but not in the directory of executable PVS-Studio_Cmd.exe file. Yes, it is worth noting that PVS-Studio in its distribution has a significant number of MSBuild libraries (including the one which we are talking about) - the reason why it happened so can also be found in the article about our support of Visual Studio 2017, the link to which is given above. At the same time, starting with MSBuild version 15, its libraries are not registered at GAC. These two reasons (lack of a library in the GAC and change of the AppDomain base directory, in which Fusion looks for dependencies) led to the crash of the program.
As the exception occurred in another AppDomain, it is not possible to catch and handle it in analyzer application domain. More precisely, it is possible, but the application will still crash. The most we can do is to undertake actions for a softer crash, for example, inform about the source of the problem (what we did).
By the way, the same error occurs in the Microsoft Visual Studio 2017 development environment when trying to build this project. Most likely, it has existed in the Antlr4 task, but remained unnoticed because previous versions of MSBuild registered its dependencies in GAC, and Fusion found them there. At the moment, the bug is fixed in the Antlr4 repository. Do you know what decision did Antlr4 task developers made to fix this problem? Not to create a separate AppDomain, and do all necessary operations in the default one. Why it was necessary to complicate logic from the very beginning is an open question.
Overall, it turned out to be very ironic. The AppDomain technology, created with the aim of increasing the fault tolerance of the applications by creating isolated areas of execution, eventually led to the inability to protect ourselves in this situation. Of course, it is clear that such a situation could arise in case of an unhandled exception in a thread created in a third-party library as well, even when this thread is created in the default domain.
Anyway, we can learn several lessons from this situation:
  • There is no need to reinvent the bicycle. The more complex code, the more likely you will make a mistake.
  • Get ready to pay a certain price for the use of third-party libraries. Sometimes it can be quite huge.
  • Even if the problem is not in your code, but in some of the dependencies, the problem will automatically be your headache when it reveals itself.
  • The world is not perfect - unfortunately, there are things out of our control. For example, the exceptions, thrown from other application domains or threads.

No comments:

Post a Comment