Friday, September 3, 2010

My Thanks to the Profiler

I recently had an issue at work with one of our kernel mode file system filters for Windows.   The CPU was spending way too much time executing our drivers code.  I ran the kernrate sampling profiler tool from the Windows Driver Kit on our driver to see what code was taking so long to execute.  The profiler revealed that over 50% of the time that our driver was executing in a function named RtlUnicodeStStri.  This is a function that I wrote that is similar to the strstr function in the C standard library with the exception that it is case-insensitive and it works on UNICODE_STRING structures. In the Windows kernel, UNICODE_STRING structures are used in place of character arrays.  The structure is defined like this:

typedef struct _UNICODE_STRING
{
    USHORT Length;
    USHORT MaximumLength;
    PWCH Buffer;
} UNICODE_STRING, *PUNICODE_STRING;   


Windows includes type definitions of USHORT as unsigned short and PWCH as wchar_t*. When using UNICODE_STRING structures you have the length of the string (in bytes) and the maximum length (i.e. the buffer size) of the string (in bytes).  This provides many benefits: you can safely detect if any operation will overrun your string buffer, you can compare lengths before comparing strings, etc…  The strings stored in the Buffer member of the UNICODE_STRING structure are not guaranteed to be NULL terminated and many times they are not.  Thus you can not use the standard C library functions on the Buffer member.
Although the Windows kernel internally uses the UNICODE_STRING, there is not a very comprehensive library for handling UNICODE_STRINGS (The string handling functions are in the Rtl library).  This causes many developers like myself to write functions like RtlUnicodeStrStri.  Below is the original version of RtlUnicodeStrStri that I had written. 
 1 __drv_maxIRQL(APC_LEVEL)
 2 NTSTATUS
 3 RtlUnicodeStrStri(
 4     __in PCUNICODE_STRING  Str,
 5     __in PCUNICODE_STRING  SubStr,
 6     __out_opt PUNICODE_STRING  Result,
 7     __out_opt int* MatchStartIndex
 8     )
 9 {
10     USHORT l1=0, l2=0;
11     USHORT start = 0;
12     
13     //  Translate all counts to character counts.
14     const USHORT StrCharLen = Str->Length / sizeof(WCHAR);
15     const USHORT SubStrCharLen = SubStr->Length / sizeof(WCHAR);
16 
17     ASSERT_VALID_STRING(Str);
18     ASSERT_VALID_STRING(SubStr);
19     
20     while (start < StrCharLen)
21     {
22         for (l1=start,l2=0; l1<StrCharLen && l2<SubStrCharLen; ++l1, ++l2)
23         {
24             if (RtlUpcaseUnicodeChar(Str->Buffer[l1]) != 
25                 RtlUpcaseUnicodeChar(SubStr->Buffer[l2]))
26             {
27                 break;
28             }
29         }
30     }
31 
32     //    other code removed...
33 
34 } // RtlUnicodeStrStri

At first glance, I thought that the function was fairly straightforward.  Then I noticed one particular optimization that should have been made.  In the worst case, when the sub-string is not present, the algorithm continues to look through the entire source string even when the remaining length of the source string is too short to hold the sub-string.  This is an optimization that can not be made in the C version of strstr because you have no idea in advance how long the source string.  The full code is given below.


 1 __drv_maxIRQL(APC_LEVEL)
 2 NTSTATUS
 3 RtlUnicodeStrStri(
 4     __in PCUNICODE_STRING  Str,
 5     __in PCUNICODE_STRING  SubStr,
 6     __out_opt PUNICODE_STRING  Result,
 7     __out_opt int* MatchStartIndex
 8     )
 9 /*++
10 
11 Routine Description:
12 
13     This routine search Str for the first occurrence of the string SubStr
14     in a case insensitive manner.  If Result is provided and the search is
15     successful, it will be filled in with the string starting at the matched 
16     sub-string.
17 
18 Arguments:
19 
20     Str - The string to search.
21     
22     SubStr - The string to search for.
23 
24     Result - Result of the search starting at the sub-string and 
25              continuing through the rest of Str. This structure 
26              justs points to the string in Str.
27 
28     MatchStartIndex - The starting index within Str at which SubStr
29                       appears.                                   
30 
31 Return Value:
32 
33     STATUS_SUCCESS if SubStr was found in Str, 
34     STATUS_OBJECT_NAME_NOT_FOUND otherwise.
35 
36 Unit test:
37     
38     UnitTest4() via RtlWStrStri
39 
40 --*/
41 {
42     USHORT l1=0, l2=0;
43     USHORT start = 0;
44     
45     //  Translate all counts to character counts.
46     const USHORT StrCharLen = Str->Length / sizeof(WCHAR);
47     const USHORT SubStrCharLen = SubStr->Length / sizeof(WCHAR);
48 
49     ASSERT_VALID_STRING(Str);
50     ASSERT_VALID_STRING(SubStr);
51     
52     // Run the loop while the length of the sub-string (SubStr) is less
53     // than the remaining length of the input string being searched (Str). 
54     while (start < StrCharLen && (SubStrCharLen <= (StrCharLen - start)))
55     {
56         for (l1=start,l2=0; l1<StrCharLen && l2<SubStrCharLen; ++l1, ++l2)
57         {
58             if (RtlUpcaseUnicodeChar(Str->Buffer[l1]) != 
59                 RtlUpcaseUnicodeChar(SubStr->Buffer[l2]))
60             {
61                 break;
62             }
63         }
64 
65         if (l2 == SubStrCharLen)
66         {
67             if (ARGUMENT_PRESENT(Result))
68             {
69                 //  Translate start back to byte count.
70                 l1 = start * sizeof(WCHAR);
71 
72                 Result->Length = Str->Length - l1;
73                 Result->MaximumLength = Str->MaximumLength - l1;
74                 Result->Buffer = &Str->Buffer[start];
75             }
76 
77             if (ARGUMENT_PRESENT(MatchStartIndex))
78             {
79                 *MatchStartIndex = start;
80             }
81 
82             return STATUS_SUCCESS;
83         }
84 
85         ++start;
86     }
87 
88     return STATUS_OBJECT_NAME_NOT_FOUND;
89 
90 } // RtlUnicodeStrStri

This optimization dropped our drivers execution time dramatically, to an acceptable level, though RtlUnicodeStrStri was still responsible for about %30 of our drivers execution time.  I tried to optimize the calls to RtlUpcaseUnicodeChar (A Windows API) by doing quick translations for English ASCII codes (including ignoring chars that were already up-cased).  It turns out that RtlUpcaseUnicodeChar is very very good and any optimization attempt I made just made the driver’s execution time worse.

It turns out that our programmer’s (including myself) just got lazy and overused RtlUnicodeStrStri.  I mean, why parse a string and do a compare on the part of interest when you can just call RtlUnicodeStrStri?  So I wrote many more UNICODE_STRING handling functions like RtlStringEndsWithSuffix that can replace many calls to RtlUnicodeStrStri and more clearly capture the programmer’s intention in the code.

No comments:

Post a Comment