Programming Bits: My Thanks to the Profiler

I recently had an issue at work with one of our kernel mode file system filters for Windows.   The CPU was spending way too much time executing our drivers code. I ran the kernrate sampling profiler tool from the Windows Driver Kit on our driver to see what code was taking so long to execute. The profiler revealed that over 50% of the time that our driver was executing in a function named RtlUnicodeStStri. This is a function that I wrote that is similar to the strstr function in the C standard library with the exception that it is case-insensitive and it works on UNICODE_STRING structures. In the Windows kernel, UNICODE_STRING structures are used in place of character arrays. The structure is defined like this:

typedef struct _UNICODE_STRING
{
    USHORT Length;
    USHORT MaximumLength;
    PWCH Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

Windows includes type definitions of USHORT as unsigned short and PWCH as wchar_t*. When using UNICODE_STRING structures you have the length of the string (in bytes) and the maximum length (i.e. the buffer size) of the string (in bytes). This provides many benefits: you can safely detect if any operation will overrun your string buffer, you can compare lengths before comparing strings, etc… The strings stored in the Buffer member of the UNICODE_STRING structure are not guaranteed to be NULL terminated and many times they are not. Thus you can not use the standard C library functions on the Buffer member.
Although the Windows kernel internally uses the UNICODE_STRING, there is not a very comprehensive library for handling UNICODE_STRINGS (The string handling functions are in the Rtl library). This causes many developers like myself to write functions like RtlUnicodeStrStri. Below is the original version of RtlUnicodeStrStri that I had written.

 1 __drv_maxIRQL(APC_LEVEL)
 2 NTSTATUS
 3 RtlUnicodeStrStri(
 4     __in PCUNICODE_STRING  Str,
 5     __in PCUNICODE_STRING  SubStr,
 6     __out_opt PUNICODE_STRING  Result,
 7     __out_opt int* MatchStartIndex
 8     )
 9 {
10     USHORT l1=0, l2=0;
11     USHORT start = 0;
12     
13     //  Translate all counts to character counts.
14     const USHORT StrCharLen = Str->Length / sizeof(WCHAR);
15     const USHORT SubStrCharLen = SubStr->Length / sizeof(WCHAR);
16 
17     ASSERT_VALID_STRING(Str);
18     ASSERT_VALID_STRING(SubStr);
19     
20     while (start < StrCharLen)
21     {
22         for (l1=start,l2=0; l1<StrCharLen && l2<SubStrCharLen; ++l1, ++l2)
23         {
24             if (RtlUpcaseUnicodeChar(Str->Buffer[l1]) != 
25                 RtlUpcaseUnicodeChar(SubStr->Buffer[l2]))
26             {
27                 break;
28             }
29         }
30     }
31 
32     //    other code removed...
33 
34 } // RtlUnicodeStrStri

At first glance, I thought that the function was fairly straightforward. Then I noticed one particular optimization that should have been made. In the worst case, when the sub-string is not present, the algorithm continues to look through the entire source string even when the remaining length of the source string is too short to hold the sub-string. This is an optimization that can not be made in the C version of strstr because you have no idea in advance how long the source string. The full code is given below.

 1 __drv_maxIRQL(APC_LEVEL)
 2 NTSTATUS
 3 RtlUnicodeStrStri(
 4     __in PCUNICODE_STRING  Str,
 5     __in PCUNICODE_STRING  SubStr,
 6     __out_opt PUNICODE_STRING  Result,
 7     __out_opt int* MatchStartIndex
 8     )
 9 /*++
10 
11 Routine Description:
12 
13     This routine search Str for the first occurrence of the string SubStr
14     in a case insensitive manner.  If Result is provided and the search is
15     successful, it will be filled in with the string starting at the matched 
16     sub-string.
17 
18 Arguments:
19 
20     Str - The string to search.
21     
22     SubStr - The string to search for.
23 
24     Result - Result of the search starting at the sub-string and 
25              continuing through the rest of Str. This structure 
26              justs points to the string in Str.
27 
28     MatchStartIndex - The starting index within Str at which SubStr
29                       appears.                                   
30 
31 Return Value:
32 
33     STATUS_SUCCESS if SubStr was found in Str, 
34     STATUS_OBJECT_NAME_NOT_FOUND otherwise.
35 
36 Unit test:
37     
38     UnitTest4() via RtlWStrStri
39 
40 --*/
41 {
42     USHORT l1=0, l2=0;
43     USHORT start = 0;
44     
45     //  Translate all counts to character counts.
46     const USHORT StrCharLen = Str->Length / sizeof(WCHAR);
47     const USHORT SubStrCharLen = SubStr->Length / sizeof(WCHAR);
48 
49     ASSERT_VALID_STRING(Str);
50     ASSERT_VALID_STRING(SubStr);
51     
52     // Run the loop while the length of the sub-string (SubStr) is less
53     // than the remaining length of the input string being searched (Str). 
54     while (start < StrCharLen && (SubStrCharLen <= (StrCharLen - start)))
55     {
56         for (l1=start,l2=0; l1<StrCharLen && l2<SubStrCharLen; ++l1, ++l2)
57         {
58             if (RtlUpcaseUnicodeChar(Str->Buffer[l1]) != 
59                 RtlUpcaseUnicodeChar(SubStr->Buffer[l2]))
60             {
61                 break;
62             }
63         }
64 
65         if (l2 == SubStrCharLen)
66         {
67             if (ARGUMENT_PRESENT(Result))
68             {
69                 //  Translate start back to byte count.
70                 l1 = start * sizeof(WCHAR);
71 
72                 Result->Length = Str->Length - l1;
73                 Result->MaximumLength = Str->MaximumLength - l1;
74                 Result->Buffer = &Str->Buffer[start];
75             }
76 
77             if (ARGUMENT_PRESENT(MatchStartIndex))
78             {
79                 *MatchStartIndex = start;
80             }
81 
82             return STATUS_SUCCESS;
83         }
84 
85         ++start;
86     }
87 
88     return STATUS_OBJECT_NAME_NOT_FOUND;
89 
90 } // RtlUnicodeStrStri

This optimization dropped our drivers execution time dramatically, to an acceptable level, though RtlUnicodeStrStri was still responsible for about %30 of our drivers execution time. I tried to optimize the calls to RtlUpcaseUnicodeChar (A Windows API) by doing quick translations for English ASCII codes (including ignoring chars that were already up-cased). It turns out that RtlUpcaseUnicodeChar is very very good and any optimization attempt I made just made the driver’s execution time worse.

It turns out that our programmer’s (including myself) just got lazy and overused RtlUnicodeStrStri. I mean, why parse a string and do a compare on the part of interest when you can just call RtlUnicodeStrStri? So I wrote many more UNICODE_STRING handling functions like RtlStringEndsWithSuffix that can replace many calls to RtlUnicodeStrStri and more clearly capture the programmer’s intention in the code.

Programming Bits

Friday, September 3, 2010

My Thanks to the Profiler

No comments:

Post a Comment

Search This Blog

Labels

Blog Archive

Followers

About Me