Browse code

Loosen restrictions on embedded file identification

In regression testing against a large sample set, I found that strictly
disallowing any embedded file identification if any previous layer was
an embedded file resulted in missed detections.

Specifically, I found an MSEXE file which has an embedded RAR, which in
turn had another MSEXE that itself had an embedded 7ZIP containing... malware.
sha256: c3cf573fd3d1568348506bf6dd6152d99368a7dc19037d135d5903bc1958ea85

To make it so ClamAV can extract all that, we must loosen the
restriction and allow prior layers to be embedded, just not the current
layer.

I've also added some logic to prevent attempting to extract an object at
the same offset twice. The `fpt->offset`s appear in order, but if you
have multiple file type magic signatures match on the same address, like
maybe you accidentally load two different .ftm files, then you'd get
duplicates and double-extraction.
As a bonus, I found I could also skip over offsets within a valid ZIP,
ARJ, and CAB since we now have the length of those from the header check
and as I know we don't want to extract embedded contents in that way.

Val S. authored on 2025/10/11 11:45:52
Showing 1 changed files
... ...
@@ -3670,6 +3670,8 @@ static cl_error_t scanraw(cli_ctx *ctx, cli_file_t type, uint8_t typercg, cli_fi
3670 3670
     // In allmatch-mode, ret will never be CL_VIRUS, so ret may be used exclusively for file type detection and for terminal errors.
3671 3671
     // When not in allmatch-mode, it's more important to return right away if ret is CL_VIRUS, so we don't care if file type matches were found.
3672 3672
     if (ret >= CL_TYPENO) {
3673
+        size_t last_offset = 0;
3674
+
3673 3675
         // Matched 1+ file type signatures. Handle them.
3674 3676
         found_type = (cli_file_t)ret;
3675 3677
 
... ...
@@ -3678,11 +3680,16 @@ static cl_error_t scanraw(cli_ctx *ctx, cli_file_t type, uint8_t typercg, cli_fi
3678 3678
         fpt = ftoffset;
3679 3679
 
3680 3680
         while (fpt) {
3681
-            if (fpt->offset > 0) {
3681
+            if ((fpt->offset > 0) &&
3682
+                // Only handle each offset once to prevent duplicate processing like if two signatures are found at the same offset.
3683
+                ((size_t)fpt->offset > last_offset)) {
3684
+
3682 3685
                 bool type_has_been_handled = true;
3683 3686
                 bool ancestor_was_embedded = false;
3684 3687
                 size_t i;
3685 3688
 
3689
+                last_offset = (size_t)fpt->offset;
3690
+
3686 3691
                 /*
3687 3692
                  * First, use "embedded type recognition" to identify a file's actual type.
3688 3693
                  * (a.k.a. not embedded files, but file type detection corrections)
... ...
@@ -3874,83 +3881,9 @@ static cl_error_t scanraw(cli_ctx *ctx, cli_file_t type, uint8_t typercg, cli_fi
3874 3874
                 }
3875 3875
 
3876 3876
                 /*
3877
-                 * Only scan embedded files if we are not already in an embedded context.
3878
-                 * That is, if this or a previous layer was identified with embedded file type recognition, then we do
3879
-                 * not scan for embedded files again.
3880
-                 *
3881
-                 * This restriction will prevent detecting the same embedded content more than once when recursing with
3882
-                 * embedded file type recognition deeper within the same buffer.
3883
-                 *
3884
-                 * This is necessary because we have no way of knowing the length of a file for many formats and cannot
3885
-                 * prevent a search for embedded files from finding the same embedded content multiple times (like a LOT
3886
-                 * of times).
3887
-                 *
3888
-                 * E.g. if the file is like this:
3889
-                 *
3890
-                 *          [ data ] [ embedded file ] [ data ] [ embedded file ]
3891
-                 *
3892
-                 * The first time we do it we'll find "two" embedded files, like this:
3893
-                 *
3894
-                 *  Emb. File #1:    [ embedded file ] [ data ] [ embedded file ]
3895
-                 *  Emb. File #2:                               [ embedded file ]
3896
-                 *
3897
-                 * We must not scan Emb. File #1 again for embedded files, because it would double-extract Emb. File #2.
3898
-                 *
3899
-                 * There is a flaw in this logic, though. Suppose that we actually have:
3900
-                 *
3901
-                 *          [ data ] [ compressed file w. recognizable magic bytes ]
3902
-                 *
3903
-                 * A first pass of the above will again identify "two" embedded files:
3904
-                 *
3905
-                 *  Emb. File #1:    [ compressed archive w. recognizable magic bytes ]
3906
-                 *  Emb. File #2:                               [ magic bytes         ] <- Compressed data/Not real file
3907
-                 *
3908
-                 * In this case, the magic bytes of a contained, compressed file is somehow still identifiable despite
3909
-                 * compression. The result is the Emb. File #2 will fail to be parsed and when we decompress Emb. File
3910
-                 * #1, then we maybe get something like this:
3911
-                 *
3912
-                 *  Decompressed:    [ data                   ] [ embedded file ]
3913
-                 *
3914
-                 * So if this happened... then we WOULD want to scan the decompressed file for embedded files.
3915
-                 * The problem is, we have no way of knowing how long embedded files are.
3916
-                 * We don't know if we have:
3917
-                 *
3918
-                 * A.       [ data ] [ embedded file ] [ data ] [ embedded file ]
3919
-                 *  or
3920
-                 * B.       [ data ] [ embedded compressed archive w. recognizable magic bytes ]
3921
-                 *  or
3922
-                 * C.       [ data ] [ embedded uncompressed archive w. multiple file entries [ file 1 ] [ file 2 ] [ file 2 ] ]
3923
-                 *
3924
-                 * Some ideas for a more accurate solution:
3925
-                 *
3926
-                 * 1. Record the offset and size of each file extracted by the parsers.
3927
-                 *    Then, when we do embedded file type recognition, we can check if the offset and size of the
3928
-                 *    embedded file matches the offset and size of a file that was extracted by a parser.
3929
-                 *    This falls apart a little bit for multiple layers of archives unless we also compare offsets within
3930
-                 *    each layer. We could do that, but it would be a lot of work. And we'd probably want to take into
3931
-                 *    consideration if files were decompressed or decrypted. ... I don't know a clean solution.
3932
-                 *
3933
-                 * 2. Have all parsers to run before embedded file type recognition and they each determine the length
3934
-                 *    of the file they parsed, so we can differentiate between embedded files and appended files.
3935
-                 *    For appended files, we would know they weren't extracted by a parser module and the parser for
3936
-                 *    each of those would report the length of the file it parsed so we can use that to mitigate
3937
-                 *    overlapping embedded file type recognition.
3938
-                 *    But I highly doubt all file types can be parsed to determine the correct length of the file.
3939
-                 */
3940
-                for (i = ctx->recursion_level; i > 0; i--) {
3941
-                    if (ctx->recursion_stack[i].attributes & LAYER_ATTRIBUTES_EMBEDDED) {
3942
-                        // Found an ancestor that was embedded.
3943
-                        // Do not scan embedded files again.
3944
-                        ancestor_was_embedded = true;
3945
-                        break;
3946
-                    }
3947
-                }
3948
-
3949
-                /*
3950 3877
                  * Next, check for actual embedded files.
3951 3878
                  */
3952
-                if ((false == ancestor_was_embedded) &&
3953
-                    (false == type_has_been_handled)) {
3879
+                if (false == type_has_been_handled) {
3954 3880
                     cli_dbgmsg("%s signature found at %u\n", cli_ftname(fpt->type), (unsigned int)fpt->offset);
3955 3881
 
3956 3882
                     type_has_been_handled = true;
... ...
@@ -4003,6 +3936,9 @@ static cl_error_t scanraw(cli_ctx *ctx, cli_file_t type, uint8_t typercg, cli_fi
4003 4003
                                     break;
4004 4004
                                 }
4005 4005
 
4006
+                                // Increment last_offset to ignore any file type matches that occured within this legitimate archive.
4007
+                                last_offset += zip_size - 1; // Note: size is definitely > 0 because header_check succeeded.
4008
+
4006 4009
                                 nret = cli_magic_scan_nested_fmap_type(
4007 4010
                                     ctx->fmap,
4008 4011
                                     fpt->offset,
... ...
@@ -4025,6 +3961,9 @@ static cl_error_t scanraw(cli_ctx *ctx, cli_file_t type, uint8_t typercg, cli_fi
4025 4025
                                     break;
4026 4026
                                 }
4027 4027
 
4028
+                                // Increment last_offset to ignore any file type matches that occured within this legitimate archive.
4029
+                                last_offset += cab_size - 1; // Note: size is definitely > 0 because header_check succeeded.
4030
+
4028 4031
                                 nret = cli_magic_scan_nested_fmap_type(
4029 4032
                                     ctx->fmap,
4030 4033
                                     fpt->offset,
... ...
@@ -4048,6 +3987,9 @@ static cl_error_t scanraw(cli_ctx *ctx, cli_file_t type, uint8_t typercg, cli_fi
4048 4048
                                     break;
4049 4049
                                 }
4050 4050
 
4051
+                                // Increment last_offset to ignore any file type matches that occured within this legitimate archive.
4052
+                                last_offset += arj_size - 1; // Note: size is definitely > 0 because header_check succeeded.
4053
+
4051 4054
                                 nret = cli_magic_scan_nested_fmap_type(
4052 4055
                                     ctx->fmap,
4053 4056
                                     fpt->offset,