Browse code

update documentation. Part I, more to come. (bb #554).

git-svn: trunk@3508

Török Edvin authored on 2008/01/19 00:28:05
Showing 4 changed files
... ...
@@ -1,3 +1,8 @@
1
+Fri Jan 18 17:01:25 EET 2008 (edwin)
2
+------------------------------------
3
+  * docs/phishsigs_howto.tex/.pdf: update documentation. Part I, more to come.
4
+  (bb #554).
5
+
1 6
 Fri Jan 18 12:13:16 CET 2008 (acab)
2 7
 -----------------------------------
3 8
   * test: Storing the testifles byteswapped to avoid detection of the tarball.
4 9
deleted file mode 100644
... ...
@@ -1,1363 +0,0 @@
1
-#LyX 1.4.2 created this file. For more info see http://www.lyx.org/
2
-\lyxformat 245
3
-\begin_document
4
-\begin_header
5
-\textclass article
6
-\language english
7
-\inputencoding auto
8
-\fontscheme pslatex
9
-\graphics default
10
-\paperfontsize default
11
-\spacing single
12
-\papersize a4paper
13
-\use_geometry false
14
-\use_amsmath 1
15
-\cite_engine basic
16
-\use_bibtopic false
17
-\paperorientation portrait
18
-\secnumdepth 3
19
-\tocdepth 3
20
-\paragraph_separation indent
21
-\defskip medskip
22
-\quotes_language english
23
-\papercolumns 1
24
-\papersides 1
25
-\paperpagestyle default
26
-\tracking_changes false
27
-\output_changes false
28
-\end_header
29
-
30
-\begin_body
31
-
32
-\begin_layout Title
33
-
34
-\family roman
35
-\series medium
36
-\shape up
37
-\size normal
38
-\emph off
39
-\bar no
40
-\noun off
41
-\color none
42
-Phishing signatures creation HOWTO
43
-\end_layout
44
-
45
-\begin_layout Author
46

                
47
-\end_layout
48
-
49
-\begin_layout Section
50
-Database file format
51
-\end_layout
52
-
53
-\begin_layout Standard
54
-The database file format is common for the whitelist (.wdb), and domainlist
55
- (.pdb), and it consists of (multiple) lines of form:
56
-\end_layout
57
-
58
-\begin_layout Standard
59
-
60
-\series bold
61
-Flags\InsetSpace ~
62
-RealURL\InsetSpace ~
63
-DisplayedURL
64
-\end_layout
65
-
66
-\begin_layout Itemize
67
-Where 
68
-\noun on
69
-Flags
70
-\noun default
71
- is:
72
-\end_layout
73
-
74
-\begin_deeper
75
-\begin_layout Itemize
76
-an (optional) character : 
77
-\end_layout
78
-
79
-\begin_deeper
80
-\begin_layout Description
81
-R regex, has to match entire url, see section 
82
-\end_layout
83
-
84
-\begin_layout Description
85
-H has to match the host part of url only (a simple pattern, i.e.
86
- it is matched literally)
87
-\end_layout
88
-
89
-\begin_layout Description
90
-no\InsetSpace ~
91
-character matches the entire url, but as a simple pattern (non-regex)
92
-\end_layout
93
-
94
-\end_deeper
95
-\begin_layout Itemize
96
-followed by an (optional) 3-digit hexadecimal number representing flags
97
- that should be filtered.
98
-\end_layout
99
-
100
-\begin_deeper
101
-\begin_layout Itemize
102
-flag filtering only makes sense in .pdb files, (however clamav won't complain
103
- if you put flags in .wdb files, it just won't use them)
104
-\end_layout
105
-
106
-\begin_layout Itemize
107
-for details on how to construct a flag number see section 
108
-\begin_inset LatexCommand \prettyref{sec:Flags}
109
-
110
-\end_inset
111
-
112
-
113
-\end_layout
114
-
115
-\end_deeper
116
-\end_deeper
117
-\begin_layout Itemize
118
-
119
-\noun on
120
-RealURL 
121
-\noun default
122
-is the URL the user is sent to
123
-\end_layout
124
-
125
-\begin_layout Itemize
126
-
127
-\noun on
128
-displayedURL
129
-\noun default
130
- is the URL description displayed to the user, that is where it is 
131
-\emph on
132
-claimed
133
-\emph default
134
- they are sent, the most obvious example is that of an html anchor (<a>tag):
135
- its href attribute is the 
136
-\noun on
137
-realURL
138
-\noun default
139
-, and its contents is the 
140
-\noun on
141
-displayedURL
142
-\end_layout
143
-
144
-\begin_layout Itemize
145
-see section 
146
-\begin_inset LatexCommand \vref{sub:Extraction-of-realURL,}
147
-
148
-\end_inset
149
-
150
- for more details on what 
151
-\noun on
152
-realURL/displayedURL
153
-\noun default
154
- is
155
-\end_layout
156
-
157
-\begin_layout Standard
158
-Note: The spaces are mandatory, and empty lines are skipped.
159
-\end_layout
160
-
161
-\begin_layout Standard
162
-If any of the lines of daily.wdb/daily.pdb don't conform to the above file
163
- format, the loading of the file shall fail, and whitelist/domainlist feature
164
- will be disabled.
165
- If the loading of the whitelist fails, the phishing checks will be disabled
166
- entirely.
167
-\end_layout
168
-
169
-\begin_layout Standard
170
-Therefore it is important to test the daily.wdb/daily.pdb before packing it
171
- into daily.cvd!
172
-\end_layout
173
-
174
-\begin_layout Subsubsection
175
-Example
176
-\end_layout
177
-
178
-\begin_layout Standard
179
-The following line:
180
-\end_layout
181
-
182
-\begin_layout Standard
183
-
184
-\emph on
185
-R http://www
186
-\backslash
187
-.google
188
-\backslash
189
-.(com|ro|it) www
190
-\backslash
191
-.google
192
-\backslash
193
-.com
194
-\end_layout
195
-
196
-\begin_layout Standard
197
-Means: 
198
-\emph on
199
-\noun on
200
-R
201
-\emph default
202
- 
203
-\noun default
204
-- this is a regex.
205
- 
206
-\end_layout
207
-
208
-\begin_layout Standard
209
-Example of url pairs matching: http://www.google.com www.google.com, http://www.googl
210
-e.it www.google.com.
211
-\end_layout
212
-
213
-\begin_layout Standard
214
-Example of url pairs not matching: http://www.google.c0m www.google.com
215
-\end_layout
216
-
217
-\begin_layout Subsection
218
-How matching works
219
-\end_layout
220
-
221
-\begin_layout Subsubsection
222
-RealURL, displayedURL concatenation
223
-\begin_inset LatexCommand \label{sub:RealURL,-displayedURL-concatenation}
224
-
225
-\end_inset
226
-
227
-
228
-\end_layout
229
-
230
-\begin_layout Standard
231
-The phishing detection module processes pairs of realURL/displayedURL, and
232
- the matching against daily.wdb/daily.pdb is done as follows: the realURL
233
- is concatenated with a space, and with the displayedURL, then that 
234
-\emph on
235
-line 
236
-\emph default
237
-is matched against the lines in daily.wdb/daily.pdb
238
-\end_layout
239
-
240
-\begin_layout Standard
241
-So if you have a line like
242
-\end_layout
243
-
244
-\begin_layout Standard
245
-
246
-\shape italic
247
-\InsetSpace ~
248
-www.google.ro\InsetSpace ~
249
-www.google.com
250
-\end_layout
251
-
252
-\begin_layout Standard
253
-and a href like: 
254
-\emph on
255
-<a href=
256
-\begin_inset Quotes erd
257
-\end_inset
258
-
259
-http://www.google.ro
260
-\begin_inset Quotes erd
261
-\end_inset
262
-
263
->www.google.com</a>, 
264
-\emph default
265
-then it will match, but: 
266
-\emph on
267
-<a href=
268
-\begin_inset Quotes erd
269
-\end_inset
270
-
271
-http://images.google.com
272
-\begin_inset Quotes erd
273
-\end_inset
274
-
275
->www.google.com</a>
276
-\emph default
277
- will not match.
278
-\end_layout
279
-
280
-\begin_layout Standard
281
-If you use the 
282
-\series bold
283
-\noun on
284
-H
285
-\noun default
286
- 
287
-\series default
288
-flag, then the 2nd href will match too.
289
-\end_layout
290
-
291
-\begin_layout Subsubsection
292
-What happens when a match is found
293
-\end_layout
294
-
295
-\begin_layout Standard
296
-In the case of the whitelist, a match means that the realURL/displayedURL
297
- combination is considered 
298
-\noun on
299
-clean
300
-\noun default
301
-, and no further checks are performed on it.
302
-\end_layout
303
-
304
-\begin_layout Standard
305
-In the case of the domainlist, a match means that the realURL/displayedURL
306
- is going to be checked for phishing attempts.
307
- This is only done if you don't run clamav with the 
308
-\emph on
309
-alldomains
310
-\emph default
311
- option (since then all urls are checked).
312
- Furthermore you can restrict what checks are to be performed by specifying
313
- the 3-digit hexnumber.
314
-\end_layout
315
-
316
-\begin_layout Subsubsection
317
-Extraction of 
318
-\noun on
319
-realURL
320
-\noun default
321
-, 
322
-\noun on
323
-displayedURL
324
-\noun default
325
- from HTML tags
326
-\begin_inset LatexCommand \label{sub:Extraction-of-realURL,}
327
-
328
-\end_inset
329
-
330
-
331
-\end_layout
332
-
333
-\begin_layout Standard
334
-The html parser extracts pairs of 
335
-\noun on
336
-realURL
337
-\noun default
338
-/
339
-\noun on
340
-displayedURL
341
-\noun default
342
- based on the following rules:
343
-\end_layout
344
-
345
-\begin_layout Description
346
-a (anchor) the 
347
-\emph on
348
-href
349
-\emph default
350
- is the 
351
-\noun on
352
-realURL
353
-\noun default
354
-, its 
355
-\emph on
356
-contents
357
-\emph default
358
- is the 
359
-\noun on
360
-displayedURL
361
-\end_layout
362
-
363
-\begin_deeper
364
-\begin_layout Description
365
-contents is the tag-stripped contents of the <a> tags, so for example <b>
366
- tags are stripped (but not their contents)
367
-\end_layout
368
-
369
-\begin_layout Standard
370
-nesting another <a> tag withing an <a> tag (besides being invalid html)
371
- is treated as a </a><a..
372
-\end_layout
373
-
374
-\end_deeper
375
-\begin_layout Description
376
-form the 
377
-\emph on
378
-action 
379
-\emph default
380
-attribute is the 
381
-\noun on
382
-realURL
383
-\noun default
384
-, and a nested <a> tag is the 
385
-\noun on
386
-displayedURL
387
-\end_layout
388
-
389
-\begin_layout Description
390
-img/area if nested within an
391
-\emph on
392
- <a>
393
-\emph default
394
- tag, the 
395
-\noun on
396
-realURL
397
-\noun default
398
- is the 
399
-\emph on
400
-href
401
-\emph default
402
- of the a tag, and the 
403
-\emph on
404
-src/dynsrc/area
405
-\emph default
406
- is the 
407
-\noun on
408
-displayedURL
409
-\noun default
410
- of the img 
411
-\end_layout
412
-
413
-\begin_deeper
414
-\begin_layout Standard
415
-if nested withing a 
416
-\emph on
417
-form
418
-\emph default
419
- tag, then the action attribute of the 
420
-\emph on
421
-form
422
-\emph default
423
- tag is the 
424
-\noun on
425
-realURL
426
-\noun default
427
- 
428
-\end_layout
429
-
430
-\end_deeper
431
-\begin_layout Description
432
-iframe if nested withing an 
433
-\emph on
434
-<a>
435
-\emph default
436
- tag the 
437
-\emph on
438
-src
439
-\emph default
440
- attribute is the displayedURL, and the 
441
-\emph on
442
-href
443
-\emph default
444
- of its parent
445
-\emph on
446
- a
447
-\emph default
448
- tag is the 
449
-\noun on
450
-realURL
451
-\end_layout
452
-
453
-\begin_deeper
454
-\begin_layout Standard
455
-if nested withing a 
456
-\emph on
457
-form
458
-\emph default
459
- tag, then the action attribute of the 
460
-\emph on
461
-form
462
-\emph default
463
- tag is the 
464
-\noun on
465
-realURL
466
-\end_layout
467
-
468
-\end_deeper
469
-\begin_layout Subsubsection
470
-Example
471
-\end_layout
472
-
473
-\begin_layout Standard
474
-Consider this html file:
475
-\end_layout
476
-
477
-\begin_layout Quote
478
-
479
-\emph on
480
-<a href=
481
-\begin_inset Quotes erd
482
-\end_inset
483
-
484
-evilurl
485
-\begin_inset Quotes erd
486
-\end_inset
487
-
488
->www.paypal.com</a>
489
-\end_layout
490
-
491
-\begin_layout Quote
492
-
493
-\emph on
494
-<a href=
495
-\begin_inset Quotes erd
496
-\end_inset
497
-
498
-evilurl2
499
-\begin_inset Quotes erd
500
-\end_inset
501
-
502
- title=
503
-\begin_inset Quotes erd
504
-\end_inset
505
-
506
-www.ebay.com
507
-\begin_inset Quotes erd
508
-\end_inset
509
-
510
->click here to sign in</a>
511
-\end_layout
512
-
513
-\begin_layout Quote
514
-
515
-\emph on
516
-<form action=
517
-\begin_inset Quotes erd
518
-\end_inset
519
-
520
-evilurl_form
521
-\begin_inset Quotes erd
522
-\end_inset
523
-
524
->
525
-\end_layout
526
-
527
-\begin_layout Quote
528
-
529
-\emph on
530
-Please sign in to <a href=
531
-\begin_inset Quotes erd
532
-\end_inset
533
-
534
-cgi.ebay.com
535
-\begin_inset Quotes erd
536
-\end_inset
537
-
538
->Ebay</a> using this form
539
-\end_layout
540
-
541
-\begin_layout Quote
542
-
543
-\emph on
544
-<input type='text' name='username'>Username</input>
545
-\end_layout
546
-
547
-\begin_layout Quote
548
-
549
-\emph on
550
-....
551
-\end_layout
552
-
553
-\begin_layout Quote
554
-
555
-\emph on
556
-</form>
557
-\end_layout
558
-
559
-\begin_layout Quote
560
-
561
-\emph on
562
-<a href=
563
-\begin_inset Quotes erd
564
-\end_inset
565
-
566
-evilurl
567
-\begin_inset Quotes erd
568
-\end_inset
569
-
570
-><img src=
571
-\begin_inset Quotes erd
572
-\end_inset
573
-
574
-images.paypal.com/secure.jpg
575
-\begin_inset Quotes erd
576
-\end_inset
577
-
578
-></a>
579
-\end_layout
580
-
581
-\begin_layout Standard
582
-The resulting 
583
-\noun on
584
-realURL/displayedURL
585
-\noun default
586
- pairs will be (note that one tag can generate multiple pairs):
587
-\end_layout
588
-
589
-\begin_layout Itemize
590
-evilurl / www.paypal.com
591
-\end_layout
592
-
593
-\begin_layout Itemize
594
-evilurl2 / click here to sign in
595
-\end_layout
596
-
597
-\begin_layout Itemize
598
-evilurl2 / www.ebay.com
599
-\end_layout
600
-
601
-\begin_layout Itemize
602
-evilurl_form / cgi.ebay.com
603
-\end_layout
604
-
605
-\begin_layout Itemize
606
-cgi.ebay.com / Ebay
607
-\end_layout
608
-
609
-\begin_layout Itemize
610
-evilurl / image.paypal.com/secure.jpg
611
-\end_layout
612
-
613
-\begin_layout Subsection
614
-Simple patterns
615
-\begin_inset LatexCommand \label{sec:Simple-patterns}
616
-
617
-\end_inset
618
-
619
-
620
-\end_layout
621
-
622
-\begin_layout Standard
623
-Simple patterns are matched literally, i.e.
624
- if you say: 
625
-\end_layout
626
-
627
-\begin_layout Quote
628
-www.google.com
629
-\end_layout
630
-
631
-\begin_layout Standard
632
-it is going to match 
633
-\emph on
634
-www.google.com
635
-\emph default
636
-, and only that.
637
- The 
638
-\emph on
639
-.
640
- (dot)
641
-\emph default
642
- character has no special meaning (see the section on regexes 
643
-\begin_inset LatexCommand \vref{sec:Regular-expressions}
644
-
645
-\end_inset
646
-
647
- for how the 
648
-\emph on
649
-.(dot)
650
-\emph default
651
- character behaves there)
652
-\end_layout
653
-
654
-\begin_layout Subsection
655
-Regular expressions
656
-\begin_inset LatexCommand \label{sec:Regular-expressions}
657
-
658
-\end_inset
659
-
660
-
661
-\end_layout
662
-
663
-\begin_layout Standard
664
-POSIX regular expressions are supported, and you can consider that internally
665
- it is wrapped by 
666
-\emph on
667
-^
668
-\emph default
669
-, and 
670
-\emph on
671
-$.
672
- 
673
-\emph default
674
-In other words, this means that the regular expression has to match the
675
- entire concatenated (see section 
676
-\begin_inset LatexCommand \vref{sub:RealURL,-displayedURL-concatenation}
677
-
678
-\end_inset
679
-
680
- for details on concatenation) url.
681
-\end_layout
682
-
683
-\begin_layout Standard
684
-It is recomended that you read section 
685
-\begin_inset LatexCommand \vref{sec:Introduction-to-regular}
686
-
687
-\end_inset
688
-
689
- to learn how to write regular expressions, and then come back and read
690
- this for hints.
691
-\end_layout
692
-
693
-\begin_layout Standard
694
-Be advised that clamav contains an internal, very basic regex matcher to
695
- reduce the load on the regex matching core.
696
- Thus it is recomended that you avoid using regex syntax not supported by
697
- it at the very beginning of regexes (at least the first few characters).
698
-\end_layout
699
-
700
-\begin_layout Standard
701
-Currently the clamav regex matcher supports:
702
-\end_layout
703
-
704
-\begin_layout Itemize
705
-.
706
- (dot) character
707
-\end_layout
708
-
709
-\begin_layout Itemize
710
-
711
-\backslash
712
- (escaping special characters)
713
-\end_layout
714
-
715
-\begin_layout Itemize
716
-| (pipe) alternatives
717
-\end_layout
718
-
719
-\begin_layout Itemize
720
-[] (character classes)
721
-\end_layout
722
-
723
-\begin_layout Itemize
724
-() (paranthesis for grouping, but no group extraction is performed)
725
-\end_layout
726
-
727
-\begin_layout Itemize
728
-other non-special characters
729
-\end_layout
730
-
731
-\begin_layout Standard
732
-Thus the following are not supported:
733
-\end_layout
734
-
735
-\begin_layout Itemize
736
-+ repetition
737
-\end_layout
738
-
739
-\begin_layout Itemize
740
-* repetition
741
-\end_layout
742
-
743
-\begin_layout Itemize
744
-{} repetition
745
-\end_layout
746
-
747
-\begin_layout Itemize
748
-backreferences
749
-\end_layout
750
-
751
-\begin_layout Itemize
752
-lookaround
753
-\end_layout
754
-
755
-\begin_layout Itemize
756
-other 
757
-\begin_inset Quotes eld
758
-\end_inset
759
-
760
-advanced
761
-\begin_inset Quotes erd
762
-\end_inset
763
-
764
- features not listed in the supported list ;)
765
-\end_layout
766
-
767
-\begin_layout Standard
768
-This however shouldn't discourage you from using the 
769
-\begin_inset Quotes eld
770
-\end_inset
771
-
772
-not directly supported features 
773
-\begin_inset Quotes eld
774
-\end_inset
775
-
776
-, because if the internal engine encounters unsupported syntax, it passes
777
- it on to the POSIX regex core (beginning from the first unsupported token,
778
- everything before that is still processed by the internal matcher).
779
- An example might make this more clear:
780
-\end_layout
781
-
782
-\begin_layout Standard
783
-
784
-\emph on
785
-www
786
-\backslash
787
-.google
788
-\backslash
789
-.(com|ro|it) ([a-zA-Z])+
790
-\backslash
791
-.google
792
-\backslash
793
-.(com|ro|it)
794
-\end_layout
795
-
796
-\begin_layout Standard
797
-Everything till 
798
-\emph on
799
-([a-zA-Z])+
800
-\emph default
801
- is processed internally, that paranthesis (and everything beyond) is processed
802
- by the posix core.
803
-\end_layout
804
-
805
-\begin_layout Standard
806
-Examples of url pairs that match: 
807
-\end_layout
808
-
809
-\begin_layout Itemize
810
-
811
-\emph on
812
-www.google.ro images.google.ro
813
-\end_layout
814
-
815
-\begin_layout Itemize
816
-www.google.com images.google.ro
817
-\end_layout
818
-
819
-\begin_layout Standard
820
-Example of url pairs that don't match:
821
-\end_layout
822
-
823
-\begin_layout Itemize
824
-www.google.ro images1.google.ro
825
-\end_layout
826
-
827
-\begin_layout Itemize
828
-images.google.com image.google.com
829
-\end_layout
830
-
831
-\begin_layout Subsection
832
-Flags
833
-\begin_inset LatexCommand \label{sec:Flags}
834
-
835
-\end_inset
836
-
837
-
838
-\end_layout
839
-
840
-\begin_layout Standard
841
-Flags are a binary OR of the following numbers:
842
-\end_layout
843
-
844
-\begin_layout Description
845
-HOST_SUFFICIENT 1
846
-\end_layout
847
-
848
-\begin_layout Description
849
-DOMAIN_SUFFICIENT 2
850
-\end_layout
851
-
852
-\begin_layout Description
853
-DO_REVERSE_LOOKUP 4
854
-\end_layout
855
-
856
-\begin_layout Description
857
-CHECK_REDIR 8
858
-\end_layout
859
-
860
-\begin_layout Description
861
-CHECK_SSL 16 
862
-\end_layout
863
-
864
-\begin_layout Description
865
-CHECK_CLOAKING 32
866
-\end_layout
867
-
868
-\begin_layout Description
869
-CLEANUP_URL 64 
870
-\end_layout
871
-
872
-\begin_layout Description
873
-CHECK_DOMAIN_REVERSE 128 
874
-\end_layout
875
-
876
-\begin_layout Description
877
-CHECK_IMG_URL 256 
878
-\end_layout
879
-
880
-\begin_layout Description
881
-DOMAINLIST_REQUIRED 512 
882
-\end_layout
883
-
884
-\begin_layout Standard
885
-The names of the constants are self-explanatory.
886
-\end_layout
887
-
888
-\begin_layout Standard
889
-These constants are defined in libclamav/phishcheck.h, you can check there
890
- for the latest flags.
891
-\end_layout
892
-
893
-\begin_layout Standard
894
-There is a default set of flags that are enabled, these are currently: (CLEANUP_
895
-URL|DOMAIN_SUFFICIENT|CHECK_SSL|CHECK_CLOAKING|DOMAINLIST_REQUIRED|CHECK_IMG_URL
896
-), ssl checking is performed only for a tags currently.
897
-\end_layout
898
-
899
-\begin_layout Standard
900
-You must decide for each line in the domainlist if you want to filter any
901
- flags (that is you don't want certain checks to be done), and then calculate
902
- the binary OR of those constants, and then convert it into a 3-digit hexnumber.
903
- For example you devide that domain_sufficient shouldn't be used for ebay.com,
904
- and you don't want to check images either, so you come up with this flag
905
- number: 
906
-\begin_inset Formula $2|256\Rightarrow$
907
-\end_inset
908
-
909
-258
910
-\begin_inset Formula $(decimal)\Rightarrow102(hexadecimal)$
911
-\end_inset
912
-
913
-
914
-\end_layout
915
-
916
-\begin_layout Standard
917
-So you add this line to daily.wdb:
918
-\end_layout
919
-
920
-\begin_layout Itemize
921
-R102\InsetSpace ~
922
-www.ebay.com\InsetSpace ~
923
-.+
924
-\end_layout
925
-
926
-\begin_layout Section
927
-Introduction to regular expressions
928
-\begin_inset LatexCommand \label{sec:Introduction-to-regular}
929
-
930
-\end_inset
931
-
932
-
933
-\end_layout
934
-
935
-\begin_layout Standard
936
-Recomended reading:
937
-\end_layout
938
-
939
-\begin_layout Itemize
940
-http://www.regular-expressions.info/quickstart.html
941
-\end_layout
942
-
943
-\begin_layout Itemize
944
-http://www.regular-expressions.info/tutorial.html
945
-\end_layout
946
-
947
-\begin_layout Itemize
948
-regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7&topic=regex
949
-\end_layout
950
-
951
-\begin_layout Subsection
952
-Special characters
953
-\end_layout
954
-
955
-\begin_layout Description
956
-[ the opening square bracket - it marks the beginning of a character class,
957
- see section
958
-\begin_inset LatexCommand \vref{sub:Character-classes}
959
-
960
-\end_inset
961
-
962
-
963
-\end_layout
964
-
965
-\begin_layout Description
966
-
967
-\backslash
968
- the backslash - escapes special characters, see section 
969
-\begin_inset LatexCommand \vref{sub:Escaping}
970
-
971
-\end_inset
972
-
973
-
974
-\end_layout
975
-
976
-\begin_layout Description
977
-\i \^{ }
978
- the caret - matches the beginning of a line (not needed in clamav regexes,
979
- this is implied)
980
-\end_layout
981
-
982
-\begin_layout Description
983
-$ the dollar sign - matches the end of a line (not needed in clamav regexes,
984
- this is implied)
985
-\end_layout
986
-
987
-\begin_layout Description
988
-\i \.{ }
989
- the period or dot - matches 
990
-\emph on
991
-any
992
-\emph default
993
- character
994
-\end_layout
995
-
996
-\begin_layout Description
997
-| the vertical bar or pipe symbol - matches either of the token on its left
998
- and right side, see section
999
-\begin_inset LatexCommand \vref{sub:Alternation}
1000
-
1001
-\end_inset
1002
-
1003
-
1004
-\end_layout
1005
-
1006
-\begin_layout Description
1007
-? the question mark - matches optionally the left-side token, see section
1008
-\begin_inset LatexCommand \vref{sub:Optional-matching,-and}
1009
-
1010
-\end_inset
1011
-
1012
-
1013
-\end_layout
1014
-
1015
-\begin_layout Description
1016
-* the asterisk or star - matches 0 or more occurences of the left-side token,
1017
- see section 
1018
-\begin_inset LatexCommand \vref{sub:Optional-matching,-and}
1019
-
1020
-\end_inset
1021
-
1022
-
1023
-\end_layout
1024
-
1025
-\begin_layout Description
1026
-+ the plus sign - matches 1 or more occurences of the left-side token, see
1027
- section 
1028
-\begin_inset LatexCommand \vref{sub:Optional-matching,-and}
1029
-
1030
-\end_inset
1031
-
1032
-
1033
-\end_layout
1034
-
1035
-\begin_layout Description
1036
-( the opening round bracket - \i \c{m}
1037
-arks beginning of a group, see section 
1038
-\begin_inset LatexCommand \vref{sub:Groups}
1039
-
1040
-\end_inset
1041
-
1042
-
1043
-\end_layout
1044
-
1045
-\begin_layout Description
1046
-) the closing round bracket - marks end of a group, see section
1047
-\begin_inset LatexCommand \vref{sub:Groups}
1048
-
1049
-\end_inset
1050
-
1051
-
1052
-\end_layout
1053
-
1054
-\begin_layout Subsection
1055
-Character classes
1056
-\begin_inset LatexCommand \label{sub:Character-classes}
1057
-
1058
-\end_inset
1059
-
1060
-
1061
-\end_layout
1062
-
1063
-\begin_layout Subsection
1064
-Escaping
1065
-\begin_inset LatexCommand \label{sub:Escaping}
1066
-
1067
-\end_inset
1068
-
1069
-
1070
-\end_layout
1071
-
1072
-\begin_layout Standard
1073
-Escaping has two purposes: 
1074
-\end_layout
1075
-
1076
-\begin_layout Itemize
1077
-it allows you to actually match the special characters themselves, for example
1078
- to match the literal 
1079
-\emph on
1080
-+
1081
-\emph default
1082
-, you would write 
1083
-\emph on
1084
-
1085
-\backslash
1086
-+
1087
-\end_layout
1088
-
1089
-\begin_layout Itemize
1090
-it also allows you to match non-printable characters, such as the tab (
1091
-\emph on
1092
-
1093
-\backslash
1094
-t
1095
-\emph default
1096
-), newline (
1097
-\emph on
1098
-
1099
-\backslash
1100
-n
1101
-\emph default
1102
-), ..
1103
-\end_layout
1104
-
1105
-\begin_layout Standard
1106
-However since non-printable characters are not valid inside an url, you
1107
- won't have a reason to use them.
1108
-\end_layout
1109
-
1110
-\begin_layout Subsection
1111
-Alternation
1112
-\begin_inset LatexCommand \label{sub:Alternation}
1113
-
1114
-\end_inset
1115
-
1116
-
1117
-\end_layout
1118
-
1119
-\begin_layout Subsection
1120
-Optional matching, and repetition
1121
-\begin_inset LatexCommand \label{sub:Optional-matching,-and}
1122
-
1123
-\end_inset
1124
-
1125
-
1126
-\end_layout
1127
-
1128
-\begin_layout Subsection
1129
-Groups
1130
-\begin_inset LatexCommand \label{sub:Groups}
1131
-
1132
-\end_inset
1133
-
1134
-
1135
-\end_layout
1136
-
1137
-\begin_layout Standard
1138
-Groups are usually used together with repetition, or alternation.
1139
- For example: 
1140
-\emph on
1141
-(com|it)+
1142
-\emph default
1143
- means: match 1 or more repetitions of 
1144
-\emph on
1145
-com
1146
-\emph default
1147
- or 
1148
-\emph on
1149
-it,
1150
-\emph default
1151
- that is it matches: com, it, comcom, comcomcom, comit, itit, ititcom,...
1152
- you get the idea.
1153
-\end_layout
1154
-
1155
-\begin_layout Standard
1156
-Groups can also be used to extract substring, but this is not supported
1157
- by the clam engine, and not needed either in this case.
1158
-\end_layout
1159
-
1160
-\begin_layout Section
1161
-How to create database files
1162
-\end_layout
1163
-
1164
-\begin_layout Subsection
1165
-How to create and maintain the whitelist (daily.wdb)
1166
-\end_layout
1167
-
1168
-\begin_layout Standard
1169
-If the phishing code claims that a certain mail is phishing, but its not,
1170
- you have 2 choices:
1171
-\end_layout
1172
-
1173
-\begin_layout Itemize
1174
-examine your rules daily.pdb, and fix them if necessary (see: section
1175
-\begin_inset LatexCommand \vref{sub:How-to-create}
1176
-
1177
-\end_inset
1178
-
1179
-)
1180
-\end_layout
1181
-
1182
-\begin_layout Itemize
1183
-add it to the whitelist (discussed here)
1184
-\end_layout
1185
-
1186
-\begin_layout Standard
1187
-Lets assume you are having problems because of links like this in a mail:
1188
-\end_layout
1189
-
1190
-\begin_layout Quote
1191
-<a href=
1192
-\begin_inset Quotes erd
1193
-\end_inset
1194
-
1195
-http://69.0.241.57/bCentral/L.asp?L=XXXXXXXX
1196
-\begin_inset Quotes erd
1197
-\end_inset
1198
-
1199
->http://www.bcentral.it/</a>
1200
-\end_layout
1201
-
1202
-\begin_layout Standard
1203
-After investigating those sites further, you decide they are no threat,
1204
- and create a line like this in daily.wdb:
1205
-\end_layout
1206
-
1207
-\begin_layout Quote
1208
-R http://www
1209
-\backslash
1210
-.bcentral
1211
-\backslash
1212
-.it/.+ http://69
1213
-\backslash
1214
-.0
1215
-\backslash
1216
-.241
1217
-\backslash
1218
-.57/bCentral/L
1219
-\backslash
1220
-.asp?L=.+ 
1221
-\end_layout
1222
-
1223
-\begin_layout Standard
1224
-Note: urls like the above can be used to track unique mail recipients, and
1225
- thus know if somebody actually reads mails (so they can send more spam).
1226
- However since this site required no authentication information, it is safe
1227
- from a phishing point of view.
1228
-\end_layout
1229
-
1230
-\begin_layout Subsection
1231
-How to create and maintain the domainlist (daily.pdb)
1232
-\begin_inset LatexCommand \label{sub:How-to-create}
1233
-
1234
-\end_inset
1235
-
1236
-
1237
-\end_layout
1238
-
1239
-\begin_layout Standard
1240
-When not using --phish-scan-alldomains (production environments for example),
1241
- you need to decide which urls you are going to check.
1242
- 
1243
-\end_layout
1244
-
1245
-\begin_layout Standard
1246
-Although at a first glance it might seem a good idea to check everything,
1247
- it would produce false positives.
1248
- Particularly newsletters, ads, etc.
1249
- are likely to use URLs that look like phishing attempts.
1250
-\end_layout
1251
-
1252
-\begin_layout Standard
1253
-Lets assume that you've recently seen many phishing attempts claiming they
1254
- come from Paypal.
1255
- Thus you need to add paypal to daily.pdb:
1256
-\end_layout
1257
-
1258
-\begin_layout Quote
1259
-R .+ .+
1260
-\backslash
1261
-.paypal
1262
-\backslash
1263
-.com
1264
-\end_layout
1265
-
1266
-\begin_layout Standard
1267
-The above line will block (detect as phishing) mails that contain urls that
1268
- claim to lead to paypal, but they don't in fact.
1269
-\end_layout
1270
-
1271
-\begin_layout Standard
1272
-Be carefull not to create regexes that match a too broad range of urls though.
1273
-\end_layout
1274
-
1275
-\begin_layout Subsection
1276
-Dealing with false positives, and undetected phishing mails
1277
-\end_layout
1278
-
1279
-\begin_layout Subsubsection
1280
-False positives
1281
-\end_layout
1282
-
1283
-\begin_layout Standard
1284
-Whenever you see a false positive (mail that is detected as phishing, but
1285
- its not), you need to examine 
1286
-\emph on
1287
-why
1288
-\emph default
1289
- clamav decided that its phishing.
1290
- You can do this easily by building clamav with debugging (./configure --enable-e
1291
-xperimental --enable-debug), and then running a tool:
1292
-\end_layout
1293
-
1294
-\begin_layout Quote
1295
-$contrib/phishing/why.py phishing.eml
1296
-\end_layout
1297
-
1298
-\begin_layout Standard
1299
-This will show the url that triggers the phish verdict, and a reason why
1300
- that url is considered phishing attempt.
1301
-\end_layout
1302
-
1303
-\begin_layout Standard
1304
-Once you know the reason, you might need to modify daily.pdb (if one of yours
1305
- rules inthere are too broad), or you need to add the url to daily.wdb.
1306
- If you think the algorithm is incorrect, please file a bugreport on bugzilla.cla
1307
-mav.net, including the output of 
1308
-\emph on
1309
-why.py
1310
-\emph default
1311
-.
1312
-\end_layout
1313
-
1314
-\begin_layout Subsubsection
1315
-Undetected phish mails
1316
-\end_layout
1317
-
1318
-\begin_layout Standard
1319
-Using why.py doesn't help here unfortunately (it will say: clean), so all
1320
- you can do is:
1321
-\end_layout
1322
-
1323
-\begin_layout Quote
1324
-$clamscan/clamscan --phish-scan-alldomains undetected.eml
1325
-\end_layout
1326
-
1327
-\begin_layout Standard
1328
-And see if the mail is detected, if yes, then you need to add an appropiate
1329
- line to daily.pdb (see section 
1330
-\begin_inset LatexCommand \vref{sub:How-to-create}
1331
-
1332
-\end_inset
1333
-
1334
-).
1335
-\end_layout
1336
-
1337
-\begin_layout Standard
1338
-If the mail is not detected, then try using:
1339
-\end_layout
1340
-
1341
-\begin_layout Quote
1342
-$clamscan/clamscan --debug undetected.eml|less
1343
-\end_layout
1344
-
1345
-\begin_layout Address
1346
-Then see what urls are being checked, see if any of them is in a whitelist,
1347
- see if all urls are detected, etc.
1348
-\end_layout
1349
-
1350
-\begin_layout Section
1351
-Hints and recomandations
1352
-\end_layout
1353
-
1354
-\begin_layout Section
1355
-Examples
1356
-\end_layout
1357
-
1358
-\begin_layout Standard
1359
-
1360
-\end_layout
1361
-
1362
-\end_body
1363
-\end_document
1364 1
Binary files a/docs/phishsigs_howto.pdf and b/docs/phishsigs_howto.pdf differ
1365 2
new file mode 100644
... ...
@@ -0,0 +1,491 @@
0
+%% LyX 1.5.3 created this file.  For more info, see http://www.lyx.org/.
1
+%% Do not edit unless you really know what you are doing.
2
+\documentclass[a4paper,english]{article}
3
+\usepackage{mathptmx}
4
+\usepackage[T1]{fontenc}
5
+\usepackage{varioref}
6
+\usepackage{prettyref}
7
+\usepackage{amssymb}
8
+\usepackage{pslatex}
9
+\usepackage[dvips]{graphicx}
10
+\usepackage{wrapfig}
11
+\usepackage{url}
12
+\date{}
13
+
14
+\begin{document}
15
+
16
+\title{{\huge Phishing signatures creation HOWTO}}
17
+\author{T\"or\"ok Edwin}
18
+\maketitle
19
+
20
+\section{Database file format}
21
+
22
+\subsection{PDB format}
23
+This file contains urls/hosts that are target of phishing attempts.
24
+It contains lines in the following format:
25
+\begin{verbatim}
26
+R[Filter]:RealURL:DisplayedURL[:FuncLevelSpec]
27
+H[Filter]:DisplayedHostname[:FuncLevelSpec]
28
+\end{verbatim}
29
+
30
+\begin{description}
31
+ \item [{R}] regular expression, for the concatenated URL
32
+ \item [{H}] matches the \verb+DisplayedHostname+ as a simple pattern (literally, no regular expression)
33
+ 	\begin{itemize}
34
+ 		\item the pattern can match either the full hostname
35
+ 		\item or a subdomain of the specified hostname
36
+ 		\item to avoid false matches in case of subdomain matches, the engine checks that there  is a dot(\verb+.+) or a space(\verb+ +) before the matched portion
37
+	\end{itemize}
38
+ \item [{Filter}] an (optional) 3-digit hexadecimal number representing flags that should be filtered.
39
+	\begin{itemize}
40
+ 		\item flag filtering only makes sense in .pdb files. (however clamav won't complain if you put flags in .wdb files, it will just skip them)
41
+ 		\item for details on how to construct a flag number see section \prettyref{sec:Flags}
42
+	\end{itemize}
43
+
44
+ \item [{RealURL }] is the URL the user is sent to
45
+ \item [{DisplayedURL}] is the URL description displayed to the user, that is where it is \emph{claimed} they are sent, the most obvious example is that of an html anchor (<a>tag): its href attribute is the \textsc{realURL}, and its contents is the \textsc{displayedURL}
46
+ \item [{DisplayedHostname}] is the hostname portion of the [{DisplayedURL}]
47
+ \item [{FuncLevelSpec}] an (optional) functionality level, 2 formats are possible:
48
+	\begin{itemize}
49
+ 		\item \verb+minlevel+ all engines having functionality level >= \verb+minlevel+ will load this line
50
+ 		\item \verb+minlevel-maxlevel+ engines with functionality level $>= $ \verb+minlevel+, and $< $ \verb+maxlevel+ will load this line
51
+	\end{itemize}
52
+\end{description}
53
+
54
+\subsection{WDB format}
55
+This file contains whitelisted url pairs
56
+It contains lines in the following format:
57
+\begin{verbatim}
58
+X:RealURL:DisplayedURL[:FuncLevelSpec]
59
+M:RealHostname:DisplayedHostname[:FuncLevelSpec]
60
+\end{verbatim}
61
+
62
+\begin{description}
63
+ \item [{X}] regular expression, for the \textsc{entire URL}, not just the hostname
64
+ \begin{itemize}
65
+  \item The regular expression is by default anchored to start-of-line and end-of-line, as if you have used \verb+^RegularExpression$+
66
+  \item A trailing \verb+/+ is automatically added both to the regex, and the input string to avoid false matches
67
+  \item The regular expression matches the \textsc{concatenation} of RealURL, a colon(\verb+:+), and DisplayedURL as a single string. It doesn't separately match RealURL and DisplayedURL!
68
+ \end{itemize}
69
+ \item [{M}] matches hostname, or subdomain of it, see notes for \textsc{H} above
70
+\end{description}
71
+
72
+\subsection{Hints}
73
+
74
+\begin{itemize}
75
+ \item empty lines are ignored
76
+ \item the colons are mandatory
77
+ \item Don't leave extra spaces on the end of a line!
78
+ \item if any of the lines don't conform to this format, clamav will abort with a Malformed Database Error
79
+ \item see section \vref{sub:Extraction-of-realURL,} for more details on \textsc{realURL/displayedURL}
80
+\end{itemize}
81
+
82
+%TODO: give up-to-date examples
83
+
84
+\subsubsection{Example}
85
+
86
+The following line:
87
+
88
+\emph{R http://www\textbackslash{}.google\textbackslash{}.(com|ro|it)
89
+www\textbackslash{}.google\textbackslash{}.com}
90
+
91
+Means: \emph{\textsc{R}}\textsc{ }- this is a regex. 
92
+
93
+Example of url pairs matching: http://www.google.com www.google.com,
94
+http://www.google.it www.google.com.
95
+
96
+Example of url pairs not matching: http://www.google.c0m www.google.com
97
+
98
+
99
+\subsection{How matching works}
100
+
101
+
102
+\subsubsection{RealURL, displayedURL concatenation\label{sub:RealURL,-displayedURL-concatenation}}
103
+
104
+The phishing detection module processes pairs of realURL/displayedURL,
105
+and the matching against daily.wdb/daily.pdb is done as follows: the
106
+realURL is concatenated with a space, and with the displayedURL, then
107
+that \emph{line} is matched against the lines in daily.wdb/daily.pdb
108
+
109
+So if you have a line like
110
+
111
+\textit{~www.google.ro~www.google.com}
112
+
113
+and a href like: \emph{<a href=''http://www.google.ro''>www.google.com</a>,}
114
+then it will match, but: \emph{<a href=''http://images.google.com''>www.google.com</a>}
115
+will not match.
116
+
117
+If you use the \textbf{\textsc{H}} flag, then the 2nd href will match
118
+too.
119
+
120
+
121
+\subsubsection{What happens when a match is found}
122
+
123
+In the case of the whitelist, a match means that the realURL/displayedURL
124
+combination is considered \textsc{clean}, and no further checks are
125
+performed on it.
126
+
127
+In the case of the domainlist, a match means that the realURL/displayedURL
128
+is going to be checked for phishing attempts. This is only done if
129
+you don't run clamav with the \emph{alldomains} option (since then
130
+all urls are checked). Furthermore you can restrict what checks are
131
+to be performed by specifying the 3-digit hexnumber.
132
+
133
+
134
+\subsubsection{Extraction of \textsc{realURL}, \textsc{displayedURL} from HTML tags\label{sub:Extraction-of-realURL,}}
135
+
136
+The html parser extracts pairs of \textsc{realURL}/\textsc{displayedURL}
137
+based on the following rules:
138
+
139
+\begin{description}
140
+\item [{a}] (anchor) the \emph{href} is the \textsc{realURL}, its \emph{contents}
141
+is the \textsc{displayedURL}
142
+
143
+\begin{description}
144
+\item [{contents}] is the tag-stripped contents of the <a> tags, so for
145
+example <b> tags are stripped (but not their contents)
146
+\end{description}
147
+nesting another <a> tag withing an <a> tag (besides being invalid
148
+html) is treated as a </a><a..
149
+
150
+\item [{form}] the \emph{action} attribute is the \textsc{realURL}, and a
151
+nested <a> tag is the \textsc{displayedURL}
152
+\item [{img/area}] if nested within an \emph{<a>} tag, the \textsc{realURL}
153
+is the \emph{href} of the a tag, and the \emph{src/dynsrc/area} is
154
+the \textsc{displayedURL} of the img 
155
+
156
+
157
+if nested withing a \emph{form} tag, then the action attribute of
158
+the \emph{form} tag is the \textsc{realURL} 
159
+
160
+\item [{iframe}] if nested withing an \emph{<a>} tag the \emph{src} attribute
161
+is the displayedURL, and the \emph{href} of its parent \emph{a} tag
162
+is the \textsc{realURL}
163
+
164
+
165
+if nested withing a \emph{form} tag, then the action attribute of
166
+the \emph{form} tag is the \textsc{realURL}
167
+
168
+\end{description}
169
+
170
+\subsubsection{Example}
171
+
172
+Consider this html file:
173
+
174
+\begin{quote}
175
+\emph{<a href=''evilurl''>www.paypal.com</a>}
176
+
177
+\emph{<a href=''evilurl2'' title=''www.ebay.com''>click here to
178
+sign in</a>}
179
+
180
+\emph{<form action=''evilurl\_form''>}
181
+
182
+\emph{Please sign in to <a href=''cgi.ebay.com''>Ebay</a> using
183
+this form}
184
+
185
+\emph{<input type='text' name='username'>Username</input>}
186
+
187
+\emph{....}
188
+
189
+\emph{</form>}
190
+
191
+\emph{<a href=''evilurl''><img src=''images.paypal.com/secure.jpg''></a>}
192
+\end{quote}
193
+The resulting \textsc{realURL/displayedURL} pairs will be (note that
194
+one tag can generate multiple pairs):
195
+
196
+\begin{itemize}
197
+\item evilurl / www.paypal.com
198
+\item evilurl2 / click here to sign in
199
+\item evilurl2 / www.ebay.com
200
+\item evilurl\_form / cgi.ebay.com
201
+\item cgi.ebay.com / Ebay
202
+\item evilurl / image.paypal.com/secure.jpg
203
+\end{itemize}
204
+
205
+\subsection{Simple patterns\label{sec:Simple-patterns}}
206
+
207
+Simple patterns are matched literally, i.e. if you say: 
208
+
209
+\begin{quote}
210
+www.google.com
211
+\end{quote}
212
+it is going to match \emph{www.google.com}, and only that. The \emph{.
213
+(dot)} character has no special meaning (see the section on regexes
214
+\vref{sec:Regular-expressions} for how the \emph{.(dot)} character
215
+behaves there)
216
+
217
+
218
+\subsection{Regular expressions\label{sec:Regular-expressions}}
219
+
220
+POSIX regular expressions are supported, and you can consider that
221
+internally it is wrapped by \emph{\textasciicircum{}}, and \emph{\$.}
222
+In other words, this means that the regular expression has to match
223
+the entire concatenated (see section \vref{sub:RealURL,-displayedURL-concatenation}
224
+for details on concatenation) url.
225
+
226
+It is recomended that you read section \vref{sec:Introduction-to-regular}
227
+to learn how to write regular expressions, and then come back and
228
+read this for hints.
229
+
230
+Be advised that clamav contains an internal, very basic regex matcher
231
+to reduce the load on the regex matching core. Thus it is recomended
232
+that you avoid using regex syntax not supported by it at the very
233
+beginning of regexes (at least the first few characters).
234
+
235
+Currently the clamav regex matcher supports:
236
+
237
+\begin{itemize}
238
+\item . (dot) character
239
+\item \textbackslash{} (escaping special characters)
240
+\item | (pipe) alternatives
241
+\item {[}] (character classes)
242
+\item () (paranthesis for grouping, but no group extraction is performed)
243
+\item other non-special characters
244
+\end{itemize}
245
+Thus the following are not supported:
246
+
247
+\begin{itemize}
248
+\item + repetition
249
+\item {*} repetition
250
+\item \{\} repetition
251
+\item backreferences
252
+\item lookaround
253
+\item other {}``advanced'' features not listed in the supported list ;)
254
+\end{itemize}
255
+This however shouldn't discourage you from using the {}``not directly
256
+supported features {}``, because if the internal engine encounters
257
+unsupported syntax, it passes it on to the POSIX regex core (beginning
258
+from the first unsupported token, everything before that is still
259
+processed by the internal matcher). An example might make this more
260
+clear:
261
+
262
+\emph{www\textbackslash{}.google\textbackslash{}.(com|ro|it) ({[}a-zA-Z])+\textbackslash{}.google\textbackslash{}.(com|ro|it)}
263
+
264
+Everything till \emph{({[}a-zA-Z])+} is processed internally, that
265
+paranthesis (and everything beyond) is processed by the posix core.
266
+
267
+Examples of url pairs that match: 
268
+
269
+\begin{itemize}
270
+\item \emph{www.google.ro images.google.ro}
271
+\item www.google.com images.google.ro
272
+\end{itemize}
273
+Example of url pairs that don't match:
274
+
275
+\begin{itemize}
276
+\item www.google.ro images1.google.ro
277
+\item images.google.com image.google.com
278
+\end{itemize}
279
+
280
+\subsection{Flags\label{sec:Flags}}
281
+
282
+Flags are a binary OR of the following numbers:
283
+
284
+\begin{description}
285
+\item [{HOST\_SUFFICIENT}] 1
286
+\item [{DOMAIN\_SUFFICIENT}] 2
287
+\item [{DO\_REVERSE\_LOOKUP}] 4
288
+\item [{CHECK\_REDIR}] 8
289
+\item [{CHECK\_SSL}] 16 
290
+\item [{CHECK\_CLOAKING}] 32
291
+\item [{CLEANUP\_URL}] 64 
292
+\item [{CHECK\_DOMAIN\_REVERSE}] 128 
293
+\item [{CHECK\_IMG\_URL}] 256 
294
+\item [{DOMAINLIST\_REQUIRED}] 512 
295
+\end{description}
296
+The names of the constants are self-explanatory.
297
+
298
+These constants are defined in libclamav/phishcheck.h, you can check
299
+there for the latest flags.
300
+
301
+There is a default set of flags that are enabled, these are currently:
302
+(CLEANUP\_URL|DOMAIN\_SUFFICIENT|CHECK\_SSL|CHECK\_CLOAKING|DOMAINLIST\_REQUIRED|CHECK\_IMG\_URL),
303
+ssl checking is performed only for a tags currently.
304
+
305
+You must decide for each line in the domainlist if you want to filter
306
+any flags (that is you don't want certain checks to be done), and
307
+then calculate the binary OR of those constants, and then convert
308
+it into a 3-digit hexnumber. For example you devide that domain\_sufficient
309
+shouldn't be used for ebay.com, and you don't want to check images
310
+either, so you come up with this flag number: $2|256\Rightarrow$258$(decimal)\Rightarrow102(hexadecimal)$
311
+
312
+So you add this line to daily.wdb:
313
+
314
+\begin{itemize}
315
+\item R102~www.ebay.com~.+
316
+\end{itemize}
317
+
318
+\section{Introduction to regular expressions\label{sec:Introduction-to-regular}}
319
+
320
+Recomended reading:
321
+
322
+\begin{itemize}
323
+\item http://www.regular-expressions.info/quickstart.html
324
+\item http://www.regular-expressions.info/tutorial.html
325
+\item regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7\&topic=regex
326
+\end{itemize}
327
+
328
+\subsection{Special characters}
329
+
330
+\begin{description}
331
+\item [{{[}}] the opening square bracket - it marks the beginning of a
332
+character class, see section\vref{sub:Character-classes}
333
+\item [{\textbackslash{}}] the backslash - escapes special characters,
334
+see section \vref{sub:Escaping}
335
+\item [{\^{ }}] the caret - matches the beginning of a line (not needed
336
+in clamav regexes, this is implied)
337
+\item [{\$}] the dollar sign - matches the end of a line (not needed in
338
+clamav regexes, this is implied)
339
+\item [{\.{ }}] the period or dot - matches \emph{any} character
340
+\item [{|}] the vertical bar or pipe symbol - matches either of the token
341
+on its left and right side, see section\vref{sub:Alternation}
342
+\item [{?}] the question mark - matches optionally the left-side token,
343
+see section\vref{sub:Optional-matching,-and}
344
+\item [{{*}}] the asterisk or star - matches 0 or more occurences of the
345
+left-side token, see section \vref{sub:Optional-matching,-and}
346
+\item [{+}] the plus sign - matches 1 or more occurences of the left-side
347
+token, see section \vref{sub:Optional-matching,-and}
348
+\item [{(}] the opening round bracket - \c{m}arks beginning of a group,
349
+see section \vref{sub:Groups}
350
+\item [{)}] the closing round bracket - marks end of a group, see section\vref{sub:Groups}
351
+\end{description}
352
+
353
+\subsection{Character classes\label{sub:Character-classes}}
354
+
355
+
356
+\subsection{Escaping\label{sub:Escaping}}
357
+
358
+Escaping has two purposes: 
359
+
360
+\begin{itemize}
361
+\item it allows you to actually match the special characters themselves,
362
+for example to match the literal \emph{+}, you would write \emph{\textbackslash{}+}
363
+\item it also allows you to match non-printable characters, such as the
364
+tab (\emph{\textbackslash{}t}), newline (\emph{\textbackslash{}n}),
365
+..
366
+\end{itemize}
367
+However since non-printable characters are not valid inside an url,
368
+you won't have a reason to use them.
369
+
370
+
371
+\subsection{Alternation\label{sub:Alternation}}
372
+
373
+
374
+\subsection{Optional matching, and repetition\label{sub:Optional-matching,-and}}
375
+
376
+
377
+\subsection{Groups\label{sub:Groups}}
378
+
379
+Groups are usually used together with repetition, or alternation.
380
+For example: \emph{(com|it)+} means: match 1 or more repetitions of
381
+\emph{com} or \emph{it,} that is it matches: com, it, comcom, comcomcom,
382
+comit, itit, ititcom,... you get the idea.
383
+
384
+Groups can also be used to extract substring, but this is not supported
385
+by the clam engine, and not needed either in this case.
386
+
387
+
388
+\section{How to create database files}
389
+
390
+
391
+\subsection{How to create and maintain the whitelist (daily.wdb)}
392
+
393
+If the phishing code claims that a certain mail is phishing, but its
394
+not, you have 2 choices:
395
+
396
+\begin{itemize}
397
+\item examine your rules daily.pdb, and fix them if necessary (see: section\vref{sub:How-to-create})
398
+\item add it to the whitelist (discussed here)
399
+\end{itemize}
400
+Lets assume you are having problems because of links like this in
401
+a mail:
402
+
403
+\begin{quote}
404
+<a href=''http://69.0.241.57/bCentral/L.asp?L=XXXXXXXX''>http://www.bcentral.it/</a>
405
+\end{quote}
406
+After investigating those sites further, you decide they are no threat,
407
+and create a line like this in daily.wdb:
408
+
409
+\begin{quote}
410
+R http://www\textbackslash{}.bcentral\textbackslash{}.it/.+ http://69\textbackslash{}.0\textbackslash{}.241\textbackslash{}.57/bCentral/L\textbackslash{}.asp?L=.+ 
411
+\end{quote}
412
+Note: urls like the above can be used to track unique mail recipients,
413
+and thus know if somebody actually reads mails (so they can send more
414
+spam). However since this site required no authentication information,
415
+it is safe from a phishing point of view.
416
+
417
+
418
+\subsection{How to create and maintain the domainlist (daily.pdb)\label{sub:How-to-create}}
419
+
420
+When not using --phish-scan-alldomains (production environments for
421
+example), you need to decide which urls you are going to check. 
422
+
423
+Although at a first glance it might seem a good idea to check everything,
424
+it would produce false positives. Particularly newsletters, ads, etc.
425
+are likely to use URLs that look like phishing attempts.
426
+
427
+Lets assume that you've recently seen many phishing attempts claiming
428
+they come from Paypal. Thus you need to add paypal to daily.pdb:
429
+
430
+\begin{quote}
431
+R .+ .+\textbackslash{}.paypal\textbackslash{}.com
432
+\end{quote}
433
+The above line will block (detect as phishing) mails that contain
434
+urls that claim to lead to paypal, but they don't in fact.
435
+
436
+Be carefull not to create regexes that match a too broad range of
437
+urls though.
438
+
439
+
440
+\subsection{Dealing with false positives, and undetected phishing mails}
441
+
442
+
443
+\subsubsection{False positives}
444
+
445
+Whenever you see a false positive (mail that is detected as phishing,
446
+but its not), you need to examine \emph{why} clamav decided that its
447
+phishing. You can do this easily by building clamav with debugging
448
+(./configure --enable-experimental --enable-debug), and then running
449
+a tool:
450
+
451
+\begin{quote}
452
+\$contrib/phishing/why.py phishing.eml
453
+\end{quote}
454
+This will show the url that triggers the phish verdict, and a reason
455
+why that url is considered phishing attempt.
456
+
457
+Once you know the reason, you might need to modify daily.pdb (if one
458
+of yours rules inthere are too broad), or you need to add the url
459
+to daily.wdb. If you think the algorithm is incorrect, please file
460
+a bugreport on bugzilla.clamav.net, including the output of \emph{why.py}.
461
+
462
+
463
+\subsubsection{Undetected phish mails}
464
+
465
+Using why.py doesn't help here unfortunately (it will say: clean),
466
+so all you can do is:
467
+
468
+\begin{quote}
469
+\$clamscan/clamscan --phish-scan-alldomains undetected.eml
470
+\end{quote}
471
+And see if the mail is detected, if yes, then you need to add an appropiate
472
+line to daily.pdb (see section \vref{sub:How-to-create}).
473
+
474
+If the mail is not detected, then try using:
475
+
476
+\begin{quote}
477
+\$clamscan/clamscan --debug undetected.eml|less
478
+\end{quote}
479
+
480
+Then see what urls are being checked, see if any of them is in a
481
+whitelist, see if all urls are detected, etc.
482
+
483
+
484
+\section{Hints and recomandations}
485
+
486
+
487
+\section{Examples}
488
+
489
+
490
+\end{document}