git-svn: trunk@3508
Török Edvin authored on 2008/01/19 00:28:05... | ... |
@@ -1,3 +1,8 @@ |
1 |
+Fri Jan 18 17:01:25 EET 2008 (edwin) |
|
2 |
+------------------------------------ |
|
3 |
+ * docs/phishsigs_howto.tex/.pdf: update documentation. Part I, more to come. |
|
4 |
+ (bb #554). |
|
5 |
+ |
|
1 | 6 |
Fri Jan 18 12:13:16 CET 2008 (acab) |
2 | 7 |
----------------------------------- |
3 | 8 |
* test: Storing the testifles byteswapped to avoid detection of the tarball. |
4 | 9 |
deleted file mode 100644 |
... | ... |
@@ -1,1363 +0,0 @@ |
1 |
-#LyX 1.4.2 created this file. For more info see http://www.lyx.org/ |
|
2 |
-\lyxformat 245 |
|
3 |
-\begin_document |
|
4 |
-\begin_header |
|
5 |
-\textclass article |
|
6 |
-\language english |
|
7 |
-\inputencoding auto |
|
8 |
-\fontscheme pslatex |
|
9 |
-\graphics default |
|
10 |
-\paperfontsize default |
|
11 |
-\spacing single |
|
12 |
-\papersize a4paper |
|
13 |
-\use_geometry false |
|
14 |
-\use_amsmath 1 |
|
15 |
-\cite_engine basic |
|
16 |
-\use_bibtopic false |
|
17 |
-\paperorientation portrait |
|
18 |
-\secnumdepth 3 |
|
19 |
-\tocdepth 3 |
|
20 |
-\paragraph_separation indent |
|
21 |
-\defskip medskip |
|
22 |
-\quotes_language english |
|
23 |
-\papercolumns 1 |
|
24 |
-\papersides 1 |
|
25 |
-\paperpagestyle default |
|
26 |
-\tracking_changes false |
|
27 |
-\output_changes false |
|
28 |
-\end_header |
|
29 |
- |
|
30 |
-\begin_body |
|
31 |
- |
|
32 |
-\begin_layout Title |
|
33 |
- |
|
34 |
-\family roman |
|
35 |
-\series medium |
|
36 |
-\shape up |
|
37 |
-\size normal |
|
38 |
-\emph off |
|
39 |
-\bar no |
|
40 |
-\noun off |
|
41 |
-\color none |
|
42 |
-Phishing signatures creation HOWTO |
|
43 |
-\end_layout |
|
44 |
- |
|
45 |
-\begin_layout Author |
|
46 | ||
47 |
-\end_layout |
|
48 |
- |
|
49 |
-\begin_layout Section |
|
50 |
-Database file format |
|
51 |
-\end_layout |
|
52 |
- |
|
53 |
-\begin_layout Standard |
|
54 |
-The database file format is common for the whitelist (.wdb), and domainlist |
|
55 |
- (.pdb), and it consists of (multiple) lines of form: |
|
56 |
-\end_layout |
|
57 |
- |
|
58 |
-\begin_layout Standard |
|
59 |
- |
|
60 |
-\series bold |
|
61 |
-Flags\InsetSpace ~ |
|
62 |
-RealURL\InsetSpace ~ |
|
63 |
-DisplayedURL |
|
64 |
-\end_layout |
|
65 |
- |
|
66 |
-\begin_layout Itemize |
|
67 |
-Where |
|
68 |
-\noun on |
|
69 |
-Flags |
|
70 |
-\noun default |
|
71 |
- is: |
|
72 |
-\end_layout |
|
73 |
- |
|
74 |
-\begin_deeper |
|
75 |
-\begin_layout Itemize |
|
76 |
-an (optional) character : |
|
77 |
-\end_layout |
|
78 |
- |
|
79 |
-\begin_deeper |
|
80 |
-\begin_layout Description |
|
81 |
-R regex, has to match entire url, see section |
|
82 |
-\end_layout |
|
83 |
- |
|
84 |
-\begin_layout Description |
|
85 |
-H has to match the host part of url only (a simple pattern, i.e. |
|
86 |
- it is matched literally) |
|
87 |
-\end_layout |
|
88 |
- |
|
89 |
-\begin_layout Description |
|
90 |
-no\InsetSpace ~ |
|
91 |
-character matches the entire url, but as a simple pattern (non-regex) |
|
92 |
-\end_layout |
|
93 |
- |
|
94 |
-\end_deeper |
|
95 |
-\begin_layout Itemize |
|
96 |
-followed by an (optional) 3-digit hexadecimal number representing flags |
|
97 |
- that should be filtered. |
|
98 |
-\end_layout |
|
99 |
- |
|
100 |
-\begin_deeper |
|
101 |
-\begin_layout Itemize |
|
102 |
-flag filtering only makes sense in .pdb files, (however clamav won't complain |
|
103 |
- if you put flags in .wdb files, it just won't use them) |
|
104 |
-\end_layout |
|
105 |
- |
|
106 |
-\begin_layout Itemize |
|
107 |
-for details on how to construct a flag number see section |
|
108 |
-\begin_inset LatexCommand \prettyref{sec:Flags} |
|
109 |
- |
|
110 |
-\end_inset |
|
111 |
- |
|
112 |
- |
|
113 |
-\end_layout |
|
114 |
- |
|
115 |
-\end_deeper |
|
116 |
-\end_deeper |
|
117 |
-\begin_layout Itemize |
|
118 |
- |
|
119 |
-\noun on |
|
120 |
-RealURL |
|
121 |
-\noun default |
|
122 |
-is the URL the user is sent to |
|
123 |
-\end_layout |
|
124 |
- |
|
125 |
-\begin_layout Itemize |
|
126 |
- |
|
127 |
-\noun on |
|
128 |
-displayedURL |
|
129 |
-\noun default |
|
130 |
- is the URL description displayed to the user, that is where it is |
|
131 |
-\emph on |
|
132 |
-claimed |
|
133 |
-\emph default |
|
134 |
- they are sent, the most obvious example is that of an html anchor (<a>tag): |
|
135 |
- its href attribute is the |
|
136 |
-\noun on |
|
137 |
-realURL |
|
138 |
-\noun default |
|
139 |
-, and its contents is the |
|
140 |
-\noun on |
|
141 |
-displayedURL |
|
142 |
-\end_layout |
|
143 |
- |
|
144 |
-\begin_layout Itemize |
|
145 |
-see section |
|
146 |
-\begin_inset LatexCommand \vref{sub:Extraction-of-realURL,} |
|
147 |
- |
|
148 |
-\end_inset |
|
149 |
- |
|
150 |
- for more details on what |
|
151 |
-\noun on |
|
152 |
-realURL/displayedURL |
|
153 |
-\noun default |
|
154 |
- is |
|
155 |
-\end_layout |
|
156 |
- |
|
157 |
-\begin_layout Standard |
|
158 |
-Note: The spaces are mandatory, and empty lines are skipped. |
|
159 |
-\end_layout |
|
160 |
- |
|
161 |
-\begin_layout Standard |
|
162 |
-If any of the lines of daily.wdb/daily.pdb don't conform to the above file |
|
163 |
- format, the loading of the file shall fail, and whitelist/domainlist feature |
|
164 |
- will be disabled. |
|
165 |
- If the loading of the whitelist fails, the phishing checks will be disabled |
|
166 |
- entirely. |
|
167 |
-\end_layout |
|
168 |
- |
|
169 |
-\begin_layout Standard |
|
170 |
-Therefore it is important to test the daily.wdb/daily.pdb before packing it |
|
171 |
- into daily.cvd! |
|
172 |
-\end_layout |
|
173 |
- |
|
174 |
-\begin_layout Subsubsection |
|
175 |
-Example |
|
176 |
-\end_layout |
|
177 |
- |
|
178 |
-\begin_layout Standard |
|
179 |
-The following line: |
|
180 |
-\end_layout |
|
181 |
- |
|
182 |
-\begin_layout Standard |
|
183 |
- |
|
184 |
-\emph on |
|
185 |
-R http://www |
|
186 |
-\backslash |
|
187 |
|
|
188 |
-\backslash |
|
189 |
-.(com|ro|it) www |
|
190 |
-\backslash |
|
191 |
|
|
192 |
-\backslash |
|
193 |
-.com |
|
194 |
-\end_layout |
|
195 |
- |
|
196 |
-\begin_layout Standard |
|
197 |
-Means: |
|
198 |
-\emph on |
|
199 |
-\noun on |
|
200 |
-R |
|
201 |
-\emph default |
|
202 |
- |
|
203 |
-\noun default |
|
204 |
-- this is a regex. |
|
205 |
- |
|
206 |
-\end_layout |
|
207 |
- |
|
208 |
-\begin_layout Standard |
|
209 |
-Example of url pairs matching: http://www.google.com www.google.com, http://www.googl |
|
210 |
-e.it www.google.com. |
|
211 |
-\end_layout |
|
212 |
- |
|
213 |
-\begin_layout Standard |
|
214 |
-Example of url pairs not matching: http://www.google.c0m www.google.com |
|
215 |
-\end_layout |
|
216 |
- |
|
217 |
-\begin_layout Subsection |
|
218 |
-How matching works |
|
219 |
-\end_layout |
|
220 |
- |
|
221 |
-\begin_layout Subsubsection |
|
222 |
-RealURL, displayedURL concatenation |
|
223 |
-\begin_inset LatexCommand \label{sub:RealURL,-displayedURL-concatenation} |
|
224 |
- |
|
225 |
-\end_inset |
|
226 |
- |
|
227 |
- |
|
228 |
-\end_layout |
|
229 |
- |
|
230 |
-\begin_layout Standard |
|
231 |
-The phishing detection module processes pairs of realURL/displayedURL, and |
|
232 |
- the matching against daily.wdb/daily.pdb is done as follows: the realURL |
|
233 |
- is concatenated with a space, and with the displayedURL, then that |
|
234 |
-\emph on |
|
235 |
-line |
|
236 |
-\emph default |
|
237 |
-is matched against the lines in daily.wdb/daily.pdb |
|
238 |
-\end_layout |
|
239 |
- |
|
240 |
-\begin_layout Standard |
|
241 |
-So if you have a line like |
|
242 |
-\end_layout |
|
243 |
- |
|
244 |
-\begin_layout Standard |
|
245 |
- |
|
246 |
-\shape italic |
|
247 |
-\InsetSpace ~ |
|
248 |
-www.google.ro\InsetSpace ~ |
|
249 |
-www.google.com |
|
250 |
-\end_layout |
|
251 |
- |
|
252 |
-\begin_layout Standard |
|
253 |
-and a href like: |
|
254 |
-\emph on |
|
255 |
-<a href= |
|
256 |
-\begin_inset Quotes erd |
|
257 |
-\end_inset |
|
258 |
- |
|
259 |
-http://www.google.ro |
|
260 |
-\begin_inset Quotes erd |
|
261 |
-\end_inset |
|
262 |
- |
|
263 |
->www.google.com</a>, |
|
264 |
-\emph default |
|
265 |
-then it will match, but: |
|
266 |
-\emph on |
|
267 |
-<a href= |
|
268 |
-\begin_inset Quotes erd |
|
269 |
-\end_inset |
|
270 |
- |
|
271 |
-http://images.google.com |
|
272 |
-\begin_inset Quotes erd |
|
273 |
-\end_inset |
|
274 |
- |
|
275 |
->www.google.com</a> |
|
276 |
-\emph default |
|
277 |
- will not match. |
|
278 |
-\end_layout |
|
279 |
- |
|
280 |
-\begin_layout Standard |
|
281 |
-If you use the |
|
282 |
-\series bold |
|
283 |
-\noun on |
|
284 |
-H |
|
285 |
-\noun default |
|
286 |
- |
|
287 |
-\series default |
|
288 |
-flag, then the 2nd href will match too. |
|
289 |
-\end_layout |
|
290 |
- |
|
291 |
-\begin_layout Subsubsection |
|
292 |
-What happens when a match is found |
|
293 |
-\end_layout |
|
294 |
- |
|
295 |
-\begin_layout Standard |
|
296 |
-In the case of the whitelist, a match means that the realURL/displayedURL |
|
297 |
- combination is considered |
|
298 |
-\noun on |
|
299 |
-clean |
|
300 |
-\noun default |
|
301 |
-, and no further checks are performed on it. |
|
302 |
-\end_layout |
|
303 |
- |
|
304 |
-\begin_layout Standard |
|
305 |
-In the case of the domainlist, a match means that the realURL/displayedURL |
|
306 |
- is going to be checked for phishing attempts. |
|
307 |
- This is only done if you don't run clamav with the |
|
308 |
-\emph on |
|
309 |
-alldomains |
|
310 |
-\emph default |
|
311 |
- option (since then all urls are checked). |
|
312 |
- Furthermore you can restrict what checks are to be performed by specifying |
|
313 |
- the 3-digit hexnumber. |
|
314 |
-\end_layout |
|
315 |
- |
|
316 |
-\begin_layout Subsubsection |
|
317 |
-Extraction of |
|
318 |
-\noun on |
|
319 |
-realURL |
|
320 |
-\noun default |
|
321 |
-, |
|
322 |
-\noun on |
|
323 |
-displayedURL |
|
324 |
-\noun default |
|
325 |
- from HTML tags |
|
326 |
-\begin_inset LatexCommand \label{sub:Extraction-of-realURL,} |
|
327 |
- |
|
328 |
-\end_inset |
|
329 |
- |
|
330 |
- |
|
331 |
-\end_layout |
|
332 |
- |
|
333 |
-\begin_layout Standard |
|
334 |
-The html parser extracts pairs of |
|
335 |
-\noun on |
|
336 |
-realURL |
|
337 |
-\noun default |
|
338 |
-/ |
|
339 |
-\noun on |
|
340 |
-displayedURL |
|
341 |
-\noun default |
|
342 |
- based on the following rules: |
|
343 |
-\end_layout |
|
344 |
- |
|
345 |
-\begin_layout Description |
|
346 |
-a (anchor) the |
|
347 |
-\emph on |
|
348 |
-href |
|
349 |
-\emph default |
|
350 |
- is the |
|
351 |
-\noun on |
|
352 |
-realURL |
|
353 |
-\noun default |
|
354 |
-, its |
|
355 |
-\emph on |
|
356 |
-contents |
|
357 |
-\emph default |
|
358 |
- is the |
|
359 |
-\noun on |
|
360 |
-displayedURL |
|
361 |
-\end_layout |
|
362 |
- |
|
363 |
-\begin_deeper |
|
364 |
-\begin_layout Description |
|
365 |
-contents is the tag-stripped contents of the <a> tags, so for example <b> |
|
366 |
- tags are stripped (but not their contents) |
|
367 |
-\end_layout |
|
368 |
- |
|
369 |
-\begin_layout Standard |
|
370 |
-nesting another <a> tag withing an <a> tag (besides being invalid html) |
|
371 |
- is treated as a </a><a.. |
|
372 |
-\end_layout |
|
373 |
- |
|
374 |
-\end_deeper |
|
375 |
-\begin_layout Description |
|
376 |
-form the |
|
377 |
-\emph on |
|
378 |
-action |
|
379 |
-\emph default |
|
380 |
-attribute is the |
|
381 |
-\noun on |
|
382 |
-realURL |
|
383 |
-\noun default |
|
384 |
-, and a nested <a> tag is the |
|
385 |
-\noun on |
|
386 |
-displayedURL |
|
387 |
-\end_layout |
|
388 |
- |
|
389 |
-\begin_layout Description |
|
390 |
-img/area if nested within an |
|
391 |
-\emph on |
|
392 |
- <a> |
|
393 |
-\emph default |
|
394 |
- tag, the |
|
395 |
-\noun on |
|
396 |
-realURL |
|
397 |
-\noun default |
|
398 |
- is the |
|
399 |
-\emph on |
|
400 |
-href |
|
401 |
-\emph default |
|
402 |
- of the a tag, and the |
|
403 |
-\emph on |
|
404 |
-src/dynsrc/area |
|
405 |
-\emph default |
|
406 |
- is the |
|
407 |
-\noun on |
|
408 |
-displayedURL |
|
409 |
-\noun default |
|
410 |
- of the img |
|
411 |
-\end_layout |
|
412 |
- |
|
413 |
-\begin_deeper |
|
414 |
-\begin_layout Standard |
|
415 |
-if nested withing a |
|
416 |
-\emph on |
|
417 |
-form |
|
418 |
-\emph default |
|
419 |
- tag, then the action attribute of the |
|
420 |
-\emph on |
|
421 |
-form |
|
422 |
-\emph default |
|
423 |
- tag is the |
|
424 |
-\noun on |
|
425 |
-realURL |
|
426 |
-\noun default |
|
427 |
- |
|
428 |
-\end_layout |
|
429 |
- |
|
430 |
-\end_deeper |
|
431 |
-\begin_layout Description |
|
432 |
-iframe if nested withing an |
|
433 |
-\emph on |
|
434 |
-<a> |
|
435 |
-\emph default |
|
436 |
- tag the |
|
437 |
-\emph on |
|
438 |
-src |
|
439 |
-\emph default |
|
440 |
- attribute is the displayedURL, and the |
|
441 |
-\emph on |
|
442 |
-href |
|
443 |
-\emph default |
|
444 |
- of its parent |
|
445 |
-\emph on |
|
446 |
- a |
|
447 |
-\emph default |
|
448 |
- tag is the |
|
449 |
-\noun on |
|
450 |
-realURL |
|
451 |
-\end_layout |
|
452 |
- |
|
453 |
-\begin_deeper |
|
454 |
-\begin_layout Standard |
|
455 |
-if nested withing a |
|
456 |
-\emph on |
|
457 |
-form |
|
458 |
-\emph default |
|
459 |
- tag, then the action attribute of the |
|
460 |
-\emph on |
|
461 |
-form |
|
462 |
-\emph default |
|
463 |
- tag is the |
|
464 |
-\noun on |
|
465 |
-realURL |
|
466 |
-\end_layout |
|
467 |
- |
|
468 |
-\end_deeper |
|
469 |
-\begin_layout Subsubsection |
|
470 |
-Example |
|
471 |
-\end_layout |
|
472 |
- |
|
473 |
-\begin_layout Standard |
|
474 |
-Consider this html file: |
|
475 |
-\end_layout |
|
476 |
- |
|
477 |
-\begin_layout Quote |
|
478 |
- |
|
479 |
-\emph on |
|
480 |
-<a href= |
|
481 |
-\begin_inset Quotes erd |
|
482 |
-\end_inset |
|
483 |
- |
|
484 |
-evilurl |
|
485 |
-\begin_inset Quotes erd |
|
486 |
-\end_inset |
|
487 |
- |
|
488 |
->www.paypal.com</a> |
|
489 |
-\end_layout |
|
490 |
- |
|
491 |
-\begin_layout Quote |
|
492 |
- |
|
493 |
-\emph on |
|
494 |
-<a href= |
|
495 |
-\begin_inset Quotes erd |
|
496 |
-\end_inset |
|
497 |
- |
|
498 |
-evilurl2 |
|
499 |
-\begin_inset Quotes erd |
|
500 |
-\end_inset |
|
501 |
- |
|
502 |
- title= |
|
503 |
-\begin_inset Quotes erd |
|
504 |
-\end_inset |
|
505 |
- |
|
506 |
-www.ebay.com |
|
507 |
-\begin_inset Quotes erd |
|
508 |
-\end_inset |
|
509 |
- |
|
510 |
->click here to sign in</a> |
|
511 |
-\end_layout |
|
512 |
- |
|
513 |
-\begin_layout Quote |
|
514 |
- |
|
515 |
-\emph on |
|
516 |
-<form action= |
|
517 |
-\begin_inset Quotes erd |
|
518 |
-\end_inset |
|
519 |
- |
|
520 |
-evilurl_form |
|
521 |
-\begin_inset Quotes erd |
|
522 |
-\end_inset |
|
523 |
- |
|
524 |
-> |
|
525 |
-\end_layout |
|
526 |
- |
|
527 |
-\begin_layout Quote |
|
528 |
- |
|
529 |
-\emph on |
|
530 |
-Please sign in to <a href= |
|
531 |
-\begin_inset Quotes erd |
|
532 |
-\end_inset |
|
533 |
- |
|
534 |
-cgi.ebay.com |
|
535 |
-\begin_inset Quotes erd |
|
536 |
-\end_inset |
|
537 |
- |
|
538 |
->Ebay</a> using this form |
|
539 |
-\end_layout |
|
540 |
- |
|
541 |
-\begin_layout Quote |
|
542 |
- |
|
543 |
-\emph on |
|
544 |
-<input type='text' name='username'>Username</input> |
|
545 |
-\end_layout |
|
546 |
- |
|
547 |
-\begin_layout Quote |
|
548 |
- |
|
549 |
-\emph on |
|
550 |
-.... |
|
551 |
-\end_layout |
|
552 |
- |
|
553 |
-\begin_layout Quote |
|
554 |
- |
|
555 |
-\emph on |
|
556 |
-</form> |
|
557 |
-\end_layout |
|
558 |
- |
|
559 |
-\begin_layout Quote |
|
560 |
- |
|
561 |
-\emph on |
|
562 |
-<a href= |
|
563 |
-\begin_inset Quotes erd |
|
564 |
-\end_inset |
|
565 |
- |
|
566 |
-evilurl |
|
567 |
-\begin_inset Quotes erd |
|
568 |
-\end_inset |
|
569 |
- |
|
570 |
-><img src= |
|
571 |
-\begin_inset Quotes erd |
|
572 |
-\end_inset |
|
573 |
- |
|
574 |
-images.paypal.com/secure.jpg |
|
575 |
-\begin_inset Quotes erd |
|
576 |
-\end_inset |
|
577 |
- |
|
578 |
-></a> |
|
579 |
-\end_layout |
|
580 |
- |
|
581 |
-\begin_layout Standard |
|
582 |
-The resulting |
|
583 |
-\noun on |
|
584 |
-realURL/displayedURL |
|
585 |
-\noun default |
|
586 |
- pairs will be (note that one tag can generate multiple pairs): |
|
587 |
-\end_layout |
|
588 |
- |
|
589 |
-\begin_layout Itemize |
|
590 |
-evilurl / www.paypal.com |
|
591 |
-\end_layout |
|
592 |
- |
|
593 |
-\begin_layout Itemize |
|
594 |
-evilurl2 / click here to sign in |
|
595 |
-\end_layout |
|
596 |
- |
|
597 |
-\begin_layout Itemize |
|
598 |
-evilurl2 / www.ebay.com |
|
599 |
-\end_layout |
|
600 |
- |
|
601 |
-\begin_layout Itemize |
|
602 |
-evilurl_form / cgi.ebay.com |
|
603 |
-\end_layout |
|
604 |
- |
|
605 |
-\begin_layout Itemize |
|
606 |
-cgi.ebay.com / Ebay |
|
607 |
-\end_layout |
|
608 |
- |
|
609 |
-\begin_layout Itemize |
|
610 |
-evilurl / image.paypal.com/secure.jpg |
|
611 |
-\end_layout |
|
612 |
- |
|
613 |
-\begin_layout Subsection |
|
614 |
-Simple patterns |
|
615 |
-\begin_inset LatexCommand \label{sec:Simple-patterns} |
|
616 |
- |
|
617 |
-\end_inset |
|
618 |
- |
|
619 |
- |
|
620 |
-\end_layout |
|
621 |
- |
|
622 |
-\begin_layout Standard |
|
623 |
-Simple patterns are matched literally, i.e. |
|
624 |
- if you say: |
|
625 |
-\end_layout |
|
626 |
- |
|
627 |
-\begin_layout Quote |
|
628 |
-www.google.com |
|
629 |
-\end_layout |
|
630 |
- |
|
631 |
-\begin_layout Standard |
|
632 |
-it is going to match |
|
633 |
-\emph on |
|
634 |
-www.google.com |
|
635 |
-\emph default |
|
636 |
-, and only that. |
|
637 |
- The |
|
638 |
-\emph on |
|
639 |
-. |
|
640 |
- (dot) |
|
641 |
-\emph default |
|
642 |
- character has no special meaning (see the section on regexes |
|
643 |
-\begin_inset LatexCommand \vref{sec:Regular-expressions} |
|
644 |
- |
|
645 |
-\end_inset |
|
646 |
- |
|
647 |
- for how the |
|
648 |
-\emph on |
|
649 |
-.(dot) |
|
650 |
-\emph default |
|
651 |
- character behaves there) |
|
652 |
-\end_layout |
|
653 |
- |
|
654 |
-\begin_layout Subsection |
|
655 |
-Regular expressions |
|
656 |
-\begin_inset LatexCommand \label{sec:Regular-expressions} |
|
657 |
- |
|
658 |
-\end_inset |
|
659 |
- |
|
660 |
- |
|
661 |
-\end_layout |
|
662 |
- |
|
663 |
-\begin_layout Standard |
|
664 |
-POSIX regular expressions are supported, and you can consider that internally |
|
665 |
- it is wrapped by |
|
666 |
-\emph on |
|
667 |
-^ |
|
668 |
-\emph default |
|
669 |
-, and |
|
670 |
-\emph on |
|
671 |
-$. |
|
672 |
- |
|
673 |
-\emph default |
|
674 |
-In other words, this means that the regular expression has to match the |
|
675 |
- entire concatenated (see section |
|
676 |
-\begin_inset LatexCommand \vref{sub:RealURL,-displayedURL-concatenation} |
|
677 |
- |
|
678 |
-\end_inset |
|
679 |
- |
|
680 |
- for details on concatenation) url. |
|
681 |
-\end_layout |
|
682 |
- |
|
683 |
-\begin_layout Standard |
|
684 |
-It is recomended that you read section |
|
685 |
-\begin_inset LatexCommand \vref{sec:Introduction-to-regular} |
|
686 |
- |
|
687 |
-\end_inset |
|
688 |
- |
|
689 |
- to learn how to write regular expressions, and then come back and read |
|
690 |
- this for hints. |
|
691 |
-\end_layout |
|
692 |
- |
|
693 |
-\begin_layout Standard |
|
694 |
-Be advised that clamav contains an internal, very basic regex matcher to |
|
695 |
- reduce the load on the regex matching core. |
|
696 |
- Thus it is recomended that you avoid using regex syntax not supported by |
|
697 |
- it at the very beginning of regexes (at least the first few characters). |
|
698 |
-\end_layout |
|
699 |
- |
|
700 |
-\begin_layout Standard |
|
701 |
-Currently the clamav regex matcher supports: |
|
702 |
-\end_layout |
|
703 |
- |
|
704 |
-\begin_layout Itemize |
|
705 |
-. |
|
706 |
- (dot) character |
|
707 |
-\end_layout |
|
708 |
- |
|
709 |
-\begin_layout Itemize |
|
710 |
- |
|
711 |
-\backslash |
|
712 |
- (escaping special characters) |
|
713 |
-\end_layout |
|
714 |
- |
|
715 |
-\begin_layout Itemize |
|
716 |
-| (pipe) alternatives |
|
717 |
-\end_layout |
|
718 |
- |
|
719 |
-\begin_layout Itemize |
|
720 |
-[] (character classes) |
|
721 |
-\end_layout |
|
722 |
- |
|
723 |
-\begin_layout Itemize |
|
724 |
-() (paranthesis for grouping, but no group extraction is performed) |
|
725 |
-\end_layout |
|
726 |
- |
|
727 |
-\begin_layout Itemize |
|
728 |
-other non-special characters |
|
729 |
-\end_layout |
|
730 |
- |
|
731 |
-\begin_layout Standard |
|
732 |
-Thus the following are not supported: |
|
733 |
-\end_layout |
|
734 |
- |
|
735 |
-\begin_layout Itemize |
|
736 |
-+ repetition |
|
737 |
-\end_layout |
|
738 |
- |
|
739 |
-\begin_layout Itemize |
|
740 |
-* repetition |
|
741 |
-\end_layout |
|
742 |
- |
|
743 |
-\begin_layout Itemize |
|
744 |
-{} repetition |
|
745 |
-\end_layout |
|
746 |
- |
|
747 |
-\begin_layout Itemize |
|
748 |
-backreferences |
|
749 |
-\end_layout |
|
750 |
- |
|
751 |
-\begin_layout Itemize |
|
752 |
-lookaround |
|
753 |
-\end_layout |
|
754 |
- |
|
755 |
-\begin_layout Itemize |
|
756 |
-other |
|
757 |
-\begin_inset Quotes eld |
|
758 |
-\end_inset |
|
759 |
- |
|
760 |
-advanced |
|
761 |
-\begin_inset Quotes erd |
|
762 |
-\end_inset |
|
763 |
- |
|
764 |
- features not listed in the supported list ;) |
|
765 |
-\end_layout |
|
766 |
- |
|
767 |
-\begin_layout Standard |
|
768 |
-This however shouldn't discourage you from using the |
|
769 |
-\begin_inset Quotes eld |
|
770 |
-\end_inset |
|
771 |
- |
|
772 |
-not directly supported features |
|
773 |
-\begin_inset Quotes eld |
|
774 |
-\end_inset |
|
775 |
- |
|
776 |
-, because if the internal engine encounters unsupported syntax, it passes |
|
777 |
- it on to the POSIX regex core (beginning from the first unsupported token, |
|
778 |
- everything before that is still processed by the internal matcher). |
|
779 |
- An example might make this more clear: |
|
780 |
-\end_layout |
|
781 |
- |
|
782 |
-\begin_layout Standard |
|
783 |
- |
|
784 |
-\emph on |
|
785 |
-www |
|
786 |
-\backslash |
|
787 |
|
|
788 |
-\backslash |
|
789 |
-.(com|ro|it) ([a-zA-Z])+ |
|
790 |
-\backslash |
|
791 |
|
|
792 |
-\backslash |
|
793 |
-.(com|ro|it) |
|
794 |
-\end_layout |
|
795 |
- |
|
796 |
-\begin_layout Standard |
|
797 |
-Everything till |
|
798 |
-\emph on |
|
799 |
-([a-zA-Z])+ |
|
800 |
-\emph default |
|
801 |
- is processed internally, that paranthesis (and everything beyond) is processed |
|
802 |
- by the posix core. |
|
803 |
-\end_layout |
|
804 |
- |
|
805 |
-\begin_layout Standard |
|
806 |
-Examples of url pairs that match: |
|
807 |
-\end_layout |
|
808 |
- |
|
809 |
-\begin_layout Itemize |
|
810 |
- |
|
811 |
-\emph on |
|
812 |
-www.google.ro images.google.ro |
|
813 |
-\end_layout |
|
814 |
- |
|
815 |
-\begin_layout Itemize |
|
816 |
-www.google.com images.google.ro |
|
817 |
-\end_layout |
|
818 |
- |
|
819 |
-\begin_layout Standard |
|
820 |
-Example of url pairs that don't match: |
|
821 |
-\end_layout |
|
822 |
- |
|
823 |
-\begin_layout Itemize |
|
824 |
-www.google.ro images1.google.ro |
|
825 |
-\end_layout |
|
826 |
- |
|
827 |
-\begin_layout Itemize |
|
828 |
-images.google.com image.google.com |
|
829 |
-\end_layout |
|
830 |
- |
|
831 |
-\begin_layout Subsection |
|
832 |
-Flags |
|
833 |
-\begin_inset LatexCommand \label{sec:Flags} |
|
834 |
- |
|
835 |
-\end_inset |
|
836 |
- |
|
837 |
- |
|
838 |
-\end_layout |
|
839 |
- |
|
840 |
-\begin_layout Standard |
|
841 |
-Flags are a binary OR of the following numbers: |
|
842 |
-\end_layout |
|
843 |
- |
|
844 |
-\begin_layout Description |
|
845 |
-HOST_SUFFICIENT 1 |
|
846 |
-\end_layout |
|
847 |
- |
|
848 |
-\begin_layout Description |
|
849 |
-DOMAIN_SUFFICIENT 2 |
|
850 |
-\end_layout |
|
851 |
- |
|
852 |
-\begin_layout Description |
|
853 |
-DO_REVERSE_LOOKUP 4 |
|
854 |
-\end_layout |
|
855 |
- |
|
856 |
-\begin_layout Description |
|
857 |
-CHECK_REDIR 8 |
|
858 |
-\end_layout |
|
859 |
- |
|
860 |
-\begin_layout Description |
|
861 |
-CHECK_SSL 16 |
|
862 |
-\end_layout |
|
863 |
- |
|
864 |
-\begin_layout Description |
|
865 |
-CHECK_CLOAKING 32 |
|
866 |
-\end_layout |
|
867 |
- |
|
868 |
-\begin_layout Description |
|
869 |
-CLEANUP_URL 64 |
|
870 |
-\end_layout |
|
871 |
- |
|
872 |
-\begin_layout Description |
|
873 |
-CHECK_DOMAIN_REVERSE 128 |
|
874 |
-\end_layout |
|
875 |
- |
|
876 |
-\begin_layout Description |
|
877 |
-CHECK_IMG_URL 256 |
|
878 |
-\end_layout |
|
879 |
- |
|
880 |
-\begin_layout Description |
|
881 |
-DOMAINLIST_REQUIRED 512 |
|
882 |
-\end_layout |
|
883 |
- |
|
884 |
-\begin_layout Standard |
|
885 |
-The names of the constants are self-explanatory. |
|
886 |
-\end_layout |
|
887 |
- |
|
888 |
-\begin_layout Standard |
|
889 |
-These constants are defined in libclamav/phishcheck.h, you can check there |
|
890 |
- for the latest flags. |
|
891 |
-\end_layout |
|
892 |
- |
|
893 |
-\begin_layout Standard |
|
894 |
-There is a default set of flags that are enabled, these are currently: (CLEANUP_ |
|
895 |
-URL|DOMAIN_SUFFICIENT|CHECK_SSL|CHECK_CLOAKING|DOMAINLIST_REQUIRED|CHECK_IMG_URL |
|
896 |
-), ssl checking is performed only for a tags currently. |
|
897 |
-\end_layout |
|
898 |
- |
|
899 |
-\begin_layout Standard |
|
900 |
-You must decide for each line in the domainlist if you want to filter any |
|
901 |
- flags (that is you don't want certain checks to be done), and then calculate |
|
902 |
- the binary OR of those constants, and then convert it into a 3-digit hexnumber. |
|
903 |
- For example you devide that domain_sufficient shouldn't be used for ebay.com, |
|
904 |
- and you don't want to check images either, so you come up with this flag |
|
905 |
- number: |
|
906 |
-\begin_inset Formula $2|256\Rightarrow$ |
|
907 |
-\end_inset |
|
908 |
- |
|
909 |
-258 |
|
910 |
-\begin_inset Formula $(decimal)\Rightarrow102(hexadecimal)$ |
|
911 |
-\end_inset |
|
912 |
- |
|
913 |
- |
|
914 |
-\end_layout |
|
915 |
- |
|
916 |
-\begin_layout Standard |
|
917 |
-So you add this line to daily.wdb: |
|
918 |
-\end_layout |
|
919 |
- |
|
920 |
-\begin_layout Itemize |
|
921 |
-R102\InsetSpace ~ |
|
922 |
-www.ebay.com\InsetSpace ~ |
|
923 |
-.+ |
|
924 |
-\end_layout |
|
925 |
- |
|
926 |
-\begin_layout Section |
|
927 |
-Introduction to regular expressions |
|
928 |
-\begin_inset LatexCommand \label{sec:Introduction-to-regular} |
|
929 |
- |
|
930 |
-\end_inset |
|
931 |
- |
|
932 |
- |
|
933 |
-\end_layout |
|
934 |
- |
|
935 |
-\begin_layout Standard |
|
936 |
-Recomended reading: |
|
937 |
-\end_layout |
|
938 |
- |
|
939 |
-\begin_layout Itemize |
|
940 |
-http://www.regular-expressions.info/quickstart.html |
|
941 |
-\end_layout |
|
942 |
- |
|
943 |
-\begin_layout Itemize |
|
944 |
-http://www.regular-expressions.info/tutorial.html |
|
945 |
-\end_layout |
|
946 |
- |
|
947 |
-\begin_layout Itemize |
|
948 |
-regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7&topic=regex |
|
949 |
-\end_layout |
|
950 |
- |
|
951 |
-\begin_layout Subsection |
|
952 |
-Special characters |
|
953 |
-\end_layout |
|
954 |
- |
|
955 |
-\begin_layout Description |
|
956 |
-[ the opening square bracket - it marks the beginning of a character class, |
|
957 |
- see section |
|
958 |
-\begin_inset LatexCommand \vref{sub:Character-classes} |
|
959 |
- |
|
960 |
-\end_inset |
|
961 |
- |
|
962 |
- |
|
963 |
-\end_layout |
|
964 |
- |
|
965 |
-\begin_layout Description |
|
966 |
- |
|
967 |
-\backslash |
|
968 |
- the backslash - escapes special characters, see section |
|
969 |
-\begin_inset LatexCommand \vref{sub:Escaping} |
|
970 |
- |
|
971 |
-\end_inset |
|
972 |
- |
|
973 |
- |
|
974 |
-\end_layout |
|
975 |
- |
|
976 |
-\begin_layout Description |
|
977 |
-\i \^{ } |
|
978 |
- the caret - matches the beginning of a line (not needed in clamav regexes, |
|
979 |
- this is implied) |
|
980 |
-\end_layout |
|
981 |
- |
|
982 |
-\begin_layout Description |
|
983 |
-$ the dollar sign - matches the end of a line (not needed in clamav regexes, |
|
984 |
- this is implied) |
|
985 |
-\end_layout |
|
986 |
- |
|
987 |
-\begin_layout Description |
|
988 |
-\i \.{ } |
|
989 |
- the period or dot - matches |
|
990 |
-\emph on |
|
991 |
-any |
|
992 |
-\emph default |
|
993 |
- character |
|
994 |
-\end_layout |
|
995 |
- |
|
996 |
-\begin_layout Description |
|
997 |
-| the vertical bar or pipe symbol - matches either of the token on its left |
|
998 |
- and right side, see section |
|
999 |
-\begin_inset LatexCommand \vref{sub:Alternation} |
|
1000 |
- |
|
1001 |
-\end_inset |
|
1002 |
- |
|
1003 |
- |
|
1004 |
-\end_layout |
|
1005 |
- |
|
1006 |
-\begin_layout Description |
|
1007 |
-? the question mark - matches optionally the left-side token, see section |
|
1008 |
-\begin_inset LatexCommand \vref{sub:Optional-matching,-and} |
|
1009 |
- |
|
1010 |
-\end_inset |
|
1011 |
- |
|
1012 |
- |
|
1013 |
-\end_layout |
|
1014 |
- |
|
1015 |
-\begin_layout Description |
|
1016 |
-* the asterisk or star - matches 0 or more occurences of the left-side token, |
|
1017 |
- see section |
|
1018 |
-\begin_inset LatexCommand \vref{sub:Optional-matching,-and} |
|
1019 |
- |
|
1020 |
-\end_inset |
|
1021 |
- |
|
1022 |
- |
|
1023 |
-\end_layout |
|
1024 |
- |
|
1025 |
-\begin_layout Description |
|
1026 |
-+ the plus sign - matches 1 or more occurences of the left-side token, see |
|
1027 |
- section |
|
1028 |
-\begin_inset LatexCommand \vref{sub:Optional-matching,-and} |
|
1029 |
- |
|
1030 |
-\end_inset |
|
1031 |
- |
|
1032 |
- |
|
1033 |
-\end_layout |
|
1034 |
- |
|
1035 |
-\begin_layout Description |
|
1036 |
-( the opening round bracket - \i \c{m} |
|
1037 |
-arks beginning of a group, see section |
|
1038 |
-\begin_inset LatexCommand \vref{sub:Groups} |
|
1039 |
- |
|
1040 |
-\end_inset |
|
1041 |
- |
|
1042 |
- |
|
1043 |
-\end_layout |
|
1044 |
- |
|
1045 |
-\begin_layout Description |
|
1046 |
-) the closing round bracket - marks end of a group, see section |
|
1047 |
-\begin_inset LatexCommand \vref{sub:Groups} |
|
1048 |
- |
|
1049 |
-\end_inset |
|
1050 |
- |
|
1051 |
- |
|
1052 |
-\end_layout |
|
1053 |
- |
|
1054 |
-\begin_layout Subsection |
|
1055 |
-Character classes |
|
1056 |
-\begin_inset LatexCommand \label{sub:Character-classes} |
|
1057 |
- |
|
1058 |
-\end_inset |
|
1059 |
- |
|
1060 |
- |
|
1061 |
-\end_layout |
|
1062 |
- |
|
1063 |
-\begin_layout Subsection |
|
1064 |
-Escaping |
|
1065 |
-\begin_inset LatexCommand \label{sub:Escaping} |
|
1066 |
- |
|
1067 |
-\end_inset |
|
1068 |
- |
|
1069 |
- |
|
1070 |
-\end_layout |
|
1071 |
- |
|
1072 |
-\begin_layout Standard |
|
1073 |
-Escaping has two purposes: |
|
1074 |
-\end_layout |
|
1075 |
- |
|
1076 |
-\begin_layout Itemize |
|
1077 |
-it allows you to actually match the special characters themselves, for example |
|
1078 |
- to match the literal |
|
1079 |
-\emph on |
|
1080 |
-+ |
|
1081 |
-\emph default |
|
1082 |
-, you would write |
|
1083 |
-\emph on |
|
1084 |
- |
|
1085 |
-\backslash |
|
1086 |
-+ |
|
1087 |
-\end_layout |
|
1088 |
- |
|
1089 |
-\begin_layout Itemize |
|
1090 |
-it also allows you to match non-printable characters, such as the tab ( |
|
1091 |
-\emph on |
|
1092 |
- |
|
1093 |
-\backslash |
|
1094 |
-t |
|
1095 |
-\emph default |
|
1096 |
-), newline ( |
|
1097 |
-\emph on |
|
1098 |
- |
|
1099 |
-\backslash |
|
1100 |
-n |
|
1101 |
-\emph default |
|
1102 |
-), .. |
|
1103 |
-\end_layout |
|
1104 |
- |
|
1105 |
-\begin_layout Standard |
|
1106 |
-However since non-printable characters are not valid inside an url, you |
|
1107 |
- won't have a reason to use them. |
|
1108 |
-\end_layout |
|
1109 |
- |
|
1110 |
-\begin_layout Subsection |
|
1111 |
-Alternation |
|
1112 |
-\begin_inset LatexCommand \label{sub:Alternation} |
|
1113 |
- |
|
1114 |
-\end_inset |
|
1115 |
- |
|
1116 |
- |
|
1117 |
-\end_layout |
|
1118 |
- |
|
1119 |
-\begin_layout Subsection |
|
1120 |
-Optional matching, and repetition |
|
1121 |
-\begin_inset LatexCommand \label{sub:Optional-matching,-and} |
|
1122 |
- |
|
1123 |
-\end_inset |
|
1124 |
- |
|
1125 |
- |
|
1126 |
-\end_layout |
|
1127 |
- |
|
1128 |
-\begin_layout Subsection |
|
1129 |
-Groups |
|
1130 |
-\begin_inset LatexCommand \label{sub:Groups} |
|
1131 |
- |
|
1132 |
-\end_inset |
|
1133 |
- |
|
1134 |
- |
|
1135 |
-\end_layout |
|
1136 |
- |
|
1137 |
-\begin_layout Standard |
|
1138 |
-Groups are usually used together with repetition, or alternation. |
|
1139 |
- For example: |
|
1140 |
-\emph on |
|
1141 |
-(com|it)+ |
|
1142 |
-\emph default |
|
1143 |
- means: match 1 or more repetitions of |
|
1144 |
-\emph on |
|
1145 |
-com |
|
1146 |
-\emph default |
|
1147 |
- or |
|
1148 |
-\emph on |
|
1149 |
-it, |
|
1150 |
-\emph default |
|
1151 |
- that is it matches: com, it, comcom, comcomcom, comit, itit, ititcom,... |
|
1152 |
- you get the idea. |
|
1153 |
-\end_layout |
|
1154 |
- |
|
1155 |
-\begin_layout Standard |
|
1156 |
-Groups can also be used to extract substring, but this is not supported |
|
1157 |
- by the clam engine, and not needed either in this case. |
|
1158 |
-\end_layout |
|
1159 |
- |
|
1160 |
-\begin_layout Section |
|
1161 |
-How to create database files |
|
1162 |
-\end_layout |
|
1163 |
- |
|
1164 |
-\begin_layout Subsection |
|
1165 |
-How to create and maintain the whitelist (daily.wdb) |
|
1166 |
-\end_layout |
|
1167 |
- |
|
1168 |
-\begin_layout Standard |
|
1169 |
-If the phishing code claims that a certain mail is phishing, but its not, |
|
1170 |
- you have 2 choices: |
|
1171 |
-\end_layout |
|
1172 |
- |
|
1173 |
-\begin_layout Itemize |
|
1174 |
-examine your rules daily.pdb, and fix them if necessary (see: section |
|
1175 |
-\begin_inset LatexCommand \vref{sub:How-to-create} |
|
1176 |
- |
|
1177 |
-\end_inset |
|
1178 |
- |
|
1179 |
-) |
|
1180 |
-\end_layout |
|
1181 |
- |
|
1182 |
-\begin_layout Itemize |
|
1183 |
-add it to the whitelist (discussed here) |
|
1184 |
-\end_layout |
|
1185 |
- |
|
1186 |
-\begin_layout Standard |
|
1187 |
-Lets assume you are having problems because of links like this in a mail: |
|
1188 |
-\end_layout |
|
1189 |
- |
|
1190 |
-\begin_layout Quote |
|
1191 |
-<a href= |
|
1192 |
-\begin_inset Quotes erd |
|
1193 |
-\end_inset |
|
1194 |
- |
|
1195 |
-http://69.0.241.57/bCentral/L.asp?L=XXXXXXXX |
|
1196 |
-\begin_inset Quotes erd |
|
1197 |
-\end_inset |
|
1198 |
- |
|
1199 |
->http://www.bcentral.it/</a> |
|
1200 |
-\end_layout |
|
1201 |
- |
|
1202 |
-\begin_layout Standard |
|
1203 |
-After investigating those sites further, you decide they are no threat, |
|
1204 |
- and create a line like this in daily.wdb: |
|
1205 |
-\end_layout |
|
1206 |
- |
|
1207 |
-\begin_layout Quote |
|
1208 |
-R http://www |
|
1209 |
-\backslash |
|
1210 |
-.bcentral |
|
1211 |
-\backslash |
|
1212 |
-.it/.+ http://69 |
|
1213 |
-\backslash |
|
1214 |
-.0 |
|
1215 |
-\backslash |
|
1216 |
-.241 |
|
1217 |
-\backslash |
|
1218 |
-.57/bCentral/L |
|
1219 |
-\backslash |
|
1220 |
-.asp?L=.+ |
|
1221 |
-\end_layout |
|
1222 |
- |
|
1223 |
-\begin_layout Standard |
|
1224 |
-Note: urls like the above can be used to track unique mail recipients, and |
|
1225 |
- thus know if somebody actually reads mails (so they can send more spam). |
|
1226 |
- However since this site required no authentication information, it is safe |
|
1227 |
- from a phishing point of view. |
|
1228 |
-\end_layout |
|
1229 |
- |
|
1230 |
-\begin_layout Subsection |
|
1231 |
-How to create and maintain the domainlist (daily.pdb) |
|
1232 |
-\begin_inset LatexCommand \label{sub:How-to-create} |
|
1233 |
- |
|
1234 |
-\end_inset |
|
1235 |
- |
|
1236 |
- |
|
1237 |
-\end_layout |
|
1238 |
- |
|
1239 |
-\begin_layout Standard |
|
1240 |
-When not using --phish-scan-alldomains (production environments for example), |
|
1241 |
- you need to decide which urls you are going to check. |
|
1242 |
- |
|
1243 |
-\end_layout |
|
1244 |
- |
|
1245 |
-\begin_layout Standard |
|
1246 |
-Although at a first glance it might seem a good idea to check everything, |
|
1247 |
- it would produce false positives. |
|
1248 |
- Particularly newsletters, ads, etc. |
|
1249 |
- are likely to use URLs that look like phishing attempts. |
|
1250 |
-\end_layout |
|
1251 |
- |
|
1252 |
-\begin_layout Standard |
|
1253 |
-Lets assume that you've recently seen many phishing attempts claiming they |
|
1254 |
- come from Paypal. |
|
1255 |
- Thus you need to add paypal to daily.pdb: |
|
1256 |
-\end_layout |
|
1257 |
- |
|
1258 |
-\begin_layout Quote |
|
1259 |
-R .+ .+ |
|
1260 |
-\backslash |
|
1261 |
-.paypal |
|
1262 |
-\backslash |
|
1263 |
-.com |
|
1264 |
-\end_layout |
|
1265 |
- |
|
1266 |
-\begin_layout Standard |
|
1267 |
-The above line will block (detect as phishing) mails that contain urls that |
|
1268 |
- claim to lead to paypal, but they don't in fact. |
|
1269 |
-\end_layout |
|
1270 |
- |
|
1271 |
-\begin_layout Standard |
|
1272 |
-Be carefull not to create regexes that match a too broad range of urls though. |
|
1273 |
-\end_layout |
|
1274 |
- |
|
1275 |
-\begin_layout Subsection |
|
1276 |
-Dealing with false positives, and undetected phishing mails |
|
1277 |
-\end_layout |
|
1278 |
- |
|
1279 |
-\begin_layout Subsubsection |
|
1280 |
-False positives |
|
1281 |
-\end_layout |
|
1282 |
- |
|
1283 |
-\begin_layout Standard |
|
1284 |
-Whenever you see a false positive (mail that is detected as phishing, but |
|
1285 |
- its not), you need to examine |
|
1286 |
-\emph on |
|
1287 |
-why |
|
1288 |
-\emph default |
|
1289 |
- clamav decided that its phishing. |
|
1290 |
- You can do this easily by building clamav with debugging (./configure --enable-e |
|
1291 |
-xperimental --enable-debug), and then running a tool: |
|
1292 |
-\end_layout |
|
1293 |
- |
|
1294 |
-\begin_layout Quote |
|
1295 |
-$contrib/phishing/why.py phishing.eml |
|
1296 |
-\end_layout |
|
1297 |
- |
|
1298 |
-\begin_layout Standard |
|
1299 |
-This will show the url that triggers the phish verdict, and a reason why |
|
1300 |
- that url is considered phishing attempt. |
|
1301 |
-\end_layout |
|
1302 |
- |
|
1303 |
-\begin_layout Standard |
|
1304 |
-Once you know the reason, you might need to modify daily.pdb (if one of yours |
|
1305 |
- rules inthere are too broad), or you need to add the url to daily.wdb. |
|
1306 |
- If you think the algorithm is incorrect, please file a bugreport on bugzilla.cla |
|
1307 |
-mav.net, including the output of |
|
1308 |
-\emph on |
|
1309 |
-why.py |
|
1310 |
-\emph default |
|
1311 |
-. |
|
1312 |
-\end_layout |
|
1313 |
- |
|
1314 |
-\begin_layout Subsubsection |
|
1315 |
-Undetected phish mails |
|
1316 |
-\end_layout |
|
1317 |
- |
|
1318 |
-\begin_layout Standard |
|
1319 |
-Using why.py doesn't help here unfortunately (it will say: clean), so all |
|
1320 |
- you can do is: |
|
1321 |
-\end_layout |
|
1322 |
- |
|
1323 |
-\begin_layout Quote |
|
1324 |
-$clamscan/clamscan --phish-scan-alldomains undetected.eml |
|
1325 |
-\end_layout |
|
1326 |
- |
|
1327 |
-\begin_layout Standard |
|
1328 |
-And see if the mail is detected, if yes, then you need to add an appropiate |
|
1329 |
- line to daily.pdb (see section |
|
1330 |
-\begin_inset LatexCommand \vref{sub:How-to-create} |
|
1331 |
- |
|
1332 |
-\end_inset |
|
1333 |
- |
|
1334 |
-). |
|
1335 |
-\end_layout |
|
1336 |
- |
|
1337 |
-\begin_layout Standard |
|
1338 |
-If the mail is not detected, then try using: |
|
1339 |
-\end_layout |
|
1340 |
- |
|
1341 |
-\begin_layout Quote |
|
1342 |
-$clamscan/clamscan --debug undetected.eml|less |
|
1343 |
-\end_layout |
|
1344 |
- |
|
1345 |
-\begin_layout Address |
|
1346 |
-Then see what urls are being checked, see if any of them is in a whitelist, |
|
1347 |
- see if all urls are detected, etc. |
|
1348 |
-\end_layout |
|
1349 |
- |
|
1350 |
-\begin_layout Section |
|
1351 |
-Hints and recomandations |
|
1352 |
-\end_layout |
|
1353 |
- |
|
1354 |
-\begin_layout Section |
|
1355 |
-Examples |
|
1356 |
-\end_layout |
|
1357 |
- |
|
1358 |
-\begin_layout Standard |
|
1359 |
- |
|
1360 |
-\end_layout |
|
1361 |
- |
|
1362 |
-\end_body |
|
1363 |
-\end_document |
1365 | 2 |
new file mode 100644 |
... | ... |
@@ -0,0 +1,491 @@ |
0 |
+%% LyX 1.5.3 created this file. For more info, see http://www.lyx.org/. |
|
1 |
+%% Do not edit unless you really know what you are doing. |
|
2 |
+\documentclass[a4paper,english]{article} |
|
3 |
+\usepackage{mathptmx} |
|
4 |
+\usepackage[T1]{fontenc} |
|
5 |
+\usepackage{varioref} |
|
6 |
+\usepackage{prettyref} |
|
7 |
+\usepackage{amssymb} |
|
8 |
+\usepackage{pslatex} |
|
9 |
+\usepackage[dvips]{graphicx} |
|
10 |
+\usepackage{wrapfig} |
|
11 |
+\usepackage{url} |
|
12 |
+\date{} |
|
13 |
+ |
|
14 |
+\begin{document} |
|
15 |
+ |
|
16 |
+\title{{\huge Phishing signatures creation HOWTO}} |
|
17 |
+\author{T\"or\"ok Edwin} |
|
18 |
+\maketitle |
|
19 |
+ |
|
20 |
+\section{Database file format} |
|
21 |
+ |
|
22 |
+\subsection{PDB format} |
|
23 |
+This file contains urls/hosts that are target of phishing attempts. |
|
24 |
+It contains lines in the following format: |
|
25 |
+\begin{verbatim} |
|
26 |
+R[Filter]:RealURL:DisplayedURL[:FuncLevelSpec] |
|
27 |
+H[Filter]:DisplayedHostname[:FuncLevelSpec] |
|
28 |
+\end{verbatim} |
|
29 |
+ |
|
30 |
+\begin{description} |
|
31 |
+ \item [{R}] regular expression, for the concatenated URL |
|
32 |
+ \item [{H}] matches the \verb+DisplayedHostname+ as a simple pattern (literally, no regular expression) |
|
33 |
+ \begin{itemize} |
|
34 |
+ \item the pattern can match either the full hostname |
|
35 |
+ \item or a subdomain of the specified hostname |
|
36 |
+ \item to avoid false matches in case of subdomain matches, the engine checks that there is a dot(\verb+.+) or a space(\verb+ +) before the matched portion |
|
37 |
+ \end{itemize} |
|
38 |
+ \item [{Filter}] an (optional) 3-digit hexadecimal number representing flags that should be filtered. |
|
39 |
+ \begin{itemize} |
|
40 |
+ \item flag filtering only makes sense in .pdb files. (however clamav won't complain if you put flags in .wdb files, it will just skip them) |
|
41 |
+ \item for details on how to construct a flag number see section \prettyref{sec:Flags} |
|
42 |
+ \end{itemize} |
|
43 |
+ |
|
44 |
+ \item [{RealURL }] is the URL the user is sent to |
|
45 |
+ \item [{DisplayedURL}] is the URL description displayed to the user, that is where it is \emph{claimed} they are sent, the most obvious example is that of an html anchor (<a>tag): its href attribute is the \textsc{realURL}, and its contents is the \textsc{displayedURL} |
|
46 |
+ \item [{DisplayedHostname}] is the hostname portion of the [{DisplayedURL}] |
|
47 |
+ \item [{FuncLevelSpec}] an (optional) functionality level, 2 formats are possible: |
|
48 |
+ \begin{itemize} |
|
49 |
+ \item \verb+minlevel+ all engines having functionality level >= \verb+minlevel+ will load this line |
|
50 |
+ \item \verb+minlevel-maxlevel+ engines with functionality level $>= $ \verb+minlevel+, and $< $ \verb+maxlevel+ will load this line |
|
51 |
+ \end{itemize} |
|
52 |
+\end{description} |
|
53 |
+ |
|
54 |
+\subsection{WDB format} |
|
55 |
+This file contains whitelisted url pairs |
|
56 |
+It contains lines in the following format: |
|
57 |
+\begin{verbatim} |
|
58 |
+X:RealURL:DisplayedURL[:FuncLevelSpec] |
|
59 |
+M:RealHostname:DisplayedHostname[:FuncLevelSpec] |
|
60 |
+\end{verbatim} |
|
61 |
+ |
|
62 |
+\begin{description} |
|
63 |
+ \item [{X}] regular expression, for the \textsc{entire URL}, not just the hostname |
|
64 |
+ \begin{itemize} |
|
65 |
+ \item The regular expression is by default anchored to start-of-line and end-of-line, as if you have used \verb+^RegularExpression$+ |
|
66 |
+ \item A trailing \verb+/+ is automatically added both to the regex, and the input string to avoid false matches |
|
67 |
+ \item The regular expression matches the \textsc{concatenation} of RealURL, a colon(\verb+:+), and DisplayedURL as a single string. It doesn't separately match RealURL and DisplayedURL! |
|
68 |
+ \end{itemize} |
|
69 |
+ \item [{M}] matches hostname, or subdomain of it, see notes for \textsc{H} above |
|
70 |
+\end{description} |
|
71 |
+ |
|
72 |
+\subsection{Hints} |
|
73 |
+ |
|
74 |
+\begin{itemize} |
|
75 |
+ \item empty lines are ignored |
|
76 |
+ \item the colons are mandatory |
|
77 |
+ \item Don't leave extra spaces on the end of a line! |
|
78 |
+ \item if any of the lines don't conform to this format, clamav will abort with a Malformed Database Error |
|
79 |
+ \item see section \vref{sub:Extraction-of-realURL,} for more details on \textsc{realURL/displayedURL} |
|
80 |
+\end{itemize} |
|
81 |
+ |
|
82 |
+%TODO: give up-to-date examples |
|
83 |
+ |
|
84 |
+\subsubsection{Example} |
|
85 |
+ |
|
86 |
+The following line: |
|
87 |
+ |
|
88 |
+\emph{R http://www\textbackslash{}.google\textbackslash{}.(com|ro|it) |
|
89 |
+www\textbackslash{}.google\textbackslash{}.com} |
|
90 |
+ |
|
91 |
+Means: \emph{\textsc{R}}\textsc{ }- this is a regex. |
|
92 |
+ |
|
93 |
+Example of url pairs matching: http://www.google.com www.google.com, |
|
94 |
+http://www.google.it www.google.com. |
|
95 |
+ |
|
96 |
+Example of url pairs not matching: http://www.google.c0m www.google.com |
|
97 |
+ |
|
98 |
+ |
|
99 |
+\subsection{How matching works} |
|
100 |
+ |
|
101 |
+ |
|
102 |
+\subsubsection{RealURL, displayedURL concatenation\label{sub:RealURL,-displayedURL-concatenation}} |
|
103 |
+ |
|
104 |
+The phishing detection module processes pairs of realURL/displayedURL, |
|
105 |
+and the matching against daily.wdb/daily.pdb is done as follows: the |
|
106 |
+realURL is concatenated with a space, and with the displayedURL, then |
|
107 |
+that \emph{line} is matched against the lines in daily.wdb/daily.pdb |
|
108 |
+ |
|
109 |
+So if you have a line like |
|
110 |
+ |
|
111 |
+\textit{~www.google.ro~www.google.com} |
|
112 |
+ |
|
113 |
+and a href like: \emph{<a href=''http://www.google.ro''>www.google.com</a>,} |
|
114 |
+then it will match, but: \emph{<a href=''http://images.google.com''>www.google.com</a>} |
|
115 |
+will not match. |
|
116 |
+ |
|
117 |
+If you use the \textbf{\textsc{H}} flag, then the 2nd href will match |
|
118 |
+too. |
|
119 |
+ |
|
120 |
+ |
|
121 |
+\subsubsection{What happens when a match is found} |
|
122 |
+ |
|
123 |
+In the case of the whitelist, a match means that the realURL/displayedURL |
|
124 |
+combination is considered \textsc{clean}, and no further checks are |
|
125 |
+performed on it. |
|
126 |
+ |
|
127 |
+In the case of the domainlist, a match means that the realURL/displayedURL |
|
128 |
+is going to be checked for phishing attempts. This is only done if |
|
129 |
+you don't run clamav with the \emph{alldomains} option (since then |
|
130 |
+all urls are checked). Furthermore you can restrict what checks are |
|
131 |
+to be performed by specifying the 3-digit hexnumber. |
|
132 |
+ |
|
133 |
+ |
|
134 |
+\subsubsection{Extraction of \textsc{realURL}, \textsc{displayedURL} from HTML tags\label{sub:Extraction-of-realURL,}} |
|
135 |
+ |
|
136 |
+The html parser extracts pairs of \textsc{realURL}/\textsc{displayedURL} |
|
137 |
+based on the following rules: |
|
138 |
+ |
|
139 |
+\begin{description} |
|
140 |
+\item [{a}] (anchor) the \emph{href} is the \textsc{realURL}, its \emph{contents} |
|
141 |
+is the \textsc{displayedURL} |
|
142 |
+ |
|
143 |
+\begin{description} |
|
144 |
+\item [{contents}] is the tag-stripped contents of the <a> tags, so for |
|
145 |
+example <b> tags are stripped (but not their contents) |
|
146 |
+\end{description} |
|
147 |
+nesting another <a> tag withing an <a> tag (besides being invalid |
|
148 |
+html) is treated as a </a><a.. |
|
149 |
+ |
|
150 |
+\item [{form}] the \emph{action} attribute is the \textsc{realURL}, and a |
|
151 |
+nested <a> tag is the \textsc{displayedURL} |
|
152 |
+\item [{img/area}] if nested within an \emph{<a>} tag, the \textsc{realURL} |
|
153 |
+is the \emph{href} of the a tag, and the \emph{src/dynsrc/area} is |
|
154 |
+the \textsc{displayedURL} of the img |
|
155 |
+ |
|
156 |
+ |
|
157 |
+if nested withing a \emph{form} tag, then the action attribute of |
|
158 |
+the \emph{form} tag is the \textsc{realURL} |
|
159 |
+ |
|
160 |
+\item [{iframe}] if nested withing an \emph{<a>} tag the \emph{src} attribute |
|
161 |
+is the displayedURL, and the \emph{href} of its parent \emph{a} tag |
|
162 |
+is the \textsc{realURL} |
|
163 |
+ |
|
164 |
+ |
|
165 |
+if nested withing a \emph{form} tag, then the action attribute of |
|
166 |
+the \emph{form} tag is the \textsc{realURL} |
|
167 |
+ |
|
168 |
+\end{description} |
|
169 |
+ |
|
170 |
+\subsubsection{Example} |
|
171 |
+ |
|
172 |
+Consider this html file: |
|
173 |
+ |
|
174 |
+\begin{quote} |
|
175 |
+\emph{<a href=''evilurl''>www.paypal.com</a>} |
|
176 |
+ |
|
177 |
+\emph{<a href=''evilurl2'' title=''www.ebay.com''>click here to |
|
178 |
+sign in</a>} |
|
179 |
+ |
|
180 |
+\emph{<form action=''evilurl\_form''>} |
|
181 |
+ |
|
182 |
+\emph{Please sign in to <a href=''cgi.ebay.com''>Ebay</a> using |
|
183 |
+this form} |
|
184 |
+ |
|
185 |
+\emph{<input type='text' name='username'>Username</input>} |
|
186 |
+ |
|
187 |
+\emph{....} |
|
188 |
+ |
|
189 |
+\emph{</form>} |
|
190 |
+ |
|
191 |
+\emph{<a href=''evilurl''><img src=''images.paypal.com/secure.jpg''></a>} |
|
192 |
+\end{quote} |
|
193 |
+The resulting \textsc{realURL/displayedURL} pairs will be (note that |
|
194 |
+one tag can generate multiple pairs): |
|
195 |
+ |
|
196 |
+\begin{itemize} |
|
197 |
+\item evilurl / www.paypal.com |
|
198 |
+\item evilurl2 / click here to sign in |
|
199 |
+\item evilurl2 / www.ebay.com |
|
200 |
+\item evilurl\_form / cgi.ebay.com |
|
201 |
+\item cgi.ebay.com / Ebay |
|
202 |
+\item evilurl / image.paypal.com/secure.jpg |
|
203 |
+\end{itemize} |
|
204 |
+ |
|
205 |
+\subsection{Simple patterns\label{sec:Simple-patterns}} |
|
206 |
+ |
|
207 |
+Simple patterns are matched literally, i.e. if you say: |
|
208 |
+ |
|
209 |
+\begin{quote} |
|
210 |
+www.google.com |
|
211 |
+\end{quote} |
|
212 |
+it is going to match \emph{www.google.com}, and only that. The \emph{. |
|
213 |
+(dot)} character has no special meaning (see the section on regexes |
|
214 |
+\vref{sec:Regular-expressions} for how the \emph{.(dot)} character |
|
215 |
+behaves there) |
|
216 |
+ |
|
217 |
+ |
|
218 |
+\subsection{Regular expressions\label{sec:Regular-expressions}} |
|
219 |
+ |
|
220 |
+POSIX regular expressions are supported, and you can consider that |
|
221 |
+internally it is wrapped by \emph{\textasciicircum{}}, and \emph{\$.} |
|
222 |
+In other words, this means that the regular expression has to match |
|
223 |
+the entire concatenated (see section \vref{sub:RealURL,-displayedURL-concatenation} |
|
224 |
+for details on concatenation) url. |
|
225 |
+ |
|
226 |
+It is recomended that you read section \vref{sec:Introduction-to-regular} |
|
227 |
+to learn how to write regular expressions, and then come back and |
|
228 |
+read this for hints. |
|
229 |
+ |
|
230 |
+Be advised that clamav contains an internal, very basic regex matcher |
|
231 |
+to reduce the load on the regex matching core. Thus it is recomended |
|
232 |
+that you avoid using regex syntax not supported by it at the very |
|
233 |
+beginning of regexes (at least the first few characters). |
|
234 |
+ |
|
235 |
+Currently the clamav regex matcher supports: |
|
236 |
+ |
|
237 |
+\begin{itemize} |
|
238 |
+\item . (dot) character |
|
239 |
+\item \textbackslash{} (escaping special characters) |
|
240 |
+\item | (pipe) alternatives |
|
241 |
+\item {[}] (character classes) |
|
242 |
+\item () (paranthesis for grouping, but no group extraction is performed) |
|
243 |
+\item other non-special characters |
|
244 |
+\end{itemize} |
|
245 |
+Thus the following are not supported: |
|
246 |
+ |
|
247 |
+\begin{itemize} |
|
248 |
+\item + repetition |
|
249 |
+\item {*} repetition |
|
250 |
+\item \{\} repetition |
|
251 |
+\item backreferences |
|
252 |
+\item lookaround |
|
253 |
+\item other {}``advanced'' features not listed in the supported list ;) |
|
254 |
+\end{itemize} |
|
255 |
+This however shouldn't discourage you from using the {}``not directly |
|
256 |
+supported features {}``, because if the internal engine encounters |
|
257 |
+unsupported syntax, it passes it on to the POSIX regex core (beginning |
|
258 |
+from the first unsupported token, everything before that is still |
|
259 |
+processed by the internal matcher). An example might make this more |
|
260 |
+clear: |
|
261 |
+ |
|
262 |
+\emph{www\textbackslash{}.google\textbackslash{}.(com|ro|it) ({[}a-zA-Z])+\textbackslash{}.google\textbackslash{}.(com|ro|it)} |
|
263 |
+ |
|
264 |
+Everything till \emph{({[}a-zA-Z])+} is processed internally, that |
|
265 |
+paranthesis (and everything beyond) is processed by the posix core. |
|
266 |
+ |
|
267 |
+Examples of url pairs that match: |
|
268 |
+ |
|
269 |
+\begin{itemize} |
|
270 |
+\item \emph{www.google.ro images.google.ro} |
|
271 |
+\item www.google.com images.google.ro |
|
272 |
+\end{itemize} |
|
273 |
+Example of url pairs that don't match: |
|
274 |
+ |
|
275 |
+\begin{itemize} |
|
276 |
+\item www.google.ro images1.google.ro |
|
277 |
+\item images.google.com image.google.com |
|
278 |
+\end{itemize} |
|
279 |
+ |
|
280 |
+\subsection{Flags\label{sec:Flags}} |
|
281 |
+ |
|
282 |
+Flags are a binary OR of the following numbers: |
|
283 |
+ |
|
284 |
+\begin{description} |
|
285 |
+\item [{HOST\_SUFFICIENT}] 1 |
|
286 |
+\item [{DOMAIN\_SUFFICIENT}] 2 |
|
287 |
+\item [{DO\_REVERSE\_LOOKUP}] 4 |
|
288 |
+\item [{CHECK\_REDIR}] 8 |
|
289 |
+\item [{CHECK\_SSL}] 16 |
|
290 |
+\item [{CHECK\_CLOAKING}] 32 |
|
291 |
+\item [{CLEANUP\_URL}] 64 |
|
292 |
+\item [{CHECK\_DOMAIN\_REVERSE}] 128 |
|
293 |
+\item [{CHECK\_IMG\_URL}] 256 |
|
294 |
+\item [{DOMAINLIST\_REQUIRED}] 512 |
|
295 |
+\end{description} |
|
296 |
+The names of the constants are self-explanatory. |
|
297 |
+ |
|
298 |
+These constants are defined in libclamav/phishcheck.h, you can check |
|
299 |
+there for the latest flags. |
|
300 |
+ |
|
301 |
+There is a default set of flags that are enabled, these are currently: |
|
302 |
+(CLEANUP\_URL|DOMAIN\_SUFFICIENT|CHECK\_SSL|CHECK\_CLOAKING|DOMAINLIST\_REQUIRED|CHECK\_IMG\_URL), |
|
303 |
+ssl checking is performed only for a tags currently. |
|
304 |
+ |
|
305 |
+You must decide for each line in the domainlist if you want to filter |
|
306 |
+any flags (that is you don't want certain checks to be done), and |
|
307 |
+then calculate the binary OR of those constants, and then convert |
|
308 |
+it into a 3-digit hexnumber. For example you devide that domain\_sufficient |
|
309 |
+shouldn't be used for ebay.com, and you don't want to check images |
|
310 |
+either, so you come up with this flag number: $2|256\Rightarrow$258$(decimal)\Rightarrow102(hexadecimal)$ |
|
311 |
+ |
|
312 |
+So you add this line to daily.wdb: |
|
313 |
+ |
|
314 |
+\begin{itemize} |
|
315 |
+\item R102~www.ebay.com~.+ |
|
316 |
+\end{itemize} |
|
317 |
+ |
|
318 |
+\section{Introduction to regular expressions\label{sec:Introduction-to-regular}} |
|
319 |
+ |
|
320 |
+Recomended reading: |
|
321 |
+ |
|
322 |
+\begin{itemize} |
|
323 |
+\item http://www.regular-expressions.info/quickstart.html |
|
324 |
+\item http://www.regular-expressions.info/tutorial.html |
|
325 |
+\item regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7\&topic=regex |
|
326 |
+\end{itemize} |
|
327 |
+ |
|
328 |
+\subsection{Special characters} |
|
329 |
+ |
|
330 |
+\begin{description} |
|
331 |
+\item [{{[}}] the opening square bracket - it marks the beginning of a |
|
332 |
+character class, see section\vref{sub:Character-classes} |
|
333 |
+\item [{\textbackslash{}}] the backslash - escapes special characters, |
|
334 |
+see section \vref{sub:Escaping} |
|
335 |
+\item [{\^{ }}] the caret - matches the beginning of a line (not needed |
|
336 |
+in clamav regexes, this is implied) |
|
337 |
+\item [{\$}] the dollar sign - matches the end of a line (not needed in |
|
338 |
+clamav regexes, this is implied) |
|
339 |
+\item [{\.{ }}] the period or dot - matches \emph{any} character |
|
340 |
+\item [{|}] the vertical bar or pipe symbol - matches either of the token |
|
341 |
+on its left and right side, see section\vref{sub:Alternation} |
|
342 |
+\item [{?}] the question mark - matches optionally the left-side token, |
|
343 |
+see section\vref{sub:Optional-matching,-and} |
|
344 |
+\item [{{*}}] the asterisk or star - matches 0 or more occurences of the |
|
345 |
+left-side token, see section \vref{sub:Optional-matching,-and} |
|
346 |
+\item [{+}] the plus sign - matches 1 or more occurences of the left-side |
|
347 |
+token, see section \vref{sub:Optional-matching,-and} |
|
348 |
+\item [{(}] the opening round bracket - \c{m}arks beginning of a group, |
|
349 |
+see section \vref{sub:Groups} |
|
350 |
+\item [{)}] the closing round bracket - marks end of a group, see section\vref{sub:Groups} |
|
351 |
+\end{description} |
|
352 |
+ |
|
353 |
+\subsection{Character classes\label{sub:Character-classes}} |
|
354 |
+ |
|
355 |
+ |
|
356 |
+\subsection{Escaping\label{sub:Escaping}} |
|
357 |
+ |
|
358 |
+Escaping has two purposes: |
|
359 |
+ |
|
360 |
+\begin{itemize} |
|
361 |
+\item it allows you to actually match the special characters themselves, |
|
362 |
+for example to match the literal \emph{+}, you would write \emph{\textbackslash{}+} |
|
363 |
+\item it also allows you to match non-printable characters, such as the |
|
364 |
+tab (\emph{\textbackslash{}t}), newline (\emph{\textbackslash{}n}), |
|
365 |
+.. |
|
366 |
+\end{itemize} |
|
367 |
+However since non-printable characters are not valid inside an url, |
|
368 |
+you won't have a reason to use them. |
|
369 |
+ |
|
370 |
+ |
|
371 |
+\subsection{Alternation\label{sub:Alternation}} |
|
372 |
+ |
|
373 |
+ |
|
374 |
+\subsection{Optional matching, and repetition\label{sub:Optional-matching,-and}} |
|
375 |
+ |
|
376 |
+ |
|
377 |
+\subsection{Groups\label{sub:Groups}} |
|
378 |
+ |
|
379 |
+Groups are usually used together with repetition, or alternation. |
|
380 |
+For example: \emph{(com|it)+} means: match 1 or more repetitions of |
|
381 |
+\emph{com} or \emph{it,} that is it matches: com, it, comcom, comcomcom, |
|
382 |
+comit, itit, ititcom,... you get the idea. |
|
383 |
+ |
|
384 |
+Groups can also be used to extract substring, but this is not supported |
|
385 |
+by the clam engine, and not needed either in this case. |
|
386 |
+ |
|
387 |
+ |
|
388 |
+\section{How to create database files} |
|
389 |
+ |
|
390 |
+ |
|
391 |
+\subsection{How to create and maintain the whitelist (daily.wdb)} |
|
392 |
+ |
|
393 |
+If the phishing code claims that a certain mail is phishing, but its |
|
394 |
+not, you have 2 choices: |
|
395 |
+ |
|
396 |
+\begin{itemize} |
|
397 |
+\item examine your rules daily.pdb, and fix them if necessary (see: section\vref{sub:How-to-create}) |
|
398 |
+\item add it to the whitelist (discussed here) |
|
399 |
+\end{itemize} |
|
400 |
+Lets assume you are having problems because of links like this in |
|
401 |
+a mail: |
|
402 |
+ |
|
403 |
+\begin{quote} |
|
404 |
+<a href=''http://69.0.241.57/bCentral/L.asp?L=XXXXXXXX''>http://www.bcentral.it/</a> |
|
405 |
+\end{quote} |
|
406 |
+After investigating those sites further, you decide they are no threat, |
|
407 |
+and create a line like this in daily.wdb: |
|
408 |
+ |
|
409 |
+\begin{quote} |
|
410 |
+R http://www\textbackslash{}.bcentral\textbackslash{}.it/.+ http://69\textbackslash{}.0\textbackslash{}.241\textbackslash{}.57/bCentral/L\textbackslash{}.asp?L=.+ |
|
411 |
+\end{quote} |
|
412 |
+Note: urls like the above can be used to track unique mail recipients, |
|
413 |
+and thus know if somebody actually reads mails (so they can send more |
|
414 |
+spam). However since this site required no authentication information, |
|
415 |
+it is safe from a phishing point of view. |
|
416 |
+ |
|
417 |
+ |
|
418 |
+\subsection{How to create and maintain the domainlist (daily.pdb)\label{sub:How-to-create}} |
|
419 |
+ |
|
420 |
+When not using --phish-scan-alldomains (production environments for |
|
421 |
+example), you need to decide which urls you are going to check. |
|
422 |
+ |
|
423 |
+Although at a first glance it might seem a good idea to check everything, |
|
424 |
+it would produce false positives. Particularly newsletters, ads, etc. |
|
425 |
+are likely to use URLs that look like phishing attempts. |
|
426 |
+ |
|
427 |
+Lets assume that you've recently seen many phishing attempts claiming |
|
428 |
+they come from Paypal. Thus you need to add paypal to daily.pdb: |
|
429 |
+ |
|
430 |
+\begin{quote} |
|
431 |
+R .+ .+\textbackslash{}.paypal\textbackslash{}.com |
|
432 |
+\end{quote} |
|
433 |
+The above line will block (detect as phishing) mails that contain |
|
434 |
+urls that claim to lead to paypal, but they don't in fact. |
|
435 |
+ |
|
436 |
+Be carefull not to create regexes that match a too broad range of |
|
437 |
+urls though. |
|
438 |
+ |
|
439 |
+ |
|
440 |
+\subsection{Dealing with false positives, and undetected phishing mails} |
|
441 |
+ |
|
442 |
+ |
|
443 |
+\subsubsection{False positives} |
|
444 |
+ |
|
445 |
+Whenever you see a false positive (mail that is detected as phishing, |
|
446 |
+but its not), you need to examine \emph{why} clamav decided that its |
|
447 |
+phishing. You can do this easily by building clamav with debugging |
|
448 |
+(./configure --enable-experimental --enable-debug), and then running |
|
449 |
+a tool: |
|
450 |
+ |
|
451 |
+\begin{quote} |
|
452 |
+\$contrib/phishing/why.py phishing.eml |
|
453 |
+\end{quote} |
|
454 |
+This will show the url that triggers the phish verdict, and a reason |
|
455 |
+why that url is considered phishing attempt. |
|
456 |
+ |
|
457 |
+Once you know the reason, you might need to modify daily.pdb (if one |
|
458 |
+of yours rules inthere are too broad), or you need to add the url |
|
459 |
+to daily.wdb. If you think the algorithm is incorrect, please file |
|
460 |
+a bugreport on bugzilla.clamav.net, including the output of \emph{why.py}. |
|
461 |
+ |
|
462 |
+ |
|
463 |
+\subsubsection{Undetected phish mails} |
|
464 |
+ |
|
465 |
+Using why.py doesn't help here unfortunately (it will say: clean), |
|
466 |
+so all you can do is: |
|
467 |
+ |
|
468 |
+\begin{quote} |
|
469 |
+\$clamscan/clamscan --phish-scan-alldomains undetected.eml |
|
470 |
+\end{quote} |
|
471 |
+And see if the mail is detected, if yes, then you need to add an appropiate |
|
472 |
+line to daily.pdb (see section \vref{sub:How-to-create}). |
|
473 |
+ |
|
474 |
+If the mail is not detected, then try using: |
|
475 |
+ |
|
476 |
+\begin{quote} |
|
477 |
+\$clamscan/clamscan --debug undetected.eml|less |
|
478 |
+\end{quote} |
|
479 |
+ |
|
480 |
+Then see what urls are being checked, see if any of them is in a |
|
481 |
+whitelist, see if all urls are detected, etc. |
|
482 |
+ |
|
483 |
+ |
|
484 |
+\section{Hints and recomandations} |
|
485 |
+ |
|
486 |
+ |
|
487 |
+\section{Examples} |
|
488 |
+ |
|
489 |
+ |
|
490 |
+\end{document} |