GitList

Browse code

update documentation. Part I, more to come. (bb #554).

git-svn: trunk@3508

Török Edvin authored on 2008/01/19 00:28:05
Showing 4 changed files

ChangeLog index e2f3654..da7dc6c 100644
docs/phishsigs_howto.lyx index ebed15c..0000000
docs/phishsigs_howto.pdf index 8cb95cf..fe13655 100644
docs/phishsigs_howto.tex index 0000000..dfc8aa5

@@ -1,3 +1,8 @@
                     +Fri Jan 18 17:01:25 EET 2008 (edwin)
                     +------------------------------------
                     +  * docs/phishsigs_howto.tex/.pdf: update documentation. Part I, more to come.
                     +  (bb #554).
+                    +
                      Fri Jan 18 12:13:16 CET 2008 (acab)
                      -----------------------------------
                        * test: Storing the testifles byteswapped to avoid detection of the tarball.

docs/phishsigs_howto.lyx

History View file @ 0d615f7

                     deleted file mode 100644
@@ -1,1363 +0,0 @@
                     -#LyX 1.4.2 created this file. For more info see http://www.lyx.org/
                     -\lyxformat 245
                     -\begin_document
                     -\begin_header
                     -\textclass article
                     -\language english
                     -\inputencoding auto
                     -\fontscheme pslatex
                     -\graphics default
                     -\paperfontsize default
                     -\spacing single
                     -\papersize a4paper
                     -\use_geometry false
                     -\use_amsmath 1
                     -\cite_engine basic
                     -\use_bibtopic false
                     -\paperorientation portrait
                     -\secnumdepth 3
                     -\tocdepth 3
                     -\paragraph_separation indent
                     -\defskip medskip
                     -\quotes_language english
                     -\papercolumns 1
                     -\papersides 1
                     -\paperpagestyle default
                     -\tracking_changes false
                     -\output_changes false
                     -\end_header
+                    -
                     -\begin_body
+                    -
                     -\begin_layout Title
+                    -
                     -\family roman
                     -\series medium
                     -\shape up
                     -\size normal
                     -\emph off
                     -\bar no
                     -\noun off
                     -\color none
                     -Phishing signatures creation HOWTO
                     -\end_layout
+                    -
                     -\begin_layout Author
                     -\end_layout
+                    -
                     -\begin_layout Section
                     -Database file format
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -The database file format is common for the whitelist (.wdb), and domainlist
                     - (.pdb), and it consists of (multiple) lines of form:
                     -\end_layout
+                    -
                     -\begin_layout Standard
+                    -
                     -\series bold
                     -Flags\InsetSpace ~
                     -RealURL\InsetSpace ~
                     -DisplayedURL
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -Where
                     -\noun on
                     -Flags
                     -\noun default
                     - is:
                     -\end_layout
+                    -
                     -\begin_deeper
                     -\begin_layout Itemize
                     -an (optional) character :
                     -\end_layout
+                    -
                     -\begin_deeper
                     -\begin_layout Description
                     -R regex, has to match entire url, see section
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -H has to match the host part of url only (a simple pattern, i.e.
                     - it is matched literally)
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -no\InsetSpace ~
                     -character matches the entire url, but as a simple pattern (non-regex)
                     -\end_layout
+                    -
                     -\end_deeper
                     -\begin_layout Itemize
                     -followed by an (optional) 3-digit hexadecimal number representing flags
                     - that should be filtered.
                     -\end_layout
+                    -
                     -\begin_deeper
                     -\begin_layout Itemize
                     -flag filtering only makes sense in .pdb files, (however clamav won't complain
                     - if you put flags in .wdb files, it just won't use them)
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -for details on how to construct a flag number see section
                     -\begin_inset LatexCommand \prettyref{sec:Flags}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\end_deeper
                     -\end_deeper
                     -\begin_layout Itemize
+                    -
                     -\noun on
                     -RealURL
                     -\noun default
                     -is the URL the user is sent to
                     -\end_layout
+                    -
                     -\begin_layout Itemize
+                    -
                     -\noun on
                     -displayedURL
                     -\noun default
                     - is the URL description displayed to the user, that is where it is
                     -\emph on
                     -claimed
                     -\emph default
                     - they are sent, the most obvious example is that of an html anchor (<a>tag):
                     - its href attribute is the
                     -\noun on
                     -realURL
                     -\noun default
                     -, and its contents is the
                     -\noun on
                     -displayedURL
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -see section
                     -\begin_inset LatexCommand \vref{sub:Extraction-of-realURL,}
+                    -
                     -\end_inset
+                    -
                     - for more details on what
                     -\noun on
                     -realURL/displayedURL
                     -\noun default
                     - is
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Note: The spaces are mandatory, and empty lines are skipped.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -If any of the lines of daily.wdb/daily.pdb don't conform to the above file
                     - format, the loading of the file shall fail, and whitelist/domainlist feature
                     - will be disabled.
                     - If the loading of the whitelist fails, the phishing checks will be disabled
                     - entirely.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Therefore it is important to test the daily.wdb/daily.pdb before packing it
                     - into daily.cvd!
                     -\end_layout
+                    -
                     -\begin_layout Subsubsection
                     -Example
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -The following line:
                     -\end_layout
+                    -
                     -\begin_layout Standard
+                    -
                     -\emph on
                     -R http://www
                     -\backslash
                     -.google
                     -\backslash
                     -.(com|ro|it) www
                     -\backslash
                     -.google
                     -\backslash
                     -.com
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Means:
                     -\emph on
                     -\noun on
                     -R
                     -\emph default
+                    -
                     -\noun default
                     -- this is a regex.
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Example of url pairs matching: http://www.google.com www.google.com, http://www.googl
                     -e.it www.google.com.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Example of url pairs not matching: http://www.google.c0m www.google.com
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -How matching works
                     -\end_layout
+                    -
                     -\begin_layout Subsubsection
                     -RealURL, displayedURL concatenation
                     -\begin_inset LatexCommand \label{sub:RealURL,-displayedURL-concatenation}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -The phishing detection module processes pairs of realURL/displayedURL, and
                     - the matching against daily.wdb/daily.pdb is done as follows: the realURL
                     - is concatenated with a space, and with the displayedURL, then that
                     -\emph on
                     -line
                     -\emph default
                     -is matched against the lines in daily.wdb/daily.pdb
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -So if you have a line like
                     -\end_layout
+                    -
                     -\begin_layout Standard
+                    -
                     -\shape italic
                     -\InsetSpace ~
                     -www.google.ro\InsetSpace ~
                     -www.google.com
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -and a href like:
                     -\emph on
                     -<a href=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -http://www.google.ro
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     ->www.google.com</a>,
                     -\emph default
                     -then it will match, but:
                     -\emph on
                     -<a href=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -http://images.google.com
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     ->www.google.com</a>
                     -\emph default
                     - will not match.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -If you use the
                     -\series bold
                     -\noun on
                     -H
                     -\noun default
+                    -
                     -\series default
                     -flag, then the 2nd href will match too.
                     -\end_layout
+                    -
                     -\begin_layout Subsubsection
                     -What happens when a match is found
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -In the case of the whitelist, a match means that the realURL/displayedURL
                     - combination is considered
                     -\noun on
                     -clean
                     -\noun default
                     -, and no further checks are performed on it.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -In the case of the domainlist, a match means that the realURL/displayedURL
                     - is going to be checked for phishing attempts.
                     - This is only done if you don't run clamav with the
                     -\emph on
                     -alldomains
                     -\emph default
                     - option (since then all urls are checked).
                     - Furthermore you can restrict what checks are to be performed by specifying
                     - the 3-digit hexnumber.
                     -\end_layout
+                    -
                     -\begin_layout Subsubsection
                     -Extraction of
                     -\noun on
                     -realURL
                     -\noun default
                     -,
                     -\noun on
                     -displayedURL
                     -\noun default
                     - from HTML tags
                     -\begin_inset LatexCommand \label{sub:Extraction-of-realURL,}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -The html parser extracts pairs of
                     -\noun on
                     -realURL
                     -\noun default
                     -/
                     -\noun on
                     -displayedURL
                     -\noun default
                     - based on the following rules:
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -a (anchor) the
                     -\emph on
                     -href
                     -\emph default
                     - is the
                     -\noun on
                     -realURL
                     -\noun default
                     -, its
                     -\emph on
                     -contents
                     -\emph default
                     - is the
                     -\noun on
                     -displayedURL
                     -\end_layout
+                    -
                     -\begin_deeper
                     -\begin_layout Description
                     -contents is the tag-stripped contents of the <a> tags, so for example <b>
                     - tags are stripped (but not their contents)
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -nesting another <a> tag withing an <a> tag (besides being invalid html)
                     - is treated as a </a><a..
                     -\end_layout
+                    -
                     -\end_deeper
                     -\begin_layout Description
                     -form the
                     -\emph on
                     -action
                     -\emph default
                     -attribute is the
                     -\noun on
                     -realURL
                     -\noun default
                     -, and a nested <a> tag is the
                     -\noun on
                     -displayedURL
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -img/area if nested within an
                     -\emph on
                     - <a>
                     -\emph default
                     - tag, the
                     -\noun on
                     -realURL
                     -\noun default
                     - is the
                     -\emph on
                     -href
                     -\emph default
                     - of the a tag, and the
                     -\emph on
                     -src/dynsrc/area
                     -\emph default
                     - is the
                     -\noun on
                     -displayedURL
                     -\noun default
                     - of the img
                     -\end_layout
+                    -
                     -\begin_deeper
                     -\begin_layout Standard
                     -if nested withing a
                     -\emph on
                     -form
                     -\emph default
                     - tag, then the action attribute of the
                     -\emph on
                     -form
                     -\emph default
                     - tag is the
                     -\noun on
                     -realURL
                     -\noun default
+                    -
                     -\end_layout
+                    -
                     -\end_deeper
                     -\begin_layout Description
                     -iframe if nested withing an
                     -\emph on
                     -<a>
                     -\emph default
                     - tag the
                     -\emph on
                     -src
                     -\emph default
                     - attribute is the displayedURL, and the
                     -\emph on
                     -href
                     -\emph default
                     - of its parent
                     -\emph on
                     - a
                     -\emph default
                     - tag is the
                     -\noun on
                     -realURL
                     -\end_layout
+                    -
                     -\begin_deeper
                     -\begin_layout Standard
                     -if nested withing a
                     -\emph on
                     -form
                     -\emph default
                     - tag, then the action attribute of the
                     -\emph on
                     -form
                     -\emph default
                     - tag is the
                     -\noun on
                     -realURL
                     -\end_layout
+                    -
                     -\end_deeper
                     -\begin_layout Subsubsection
                     -Example
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Consider this html file:
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -<a href=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -evilurl
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     ->www.paypal.com</a>
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -<a href=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -evilurl2
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     - title=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -www.ebay.com
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     ->click here to sign in</a>
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -<form action=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -evilurl_form
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     ->
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -Please sign in to <a href=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -cgi.ebay.com
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     ->Ebay</a> using this form
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -<input type='text' name='username'>Username</input>
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -....
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -</form>
                     -\end_layout
+                    -
                     -\begin_layout Quote
+                    -
                     -\emph on
                     -<a href=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -evilurl
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -><img src=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -images.paypal.com/secure.jpg
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -></a>
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -The resulting
                     -\noun on
                     -realURL/displayedURL
                     -\noun default
                     - pairs will be (note that one tag can generate multiple pairs):
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -evilurl / www.paypal.com
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -evilurl2 / click here to sign in
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -evilurl2 / www.ebay.com
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -evilurl_form / cgi.ebay.com
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -cgi.ebay.com / Ebay
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -evilurl / image.paypal.com/secure.jpg
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Simple patterns
                     -\begin_inset LatexCommand \label{sec:Simple-patterns}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Simple patterns are matched literally, i.e.
                     - if you say:
                     -\end_layout
+                    -
                     -\begin_layout Quote
                     -www.google.com
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -it is going to match
                     -\emph on
                     -www.google.com
                     -\emph default
                     -, and only that.
                     - The
                     -\emph on
                     -.
                     - (dot)
                     -\emph default
                     - character has no special meaning (see the section on regexes
                     -\begin_inset LatexCommand \vref{sec:Regular-expressions}
+                    -
                     -\end_inset
+                    -
                     - for how the
                     -\emph on
                     -.(dot)
                     -\emph default
                     - character behaves there)
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Regular expressions
                     -\begin_inset LatexCommand \label{sec:Regular-expressions}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -POSIX regular expressions are supported, and you can consider that internally
                     - it is wrapped by
                     -\emph on
                     -^
                     -\emph default
                     -, and
                     -\emph on
                     -$.
+                    -
                     -\emph default
                     -In other words, this means that the regular expression has to match the
                     - entire concatenated (see section
                     -\begin_inset LatexCommand \vref{sub:RealURL,-displayedURL-concatenation}
+                    -
                     -\end_inset
+                    -
                     - for details on concatenation) url.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -It is recomended that you read section
                     -\begin_inset LatexCommand \vref{sec:Introduction-to-regular}
+                    -
                     -\end_inset
+                    -
                     - to learn how to write regular expressions, and then come back and read
                     - this for hints.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Be advised that clamav contains an internal, very basic regex matcher to
                     - reduce the load on the regex matching core.
                     - Thus it is recomended that you avoid using regex syntax not supported by
                     - it at the very beginning of regexes (at least the first few characters).
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Currently the clamav regex matcher supports:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -.
                     - (dot) character
                     -\end_layout
+                    -
                     -\begin_layout Itemize
+                    -
                     -\backslash
                     - (escaping special characters)
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -| (pipe) alternatives
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -[] (character classes)
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -() (paranthesis for grouping, but no group extraction is performed)
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -other non-special characters
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Thus the following are not supported:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -+ repetition
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -* repetition
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -{} repetition
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -backreferences
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -lookaround
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -other
                     -\begin_inset Quotes eld
                     -\end_inset
+                    -
                     -advanced
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     - features not listed in the supported list ;)
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -This however shouldn't discourage you from using the
                     -\begin_inset Quotes eld
                     -\end_inset
+                    -
                     -not directly supported features
                     -\begin_inset Quotes eld
                     -\end_inset
+                    -
                     -, because if the internal engine encounters unsupported syntax, it passes
                     - it on to the POSIX regex core (beginning from the first unsupported token,
                     - everything before that is still processed by the internal matcher).
                     - An example might make this more clear:
                     -\end_layout
+                    -
                     -\begin_layout Standard
+                    -
                     -\emph on
                     -www
                     -\backslash
                     -.google
                     -\backslash
                     -.(com|ro|it) ([a-zA-Z])+
                     -\backslash
                     -.google
                     -\backslash
                     -.(com|ro|it)
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Everything till
                     -\emph on
                     -([a-zA-Z])+
                     -\emph default
                     - is processed internally, that paranthesis (and everything beyond) is processed
                     - by the posix core.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Examples of url pairs that match:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
+                    -
                     -\emph on
                     -www.google.ro images.google.ro
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -www.google.com images.google.ro
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Example of url pairs that don't match:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -www.google.ro images1.google.ro
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -images.google.com image.google.com
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Flags
                     -\begin_inset LatexCommand \label{sec:Flags}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Flags are a binary OR of the following numbers:
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -HOST_SUFFICIENT 1
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -DOMAIN_SUFFICIENT 2
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -DO_REVERSE_LOOKUP 4
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -CHECK_REDIR 8
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -CHECK_SSL 16
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -CHECK_CLOAKING 32
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -CLEANUP_URL 64
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -CHECK_DOMAIN_REVERSE 128
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -CHECK_IMG_URL 256
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -DOMAINLIST_REQUIRED 512
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -The names of the constants are self-explanatory.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -These constants are defined in libclamav/phishcheck.h, you can check there
                     - for the latest flags.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -There is a default set of flags that are enabled, these are currently: (CLEANUP_
                     -URL|DOMAIN_SUFFICIENT|CHECK_SSL|CHECK_CLOAKING|DOMAINLIST_REQUIRED|CHECK_IMG_URL
                     -), ssl checking is performed only for a tags currently.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -You must decide for each line in the domainlist if you want to filter any
                     - flags (that is you don't want certain checks to be done), and then calculate
                     - the binary OR of those constants, and then convert it into a 3-digit hexnumber.
                     - For example you devide that domain_sufficient shouldn't be used for ebay.com,
                     - and you don't want to check images either, so you come up with this flag
                     - number:
                     -\begin_inset Formula $2|256\Rightarrow$
                     -\end_inset
+                    -
                     -258
                     -\begin_inset Formula $(decimal)\Rightarrow102(hexadecimal)$
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -So you add this line to daily.wdb:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -R102\InsetSpace ~
                     -www.ebay.com\InsetSpace ~
                     -.+
                     -\end_layout
+                    -
                     -\begin_layout Section
                     -Introduction to regular expressions
                     -\begin_inset LatexCommand \label{sec:Introduction-to-regular}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Recomended reading:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -http://www.regular-expressions.info/quickstart.html
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -http://www.regular-expressions.info/tutorial.html
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7&topic=regex
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Special characters
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -[ the opening square bracket - it marks the beginning of a character class,
                     - see section
                     -\begin_inset LatexCommand \vref{sub:Character-classes}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Description
+                    -
                     -\backslash
                     - the backslash - escapes special characters, see section
                     -\begin_inset LatexCommand \vref{sub:Escaping}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -\i \^{ }
                     - the caret - matches the beginning of a line (not needed in clamav regexes,
                     - this is implied)
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -$ the dollar sign - matches the end of a line (not needed in clamav regexes,
                     - this is implied)
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -\i \.{ }
                     - the period or dot - matches
                     -\emph on
                     -any
                     -\emph default
                     - character
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -| the vertical bar or pipe symbol - matches either of the token on its left
                     - and right side, see section
                     -\begin_inset LatexCommand \vref{sub:Alternation}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -? the question mark - matches optionally the left-side token, see section
                     -\begin_inset LatexCommand \vref{sub:Optional-matching,-and}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -* the asterisk or star - matches 0 or more occurences of the left-side token,
                     - see section
                     -\begin_inset LatexCommand \vref{sub:Optional-matching,-and}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -+ the plus sign - matches 1 or more occurences of the left-side token, see
                     - section
                     -\begin_inset LatexCommand \vref{sub:Optional-matching,-and}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -( the opening round bracket - \i \c{m}
                     -arks beginning of a group, see section
                     -\begin_inset LatexCommand \vref{sub:Groups}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Description
                     -) the closing round bracket - marks end of a group, see section
                     -\begin_inset LatexCommand \vref{sub:Groups}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Character classes
                     -\begin_inset LatexCommand \label{sub:Character-classes}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Escaping
                     -\begin_inset LatexCommand \label{sub:Escaping}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Escaping has two purposes:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -it allows you to actually match the special characters themselves, for example
                     - to match the literal
                     -\emph on
                     -+
                     -\emph default
                     -, you would write
                     -\emph on
+                    -
                     -\backslash
                     -+
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -it also allows you to match non-printable characters, such as the tab (
                     -\emph on
+                    -
                     -\backslash
                     -t
                     -\emph default
                     -), newline (
                     -\emph on
+                    -
                     -\backslash
                     -n
                     -\emph default
                     -), ..
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -However since non-printable characters are not valid inside an url, you
                     - won't have a reason to use them.
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Alternation
                     -\begin_inset LatexCommand \label{sub:Alternation}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Optional matching, and repetition
                     -\begin_inset LatexCommand \label{sub:Optional-matching,-and}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Groups
                     -\begin_inset LatexCommand \label{sub:Groups}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Groups are usually used together with repetition, or alternation.
                     - For example:
                     -\emph on
                     -(com|it)+
                     -\emph default
                     - means: match 1 or more repetitions of
                     -\emph on
                     -com
                     -\emph default
                     - or
                     -\emph on
                     -it,
                     -\emph default
                     - that is it matches: com, it, comcom, comcomcom, comit, itit, ititcom,...
                     - you get the idea.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Groups can also be used to extract substring, but this is not supported
                     - by the clam engine, and not needed either in this case.
                     -\end_layout
+                    -
                     -\begin_layout Section
                     -How to create database files
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -How to create and maintain the whitelist (daily.wdb)
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -If the phishing code claims that a certain mail is phishing, but its not,
                     - you have 2 choices:
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -examine your rules daily.pdb, and fix them if necessary (see: section
                     -\begin_inset LatexCommand \vref{sub:How-to-create}
+                    -
                     -\end_inset
+                    -
                     -)
                     -\end_layout
+                    -
                     -\begin_layout Itemize
                     -add it to the whitelist (discussed here)
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Lets assume you are having problems because of links like this in a mail:
                     -\end_layout
+                    -
                     -\begin_layout Quote
                     -<a href=
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     -http://69.0.241.57/bCentral/L.asp?L=XXXXXXXX
                     -\begin_inset Quotes erd
                     -\end_inset
+                    -
                     ->http://www.bcentral.it/</a>
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -After investigating those sites further, you decide they are no threat,
                     - and create a line like this in daily.wdb:
                     -\end_layout
+                    -
                     -\begin_layout Quote
                     -R http://www
                     -\backslash
                     -.bcentral
                     -\backslash
                     -.it/.+ http://69
                     -\backslash
                     -.0
                     -\backslash
                     -.241
                     -\backslash
                     -.57/bCentral/L
                     -\backslash
                     -.asp?L=.+
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Note: urls like the above can be used to track unique mail recipients, and
                     - thus know if somebody actually reads mails (so they can send more spam).
                     - However since this site required no authentication information, it is safe
                     - from a phishing point of view.
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -How to create and maintain the domainlist (daily.pdb)
                     -\begin_inset LatexCommand \label{sub:How-to-create}
+                    -
                     -\end_inset
+                    -
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -When not using --phish-scan-alldomains (production environments for example),
                     - you need to decide which urls you are going to check.
+                    -
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Although at a first glance it might seem a good idea to check everything,
                     - it would produce false positives.
                     - Particularly newsletters, ads, etc.
                     - are likely to use URLs that look like phishing attempts.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Lets assume that you've recently seen many phishing attempts claiming they
                     - come from Paypal.
                     - Thus you need to add paypal to daily.pdb:
                     -\end_layout
+                    -
                     -\begin_layout Quote
                     -R .+ .+
                     -\backslash
                     -.paypal
                     -\backslash
                     -.com
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -The above line will block (detect as phishing) mails that contain urls that
                     - claim to lead to paypal, but they don't in fact.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Be carefull not to create regexes that match a too broad range of urls though.
                     -\end_layout
+                    -
                     -\begin_layout Subsection
                     -Dealing with false positives, and undetected phishing mails
                     -\end_layout
+                    -
                     -\begin_layout Subsubsection
                     -False positives
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Whenever you see a false positive (mail that is detected as phishing, but
                     - its not), you need to examine
                     -\emph on
                     -why
                     -\emph default
                     - clamav decided that its phishing.
                     - You can do this easily by building clamav with debugging (./configure --enable-e
                     -xperimental --enable-debug), and then running a tool:
                     -\end_layout
+                    -
                     -\begin_layout Quote
                     -$contrib/phishing/why.py phishing.eml
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -This will show the url that triggers the phish verdict, and a reason why
                     - that url is considered phishing attempt.
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Once you know the reason, you might need to modify daily.pdb (if one of yours
                     - rules inthere are too broad), or you need to add the url to daily.wdb.
                     - If you think the algorithm is incorrect, please file a bugreport on bugzilla.cla
                     -mav.net, including the output of
                     -\emph on
                     -why.py
                     -\emph default
                     -.
                     -\end_layout
+                    -
                     -\begin_layout Subsubsection
                     -Undetected phish mails
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -Using why.py doesn't help here unfortunately (it will say: clean), so all
                     - you can do is:
                     -\end_layout
+                    -
                     -\begin_layout Quote
                     -$clamscan/clamscan --phish-scan-alldomains undetected.eml
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -And see if the mail is detected, if yes, then you need to add an appropiate
                     - line to daily.pdb (see section
                     -\begin_inset LatexCommand \vref{sub:How-to-create}
+                    -
                     -\end_inset
+                    -
                     -).
                     -\end_layout
+                    -
                     -\begin_layout Standard
                     -If the mail is not detected, then try using:
                     -\end_layout
+                    -
                     -\begin_layout Quote
                     -$clamscan/clamscan --debug undetected.eml|less
                     -\end_layout
+                    -
                     -\begin_layout Address
                     -Then see what urls are being checked, see if any of them is in a whitelist,
                     - see if all urls are detected, etc.
                     -\end_layout
+                    -
                     -\begin_layout Section
                     -Hints and recomandations
                     -\end_layout
+                    -
                     -\begin_layout Section
                     -Examples
                     -\end_layout
+                    -
                     -\begin_layout Standard
+                    -
                     -\end_layout
+                    -
                     -\end_body
                     -\end_document

docs/phishsigs_howto.pdf

History View file @ 0d615f7

1364

Binary files a/docs/phishsigs_howto.pdf and b/docs/phishsigs_howto.pdf differ

docs/phishsigs_howto.tex

History View file @ 0d615f7

                     new file mode 100644
@@ -0,0 +1,491 @@
                     +%% LyX 1.5.3 created this file.  For more info, see http://www.lyx.org/.
                     +%% Do not edit unless you really know what you are doing.
                     +\documentclass[a4paper,english]{article}
                     +\usepackage{mathptmx}
                     +\usepackage[T1]{fontenc}
                     +\usepackage{varioref}
                     +\usepackage{prettyref}
                     +\usepackage{amssymb}
                     +\usepackage{pslatex}
                     +\usepackage[dvips]{graphicx}
                     +\usepackage{wrapfig}
                     +\usepackage{url}
                     +\date{}
+                    +
                     +\begin{document}
+                    +
                     +\title{{\huge Phishing signatures creation HOWTO}}
                     +\author{T\"or\"ok Edwin}
                     +\maketitle
+                    +
                     +\section{Database file format}
+                    +
                     +\subsection{PDB format}
                     +This file contains urls/hosts that are target of phishing attempts.
                     +It contains lines in the following format:
                     +\begin{verbatim}
                     +R[Filter]:RealURL:DisplayedURL[:FuncLevelSpec]
                     +H[Filter]:DisplayedHostname[:FuncLevelSpec]
                     +\end{verbatim}
+                    +
                     +\begin{description}
                     + \item [{R}] regular expression, for the concatenated URL
                     + \item [{H}] matches the \verb+DisplayedHostname+ as a simple pattern (literally, no regular expression)
                     + 	\begin{itemize}
                     + 		\item the pattern can match either the full hostname
                     + 		\item or a subdomain of the specified hostname
                     + 		\item to avoid false matches in case of subdomain matches, the engine checks that there  is a dot(\verb+.+) or a space(\verb+ +) before the matched portion
                     +	\end{itemize}
                     + \item [{Filter}] an (optional) 3-digit hexadecimal number representing flags that should be filtered.
                     +	\begin{itemize}
                     + 		\item flag filtering only makes sense in .pdb files. (however clamav won't complain if you put flags in .wdb files, it will just skip them)
                     + 		\item for details on how to construct a flag number see section \prettyref{sec:Flags}
                     +	\end{itemize}
+                    +
                     + \item [{RealURL }] is the URL the user is sent to
                     + \item [{DisplayedURL}] is the URL description displayed to the user, that is where it is \emph{claimed} they are sent, the most obvious example is that of an html anchor (<a>tag): its href attribute is the \textsc{realURL}, and its contents is the \textsc{displayedURL}
                     + \item [{DisplayedHostname}] is the hostname portion of the [{DisplayedURL}]
                     + \item [{FuncLevelSpec}] an (optional) functionality level, 2 formats are possible:
                     +	\begin{itemize}
                     + 		\item \verb+minlevel+ all engines having functionality level >= \verb+minlevel+ will load this line
                     + 		\item \verb+minlevel-maxlevel+ engines with functionality level $>= $ \verb+minlevel+, and $< $ \verb+maxlevel+ will load this line
                     +	\end{itemize}
                     +\end{description}
+                    +
                     +\subsection{WDB format}
                     +This file contains whitelisted url pairs
                     +It contains lines in the following format:
                     +\begin{verbatim}
                     +X:RealURL:DisplayedURL[:FuncLevelSpec]
                     +M:RealHostname:DisplayedHostname[:FuncLevelSpec]
                     +\end{verbatim}
+                    +
                     +\begin{description}
                     + \item [{X}] regular expression, for the \textsc{entire URL}, not just the hostname
                     + \begin{itemize}
                     +  \item The regular expression is by default anchored to start-of-line and end-of-line, as if you have used \verb+^RegularExpression$+
                     +  \item A trailing \verb+/+ is automatically added both to the regex, and the input string to avoid false matches
                     +  \item The regular expression matches the \textsc{concatenation} of RealURL, a colon(\verb+:+), and DisplayedURL as a single string. It doesn't separately match RealURL and DisplayedURL!
                     + \end{itemize}
                     + \item [{M}] matches hostname, or subdomain of it, see notes for \textsc{H} above
                     +\end{description}
+                    +
                     +\subsection{Hints}
+                    +
                     +\begin{itemize}
                     + \item empty lines are ignored
                     + \item the colons are mandatory
                     + \item Don't leave extra spaces on the end of a line!
                     + \item if any of the lines don't conform to this format, clamav will abort with a Malformed Database Error
                     + \item see section \vref{sub:Extraction-of-realURL,} for more details on \textsc{realURL/displayedURL}
                     +\end{itemize}
+                    +
                     +%TODO: give up-to-date examples
+                    +
                     +\subsubsection{Example}
+                    +
                     +The following line:
+                    +
                     +\emph{R http://www\textbackslash{}.google\textbackslash{}.(com|ro|it)
                     +www\textbackslash{}.google\textbackslash{}.com}
+                    +
                     +Means: \emph{\textsc{R}}\textsc{ }- this is a regex.
+                    +
                     +Example of url pairs matching: http://www.google.com www.google.com,
                     +http://www.google.it www.google.com.
+                    +
                     +Example of url pairs not matching: http://www.google.c0m www.google.com
+                    +
+                    +
                     +\subsection{How matching works}
+                    +
+                    +
                     +\subsubsection{RealURL, displayedURL concatenation\label{sub:RealURL,-displayedURL-concatenation}}
+                    +
                     +The phishing detection module processes pairs of realURL/displayedURL,
                     +and the matching against daily.wdb/daily.pdb is done as follows: the
                     +realURL is concatenated with a space, and with the displayedURL, then
                     +that \emph{line} is matched against the lines in daily.wdb/daily.pdb
+                    +
                     +So if you have a line like
+                    +
                     +\textit{~www.google.ro~www.google.com}
+                    +
                     +and a href like: \emph{<a href=''http://www.google.ro''>www.google.com</a>,}
                     +then it will match, but: \emph{<a href=''http://images.google.com''>www.google.com</a>}
                     +will not match.
+                    +
                     +If you use the \textbf{\textsc{H}} flag, then the 2nd href will match
                     +too.
+                    +
+                    +
                     +\subsubsection{What happens when a match is found}
+                    +
                     +In the case of the whitelist, a match means that the realURL/displayedURL
                     +combination is considered \textsc{clean}, and no further checks are
                     +performed on it.
+                    +
                     +In the case of the domainlist, a match means that the realURL/displayedURL
                     +is going to be checked for phishing attempts. This is only done if
                     +you don't run clamav with the \emph{alldomains} option (since then
                     +all urls are checked). Furthermore you can restrict what checks are
                     +to be performed by specifying the 3-digit hexnumber.
+                    +
+                    +
                     +\subsubsection{Extraction of \textsc{realURL}, \textsc{displayedURL} from HTML tags\label{sub:Extraction-of-realURL,}}
+                    +
                     +The html parser extracts pairs of \textsc{realURL}/\textsc{displayedURL}
                     +based on the following rules:
+                    +
                     +\begin{description}
                     +\item [{a}] (anchor) the \emph{href} is the \textsc{realURL}, its \emph{contents}
                     +is the \textsc{displayedURL}
+                    +
                     +\begin{description}
                     +\item [{contents}] is the tag-stripped contents of the <a> tags, so for
                     +example <b> tags are stripped (but not their contents)
                     +\end{description}
                     +nesting another <a> tag withing an <a> tag (besides being invalid
                     +html) is treated as a </a><a..
+                    +
                     +\item [{form}] the \emph{action} attribute is the \textsc{realURL}, and a
                     +nested <a> tag is the \textsc{displayedURL}
                     +\item [{img/area}] if nested within an \emph{<a>} tag, the \textsc{realURL}
                     +is the \emph{href} of the a tag, and the \emph{src/dynsrc/area} is
                     +the \textsc{displayedURL} of the img
+                    +
+                    +
                     +if nested withing a \emph{form} tag, then the action attribute of
                     +the \emph{form} tag is the \textsc{realURL}
+                    +
                     +\item [{iframe}] if nested withing an \emph{<a>} tag the \emph{src} attribute
                     +is the displayedURL, and the \emph{href} of its parent \emph{a} tag
                     +is the \textsc{realURL}
+                    +
+                    +
                     +if nested withing a \emph{form} tag, then the action attribute of
                     +the \emph{form} tag is the \textsc{realURL}
+                    +
                     +\end{description}
+                    +
                     +\subsubsection{Example}
+                    +
                     +Consider this html file:
+                    +
                     +\begin{quote}
                     +\emph{<a href=''evilurl''>www.paypal.com</a>}
+                    +
                     +\emph{<a href=''evilurl2'' title=''www.ebay.com''>click here to
                     +sign in</a>}
+                    +
                     +\emph{<form action=''evilurl\_form''>}
+                    +
                     +\emph{Please sign in to <a href=''cgi.ebay.com''>Ebay</a> using
                     +this form}
+                    +
                     +\emph{<input type='text' name='username'>Username</input>}
+                    +
                     +\emph{....}
+                    +
                     +\emph{</form>}
+                    +
                     +\emph{<a href=''evilurl''><img src=''images.paypal.com/secure.jpg''></a>}
                     +\end{quote}
                     +The resulting \textsc{realURL/displayedURL} pairs will be (note that
                     +one tag can generate multiple pairs):
+                    +
                     +\begin{itemize}
                     +\item evilurl / www.paypal.com
                     +\item evilurl2 / click here to sign in
                     +\item evilurl2 / www.ebay.com
                     +\item evilurl\_form / cgi.ebay.com
                     +\item cgi.ebay.com / Ebay
                     +\item evilurl / image.paypal.com/secure.jpg
                     +\end{itemize}
+                    +
                     +\subsection{Simple patterns\label{sec:Simple-patterns}}
+                    +
                     +Simple patterns are matched literally, i.e. if you say:
+                    +
                     +\begin{quote}
                     +www.google.com
                     +\end{quote}
                     +it is going to match \emph{www.google.com}, and only that. The \emph{.
                     +(dot)} character has no special meaning (see the section on regexes
                     +\vref{sec:Regular-expressions} for how the \emph{.(dot)} character
                     +behaves there)
+                    +
+                    +
                     +\subsection{Regular expressions\label{sec:Regular-expressions}}
+                    +
                     +POSIX regular expressions are supported, and you can consider that
                     +internally it is wrapped by \emph{\textasciicircum{}}, and \emph{\$.}
                     +In other words, this means that the regular expression has to match
                     +the entire concatenated (see section \vref{sub:RealURL,-displayedURL-concatenation}
                     +for details on concatenation) url.
+                    +
                     +It is recomended that you read section \vref{sec:Introduction-to-regular}
                     +to learn how to write regular expressions, and then come back and
                     +read this for hints.
+                    +
                     +Be advised that clamav contains an internal, very basic regex matcher
                     +to reduce the load on the regex matching core. Thus it is recomended
                     +that you avoid using regex syntax not supported by it at the very
                     +beginning of regexes (at least the first few characters).
+                    +
                     +Currently the clamav regex matcher supports:
+                    +
                     +\begin{itemize}
                     +\item . (dot) character
                     +\item \textbackslash{} (escaping special characters)
                     +\item | (pipe) alternatives
                     +\item {[}] (character classes)
                     +\item () (paranthesis for grouping, but no group extraction is performed)
                     +\item other non-special characters
                     +\end{itemize}
                     +Thus the following are not supported:
+                    +
                     +\begin{itemize}
                     +\item + repetition
                     +\item {*} repetition
                     +\item \{\} repetition
                     +\item backreferences
                     +\item lookaround
                     +\item other {}``advanced'' features not listed in the supported list ;)
                     +\end{itemize}
                     +This however shouldn't discourage you from using the {}``not directly
                     +supported features {}``, because if the internal engine encounters
                     +unsupported syntax, it passes it on to the POSIX regex core (beginning
                     +from the first unsupported token, everything before that is still
                     +processed by the internal matcher). An example might make this more
                     +clear:
+                    +
                     +\emph{www\textbackslash{}.google\textbackslash{}.(com|ro|it) ({[}a-zA-Z])+\textbackslash{}.google\textbackslash{}.(com|ro|it)}
+                    +
                     +Everything till \emph{({[}a-zA-Z])+} is processed internally, that
                     +paranthesis (and everything beyond) is processed by the posix core.
+                    +
                     +Examples of url pairs that match:
+                    +
                     +\begin{itemize}
                     +\item \emph{www.google.ro images.google.ro}
                     +\item www.google.com images.google.ro
                     +\end{itemize}
                     +Example of url pairs that don't match:
+                    +
                     +\begin{itemize}
                     +\item www.google.ro images1.google.ro
                     +\item images.google.com image.google.com
                     +\end{itemize}
+                    +
                     +\subsection{Flags\label{sec:Flags}}
+                    +
                     +Flags are a binary OR of the following numbers:
+                    +
                     +\begin{description}
                     +\item [{HOST\_SUFFICIENT}] 1
                     +\item [{DOMAIN\_SUFFICIENT}] 2
                     +\item [{DO\_REVERSE\_LOOKUP}] 4
                     +\item [{CHECK\_REDIR}] 8
                     +\item [{CHECK\_SSL}] 16
                     +\item [{CHECK\_CLOAKING}] 32
                     +\item [{CLEANUP\_URL}] 64
                     +\item [{CHECK\_DOMAIN\_REVERSE}] 128
                     +\item [{CHECK\_IMG\_URL}] 256
                     +\item [{DOMAINLIST\_REQUIRED}] 512
                     +\end{description}
                     +The names of the constants are self-explanatory.
+                    +
                     +These constants are defined in libclamav/phishcheck.h, you can check
                     +there for the latest flags.
+                    +
                     +There is a default set of flags that are enabled, these are currently:
                     +(CLEANUP\_URL|DOMAIN\_SUFFICIENT|CHECK\_SSL|CHECK\_CLOAKING|DOMAINLIST\_REQUIRED|CHECK\_IMG\_URL),
                     +ssl checking is performed only for a tags currently.
+                    +
                     +You must decide for each line in the domainlist if you want to filter
                     +any flags (that is you don't want certain checks to be done), and
                     +then calculate the binary OR of those constants, and then convert
                     +it into a 3-digit hexnumber. For example you devide that domain\_sufficient
                     +shouldn't be used for ebay.com, and you don't want to check images
                     +either, so you come up with this flag number: $2|256\Rightarrow$258$(decimal)\Rightarrow102(hexadecimal)$
+                    +
                     +So you add this line to daily.wdb:
+                    +
                     +\begin{itemize}
                     +\item R102~www.ebay.com~.+
                     +\end{itemize}
+                    +
                     +\section{Introduction to regular expressions\label{sec:Introduction-to-regular}}
+                    +
                     +Recomended reading:
+                    +
                     +\begin{itemize}
                     +\item http://www.regular-expressions.info/quickstart.html
                     +\item http://www.regular-expressions.info/tutorial.html
                     +\item regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7\&topic=regex
                     +\end{itemize}
+                    +
                     +\subsection{Special characters}
+                    +
                     +\begin{description}
                     +\item [{{[}}] the opening square bracket - it marks the beginning of a
                     +character class, see section\vref{sub:Character-classes}
                     +\item [{\textbackslash{}}] the backslash - escapes special characters,
                     +see section \vref{sub:Escaping}
                     +\item [{\^{ }}] the caret - matches the beginning of a line (not needed
                     +in clamav regexes, this is implied)
                     +\item [{\$}] the dollar sign - matches the end of a line (not needed in
                     +clamav regexes, this is implied)
                     +\item [{\.{ }}] the period or dot - matches \emph{any} character
                     +\item [{|}] the vertical bar or pipe symbol - matches either of the token
                     +on its left and right side, see section\vref{sub:Alternation}
                     +\item [{?}] the question mark - matches optionally the left-side token,
                     +see section\vref{sub:Optional-matching,-and}
                     +\item [{{*}}] the asterisk or star - matches 0 or more occurences of the
                     +left-side token, see section \vref{sub:Optional-matching,-and}
                     +\item [{+}] the plus sign - matches 1 or more occurences of the left-side
                     +token, see section \vref{sub:Optional-matching,-and}
                     +\item [{(}] the opening round bracket - \c{m}arks beginning of a group,
                     +see section \vref{sub:Groups}
                     +\item [{)}] the closing round bracket - marks end of a group, see section\vref{sub:Groups}
                     +\end{description}
+                    +
                     +\subsection{Character classes\label{sub:Character-classes}}
+                    +
+                    +
                     +\subsection{Escaping\label{sub:Escaping}}
+                    +
                     +Escaping has two purposes:
+                    +
                     +\begin{itemize}
                     +\item it allows you to actually match the special characters themselves,
                     +for example to match the literal \emph{+}, you would write \emph{\textbackslash{}+}
                     +\item it also allows you to match non-printable characters, such as the
                     +tab (\emph{\textbackslash{}t}), newline (\emph{\textbackslash{}n}),
                     +..
                     +\end{itemize}
                     +However since non-printable characters are not valid inside an url,
                     +you won't have a reason to use them.
+                    +
+                    +
                     +\subsection{Alternation\label{sub:Alternation}}
+                    +
+                    +
                     +\subsection{Optional matching, and repetition\label{sub:Optional-matching,-and}}
+                    +
+                    +
                     +\subsection{Groups\label{sub:Groups}}
+                    +
                     +Groups are usually used together with repetition, or alternation.
                     +For example: \emph{(com|it)+} means: match 1 or more repetitions of
                     +\emph{com} or \emph{it,} that is it matches: com, it, comcom, comcomcom,
                     +comit, itit, ititcom,... you get the idea.
+                    +
                     +Groups can also be used to extract substring, but this is not supported
                     +by the clam engine, and not needed either in this case.
+                    +
+                    +
                     +\section{How to create database files}
+                    +
+                    +
                     +\subsection{How to create and maintain the whitelist (daily.wdb)}
+                    +
                     +If the phishing code claims that a certain mail is phishing, but its
                     +not, you have 2 choices:
+                    +
                     +\begin{itemize}
                     +\item examine your rules daily.pdb, and fix them if necessary (see: section\vref{sub:How-to-create})
                     +\item add it to the whitelist (discussed here)
                     +\end{itemize}
                     +Lets assume you are having problems because of links like this in
                     +a mail:
+                    +
                     +\begin{quote}
                     +<a href=''http://69.0.241.57/bCentral/L.asp?L=XXXXXXXX''>http://www.bcentral.it/</a>
                     +\end{quote}
                     +After investigating those sites further, you decide they are no threat,
                     +and create a line like this in daily.wdb:
+                    +
                     +\begin{quote}
                     +R http://www\textbackslash{}.bcentral\textbackslash{}.it/.+ http://69\textbackslash{}.0\textbackslash{}.241\textbackslash{}.57/bCentral/L\textbackslash{}.asp?L=.+
                     +\end{quote}
                     +Note: urls like the above can be used to track unique mail recipients,
                     +and thus know if somebody actually reads mails (so they can send more
                     +spam). However since this site required no authentication information,
                     +it is safe from a phishing point of view.
+                    +
+                    +
                     +\subsection{How to create and maintain the domainlist (daily.pdb)\label{sub:How-to-create}}
+                    +
                     +When not using --phish-scan-alldomains (production environments for
                     +example), you need to decide which urls you are going to check.
+                    +
                     +Although at a first glance it might seem a good idea to check everything,
                     +it would produce false positives. Particularly newsletters, ads, etc.
                     +are likely to use URLs that look like phishing attempts.
+                    +
                     +Lets assume that you've recently seen many phishing attempts claiming
                     +they come from Paypal. Thus you need to add paypal to daily.pdb:
+                    +
                     +\begin{quote}
                     +R .+ .+\textbackslash{}.paypal\textbackslash{}.com
                     +\end{quote}
                     +The above line will block (detect as phishing) mails that contain
                     +urls that claim to lead to paypal, but they don't in fact.
+                    +
                     +Be carefull not to create regexes that match a too broad range of
                     +urls though.
+                    +
+                    +
                     +\subsection{Dealing with false positives, and undetected phishing mails}
+                    +
+                    +
                     +\subsubsection{False positives}
+                    +
                     +Whenever you see a false positive (mail that is detected as phishing,
                     +but its not), you need to examine \emph{why} clamav decided that its
                     +phishing. You can do this easily by building clamav with debugging
                     +(./configure --enable-experimental --enable-debug), and then running
                     +a tool:
+                    +
                     +\begin{quote}
                     +\$contrib/phishing/why.py phishing.eml
                     +\end{quote}
                     +This will show the url that triggers the phish verdict, and a reason
                     +why that url is considered phishing attempt.
+                    +
                     +Once you know the reason, you might need to modify daily.pdb (if one
                     +of yours rules inthere are too broad), or you need to add the url
                     +to daily.wdb. If you think the algorithm is incorrect, please file
                     +a bugreport on bugzilla.clamav.net, including the output of \emph{why.py}.
+                    +
+                    +
                     +\subsubsection{Undetected phish mails}
+                    +
                     +Using why.py doesn't help here unfortunately (it will say: clean),
                     +so all you can do is:
+                    +
                     +\begin{quote}
                     +\$clamscan/clamscan --phish-scan-alldomains undetected.eml
                     +\end{quote}
                     +And see if the mail is detected, if yes, then you need to add an appropiate
                     +line to daily.pdb (see section \vref{sub:How-to-create}).
+                    +
                     +If the mail is not detected, then try using:
+                    +
                     +\begin{quote}
                     +\$clamscan/clamscan --debug undetected.eml|less
                     +\end{quote}
+                    +
                     +Then see what urls are being checked, see if any of them is in a
                     +whitelist, see if all urls are detected, etc.
+                    +
+                    +
                     +\section{Hints and recomandations}
+                    +
+                    +
                     +\section{Examples}
+                    +
+                    +
                     +\end{document}