#LyX 1.4.2 created this file. For more info see http://www.lyx.org/ \lyxformat 245 \begin_document \begin_header \textclass article \language english \inputencoding auto \fontscheme pslatex \graphics default \paperfontsize default \spacing single \papersize a4paper \use_geometry false \use_amsmath 1 \cite_engine basic \use_bibtopic false \paperorientation portrait \secnumdepth 3 \tocdepth 3 \paragraph_separation indent \defskip medskip \quotes_language english \papercolumns 1 \papersides 1 \paperpagestyle default \tracking_changes false \output_changes false \end_header \begin_body \begin_layout Title \family roman \series medium \shape up \size normal \emph off \bar no \noun off \color none Phishing signatures creation HOWTO \end_layout \begin_layout Author Török Edwin \end_layout \begin_layout Section Database file format \end_layout \begin_layout Standard The database file format is common for the whitelist (.wdb), and domainlist (.pdb), and it consists of (multiple) lines of form: \end_layout \begin_layout Standard \series bold Flags\InsetSpace ~ RealURL\InsetSpace ~ DisplayedURL \end_layout \begin_layout Itemize Where \noun on Flags \noun default is: \end_layout \begin_deeper \begin_layout Itemize an (optional) character : \end_layout \begin_deeper \begin_layout Description R regex, has to match entire url, see section \end_layout \begin_layout Description H has to match the host part of url only (a simple pattern, i.e. it is matched literally) \end_layout \begin_layout Description no\InsetSpace ~ character matches the entire url, but as a simple pattern (non-regex) \end_layout \end_deeper \begin_layout Itemize followed by an (optional) 3-digit hexadecimal number representing flags that should be filtered. \end_layout \begin_deeper \begin_layout Itemize flag filtering only makes sense in .pdb files, (however clamav won't complain if you put flags in .wdb files, it just won't use them) \end_layout \begin_layout Itemize for details on how to construct a flag number see section \begin_inset LatexCommand \prettyref{sec:Flags} \end_inset \end_layout \end_deeper \end_deeper \begin_layout Itemize \noun on RealURL \noun default is the URL the user is sent to \end_layout \begin_layout Itemize \noun on displayedURL \noun default is the URL description displayed to the user, that is where it is \emph on claimed \emph default they are sent, the most obvious example is that of an html anchor (tag): its href attribute is the \noun on realURL \noun default , and its contents is the \noun on displayedURL \end_layout \begin_layout Itemize see section \begin_inset LatexCommand \vref{sub:Extraction-of-realURL,} \end_inset for more details on what \noun on realURL/displayedURL \noun default is \end_layout \begin_layout Standard Note: The spaces are mandatory, and empty lines are skipped. \end_layout \begin_layout Standard If any of the lines of daily.wdb/daily.pdb don't conform to the above file format, the loading of the file shall fail, and whitelist/domainlist feature will be disabled. If the loading of the whitelist fails, the phishing checks will be disabled entirely. \end_layout \begin_layout Standard Therefore it is important to test the daily.wdb/daily.pdb before packing it into daily.cvd! \end_layout \begin_layout Subsubsection Example \end_layout \begin_layout Standard The following line: \end_layout \begin_layout Standard \emph on R http://www \backslash .google \backslash .(com|ro|it) www \backslash .google \backslash .com \end_layout \begin_layout Standard Means: \emph on \noun on R \emph default \noun default - this is a regex. \end_layout \begin_layout Standard Example of url pairs matching: http://www.google.com www.google.com, http://www.googl e.it www.google.com. \end_layout \begin_layout Standard Example of url pairs not matching: http://www.google.c0m www.google.com \end_layout \begin_layout Subsection How matching works \end_layout \begin_layout Subsubsection RealURL, displayedURL concatenation \begin_inset LatexCommand \label{sub:RealURL,-displayedURL-concatenation} \end_inset \end_layout \begin_layout Standard The phishing detection module processes pairs of realURL/displayedURL, and the matching against daily.wdb/daily.pdb is done as follows: the realURL is concatenated with a space, and with the displayedURL, then that \emph on line \emph default is matched against the lines in daily.wdb/daily.pdb \end_layout \begin_layout Standard So if you have a line like \end_layout \begin_layout Standard \shape italic \InsetSpace ~ www.google.ro\InsetSpace ~ www.google.com \end_layout \begin_layout Standard and a href like: \emph on www.google.com, \emph default then it will match, but: \emph on www.google.com \emph default will not match. \end_layout \begin_layout Standard If you use the \series bold \noun on H \noun default \series default flag, then the 2nd href will match too. \end_layout \begin_layout Subsubsection What happens when a match is found \end_layout \begin_layout Standard In the case of the whitelist, a match means that the realURL/displayedURL combination is considered \noun on clean \noun default , and no further checks are performed on it. \end_layout \begin_layout Standard In the case of the domainlist, a match means that the realURL/displayedURL is going to be checked for phishing attempts. This is only done if you don't run clamav with the \emph on alldomains \emph default option (since then all urls are checked). Furthermore you can restrict what checks are to be performed by specifying the 3-digit hexnumber. \end_layout \begin_layout Subsubsection Extraction of \noun on realURL \noun default , \noun on displayedURL \noun default from HTML tags \begin_inset LatexCommand \label{sub:Extraction-of-realURL,} \end_inset \end_layout \begin_layout Standard The html parser extracts pairs of \noun on realURL \noun default / \noun on displayedURL \noun default based on the following rules: \end_layout \begin_layout Description a (anchor) the \emph on href \emph default is the \noun on realURL \noun default , its \emph on contents \emph default is the \noun on displayedURL \end_layout \begin_deeper \begin_layout Description contents is the tag-stripped contents of the tags, so for example tags are stripped (but not their contents) \end_layout \begin_layout Standard nesting another tag withing an tag (besides being invalid html) is treated as a tag is the \noun on displayedURL \end_layout \begin_layout Description img/area if nested within an \emph on \emph default tag, the \noun on realURL \noun default is the \emph on href \emph default of the a tag, and the \emph on src/dynsrc/area \emph default is the \noun on displayedURL \noun default of the img \end_layout \begin_deeper \begin_layout Standard if nested withing a \emph on form \emph default tag, then the action attribute of the \emph on form \emph default tag is the \noun on realURL \noun default \end_layout \end_deeper \begin_layout Description iframe if nested withing an \emph on \emph default tag the \emph on src \emph default attribute is the displayedURL, and the \emph on href \emph default of its parent \emph on a \emph default tag is the \noun on realURL \end_layout \begin_deeper \begin_layout Standard if nested withing a \emph on form \emph default tag, then the action attribute of the \emph on form \emph default tag is the \noun on realURL \end_layout \end_deeper \begin_layout Subsubsection Example \end_layout \begin_layout Standard Consider this html file: \end_layout \begin_layout Quote \emph on www.paypal.com \end_layout \begin_layout Quote \emph on click here to sign in \end_layout \begin_layout Quote \emph on
\end_layout \begin_layout Quote \emph on Please sign in to Ebay using this form \end_layout \begin_layout Quote \emph on Username \end_layout \begin_layout Quote \emph on .... \end_layout \begin_layout Quote \emph on
\end_layout \begin_layout Quote \emph on \end_layout \begin_layout Standard The resulting \noun on realURL/displayedURL \noun default pairs will be (note that one tag can generate multiple pairs): \end_layout \begin_layout Itemize evilurl / www.paypal.com \end_layout \begin_layout Itemize evilurl2 / click here to sign in \end_layout \begin_layout Itemize evilurl2 / www.ebay.com \end_layout \begin_layout Itemize evilurl_form / cgi.ebay.com \end_layout \begin_layout Itemize cgi.ebay.com / Ebay \end_layout \begin_layout Itemize evilurl / image.paypal.com/secure.jpg \end_layout \begin_layout Subsection Simple patterns \begin_inset LatexCommand \label{sec:Simple-patterns} \end_inset \end_layout \begin_layout Standard Simple patterns are matched literally, i.e. if you say: \end_layout \begin_layout Quote www.google.com \end_layout \begin_layout Standard it is going to match \emph on www.google.com \emph default , and only that. The \emph on . (dot) \emph default character has no special meaning (see the section on regexes \begin_inset LatexCommand \vref{sec:Regular-expressions} \end_inset for how the \emph on .(dot) \emph default character behaves there) \end_layout \begin_layout Subsection Regular expressions \begin_inset LatexCommand \label{sec:Regular-expressions} \end_inset \end_layout \begin_layout Standard POSIX regular expressions are supported, and you can consider that internally it is wrapped by \emph on ^ \emph default , and \emph on $. \emph default In other words, this means that the regular expression has to match the entire concatenated (see section \begin_inset LatexCommand \vref{sub:RealURL,-displayedURL-concatenation} \end_inset for details on concatenation) url. \end_layout \begin_layout Standard It is recomended that you read section \begin_inset LatexCommand \vref{sec:Introduction-to-regular} \end_inset to learn how to write regular expressions, and then come back and read this for hints. \end_layout \begin_layout Standard Be advised that clamav contains an internal, very basic regex matcher to reduce the load on the regex matching core. Thus it is recomended that you avoid using regex syntax not supported by it at the very beginning of regexes (at least the first few characters). \end_layout \begin_layout Standard Currently the clamav regex matcher supports: \end_layout \begin_layout Itemize . (dot) character \end_layout \begin_layout Itemize \backslash (escaping special characters) \end_layout \begin_layout Itemize | (pipe) alternatives \end_layout \begin_layout Itemize [] (character classes) \end_layout \begin_layout Itemize () (paranthesis for grouping, but no group extraction is performed) \end_layout \begin_layout Itemize other non-special characters \end_layout \begin_layout Standard Thus the following are not supported: \end_layout \begin_layout Itemize + repetition \end_layout \begin_layout Itemize * repetition \end_layout \begin_layout Itemize {} repetition \end_layout \begin_layout Itemize backreferences \end_layout \begin_layout Itemize lookaround \end_layout \begin_layout Itemize other \begin_inset Quotes eld \end_inset advanced \begin_inset Quotes erd \end_inset features not listed in the supported list ;) \end_layout \begin_layout Standard This however shouldn't discourage you from using the \begin_inset Quotes eld \end_inset not directly supported features \begin_inset Quotes eld \end_inset , because if the internal engine encounters unsupported syntax, it passes it on to the POSIX regex core (beginning from the first unsupported token, everything before that is still processed by the internal matcher). An example might make this more clear: \end_layout \begin_layout Standard \emph on www \backslash .google \backslash .(com|ro|it) ([a-zA-Z])+ \backslash .google \backslash .(com|ro|it) \end_layout \begin_layout Standard Everything till \emph on ([a-zA-Z])+ \emph default is processed internally, that paranthesis (and everything beyond) is processed by the posix core. \end_layout \begin_layout Standard Examples of url pairs that match: \end_layout \begin_layout Itemize \emph on www.google.ro images.google.ro \end_layout \begin_layout Itemize www.google.com images.google.ro \end_layout \begin_layout Standard Example of url pairs that don't match: \end_layout \begin_layout Itemize www.google.ro images1.google.ro \end_layout \begin_layout Itemize images.google.com image.google.com \end_layout \begin_layout Subsection Flags \begin_inset LatexCommand \label{sec:Flags} \end_inset \end_layout \begin_layout Standard Flags are a binary OR of the following numbers: \end_layout \begin_layout Description HOST_SUFFICIENT 1 \end_layout \begin_layout Description DOMAIN_SUFFICIENT 2 \end_layout \begin_layout Description DO_REVERSE_LOOKUP 4 \end_layout \begin_layout Description CHECK_REDIR 8 \end_layout \begin_layout Description CHECK_SSL 16 \end_layout \begin_layout Description CHECK_CLOAKING 32 \end_layout \begin_layout Description CLEANUP_URL 64 \end_layout \begin_layout Description CHECK_DOMAIN_REVERSE 128 \end_layout \begin_layout Description CHECK_IMG_URL 256 \end_layout \begin_layout Description DOMAINLIST_REQUIRED 512 \end_layout \begin_layout Standard The names of the constants are self-explanatory. \end_layout \begin_layout Standard These constants are defined in libclamav/phishcheck.h, you can check there for the latest flags. \end_layout \begin_layout Standard There is a default set of flags that are enabled, these are currently: (CLEANUP_ URL|DOMAIN_SUFFICIENT|CHECK_SSL|CHECK_CLOAKING|DOMAINLIST_REQUIRED|CHECK_IMG_URL ), ssl checking is performed only for a tags currently. \end_layout \begin_layout Standard You must decide for each line in the domainlist if you want to filter any flags (that is you don't want certain checks to be done), and then calculate the binary OR of those constants, and then convert it into a 3-digit hexnumber. For example you devide that domain_sufficient shouldn't be used for ebay.com, and you don't want to check images either, so you come up with this flag number: \begin_inset Formula $2|256\Rightarrow$ \end_inset 258 \begin_inset Formula $(decimal)\Rightarrow102(hexadecimal)$ \end_inset \end_layout \begin_layout Standard So you add this line to daily.wdb: \end_layout \begin_layout Itemize R102\InsetSpace ~ www.ebay.com\InsetSpace ~ .+ \end_layout \begin_layout Section Introduction to regular expressions \begin_inset LatexCommand \label{sec:Introduction-to-regular} \end_inset \end_layout \begin_layout Standard Recomended reading: \end_layout \begin_layout Itemize http://www.regular-expressions.info/quickstart.html \end_layout \begin_layout Itemize http://www.regular-expressions.info/tutorial.html \end_layout \begin_layout Itemize regex(7) man-page: http://www.tin.org/bin/man.cgi?section=7&topic=regex \end_layout \begin_layout Subsection Special characters \end_layout \begin_layout Description [ the opening square bracket - it marks the beginning of a character class, see section \begin_inset LatexCommand \vref{sub:Character-classes} \end_inset \end_layout \begin_layout Description \backslash the backslash - escapes special characters, see section \begin_inset LatexCommand \vref{sub:Escaping} \end_inset \end_layout \begin_layout Description \i \^{ } the caret - matches the beginning of a line (not needed in clamav regexes, this is implied) \end_layout \begin_layout Description $ the dollar sign - matches the end of a line (not needed in clamav regexes, this is implied) \end_layout \begin_layout Description \i \.{ } the period or dot - matches \emph on any \emph default character \end_layout \begin_layout Description | the vertical bar or pipe symbol - matches either of the token on its left and right side, see section \begin_inset LatexCommand \vref{sub:Alternation} \end_inset \end_layout \begin_layout Description ? the question mark - matches optionally the left-side token, see section \begin_inset LatexCommand \vref{sub:Optional-matching,-and} \end_inset \end_layout \begin_layout Description * the asterisk or star - matches 0 or more occurences of the left-side token, see section \begin_inset LatexCommand \vref{sub:Optional-matching,-and} \end_inset \end_layout \begin_layout Description + the plus sign - matches 1 or more occurences of the left-side token, see section \begin_inset LatexCommand \vref{sub:Optional-matching,-and} \end_inset \end_layout \begin_layout Description ( the opening round bracket - \i \c{m} arks beginning of a group, see section \begin_inset LatexCommand \vref{sub:Groups} \end_inset \end_layout \begin_layout Description ) the closing round bracket - marks end of a group, see section \begin_inset LatexCommand \vref{sub:Groups} \end_inset \end_layout \begin_layout Subsection Character classes \begin_inset LatexCommand \label{sub:Character-classes} \end_inset \end_layout \begin_layout Subsection Escaping \begin_inset LatexCommand \label{sub:Escaping} \end_inset \end_layout \begin_layout Standard Escaping has two purposes: \end_layout \begin_layout Itemize it allows you to actually match the special characters themselves, for example to match the literal \emph on + \emph default , you would write \emph on \backslash + \end_layout \begin_layout Itemize it also allows you to match non-printable characters, such as the tab ( \emph on \backslash t \emph default ), newline ( \emph on \backslash n \emph default ), .. \end_layout \begin_layout Standard However since non-printable characters are not valid inside an url, you won't have a reason to use them. \end_layout \begin_layout Subsection Alternation \begin_inset LatexCommand \label{sub:Alternation} \end_inset \end_layout \begin_layout Subsection Optional matching, and repetition \begin_inset LatexCommand \label{sub:Optional-matching,-and} \end_inset \end_layout \begin_layout Subsection Groups \begin_inset LatexCommand \label{sub:Groups} \end_inset \end_layout \begin_layout Standard Groups are usually used together with repetition, or alternation. For example: \emph on (com|it)+ \emph default means: match 1 or more repetitions of \emph on com \emph default or \emph on it, \emph default that is it matches: com, it, comcom, comcomcom, comit, itit, ititcom,... you get the idea. \end_layout \begin_layout Standard Groups can also be used to extract substring, but this is not supported by the clam engine, and not needed either in this case. \end_layout \begin_layout Section How to create database files \end_layout \begin_layout Subsection How to create and maintain the whitelist (daily.wdb) \end_layout \begin_layout Standard If the phishing code claims that a certain mail is phishing, but its not, you have 2 choices: \end_layout \begin_layout Itemize examine your rules daily.pdb, and fix them if necessary (see: section \begin_inset LatexCommand \vref{sub:How-to-create} \end_inset ) \end_layout \begin_layout Itemize add it to the whitelist (discussed here) \end_layout \begin_layout Standard Lets assume you are having problems because of links like this in a mail: \end_layout \begin_layout Quote http://www.bcentral.it/ \end_layout \begin_layout Standard After investigating those sites further, you decide they are no threat, and create a line like this in daily.wdb: \end_layout \begin_layout Quote R http://www \backslash .bcentral \backslash .it/.+ http://69 \backslash .0 \backslash .241 \backslash .57/bCentral/L \backslash .asp?L=.+ \end_layout \begin_layout Standard Note: urls like the above can be used to track unique mail recipients, and thus know if somebody actually reads mails (so they can send more spam). However since this site required no authentication information, it is safe from a phishing point of view. \end_layout \begin_layout Subsection How to create and maintain the domainlist (daily.pdb) \begin_inset LatexCommand \label{sub:How-to-create} \end_inset \end_layout \begin_layout Standard When not using --phish-scan-alldomains (production environments for example), you need to decide which urls you are going to check. \end_layout \begin_layout Standard Although at a first glance it might seem a good idea to check everything, it would produce false positives. Particularly newsletters, ads, etc. are likely to use URLs that look like phishing attempts. \end_layout \begin_layout Standard Lets assume that you've recently seen many phishing attempts claiming they come from Paypal. Thus you need to add paypal to daily.pdb: \end_layout \begin_layout Quote R .+ .+ \backslash .paypal \backslash .com \end_layout \begin_layout Standard The above line will block (detect as phishing) mails that contain urls that claim to lead to paypal, but they don't in fact. \end_layout \begin_layout Standard Be carefull not to create regexes that match a too broad range of urls though. \end_layout \begin_layout Subsection Dealing with false positives, and undetected phishing mails \end_layout \begin_layout Subsubsection False positives \end_layout \begin_layout Standard Whenever you see a false positive (mail that is detected as phishing, but its not), you need to examine \emph on why \emph default clamav decided that its phishing. You can do this easily by building clamav with debugging (./configure --enable-e xperimental --enable-debug), and then running a tool: \end_layout \begin_layout Quote $contrib/phishing/why.py phishing.eml \end_layout \begin_layout Standard This will show the url that triggers the phish verdict, and a reason why that url is considered phishing attempt. \end_layout \begin_layout Standard Once you know the reason, you might need to modify daily.pdb (if one of yours rules inthere are too broad), or you need to add the url to daily.wdb. If you think the algorithm is incorrect, please file a bugreport on bugzilla.cla mav.net, including the output of \emph on why.py \emph default . \end_layout \begin_layout Subsubsection Undetected phish mails \end_layout \begin_layout Standard Using why.py doesn't help here unfortunately (it will say: clean), so all you can do is: \end_layout \begin_layout Quote $clamscan/clamscan --phish-scan-alldomains undetected.eml \end_layout \begin_layout Standard And see if the mail is detected, if yes, then you need to add an appropiate line to daily.pdb (see section \begin_inset LatexCommand \vref{sub:How-to-create} \end_inset ). \end_layout \begin_layout Standard If the mail is not detected, then try using: \end_layout \begin_layout Quote $clamscan/clamscan --debug undetected.eml|less \end_layout \begin_layout Address Then see what urls are being checked, see if any of them is in a whitelist, see if all urls are detected, etc. \end_layout \begin_layout Section Hints and recomandations \end_layout \begin_layout Section Examples \end_layout \begin_layout Standard \end_layout \end_body \end_document