%% LyX 1.5.3 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[a4paper,english,10pt]{article} \usepackage{amssymb} \usepackage{pslatex} \usepackage[T1]{fontenc} \usepackage[dvips]{graphicx} \usepackage{url} \usepackage{fancyhdr} \usepackage{varioref} \usepackage{prettyref} \date{} \begin{document} \title{{\huge Phishing signatures creation HOWTO}} \author{T\"or\"ok Edwin} \maketitle %TODO: define a LaTeX command, instead of using \textsc{RealURL} each time \section{Database file format} \subsection{PDB format} This file contains urls/hosts that are target of phishing attempts. It contains lines in the following format: \begin{verbatim} R[Filter]:RealURL:DisplayedURL[:FuncLevelSpec] H[Filter]:DisplayedHostname[:FuncLevelSpec] \end{verbatim} \begin{description} \item [{R}] regular expression, for the concatenated URL \item [{H}] matches the \verb+DisplayedHostname+ as a simple pattern (literally, no regular expression) \begin{itemize} \item the pattern can match either the full hostname \item or a subdomain of the specified hostname \item to avoid false matches in case of subdomain matches, the engine checks that there is a dot(\verb+.+) or a space(\verb+ +) before the matched portion \end{itemize} \item [{Filter}] is ignored for R and H for compatibility reasons \item [{\textsc{RealURL}}] is the URL the user is sent to, example: \emph{href} attribute of an html anchor (\emph{ tag}) \item [{\textsc{DisplayedURL}}] is the URL description displayed to the user, where its \emph{claimed} they are sent, example: contents of an html anchor (\emph{ tag}) \item [{DisplayedHostname}] is the hostname portion of the \textsc{DisplayedURL} \item [{FuncLevelSpec}] an (optional) functionality level, 2 formats are possible: \begin{itemize} \item \verb+minlevel+ all engines having functionality level >= \verb+minlevel+ will load this line \item \verb+minlevel-maxlevel+ engines with functionality level $>= $ \verb+minlevel+, and $< $ \verb+maxlevel+ will load this line \end{itemize} \end{description} \subsection{GDB format} This file contains URL hashes in the following format: \begin{verbatim} S:P:HostPrefix[:FuncLevelSpec] S:F:Sha256hash[:FuncLevelSpec] S1:P:HostPrefix[:FuncLevelSpec] S1:F:Sha256hash[:FuncLevelSpec] S2:P:HostPrefix[:FuncLevelSpec] S2:F:Sha256hash[:FuncLevelSpec] S:W:Sha256hash[:FuncLevelSpec] \end{verbatim} \begin{description} \item [{S:}] These are hashes for Google Safe Browsing - malware sites, and should not be used for other purposes. \item [{S2:}] These are hashes for Google Safe Browsing - phishing sites, and should not be used for other purposes. \item [{S1:}] Hashes for blacklisting phishing sites. Virus name: Phishing.URL.Blacklisted \item [{S:W}] Locally whitelisted hashes. \item [{HostPrefix}] 4-byte prefix of the sha256 hash of the last 2 or 3 components of the hostname. If prefix doesn't match, no further lookups are performed. \item [{Sha256hash}] sha256 hash of the canonicalized URL, or a sha256 hash of its prefix/suffix according to the Google Safe Browsing ``Performing Lookups'' rules. There should be a corresponding \verb+:P:HostkeyPrefix+ entry for the hash to be taken into consideration. \end{description} To see which hash/URL matched, look at the \verb+clamscan --debug+ output, and look for the following strings: \verb+Looking up hash+, \verb+prefix matched+, and \verb+Hash matched+. Local whitelisting of .gdb entries can be done by creating a local.gdb file, and adding a line \verb+S:W:+. \subsection{WDB format} This file contains whitelisted url pairs It contains lines in the following format: \begin{verbatim} X:RealURL:DisplayedURL[:FuncLevelSpec] M:RealHostname:DisplayedHostname[:FuncLevelSpec] \end{verbatim} \begin{description} \item [{X}] regular expression, for the \emph{entire URL}, not just the hostname \begin{itemize} \item The regular expression is by default anchored to start-of-line and end-of-line, as if you have used \verb+^RegularExpression$+ \item A trailing \verb+/+ is automatically added both to the regex, and the input string to avoid false matches \item The regular expression matches the \emph{concatenation} of the \textsc{RealURL}, a colon(\verb+:+), and the \textsc{DisplayedURL} as a single string. It doesn't separately match \textsc{RealURL} and \textsc{DisplayedURL}! \end{itemize} \item [{M}] matches hostname, or subdomain of it, see notes for {H} above \end{description} \subsection{Hints} \begin{itemize} \item empty lines are ignored \item the colons are mandatory \item Don't leave extra spaces on the end of a line! \item if any of the lines don't conform to this format, clamav will abort with a Malformed Database Error \item see section \vref{sub:Extraction-of-realURL,} for more details on \textsc{realURL/displayedURL} \end{itemize} \subsection{Examples of PDB signatures} To check for phishing mails that target amazon.com, or subdomains of amazon.com: \begin{verbatim} H:amazon.com \end{verbatim} To do the same, but for amazon.co.uk: \begin{verbatim} H:amazon.co.uk \end{verbatim} To limit the signatures to certain engine versions: \begin{verbatim} H:amazon.co.uk:20-30 H:amazon.co.uk:20- H:amazon.co.uk:0-20 \end{verbatim} First line: engine versions 20, 21, ..., 29 can load it Second line: engine versions >= 20 can load it Third line: engine versions < 20 can load it In a real situation, you'd probably use the second form. A situation like that would be if you are using a feature of the signatures not available in earlier versions, or if earlier versions have bugs with your signature. Its neither case here, the above examples are for illustrative purposes only. \subsection{Examples of WDB signatures} To allow amazon's country specific domains and amazon.com, to mix domain names in \textsc{DisplayedURL}, and \textsc{RealURL}: \begin{verbatim} X:.+\.amazon\.(at|ca|co\.uk|co\.jp|de|fr)([/?].*)?:.+\.amazon\.com([/?].*)?:17- \end{verbatim} Explanation of this signature: \begin{description} \item [{X:}] this is a regular expression \item [{:17-}] load signature only for engines with functionality level >= 17 (recommended for type X) \end{description} The regular expression is the following (X:, :17- stripped, and a / appended) \begin{verbatim} .+\.amazon\.(at|ca|co\.uk|co\.jp|de|fr)([/?].*)?:.+\.amazon\.com([/?].*)?/ \end{verbatim} Explanation of this regular expression (note that it is a single regular expression, and not 2 regular expressions splitted at the {:}). \begin{itemize} \item \verb;.+; any subdomain of \item \verb;\.amazon\.; domain we are whitelisting (\textsc{RealURL} part) \item \verb;(at|ca|co\.uk|co\.jp|de|fr); country-domains: at, ca, co.uk, co.jp, de, fr \item \verb;([/?].*)?; recomended way to end real url part of whitelist, this protects against embedded URLs (evilurl.example.com/amazon.co.uk/) \item \verb;:; \textsc{RealURL} and \textsc{DisplayedURL} are concatenated via a {:}, so match a literal {:} here \item \verb;.+; any subdomain of \item \verb;\.amazon\.com; whitelisted DisplayedURL \item \verb;([/?].*)?; recommended way to end displayed url part, to protect against embedded URLs \item \verb;/; automatically added to further protect against embedded URLs \end{itemize} When you whitelist an entry make sure you check that both domains are owned by the same entity. What this whitelist entry allows is: Links claiming to point to amazon.com (\textsc{DisplayedURL}), but really go to country-specific domain of amazon (\textsc{RealURL}). \subsection{Example for how the URL extractor works} Consider the following HTML file: \begin{verbatim} 1.displayedurl.example.com 2 di

splayedurl.example.com 3.nested.example.com 4.displayedurl.example.com

sometext 5.form.nested.link-displayedurl.example.com 6.displ ayedurl.example.com