# PhishSigs - [PhishSigs](#phishsigs) - [Database file format](#database-file-format) - [PDB format](#pdb-format) - [GDB format](#gdb-format) - [WDB format](#wdb-format) - [Hints](#hints) - [Examples of PDB signatures](#examples-of-pdb-signatures) - [Examples of WDB signatures](#examples-of-wdb-signatures) - [Example for how the URL extractor works](#example-for-how-the-url-extractor-works) - [How matching works](#how-matching-works) - [RealURL, displayedURL concatenation](#realurl-displayedurl-concatenation) - [What happens when a match is found](#what-happens-when-a-match-is-found) - [Extraction of realURL, displayedURL from HTML tags](#extraction-of-realurl-displayedurl-from-html-tags) - [Example](#example) - [Simple patterns](#simple-patterns) - [Regular expressions](#regular-expressions) - [Flags](#flags) - [Introduction to regular expressions](#introduction-to-regular-expressions) - [Special characters](#special-characters) - [Character classes](#character-classes) - [Escaping](#escaping) - [Alternation](#alternation) - [Optional matching, and repetition](#optional-matching-and-repetition) - [Groups](#groups) - [How to create database files](#how-to-create-database-files) - [How to create and maintain the whitelist (daily.wdb)](#how-to-create-and-maintain-the-whitelist-dailywdb) - [How to create and maintain the domainlist (daily.pdb)](#how-to-create-and-maintain-the-domainlist-dailypdb) - [Dealing with false positives, and undetected phishing mails](#dealing-with-false-positives-and-undetected-phishing-mails) - [False positives](#false-positives) - [Undetected phish mails](#undetected-phish-mails) # Database file format ## PDB format This file contains urls/hosts that are target of phishing attempts. It contains lines in the following format: ``` R[Filter]:RealURL:DisplayedURL[:FuncLevelSpec] H[Filter]:DisplayedHostname[:FuncLevelSpec] ``` - `R` regular expression, for the concatenated URL - `H` matches the `DisplayedHostname` as a simple pattern (literally, no regular expression) - the pattern can match either the full hostname - or a subdomain of the specified hostname - to avoid false matches in case of subdomain matches, the engine checks that there is a dot(`.`) or a space(` `) before the matched portion - `Filter` is ignored for R and H for compatibility reasons - `RealURL` is the URL the user is sent to, example: *href* attribute of an html anchor (*\ tag*) - `DisplayedURL` is the URL description displayed to the user, where its *claimed* they are sent, example: contents of an html anchor (*\ tag*) - `DisplayedHostname` is the hostname portion of the DisplayedURL - `FuncLevelSpec` an (optional) functionality level, 2 formats are possible: - `minlevel` all engines having functionality level \>= `minlevel` will load this line - `minlevel-maxlevel` engines with functionality level \(>=\) `minlevel`, and \(<\) `maxlevel` will load this line ## GDB format This file contains URL hashes in the following format: S:P:HostPrefix[:FuncLevelSpec] S:F:Sha256hash[:FuncLevelSpec] S1:P:HostPrefix[:FuncLevelSpec] S1:F:Sha256hash[:FuncLevelSpec] S2:P:HostPrefix[:FuncLevelSpec] S2:F:Sha256hash[:FuncLevelSpec] S:W:Sha256hash[:FuncLevelSpec] - `S:` These are hashes for Google Safe Browsing - malware sites, and should not be used for other purposes. - `S2:` These are hashes for Google Safe Browsing - phishing sites, and should not be used for other purposes. - `S1:` Hashes for blacklisting phishing sites. Virus name: Phishing.URL.Blacklisted - `S:W:` Locally whitelisted hashes. - `HostPrefix` 4-byte prefix of the sha256 hash of the last 2 or 3 components of the hostname. If prefix doesn’t match, no further lookups are performed. - `Sha256hash` sha256 hash of the canonicalized URL, or a sha256 hash of its prefix/suffix according to the Google Safe Browsing “Performing Lookups” rules. There should be a corresponding `:P:HostkeyPrefix` entry for the hash to be taken into consideration. To see which hash/URL matched, look at the `clamscan --debug` output, and look for the following strings: `Looking up hash`, `prefix matched`, and `Hash matched`. Local whitelisting of .gdb entries can be done by creating a local.gdb file, and adding a line `S:W:`. ## WDB format This file contains whitelisted url pairs It contains lines in the following format: ``` X:RealURL:DisplayedURL[:FuncLevelSpec] M:RealHostname:DisplayedHostname[:FuncLevelSpec] ``` - `X` regular expression, for the *entire URL*, not just the hostname - The regular expression is by default anchored to start-of-line and end-of-line, as if you have used `^RegularExpression$` - A trailing `/` is automatically added both to the regex, and the input string to avoid false matches - The regular expression matches the *concatenation* of the RealURL, a colon(`:`), and the DisplayedURL as a single string. It doesn’t separately match RealURL and DisplayedURL\! - `M` matches hostname, or subdomain of it, see notes for H above ## Hints - empty lines are ignored - the colons are mandatory - Don’t leave extra spaces on the end of a line\! - if any of the lines don’t conform to this format, clamav will abort with a Malformed Database Error - see section [Extraction-of-realURL](#Extraction-of-realURL,-displayedURL-from-HTML-tags) for more details on realURL/displayedURL ## Examples of PDB signatures To check for phishing mails that target amazon.com, or subdomains of amazon.com: ``` H:amazon.com ``` To do the same, but for amazon.co.uk: ``` H:amazon.co.uk ``` To limit the signatures to certain engine versions: ``` H:amazon.co.uk:20-30 H:amazon.co.uk:20- H:amazon.co.uk:0-20 ``` First line: engine versions 20, 21, ..., 29 can load it Second line: engine versions \>= 20 can load it Third line: engine versions \< 20 can load it In a real situation, you’d probably use the second form. A situation like that would be if you are using a feature of the signatures not available in earlier versions, or if earlier versions have bugs with your signature. Its neither case here, the above examples are for illustrative purposes only. ## Examples of WDB signatures To allow amazon’s country specific domains and amazon.com, to mix domain names in DisplayedURL, and RealURL: X:.+\.amazon\.(at|ca|co\.uk|co\.jp|de|fr)([/?].*)?:.+\.amazon\.com([/?].*)?:17- Explanation of this signature: - `X:` this is a regular expression - `:17-` load signature only for engines with functionality level \>= 17 (recommended for type X) The regular expression is the following (X:, :17- stripped, and a / appended) ``` .+\.amazon\.(at|ca|co\.uk|co\.jp|de|fr)([/?].*)?:.+\.amazon\.com([/?].*)?/ ``` Explanation of this regular expression (note that it is a single regular expression, and not 2 regular expressions splitted at the :). - `.+` any subdomain of - `\.amazon\.` domain we are whitelisting (RealURL part) - `(at|ca|co\.uk|co\.jp|de|fr)` country-domains: at, ca, co.uk, co.jp, de, fr - `([/?].*)?` recomended way to end real url part of whitelist, this protects against embedded URLs (evilurl.example.com/amazon.co.uk/) - `:` RealURL and DisplayedURL are concatenated via a :, so match a literal : here - `.+` any subdomain of - `\.amazon\.com` whitelisted DisplayedURL - `([/?].*)?` recommended way to end displayed url part, to protect against embedded URLs - `/` automatically added to further protect against embedded URLs When you whitelist an entry make sure you check that both domains are owned by the same entity. What this whitelist entry allows is: Links claiming to point to amazon.com (DisplayedURL), but really go to country-specific domain of amazon (RealURL). ## Example for how the URL extractor works Consider the following HTML file: ```html 1.displayedurl.example.com 2 di

splayedurl.example.com 3.nested.example.com 4.displayedurl.example.com

sometext 5.form.nested.link-displayedurl.example.com 6.displ ayedurl.example.com