The WebDruid - A web server log file analysis tool Copyright 2003-2004 by Fabien Chevalier (fabien@juliana-multimedia.com) Copyright 1997-2000 by Bradford L. Barrett (brad@mrunix.net) Distributed under the GNU GPL. See the files "COPYING" and "Copyright" supplied with the distribution for additional info. Last updated: 28 Feb 2004 by Fabien Chevalier (fabien@juliana-multimedia.com) What is The WebDruid? ---------------------- The WebDruid is a web site log file analysis program which produces usage statistics in HTML format for viewing with a browser. The results are presented in both columnar and graphical format, which facilitates interpretation. Yearly, monthly, daily and hourly usage statistics are presented, along with the ability to display usage by site, URL, referrer, user agent (browser), search engines & search string, entry/exit page, username and country (some information is only available if supported and present in the log files being processed). Processed data may also be exported into most database and spreadsheet programs that support tab delimited data formats. The WebDruid supports CLF (common log format) log files, as well as Combined log formats as defined by NCSA and others, and variations of these which it attempts to handle intelligently. In addition, W3C log format is also supported, wich allows The WebDruid to run against Microsoft IIS logs. Gzip compressed logs may be used as input directly. Any log filename that ends with a '.gz' extension will be assumed to be in gzip format and uncompressed on the fly as it is being read. In addition, the WebDruid also supports DNS lookup capabilities if enabled at compile time. See the file DNS.README for additional information. The WebDruid, when used with the graphviz tools, can produce path graphs and users flow graphs. Path graph allows you to see the preferred path your users take when browsing on your web site. Users flow will draw lines between the urls, each line being balanced by the number of hits. For more information see 'Using path & users flox graphs'. This documentation applies to The WebDruid version 0.5.0 What is The WebDruid history? ------------------------------ Damned, it's the history of an hazardous meet...see below if you want to know more, or jump directly to the next part if that bothers you ;-) o June 2003 o We're at the beginning of June 2003. I'm studying in an 'engineering school' (100% French concept - I don't think it has any equivalent anywhere in the world). Here in France you are engineer because of your diploma, not because of your work, or your membership to any king of 'engineering order'. By the way...i digress! I'm at this time a second year student (International equivalent ~ something as one year before graduation). I have to take a trainee course somewhere. It's my duty to find a company to work in for the next two months. But we student don't interest companies - AT ALL! It's damned hard to find something to do... and overall i want something interesting! o Mid - June 2003 o A miracle has happened. A web agency looks for a student to enhance its free software log analysis tool. And yes, i enjoy working with free software... you work with people with extraordinary technical skills, and have the opportunity to learn each time new concepts, as there is no limit in the understanding of the sofware you can have. Only your brain (or your stomach :-)) can prevent you from understanding 'a bit more' how it works. I've myself discovered free software & linux something around year 2K. Since 2001 i depend solely on free software for my day to day computer use. It has been a great occasion to discover the various sides of UNIX system programming, as well as UNIX typical services. But i digress again! So i answer the offer, go to meet the people, and...yes! I'm the right man. They will take me to try to enhance 'The Webalizer' (It's the WebDruid little brother - see http://www.webalizer.org). o June 2003 - August 2003 o So i'm working on the webalizer, trying to improve it and asked some people what they would like for it. I look also around a bit to see what other people have done ( to be accurate i even did that before to accept to work on the webalizer - i just like to see how it looks like before to engage with some stuff). From this work came two features of what would become the WebDruid: - path graphs & users flow - reworked search engine support In the same time it seems that a lot of people use the webalizer not as a web server analysis tool, but for web site analysis. The enhancements i've done don't fit well with the Webalizer's author's view. But people seem interested in the new features i've done. I take the decision to 'fork' from the webalizer, and to build something to analyse the website, and not the web server. I will learn later this is called 'Web Mining'. o September 2003 - January 2004 o I'm back to school, but working also for a company to rebuild a computer network from scratch. It's time for me to learn a bit more about Samba 3, LDAP, pam and all what is required to build a mixed //GNU Linux//Win2K server//Win XP Pro// network up and running. Grrr...I don't have time to work on this new project at all (except 3 days around november 11th, where i write the first pieces of the XML backend). o February 2004 o I've spent a few days working on the WebDruid. Implemented multiple log files support, preparing the first tarball, cleaning up things a bit, fixing a last minute bug. And here is the first public release... Running the WebDruid --------------------- The WebDruid was designed to be run from a Unix command line prompt or as a cron job. There are several command line options which will modify the results it produces, and configuration files can be used as well. The format of the command line is: webdruid [options ...] [log-file1] [log-file2] [log-file3] ... Where 'options' can be one or more of the supported command line switches described below. 'log-fileX' is the name of a log file to process (see below for more detailed information). If a dash ("-") is specified for the log-file name, STDIN will be used. More than one file can be specified. In that case, the WebDruid sorts the records to be able to process theme 'in order'. Once executed, the general flow of the program follows: o A default configuration file is scanned for. A file named 'webdruid.conf' is searched for in the current directory, and if found, it's configuration data is parsed. If the file is not present in the current directory, the file '/etc/webdruid/webdruid.conf' is searched for and, if found, is used instead. o Any command line arguments given to the program are parsed. This may include the specification of a configuration file, which is processed at the time it is encountered. o If any log file was specified, it is opened and made ready for processing. If no log file was given, or the filename '-' is specified on the command line, STDIN is used for input. o If an output directory was specified, the program does a 'chdir' to that directory in preparation for generating output. If no output directory was given, the current directory is used. o If a non-zero number of DNS Children processes were specified, they will be started, and the specified log files will be processed a first time, either creating or updateing the specified DNS cache file. o The global search engines file (usually /usr/local/share/webdruid/engines.list) is loaded and parsed. o The local search engines file /etc/webdruid/engines.list is loaded and parsed. o If no hostname was given, the program attempts to get the hostname using a uname system call. If that fails, 'localhost' is used. o A history file is searched for. This file keeps previous month totals used on the main index.html page. The file is named 'webalizer.hist', and is kept in the specified output directory, o If incremental processing was specified, a data file is searched for and loaded if found, containing the 'internal state' data of the program at the end of a previous run. The file is named 'incremental.1', kept in the specified output directory. o Main processing begins on the log file. If the log spans multiple months, a separate HTML document is created for each month. o After main processing, the main 'index.html' page is created, which has totals by month and links to each months HTML document. o A new history file is saved to disk, which includes totals generated by The WebDruid during the current run. o If incremental processing was specified, a data file is written that contains the 'internal state' data at the end of this run. Incremental Processing ---------------------- The WebDruid has incremental run capability. Simply put, this allows processing large log files by breaking them up into smaller pieces, and processing these pieces instead. What this means in real terms is that you can now rotate your log files as often as you want, and still be able to produce monthly usage statistics without the loss of any detail. This is accomplished by saving and restoring all relevant internal data to a disk file between runs. Doing so allows the program to 'start where it left off' so to speak, and allows the preservation of detail from one run to the next. Some special precautions need to be taken when using the incremental run capability of The WebDruid. Configuration options should not be changed between runs, as that could cause corruption of the internal stored data. For example, changing the MangleAgents level will cause different representations of user agents to be stored, producing invalid results in the user agents section of the report. If you need to change configuration options, do it at the end of the month after normal processing of the previous month and before processing the current month. You may also want to delete the 'incremental.1' file as well (or whatever name was specified using the "IncrementalName" configuration option). The WebDruid also attempts to prevent data duplication by keeping track of the timestamp of the last record processed. This timestamp is then compared to current records being processed, and any records that were logged previous to that timestamp are ignored. This, in theory, should allow you to re-process logs that have already been processed, or process logs that contain a mix of processed/not yet processed records, and not produce duplication of statistics. The only time this may break is if you have duplicate timestamps in two separate log files... any records in the second log file that do have the same timestamp as the last record in the previous log file processed, will be discarded as if they had already been processed. There are lots of ways to prevent this however, for example, stopping the web server before rotating logs will prevent this situation. This setup also necessitates that you always process logs in chronological order, otherwise data loss will occur as a result of the timestamp compare. Using path & users flow graphs ------------------------------ The graphs require the use of an external utility, 'dot', part of the graphviz package. The source of graphviz can be downloaded here : http://www.research.att.com/sw/tools/graphviz/ Since it has an unusual license, some people might prefer not to have it on their systems. That's why it is NOT REQUIRED to run the WebDruid. If dot is not found where it should be (see DotLocation config-file parameter), path & users flow graphs will be automatically disabled. Output Produced --------------- The WebDruid produces several reports (html) and graphics for each month processed. In addition, a summary page is generated for the current and previous months (up to 12), a history file is created and if incremental mode is used, the current month's processed data. The exact location and names of these files can be changed using configuration files and command line options. The files produced, (default names) are: index.html - Main summary page (extension may be changed) usage.png - Yearly graph displayed on the main index page usage_YYYYMM.html - Monthly summary page (extension may be changed) usage_YYYYMM.png - Monthly usage graph for specified month/year daily_usage_YYYYMM.png - Daily usage graph for specified month/year hourly_usage_YYYYMM.png - Hourly usage graph for specified month/year path_graph_YYYYMM.png - Path graph for specified month/year users_flow_YYYYMM.png - Users flow for specified month/year site_YYYYMM.html - All sites listing (if enabled) url_YYYYMM.html - All urls listing (if enabled) ref_YYYYMM.html - All referrers listing (if enabled) agent_YYYYMM.html - All user agents listing (if enabled) search_YYYYMM.html - All search strings listing (if enabled) webalizer.hist - Previous month history incremental.1 - Incremental Data site_YYYYMM.tab - tab delimited sites file url_YYYYMM.tab - tab delimited urls file ref_YYYYMM.tab - tab delimited referrers file agent_YYYYMM.tab - tab delimited user agents file user_YYYYMM.tab - tab delimited usernames file search_YYYYMM.tab - tab delimited search string file The yearly (index) report shows statistics for a 12 month period, and links to each month. The monthly report has detailed statistics for that month with additional links to any URL's and referrers found. The various totals shown are explained below. Hits Any request made to the server which is logged, is considered a 'hit'. The requests can be for anything... html pages, graphic images, audio files, CGI scripts, etc... Each valid line in the server log is counted as a hit. This number represents the total number of requests that were made to the server during the specified report period. Files Some requests made to the server, require that the server then send something back to the requesting client, such as a html page or graphic image. When this happens, it is considered a 'file' and the files total is incremented. The relationship between 'hits' and 'files' can be thought of as 'incoming requests' and 'outgoing responses'. Pages Pages are, well, pages! Generally, any HTML document, or anything that generates an HTML document, would be considered a page. This does not include the other stuff that goes into a document, such as graphic images, audio clips, etc... This number represents the number of 'pages' requested only, and does not include the other 'stuff' that is in the page. What actually constitutes a 'page' can vary from server to server. The default action is to treat anything with the extension '.htm', '.html' or '.cgi' as a page. A lot of sites will probably define other extensions, such as '.phtml', '.php3' and '.pl' as pages as well. Some people consider this number as the number of 'pure' hits... I'm not sure if I totally agree with that viewpoint. Some other programs (and people :) refer to this as 'Pageviews'. Sites Each request made to the server comes from a unique 'site', which can be referenced by a name or ultimately, an IP address. The 'sites' number shows how many unique IP addresses made requests to the server during the reporting time period. This DOES NOT mean the number of unique individual users (real people) that visited, which is impossible to determine using just logs and the HTTP protocol (however, this number might be about as close as you will get). Visits Whenever a request is made to the server from a given IP address (site), the amount of time since a previous request by the address is calculated (if any). If the time difference is greater than a pre-configured 'visit timeout' value (or has never made a request before), it is considered a 'new visit', and this total is incremented (both for the site, and the IP address). The default timeout value is 30 minutes (can be changed), so if a user visits your site at 1:00 in the afternoon, and then returns at 3:00, two visits would be registered. Note: in the 'Top Sites' table, the visits total should be discounted on 'Grouped' records, and thought of as the "Minimum number of visits" that came from that grouping instead. Note: Visits only occur on PageType requests, that is, for any request whose URL is one of the 'page' types defined with the PageType option. Due to the limitation of the HTTP protocol, log rotations and other factors, this number should not be taken as absolutely accurate, rather, it should be considered a pretty close "guess". KBytes The KBytes (kilobytes) value shows the amount of data, in KB, that was sent out by the server during the specified reporting period. This value is generated directly from the log file, so it is up to the web server to produce accurate numbers in the logs (some web servers do stupid things when it comes to reporting the number of bytes). In general, this should be a fairly accurate representation of the amount of outgoing traffic the server had, regardless of the web servers reporting quirks. Note: A kilobyte is 1024 bytes, not 1000 :) Top Entry and Exit Pages The Top Entry and Exit tables give a rough estimate of what URL's are used to enter your site, and what the last pages viewed are. Because of limitations in the HTTP protocol, log rotations, etc... this number should be considered a good "rough guess" of the actual numbers, however will give a good indication of the overall trend in where users come into, and exit, your site. Command Line Options -------------------- The WebDruid supports many different configuration options that will alter the way the program behaves and generates output. Most of these can be specified on the command line, while some can only be specified in a configuration file. The command line options are listed below, with references to the corresponding configuration file keywords. -------------------------------------------------------------------------- General Options --------------- -h Display all available command line options and exit program. -v Display program version and exit program. -d Display additional 'debugging' information for errors and warnings produced during processing. This normally would not be used except to determine why you are getting all those errors and wanted to see the actual data. Normally The WebDruid will just tell you it found an error, not the actual data. This option will display the data as well. Config file keyword: Debug -F Specify that the log being used is a ftp log. Normally, the WebDruid expects to find a valid CLF or Combined format we server log file. This option allows you to process wu-ftpd xferlogs as well. Config file keyword: LogType -f Fold out of sequence log records back into analysis, by treating them as if they were the same date/time as the last good record. Normally, out of sequence log records are ignored. If you run apache, don't worry about this. Config file keyword: FoldSeqErr -i Ignore history file. USE WITH CAUTION. This causes The WebDruid to ignore any existing history file produced from previous runs and generate it's output from scratch. The effect will be as if The WebDruid is being run for the first time and any previous statistics will be lost (although the HTML documents, if any, will not be deleted) on the main index.html (yearly) web page. Config file keyword: IgnoreHist -p Preserve state (incremental processing). This allows the processing of partial logs in increments. At the end of the program, all relevant internal data is saved, so that it may be restored the next time the program is run. This allows sites that must rotate their logs more than once a month to still be able to use The WebDruid, and not worry about having to gather and feed an entire months logs to the program at the end of the month. See the section on "Incremental Processing" below for additional information. The default is to not perform incremental processing. Use this command line option to enable the feature. Config file keyword: Incremental -q Quiet mode. Normally, The WebDruid will produce various messages while it runs letting you know what it's doing. This option will suppress those messages. It should be noted that this WILL NOT suppress errors and warnings, which are output to STDERR. Config file keyword: Quiet -Q ReallyQuiet mode. This allows suppression of _all_ messages generated by The WebDruid, including warnings and errors. Useful when The WebDruid is run as a cron job. Config file keyword: ReallyQuiet -T Display timing information. The WebDruid keeps track of the time it begins and ends processing, and normally displays the total processing time at the end of each run. If quiet mode (-q or 'Quiet yes' in configuration file) is specified, this information is not displayed. This option forces the display of timing totals if quiet mode has been specified, otherwise it is redundant and will have no effect. Config file keyword: TimeMe -c file This option specifies a configuration file to use. Configuration files allow greater control over how The WebDruid behaves, and there are several ways to use them. As of version 0.98, The WebDruid searches for a default configuration file in the current directory named "webdruid.conf", and if not found, will search in the /etc/webdruid directory for a file of the same name. In addition, you may specify a configuration file to use with this command line option. -n name This option specifies the hostname for the reports generated. The hostname is used in the title of all reports, and is also prepended to URL's in the reports. This allows The WebDruid to be run on log files for 'virtual' web servers or web servers that are different than the machine the reports are located on, and still allows clicking on the URL's to go to the proper location. If a hostname is not specified, either on the command line or in a configuration file, The WebDruid attempts to determine the hostname using a 'uname' system call. If this fails, "localhost" will be used as the hostname. Config file keyword: HostName -o dir This options specifies the output directory for the reports. If not specified here or in a configuration file, the current default directory will be used for output. Config file keyword: OutputDir -x name This option allows the generated pages to have an extension other than '.html', which is the default. Do not include the leading period ('.') when you specify the extension. Config file keyword: HTMLExtension -P name Specify the file extensions for 'pages'. Pages (sometimes called 'PageViews') are normally html documents and CGI scripts that display the whole page, not just parts of it. Some system will need to define a few more, such as 'phtml', 'php3' or 'pl' in order to have them counted as well. The default is 'htm*' and 'cgi' for web logs and 'txt' for ftp. Config file keyword: PageType -t name This option specifies the title string for all reports. This string is used, in conjunction with the hostname (if not blank) to produce the actual title. If not specified, the default of "Usage Statistics for" will be used. Config file keyword: ReportTitle -Y Supress Country graph. Normally, The WebDruid produces country statistics in both Graph and Columnar forms. This option will suppress the Country Graph from being generated. Config file keyword: CountryGraph -G Supress hourly graph. Normally, The WebDruid produces hourly statistics in both Graph and Columnar forms. This option will suppress the Hourly Graph only from being generated. Config file keyword: HourlyGraph -H Supress Hourly statistics. Normally, The WebDruid produces hourly statistics in both Graph and Columnar forms. This option will suppress the Hourly Statistics table only from being generated. Config file keyword: HourlyStats -L Disable Graph Legends. The color coded legends displayed on the in-line graphs can be disabled with this option. The default is to display the legends. Config file keyword: GraphLegend -l num Graph Lines. Specify the number of background reference lines displayed on the in-line graphics produced. The default is 2 lines, however can range anywhere from zero ('0') for no lines, up to 20 lines (looks funny!). Config file keyword: GraphLines -P name Page type. This is the extension of files you consider to be pages for Pages calculations (sometimes called 'pageviews'). The default is 'htm*' and 'cgi' (plus whatever HTMLExtension you specified if it is different). Don't use a period! -m num Specify a 'visit timeout'. Visits are calculated by looking at the time difference between the current and last request made by a specific host. If the difference is greater that the visit timeout value, the request is considered a new visit. This value is specified in number of seconds. The default is 30 minutes (1800). Config file keyword: VisitTimeout -M num Mangle user agent names. Normally, The WebDruid will keep track of the user agent field verbatim. Unfortunately, there are a ton of different names that user agents go by, and the field also reports other items such as machine type and OS used. For Example, Netscape 4.03 running on Windows 95 will report a different string than Netscape 4.03 running on Windows NT, so even though they are the same browser type, they will be considered as two totally different browsers by The WebDruid. For that matter, Netscape 4.0 running on Windows NT will report different names if one is run on an Alpha and the other on an Intel processor! Internet Exploder is even worse, as it reports itself as if it were Netscape and you have to search the given string a little deeper to discover that it is really MSIE! In order to consolidate generic browser types, this option will cause The WebDruid to 'mangle' the user agent field, attempting to consolidate generic browser types. There are 6 levels that can be specified, each producing different levels of detail. Level 5 displays only the browser name (MSIE or Mozilla) and the major version number. Level 4 will also display the minor version number (single decimal place). Level 3 will display the minor version number to two decimal places. Level 2 will add any sub-level designation (such as Mozilla/3.01Gold or MSIE 3.0b). Level 1 will also attempt to add the system type. The default Level 0 will disable name mangling and leave the user agent field unmodified, producing the greatest amount of detail. Configuration file keyword: MangleAgents -g num This option allows you to specify the level of domains name grouping to be performed. The numeric value represents the level of grouping, and can be thought of as the 'number of dots' to be displayed. The default value of 0 disables any domain name grouping. Configuration file keyword: GroupDomains -D name This allows the specification of a DNS Cache file name. This filename MUST be specified if you have dns lookups enabled (using the -N command line switch or DNSChildren configuration keyword). The filename is relative to the default output directory if an absolute path is not specified (ie: starts with a leading '/'). This option is only available if DNS support was enabled at compile time, otherwise an 'Invalid Keyword' error will be generated. See the DNS.README file for additional information regarding DNS lookups. -N num Number of DNS child processes to use for reverse DNS lookups. If specified, a DNSCache name MUST be specified also. If you do not wish a DNS cache file to be generated, specify a value of zero ('0') to disable it. This does not prevent using an existing cache file, only the generation of one at run time. See the DNS.README file for additional information regarding DNS lookups. Hide Options ------------ The following options take a string argument to use as a comparison for matching. Except for the IndexAlias option, the string argument can be plain text, or plain text that either starts or ends with the wildcard character '*'. For Example: Given the string "yourmama/was/here", the arguments "was", "*here" and "your*" will all produce a match. -a name This option allows hiding of user agents (browsers) from the "Top User Agents" table in the report. This option really isn't too useful as there are a zillion different names that current browsers go by, depending where they were obtained, however you might have some particular user agents that hit your site a lot that you would like to exclude from the list. You must have a web server that includes user agents in it's log files for this option to be of any use. In addition, it is also useless if you disable the user agent table in the report (see the -A command line option or "TopAgents" configuration file keyword). You can specify as many of these as you want on the command line. The wildcard character '*' can be used either in front of or at the end of the string. (ie: Mozilla/4.0* would match anything that starts with the string "Mozilla/4.0"). Config file keyword: HideAgent -r name This option allows hiding of referrers from the "Top Referrer" table in the report. Referrers are URL's, either on your own local site or a remote site, that referred the user to a URL on your web server. This option is normally used to hide your own server from the table, as your own pages are usually the top referrers to your own pages (well, you get the idea). You must have a web server that includes referrer information in the log files for this option to be of any use. In addition, it is also useless if you disable the referrers table in the report (see the -R command line option or "TopReferrers" configuration file keyword). You can specify as many of these as you like on the command line. Config file keyword: HideReferrer -s name This option allows hiding of sites from the "Top Sites" table in the report. Normally, you will only want to hide your own domain name from the report, as it usually is one of the top sites to visit your web server. This option is of no use if you disable the top sites table in the report (see the -S command line option or "TopSites" configuration file option). Config file keyword: HideSite -X This causes all individual sites to be hidden, which results in only grouped sites to be displayed on the report. Config file keyword: HideAllSites -u name This option allows hiding of URL's from the "Top URL's" table in the report. Normally, this option is used to hide images, audio files and other objects your web server dishes out that would otherwise clutter up the table. This option is of no use if you disable the top URL's table in the report (see the -U command line option or "TopURLs" configuration file keyword). Config file keyword: HideURL -I name This option allows you to specify additional index.html aliases. The WebDruid usually strips the string 'index.' from URL's before processing, which has the effect of turning a URL such as /somedir/index.html into just /somedir/ which is really the same URL and should be treated as such. This option allows you to specify _additional_ strings that are to be treated the same way. Use with care, improper use could cause unexpected results. For example, if you specify the alias string of 'home', a URL such as /somedir/homepages/brad/home.html would be converted into just /somedir/ which probably isn't what was intended. This option is useful if your web server uses a different default index page other than the standard 'index.html' or 'index.htm', such as 'home.html' or 'homepage.html'. The string specified is searched for _anywhere_ in the URL, so "home.htm" would turn both "/somedir/home.htm" and "/somedir/home.html" into just "/somedir/". Go easy on this one, each string specified will be scanned for in EVERY log record, so if you specify a bunch of these, you will notice degraded performance. Wildcards are not allowed on this one. Config file keyword: IndexAlias Table Size Options ------------------ -e num This option specifies the number of entries to display in the "Top Entry Pages" table. To disable the table, use a value of zero (0). Config file keyword: TopEntry -E num This option specifies the number of entries to display in the "Top Exit Pages" table. To disable the table, use a value of zero (0). Config file keyword: TopExit -A num This option specifies the number of entries to display in the "Top User Agents" table. To disable the table, use a value of zero (0). Config file keyword: TopAgents -C num This option specifies the number of entries to display in the "Top Countries" table. To disable the table, use a value of zero (0). Config file keyword: TopCountries -R num This option specifies the number of entries to display in the "Top Referrers" table. To disable the table, use a value of zero (0). Config file keyword: TopReferrers -S num This option specifies the number of entries to display in the "Top Sites" table. To disable the table, use a value of zero (0). Config file keyword: TopSites -U num This option specifies the number of entries to display in the "Top URL's" table. To disable the table, use a value of zero (0). Config file keyword: TopURLs -------------------------------------------------------------------------- CONFIGURATION FILES ------------------- The WebDruid allows configuration files to be used in order to simplify life for all. There are several ways that configuration files are accessed by the WebDruid. When The WebDruid first executes, it looks for a default configuration file named "webdruid.conf" in the current directory, and if not found there, will look for "/etc/webdruid/webdruid.conf". In addition, configuration files may be specified on the command line with the '-c' option. There are lots of different ways you can combine the use of configuration files and command line options to produce various results. The WebDruid always looks for and reads configuration options from a default configuration file before doing anything else. Because of this, you can override options found in the default file by use of additional configuration files specified on the command line or command line options themselves. If you specify a configuration file on the command line, you can override options in it by additional command line options which follow. For example, most users will most likely want to create the default file /etc/webdruid/webdruid.conf and place options in it to specify the hostname, log file, table options, etc... At the end of the month when a different log file is to be used (the end of month log), you can run The WebDruid as usual, but put the different filename on the end of the command line, which will override the log file specified in the configuration file. It should be noted that you cannot override some configuration file options by the use of command line arguments. For example, if you specify "Quiet yes" in a configuration file, you cannot override this with a command line argument, as the command line option only _enables_ the feature (-q option). The configuration files are standard ASCII text files that may be created or edited using any standard editor. Blank lines and lines that begin with a pound sign ('#') are ignored. Any other lines are considered to be configuration lines, and have the form "Keyword Value", where the 'Keyword' is one of the currently available configuration keywords defined below, and 'Value' is the value to assign to that particular option. Any text found after the keyword up to the end of the line is considered the keyword's value, so you should not include anything after the actual value on the line that is not actually part of the value being assigned. The file "sample.conf" provided with the distribution contains lots of useful documentation and examples as well. It should be noted that you do not have to use any configuration files at all, in which case, default values will be used (which should be sufficient for most sites). -------------------------------------------------------------------------- General Configuration Keywords ------------------------------ LogFile This defines the log file to use. It should be a fully qualified name (ie: contain the path), but relative names will work as well. If not specified, the logfile defaults to STDIN. LogType This specified the log file type being used. Normally, The WebDruid processes web logs in either CLF or Combined format. You may also process wu-ftpd xferlog formatted logs, or squid proxy logs by setting the appropriate type using this keyword. Values may be either 'clf', 'ftp' or 'squid'. Ensure that you specify the proper file type, otherwise you will be presented with a long stream of 'invalid record' messages ;) Command line argument: -F OutputDir This defines the output directory to use for the reports. If it is not specified, the current directory is used. Command line argument: -o FontFace This is the True Type font face name. This font is used for drawing text inside charts. This is NOT the file name, this is the name that you would use in your favourite word processor when selecting this font. True Type fonts are searched in the path defined by the GDFONTPATH environment variable if it is set. If GDFONTPATH is not set, The WebDruid will look for fonts in the following directories: - /usr/share/fonts/truetype/freefont - /usr/share/fonts/truetype The default fontname is "FreeSerif". It is part of the FreeFonts package, available at http://savannah.nongnu.org/projects/freefont/ ReportTitle This specifies the title to use for the generated reports. It is used in conjunction with the hostname (unless blank) to produce the final report titles. If not defined, the default of "Usage Statistics for" is used. Command line argument: -t HostName This defines the hostname. The hostname is used in the report title as well as being prepended to URL's in the "Top URL's" table. This allows The WebDruid to be run on "virtual" web servers, or servers that do not reside on the local machine, and allows clicking on the URL to go to the right place. If not specified, The WebDruid attempts to get the hostname via a 'uname' system call, and if that fails, will default to "localhost". Command line argument: -n UseHTTPS Causes the links in the 'Top URL's' table to use 'https://' instead of the default 'http://' prefix. Not much use if you run a mix of secure/insecure servers on your machine. Only useful if you run the analysis on a secure servers logs, and want the links in the table to work properly. Quiet This allows you to enable or disable informational messages while it is running. The values for this keyword can be either 'yes' or 'no'. Using "Quiet yes" will suppress these messages, while "Quiet no" will enable them. The default is 'no' if not specified, which will allow The WebDruid to display informational messages. It should be noted that this option has no effect on Warning or Error messages that may be generated, as they go to STDERR. Command line argument: -q TimeMe This allows you to display timing information regardless of any "quiet mode" specified. Useful only if you did in fact tell the WebDruid to be quiet either by using the -q command line option or the "Quiet" keyword, otherwise timing stats are normally displayed anyway. Values may be either 'yes' or 'no', with the default being 'no'. Command line argument: -T GMTTime This keyword allows timestamps to be displayed in GMT (UTC) time instead of local time. Normally The WebDruid will display timestamps in the time-zone of the local machine (ie: PST or EDT). This keyword allows you to specify the display of timestamps in GMT (UTC) time instead. Values may be either 'yes' or 'no'. Default is 'no'. Debug This tells The WebDruid to display additional information when it encounters Warnings or Errors. Normally, The WebDruid will just tell you it found a bad record or field. This option will enable the display of the actual data that produced the Warning or Error as well. Useful only if you start getting lots of Warnings or Errors and want to determine the cause. Values may be either 'yes' or 'no', with the default being 'no'. Command line argument: -d IgnoreHist This suppresses the reading of a history file. USE WITH EXTREME CAUTION as the history file is how The WebDruid keeps track of previous months. The effect of this option is as if The WebDruid was being run for the very first time, and any previous data is discarded. Values may be either 'yes' or 'no', with the default being 'no'. Command line argument: -i FoldSeqErr Allows log records that are out of sequence to be folded back into the analysis, by treating them as if they had the same date/time as the last good record. Normally, out of sequence log records are simply ignored. If you run apache, don't worry about this. VisitTimeout Set the 'visit timeout' value. Visits are determined by looking at the time difference between the current and last request made by a specific site. If the difference in time is greater than the visit timeout value, the request is considered a new visit. The value is in number of seconds, and defaults to 30 minutes (1800). Command line argument: -m PageType Allows you to define the 'page' type extension. Normally, people consider HTML and CGI scripts as 'pages'. This option allows you to specify what extensions you consider a page. Default is 'htm*' and 'cgi' for web logs, and 'txt' for ftp logs. Command line argument: -P GraphLegend Enable/disable the display of color coded legends on the produced graphs. Default is 'yes', to display them. Command line argument: -L GraphLines Specify the number of background reference lines to display on produced graphs. The default is 2. To disable the use of background lines, use zero ('0'). Command line argument: -l PathGraph This keyword is used to either enable or disable the creation and display of the path graph. Path is...really a path, i mean a list of pages the user went to. What is displayed is only paths that were taken more than once. The drawback of this criteria is that if 200 users follow exactly the same path in your site, except for the last pages, they won't appear here because their path was different at the end. If anyone has an good algorithm to solve this issue, he should drop me a note. Another issue is because of client side and proxy caching. You can never know when a user goes back in your site. Values may be either 'yes' or 'no', with the default being 'yes'. Command line argument: none PathGraphMaxPaths This keyword is used to limit the number of path displayed by the path graph. Value is a positive number. Default value: 10 Command line argument: none UsersFlow This keyword is used to either enable or disable the creation and display of the users flow graph. This graph tries to show you the flows of the users in your web site. Each node represents a page, and each edge is balanced by the number of hits from a page to another page. Values may be either 'yes' or 'no', with the default being 'yes'. Command line argument: none UsersFlowMaxEdges This keyword is used to limit the number of edges displayed by the users flow graph. Value is a positive number. Default value: 30 Command line argument: none DotLocation This keyword takes as argument the FULL PATH to the 'dot' utility. It must be set only if you use path graphs or users flow feature, and the default value (/usr/bin/dot) doesn't point to the right place. Command line argument: none CountryGraph This keyword is used to either enable or disable the creation and display of the Country Usage graph. Values may be either 'yes' or 'no', with the default being 'yes'. Command line argument: -Y DailyGraph This keyword is used to either enable or disable the creation and display of the Daily Usage graph. Values may be either 'yes' or 'no', with the default being 'yes'. DailyStats This keyword is used to either enable or disable the creation and display of the Daily Usage statistics table. Values may be either 'yes' or 'no', with the default being 'yes'. HourlyGraph This keyword is used to either enable or disable the creation and display of the Hourly Usage graph. Values may be either 'yes' or 'no', with the default being 'yes'. Command line argument: -G HourlyStats This keyword is used to either enable or disable the creation and display of the Hourly Usage statistics table. Values may be either 'yes' or 'no', with the default being 'yes'. Command line argument: -H IndexAlias This allows additional 'index.html' aliases to be defined. Normally, The WebDruid scans for and strips the string "index." from URL's before processing them. This turns a URL such as /somedir/index.html into just /somedir/ which is really the same URL. This keyword allows _additional_ names to be treated in the same fashion for sites that use different default names, such as "home.html". The string is scanned for anywhere in the URL, so care should be used if and when you define additional aliases. For example, if you were to use an alias such as 'home', the URL /somedir/homepages/brad/home.html would be turned into just /somedir/ which probably isn't the intended result. Instead, you should have specified 'home.htm' which would correctly turn the URL into /somedir/homepages/brad/ like intended. It should also be noted that specified aliases are scanned for in EVERY log record... A bunch of aliases will noticeably degrade performance as each record has to be scanned for every alias defined. You don't have to specify 'index.' as it is always the default. Command line argument: -I MangleAgents The MangleAgents keyword specifies the level of user agent name mangling, if any. There are 6 levels that may be specified, each producing a different level of detail displayed. Level 5 displays only the browser name (MSIE or Mozilla) and the major version number. Level 4 adds the minor version (single decimal place). Level 3 adds the minor version to two decimal places. Level 2 will also add any sub-level designation (such as Mozilla/3.01Gold or MSIE 3.0b). Level 1 will also attempt to add the system type. The default level 0 will leave the user agent field unmodified and produces the greatest amount of detail. Command line argument: -M SearchEngine THIS KEYWORD IS NO MORE SUPPORTED Incremental This allows incremental processing to be enabled or disabled. Incremental processing allows processing partial logs without the loss of detail data from previous runs in the same month. This feature saves the 'internal state' of the program so that it may be restored in following runs. See the section above titled "Incremental Processing" for additional information. The value may be 'yes' or 'no', with the default being 'no'. Command line argument: -p DNSCache Specifies the DNS cache filename. This name is relative to the default output directory unless an absolute name is given (ie: starts with '/'). See the DNS.README file for additional information. DNSChildren The number of DNS children processes to run in order to create/update the DNS cache file. If specified, the DNS cache filename must also be specified (see above). Use a value of zero ('0') to disable. See the DNS.README file for additional information. Top Table Keywords ------------------ TopAgents This allows you to specify how many "Top" user agents are displayed in the "Top User Agents" table. The default is 15. If you do not want to display user agent statistics, specify a value of zero (0). The display of user agents will only work if your web server includes this information in its log file (ie: a combined log format file). Command line argument: -A AllAgents Will cause a separate HTML page to be generated for all normally visable User Agents. A link will be added to the bottom of the "Top User Agents" table if enabled. Value can be either 'yes' or 'no', with 'no' being the default. TopCountries This allows you to specify how many "Top" countries are displayed in the "Top Countries" table. The default is 30. If you want to disable the countries table, specify a value of zero (0). Command line argument: -C TopReferrers This allows you to specify how many "Top" referrers are displayed in the "Top Referrers" table. The default is 30. If you want to disable the referrers table, specify a value of zero (0). The display of referrer information will only work if your web server includes this information in its log file (ie: a combined log format file). Command line argument: -R AllReferrers Will cause a separate HTML page to be generated for all normally visable Referrers. A link will be added to the "Top Referrers" table if enabled. Value can be either 'yes' or 'no', with 'no' being the default. TopSites This allows you to specify how many "Top" sites are displayed in the "Top Sites" table. The default is 30. If you want to disable the sites table, specify a value of zero (0). Command line argument: -S TopKSites Identical to TopSites, except for the 'by KByte' table. Default is 10. No command line switch for this one. AllSites Will cause a separate HTML page to be generated for all normally visable Sites. A link will be added to the bottom of the "Top Sites" table if enabled. Value can be either 'yes' or 'no', with 'no' being the default. TopURLs This allows you to specify how many "Top" URL's are displayed in the "Top URL's" table. The default is 30. If you want to disable the URL's table, specify a value of zero (0). Command line argument: -U TopKURLs Identical to TopURLs, except for the 'by KByte' table. Default is 10. No command line switch for this one. AllURLs Will cause a separate HTML page to be generated for all normally visable URLs. A link will be added to the bottom of the "Top URLs" table if enabled. Value can be either 'yes' or 'no', with 'no' being the default. TopEntry Allows you to specify how many "Top Entry Pages" are displayed in the table. The default is 10. If you want to disable the table, specify a value of zero (0). Command line argument: -e TopExit Allows you to specify how many "Top Exit Pages" are displayed in the table. The default is 10. If you want to disable the table, specify a value of zero (0). Command line argument: -E TopSearch Allows you to specify how many "Top Search Strings" are displayed in the table. The default is 20. If you want to disable the table, specify a value of zero (0). Only works if using combined log format (ie: contains referrer information). TopUsers This allows you to specify how many "Top" usernames are displayed in the "Top Usernames" table. Usernames are only available if you use http authentication on your web server, or when processing wu-ftpd xferlogs. The default value is 20. If you want to disable the Username table, specify a value of zero (0). AllUsers Will cause a separate HTML page to be generated for all normally visable usernames. A link will be added to the bottom of the "Top Usernames" table if enabled. Value can be either 'yes' or 'no', with 'no' being the default. AllSearchStr Will create a separate HTML page to be generated for all normally visable Search Strings. A link will be added to the bottom of the "Top Search Strings" table if enabled. Value can be either 'yes' or 'no', with 'no' being the default. Hide Object Keywords -------------------- These keywords allow you to hide user agents, referrers, sites, URL's and usernames from the various "Top" tables. The value for these keywords are the same as those used in their command line counterparts. You can specify as many of these as you want without limit. Refer to the section above on "Command Line Options" for a description of the string formatting used as the value. Values cannot exceed 80 characters in length. HideAgent This allows specified user agents to be hidden from the "Top User Agents" table. Not very useful, since there a zillion different names by which browsers go by today, but could be useful if there is a particular user agent (ie: robots, spiders, real-audio, etc..) that hits your site frequently enough to make it into the top user agent listing. This keyword is useless if 1) your log file does not provide user agent information or 2) you disable the user agent table. Command line argument: -a HideReferrer This allows you to hide specified referrers from the "Top Referrers" table. Normally, you would only specify your own web server to be hidden, as it is usually the top generator of references to your own pages. Of course, this keyword is useless if 1) your log file does not include referrer information or 2) you disable the top referrers table. Command line argument: -r HideSite This allows you to hide specified sites from the "Top Sites" table. Normally, you would only specify your own web server or other local machines to be hidden, as they are usually the highest hitters of your web site, especially if you have their browsers home page pointing to it. Command line argument: -s HideAllSites This allows hiding all indvidual sites from the display, which can be useful when a lot of groupings are being used (since grouped records cannot be hidden). It is particularly useful in conjunction with the GroupDomain feature, however can be useful in other situations as well. Value can be either 'yes' or 'no', with 'no' the default. Command line argument: -X HideURL This allows you to hide URL's from the "Top URL's" table. Normally, this is used to hide items such as graphic files, audio files or other 'non-html' files that are transferred to the visiting user. Command line argument: -u HideUser This allows you to hide Usernames from the "Top Usernames" table. Usernames are only available if you use http based authentication on your web server. Group Object Keywords --------------------- The Group* keywords allow object grouping based on Site, URL, Referrer, User Agent and Usernames. Combined with the Hide* keywords, you can customize exactly what will be displayed in the 'Top' tables. For example, to only display totals for a particular directory, use a GroupURL and HideURL with the same value (ie: '/help/*'). Group processing is only done after the individual record has been fully processed, so name mangling and site total updates have already been performed. Because of this, groups are not counted in the main site total (as that would cause duplication). Groups can be displayed in bold and shaded as well. Grouped records are not, by default, hidden from the report. This allows you to display a grouped total, while still being able to see the individual records, even if they are part of the group. If you want to hide the detail records, follow the Group* directive with a Hide* one using the same value. There are no command line switches for these keywords. The Group* keywords also accept an optional label to be displayed instead of the actual value used. This label should be separated from the value by at least one whitespace character, such as a space or tab character. See the sample.conf file for examples. GroupReferrer Allows grouping Referrers. Can be handy for some of the major search engines that have multiple host names a referral could come from. GroupURL This keyword allows grouping URL's. Useful for grouping complete directory trees. GroupSite This keywords allows grouping Sites. Most used for grouping top level domains and unresolved IP address for local dial-ups, etc... GroupAgent Groups User Agents. A handy example of how you could use this one is to use "Mozilla" and "MSIE" as the values for GroupAgent and HideAgent keywords. Make sure you put the "MSIE" one first. GroupDomains Allows automatic grouping of domains. The numeric value represents the level of grouping, and can be thought of as 'the number of dots' to display. A 1 will display second level domains only (xxx.xxx), a 2 will display third level domains (xxx.xxx.xxx) etc... The default value of 0 disables any domain grouping. Command line argument: -g GroupUser Allows grouping of usernames. Combined with a group name, this can be handy for displaying statistics on a particular group of users without displaying their real usernames. GroupShading Allows shading of table rows for groups. Value can be 'yes' or 'no', with the default being 'yes'. GroupHighlight Allows bolding of table rows for groups. Value can be 'yes' or 'no', with the default being 'yes'. Ignore/Include Object Keywords ---------------------- These keywords allow you to completely ignore log records when generating statistics, or to force their inclusion regardless of ignore criteria. Records can be ignored or included based on site, URL, user agent, referrer and username. Be aware that by choosing to ignore records, the accuracy of the generated statistics become skewed, making it impossible to produce an accurate representation of load on the web server. These keywords behave identical to the Hide* keywords above, where the value can have a leading or trailing wildcard '*'. These keywords, like the Hide* ones, have an absolute limit of 80 characters for their values. These keywords do not have any command line switch counterparts, so they may only be specified in a configuration file. It should also be pointed out that using the Ignore/Include combination to selectively exclude an entire site while including a particular 'chunk' is _extremely_ inefficient, and should be avoided. Try grep'ing the records into a separate file and process it instead. IgnoreSite This allows specified sites to be completely ignored from the generated statistics. IgnoreURL This allows specified URL's to be completely ignored from the generated statistics. One use for this keyword would be to ignore all hits to a 'temporary' directory where development work is being done, but is not accessible to the outside world. IgnoreReferrer This allows records to be ignored based on the referrer field. IgnoreAgent This allows specified User Agent records to be completely ignored from the statistics. Maybe useful if you really don't want to see all those hits from MSIE :) IgnoreUser This allows specified username records to be completely ignored from the statistics. Usernames can only be used if you use http authentication on your server. IncludeSite Force the record to be processed based on hostname. This takes precedence over the Ignore* keywords. IncludeURL Force the record to be processed based on URL. This takes precedence over the Ignore* keywords. IncludeReferrer Force the record to be processed based on referrer. This takes precedence over the Ignore* keywords. IncludeAgent Force the record to be processed based on user agent. This takes precedence over the Ignore* keywords. IncludeUser Force the record to be processed based on username. Usernames are only available if you use http based authentication on your server. This takes precedence over the Ignore* keywords. Dump Object Keywords -------------------- The Dump* Keywords allow text files to be generated that can then be used for import into most database, spreadsheet and other external programs. The file is a standard tab delimited text file, meaning that each column is separated by a tab (0x09) character. A header record may be included if required, using the 'DumpHeader' keyword. Since these files contain all records that have been processed, including normally hidden records, an alternate location for the files can be specified using the 'DumpPath' keyword, otherwise they will be located in the default output directory. DumpPath Specifies an alternate location for the dump files. The default output location will be used otherwise. The value is the path portion to use, and normally should be an absolute path (ie: has a leading '/' character), however relative path names can be used as well, and will be relative to the output directory location. DumpExtension Allows the dump filename extensions to be specified. The default extension is "tab", however may be changed with this option. DumpHeader Allows a header record to be written as the first record of the file. Value can be either 'yes' or 'no', with the default being 'no'. DumpSites Dump tab delimited sites file. Value can be either 'yes' or 'no', with the default being 'no'. The filename used is site_YYYYMM.tab (YYYY=year, MM=month). DumpURLs Dump tab delimited url file. Value can be either 'yes' or 'no', with the default being 'no'. The filename used is url_YYYYMM.tab (YYYY=year, MM=month). DumpReferrers Dump tab delimited referrer file. Value can be either 'yes' or 'no', with the default being 'no'. Filename used is ref_YYYYMM.tab (YYYY=year, MM=month). Referrer information is only available if present in the log file (ie: combined web server log). DumpAgents Dump tab delmited user agent file. Value can be either 'yes' or 'no', with the default being 'no'. Filename used is agent_YYYYMM.tab (YYYY=year, MM=month). User agent information is only available if present in the log file (ie: combined web server log). DumpUsers Dump tab delimited username file. Value can be either 'yes' or 'no', with the default being 'no'. FIlename used is user_YYYYMM.tab (YYYY=year, MM=month). The username data is only avilable if processing a wu-ftpd xferlog or http authentication is used on the web server and that information is present in the log. DumpSearchStr Dump tab delimited search string file. Value can be either 'yes' or 'no', with the default being 'no'. Filename used is search_YYYYMM.tab (YYYY=year, MM=month). the search string data is only available if referrer information is present in the log being processed and recognized search engines were found and processed. HTML Generation Keywords ------------------------ These keywords allow you to customize the HTML code that The WebDruid produces, such as adding a corporate logo or links to other web pages. You can specify as many of these keywords as you like, and they will be used in the order that they are found in the file. Values cannot exceed 80 characters in length, so you may have to break long lines up into two or more lines. There are no command line counterparts to these keywords. HTMLExtension Allows generated pages to use something other than the default 'html' extension for the filenames. Do not include the leading period ('.') when you specify the extension. Command line argument: -x HTMLPre Allows code to be inserted at the very beginning of the HTML files. Defaults to the standard HTML 3.2 DOCTYPE record. Be careful not to include any HTML here, as it is inserted _before_ the tag in the file. Use it for server-side scripting capabilities, such as php3, to insert scripting files and other directives. HTMLHead Allows you to insert HTML code between the block. There is no default. Useful for adding scripts to the HTML page, such as Javascript or php3, or even just for adding a few META tags to the document. HTMLBody This keyword defines HTML code to be placed immediately after the section of the report, just before the title and "summary period/generated on" lines. If used, the first HTMLHead line MUST include a tag. Put whatever else you want in subsequent lines, but keep in mind the placement of this code in relation to the title and other aspects of the web page. Some typical uses are to change the page colors and possibly add a corporate logo (graphic) in the top right. If not specified, a default tag is used that defines page color, text color and link colors (see "sample.conf" file for example). HTMLPost This keyword defines HTML code that is placed after the title and "summary period/generated on" lines, just before the initial horizontal rule
tag. Normally this keyword isn't needed, but is provided in case you included a large graphic or some other weird formatting tag in the HTMLHead section that needs to be cleaned up or terminated before the main report section. HTMLTail This keyword defines HTML code that is placed at the bottom right side of the report. It is inserted in a section between table data tags, and is top and right aligned within the table. Normally this keyword is used to provide a link back to your home page or insert a small graphic at the bottom right of the page. HTMLEnd This allows insertion of closing code, at the very end of the page. The default is to put the closing and tags. If specified, you _must_ specify these tags yourself. -------------------------------------------------------------------------- Notes on Web Log Files ---------------------- The WebDruid supports CLF log formats, which should work for just about everyone. If you want User Agent or Referrer information, you need to make sure your web server supplies this information in it's log file, and in a format that the WebDruid can understand. While The WebDruid will try to handle many of the subtle variations in log formats, some will not work at all. Most web servers output CLF format logs by default. For Apache, in order to produce the proper log format, add the following to the httpd.conf file: LogFormat "%h %l %u %t \"%r\" %s %b \"%{Referer}i\" \"%{User-agent}i\"" This instructs the Apache web server to produce a 'combined' log that includes the referrer and user agent information on the end of each record, enclosed in quotes (This is the standard recommended by both Apache and NCSA). Netscape and other web servers have similar capabilities to alter their log formats. (note: the above works for apache servers up to V1.2. V1.3 and higher now have additional ways to specify log formats... refer to included documentation). Notes on Referrers ------------------ Referrers are weird critters... They take many shapes and forms, which makes it much harder to analyze than a typical URL, which at least has some standardization. What is contained in the referrer field of your log files varies depending on many factors, such as what site did the referral, what type of system it comes from and how the actual referral was generated. Why is this? Well, because a user can get to your site in many ways... They may have your site bookmarked in their browser, they may simply type your sites URL field in their browser, they could have clicked on a link on some remote web page or they may have found your site from one of the many search engines and site indexes found on the web. The WebDruid attempts to deal with all this variation in an intelligent way by doing certain things to the referrer string which makes it easier to analyze. Of course, if your web server doesn't provide referrer information, you probably don't really care and are asking yourself why you are reading this section... Most referrer's will take the form of "http://somesite.com/somepage.html", which is what you will get if the user clicks on a link somewhere on the web in order to get to your site. Some will be a variation of this, and look something like "file:/some/such/sillyname", which is a reference from a HTML document on the users local machine. Several variations of this can be used, depending on what type of system the user has, if he/she is on a local network, the type of network, etc... To complicate things even more, dynamic HTML documents and HTML documents that are generated by CGI scripts or external programs produce lots of extra information which is tacked on to the end of the referrer string in an almost infinite number of ways. If the user just typed your URL into their browser or clicked on a bookmark, there won't be any information in the referrer field and will take the form "-". In order to handle all these variations, The WebDruid parses the referrer field in a certain way. First, if the referrer string begins with "http", it assumes it is a normal referral and converts the "http://" and following hostname to lowercase in order to simplify hiding if desired. For example, the referrer "HTTP://WWW.MyHost.Com/This/Is/A/HTML/Document.html" will become "http://www.myhost.com/This/Is/A/HTML/Document.html". Notice that only the "http://" and hostname are converted to lower case... The rest of the referrer field is left alone. This follows standard convention, as the actual method (HTTP) and hostname are always case insensitive, while the document name portion is case sensitive. Referrers that came from search engines, dynamic HTML documents, CGI scripts and other external programs usually tack on additional information that it used to create the page. A common example of this can be found in referrals that come from search engines and site indexes common on the web. Sometimes, these referrers URL's can be several hundred characters long and include all the information that the user typed in to search for your site. The WebDruid deals with this type of referrer by stripping off all the query information, which starts with a question mark '?'. The Referrer "http://search.yahoo.com/search?p=usa%26global%26link" will be converted to just "http://search.yahoo.com/search". When a user comes to your site by using one of their bookmarks or by typing in your URL directly into their browser, the referrer field is blank, and looks like "-". Most sites will get more of these referrals than any other type. The WebDruid converts this type of referral into the string "- (Direct Request)". This is done in order to make it easier to hide via a command line option or configuration file option. This is because the character "-" is a valid character elsewhere in a referrer field, and if not turned into something unique, could not be hidden without possibly hiding other referrers that shouldn't be. Notes on Character Escaping --------------------------- The HTTP protocol defines certain ways that URL's can look and behave. To some extent, referrer fields follow most of the same conventions. Character escaping is a technique by which non-printable or other non-ASCII (and even some ASCII) characters can be used in a URL. This is done by placing the Hexadecimal value of the character in the URL, preceeded by a percent sign '%'. Since Hex values are made up of ASCII characters, any character can be escaped to ensure only printable ASCII characters are present in the URL. Some systems take this concept to the extreme and escape all sorts of stuff, even characters that don't need to be escaped. To deal with this, The WebDruid will un-escape URL's and referrers before being processed. For Example, the URL "/www.mrunix.net/%7Ebrad/resume.html" is the same URL as "/www.mrunix.net/~brad/resume.html", a very common form of a URL to access users web pages. If the URL's were not un-escaped, they would be treated as two separate documents, even though they are really one and the same. Search String Analysis ---------------------- Compared to The Webalizer, the search string analysis has been competely rewritten in the WebDruid. The SearchEngine keyword is not supported anymore. The search engines are now listed in a file called 'engines.list', located in /usr/local/share/webdruid (or where you installed the WebDruid). This list contains a list of about 80 search engines. It should be enough for most cases. Users are encouraged to send more search engines to me (fabien@juliana-multimedia.com), if they think they are usefull for other people. The list is now internally stored in a sorted array. It will be able to grow without harming performance as before. The report layout has been also profondly modified. The search strings are now grouped per search engines, and sorted per use count whithin these groups. This allows you to know wich keywords search engines have 'catched' when they have scanned your site. It will also give a good indication of what users were searching for when they stumbled across your site. When it is avaible (currently only with Google), the WebDruid detects the character set used by the user who typed in the search engines. It then translates this search engine's string to UTF-8, and outputs it in UTF-8. For engines other than google, the default is to take the system running webdruid's charset as the input charset. Notes on Visits/Entry/Exit Figures ---------------------------------- The majority of data analyzed and reported on by The WebDruid is as accurate and correct as possible based on the input log file. However, due to the limitation of the HTTP protocol, the use of firewalls, proxy servers, multi-user systems, the rotation of your log files, and a myriad of other conditions, some of these numbers cannot, without absolute accuracy, be calculated. In particular, Visits, Entry Pages and Exit Pages are suspect to random errors due to the above and other conditions. The reason for this is twofold, 1) Log files are finite in size and time interval, and 2) There is no way to distinguish multiple individual users apart given only an IP address. Because log files are finite, they have a beginning and ending, which can be represented as a fixed time period. There is no way of knowing what happened previous to this time period, nor is it possible to predict future events based on it. Also, because it is impossible to distinguish individual users apart, multiple users that have the same IP address all appear to be a single user, and are treated as such. This is most common where corporate users sit behind a proxy/firewall to the outside world, and all requests appear to come from the same location (the address of the proxy/firewall itself). Dynamic IP assignment (used with dial-up internet accounts) also present a problem, since the same user will appear as to come from multiple places. For example, suppose two users visit your server from XYZ company, which has their network connected to the Internet by a proxy server 'fw.xyz.com'. All requests from the network look as though they originated from 'fw.xyz.com', even though they were really initiated from two separate users on different PC's. The WebDruid would see these requests as from the same location, and would record only 1 visit, when in reality, there were two. Because entry and exit pages are calculated in conjunction with visits, this situation would also only record 1 entry and 1 exit page, when in reality, there should be 2. As another example, say a single user at XYZ company is surfing around your website.. They arrive at 11:52pm the last day of the month, and continue surfing until 12:30am, which is now a new day (in a new month). Since a common practice is to rotate (save then clear) the server logs at the end of the month, you now have the users visit logged in two different files (current and previous months). Because of this (and the fact that the WebDruid clears history between months), the first page the user requests after midnight will be counted as an entry page. This is unavoidable, since it is the first request seen by that particular IP address in the new month. For the most part, the numbers shown for visits, entry and exit pages are pretty good 'guesses', even though they may not be 100% accurate. They do provide a good indication of overall trends, and shouldn't be that far off from the real numbers to count much. You should probably consider them as the 'minimum' amount possible, since the actual (real) values should always be equal or greater in all cases. Exporting WebDruid Data ------------------------ The WebDruid now has the ability to dump all object tables to tab delimited ascii text files, which can then be imported into most popular database and spreadsheet programs. The files are not normally produced, as on some sites they could become quite large, and are only enabled by the use of the Dump* configuration keywords. The filename extensions default to '.tab' however may be changed using the 'DumpExtension' keyword. Since this data contains all items, even those normally hidden, it may not be desirable to have them located in the output directory where they may be visable to normal web users.. For this reason, the 'DumpPath' configuration keyword is available, and allows the placement of these files somewhere outside the normal web server document tree. An optional 'header' record may be written to these files as well, and is useful when the data is to be imported into a spreadsheet.. databases will not normally need the header. If enabled, the header is simply the column names as the first record of the file, tab separated. Log files and The WebDruid --------------------------- Most sites will choose to have The WebDruid run from cron at specified intervals. Care should be taken to ensure that data is not lost as a result of log file rotations. A suggested practice is to rotate your web server logs at the end of each month as close to midnight as possible, then have The WebDruid process the 'end of month' log file before running statistics on the new, current log. On our systems, a shell script called 'rotate_logs' is run at midnight, the end of each month. This script file looks like: ------------------------- file: rotate_logs ------------------------------ #!/bin/sh # halt the server kill `cat /var/lib/httpd/logs/httpd.pid` # define backup names OLD_ACCESS_LOG=/var/lib/httpd/logs/old/access_log.`date +%y%m%d-%H%M%S` OLD_ERROR_LOG=/var/lib/httpd/logs/old/error_log.`date +%y%m%d-%H%M%S` # make end of month copy for analyzer cp /var/lib/httpd/logs/access_log /var/lib/httpd/logs/access_log.backup # move files to archive directory mv /var/lib/httpd/logs/access_log `echo $OLD_ACCESS_LOG` mv /var/lib/httpd/logs/error_log `echo $OLD_ERROR_LOG` # restart web server /usr/sbin/httpd # compress the archived files /bin/gzip $OLD_ACCESS_LOG /bin/gzip $OLD_ERROR_LOG ------------------------- end of file ------------------------------------ This script first stops the web server using a 'kill' command. Apache keeps the PID of the server in the file httpd.pid, so we use it as the argument for the kill. Next, it defines some names for the backup files, which are basically the name of the files with the date and time appended to the end of them. It then makes a copy of the log file, appended with '.backup' in the log directory, moves the current log files to an archive directory (/var/lib/httpd/logs/old) and restarts the server. This setup allows the web server to be down for the minimum amount of time needed, which is important for busy sites. If you don't want to stop the server, you can remove the initial 'kill' command, and replace the '/usr/sbin/httpd' line with "kill -1 `cat /var/lib/httpd/logs/httpd.pid`" command instead, On most web servers, this will cause a restart of the server and create the new log files in the process... At this point, we have made copies of the previous months logs, the web server is going about it's business as usual, and we have all the time in the world to do any other additional processing we want. The last two lines of the script compress the archived logs using the GNU zip program (gzip). Remember, we still have a copy of the log which we can now run The WebDruid on without having to do any further processing. Next, we define two crontab entries. The first runs the above 'rotate_logs' script at midnight at the end of the month. The second runs The WebDruid on the '.backup' log file created above at 5 minutes after midnight. This gives other end of month processing jobs a chance to run so we don't bog the system down too much. If you have lots of end of month stuff going on, you can change the timing to suit your needs. The crontab entries look something like: ------------------------- crontab entries -------------------------------- # Rotate web server logs and run monthly analysis 0 0 1 * * /usr/local/adm/rotate_logs 5 0 1 * * /usr/bin/webdruid -Q /var/lib/httpd/logs/access_log.backup ------------------------- end of crontab --------------------------------- As you can see, the log rotations occur at midnight, and the analysis is done at 5 minutes after. Once you verify that The WebDruid ran successfully, the access_log.backup file can be deleted as it isn't needed any more. If you need to re-run the analysis, you still have the compressed archive copy that the shell script created. In order for the above analysis to work properly, you should have already created an /etc/webdruid/webdruid.conf configuration file suitable for your site, or otherwise specify configuration options or a configuration file on the crontab command line above. If you want The WebDruid to be run more often than once a month, you can specify additional crontab entries to do this as well. Care should be taken however to ensure that The WebDruid is not running when the end of month processing above occurs, or unpredictable results may happen (such as an inability to rotate the logs due to a file lock). The easiest way is to run it on the half hour with a crontab entry like: 30 * * * * /usr/bin/webdruid Known Issues ------------ o Memory Usage. The WebDruid makes liberal use of memory for internal data structures during analysis. Lack of real physical memory will noticeably degrade performance by doing lots of swapping between memory and disk. One user who had a rather large log file noticed that The WebDruid took over 7 hours to run with only 16 Meg of memory. Once memory was increased, the time was reduced to a few minutes. o Performance. The Hide*, Group*, Ignore*, Include* and IndexAlias configuration options can cause a performance decrease if lots of them are used. The reason for this is that every log record must be scanned for each item in each list. For example, if you are Hiding 20 objects, Grouping 20 more, and Ignoring 5, each record is scanned, at most, 46 times (20+20+5 + an IndexAlias scan). On really large log files, this can have a profound impact. It is recommended that you use the least amount of these configuration options that you can, as it will greatly improve performance. o Internationalisation. The WebDruid is not as good as it should be in this area. Work is in progress though. If you are interested in helping (either by sending tranlation .po files, or reporting issues), please drop me a note. Final Notes ----------- A lot of time and effort went into making The WebDruid, and to ensure that the results are as accurate as possible. If you find any abnormalities or inconsistent results, bugs, errors, ommisions or anything else that doesn't look right, please let me know so I can investigate the problem or correct the error. This goes for the minimal documentation as well. Suggestions for future versions are also welcome and appreciated.
..