NAME

http-analyze - a fast log analyzer for web servers

SYNOPSIS

http-analyze [-{hdmBVX}] [-3aefgknqvxyM] [-b bufsize] [-c cfgfile]
  [-i newcfg] [-l libdir] [-o outdir] [-p prvdir] [-s subopt,...]
  [-t num,...] [-u time] [-w hits] [-F format] [-L lang] [-C chrset]
  [-I date] [-E date] [-G suffix,...]  [-H idxfile,...] [-O vname,...]
  [-P prolog] [-R docroot] [-S srvname] [-T TLDfile] [-U srvurl]
  [-W 3Dwin] [-Z showdom] [logfile[...]]

DESCRIPTION

http-analyze analyzes the logfile of a web server and creates a detailed summary of the servers's access load in graphical, tabular, and three-dimensional form. The analyzer does this by

  • reading all logfiles specified on the command line,
  • saving all unique (different) URLs, hostnames, referrer URLs and user agents,
  • accounting for hits (successful requests), files sent, files cached, data sent, etc.,
  • and finally creating a statistics report for the period detected in the logfile(s).
  • The resulting statistics report is a comprehensive view of the server's logfile. The server writes a logfile entry for every response on behalf of a request from a browser or a forwarding system such as proxy servers. To understand the meaning of the terms in the report, you need a little knowledge about the type of data your web server records in its logfile.

    LOGFILE FORMATS

    NCSA Common Logfile Format (CLF)

    The basic logfile format supported by allmost all servers is the NCSA Common Logfile Format. It contains the following information for each request (hit):

    dns-name - auth-user [date] "clf-request" clf-status ct-length
    

    where the fields have following meaning:

    dns-name
    The IP number of the system accessing the web server. If there is an entry in the Domain Name System (DNS) for this IP number and the web server is configured to do DNS lookups, the corresponding hostname is logged instead.
    -
    Unused.
    auth-user
    The username provided by the client if authentication was required.
    [date]
    The date of the access in format [DD/MMM/YYYY:HH:MM:SS +-ZZZZ].
    clf-request
    The request in format "method URI proto", where method is one of GET, HEAD, POST, PUT, BROWSE, OPTIONS, DELETE or TRACE; URI is the Uniform Resource Identifier, and proto is the HTTP version number.
    clf-status
    The (numerical) response code from the server.
    ct-length
    This is either the size of the document or the data actually sent over the wire.

    Following is an example for an entry in NCSA Common Logfile Format:

    car.4rent.de - - [01/Aug/2000:00:00:02 +0100] "GET /doc.html HTTP/1.1" 200 393
    

    W3C Extended Logfile Format (ELF)

    The W3C Extended Logfile Format (ELF) is basically NCSA CLF plus user-agent and referrer URL information. http-analyze supports two variants of this extended format: DLF and ELF.

    The DLF format adds the referrer URL and the user-agent in this order with or without surrounding double quotes:

    CLF "referrer_URL" "user_agent"
    CLF referrer_URL user_agent
    

    This is an example for an entry in DLF format (wrapped on two lines for readability):

    car.4rent.de - - [01/Aug/2000:00:00:02 +0100] "GET /doc.html HTTP/1.1" 200 393
    "http://inet-tv.net/hot.html" "Mozilla/4.05 (X11; I; IRIX64 6.4 IP30)"
    

    The ELF format also adds the referrer URL and the user-agent, but in the opposite order and without the double quotes:

    CLF user_agent referrer_URL
    

    This is an example for an entry in ELF format (wrapped on two lines for readability):

    car.4rent.de - - [01/Aug/2000:00:00:02 +0100] "GET /doc.html HTTP/1.1" 200 393
    Mozilla/4.05 (X11; I; IRIX64 6.4 IP30) http://inet-tv.net/index.html
    

    The ELF variant is the preferred method to pass referrer URL and user-agent information. When this format is used, http-analyze searches backwards for the protocol specification of the referrer URL (to be precise, it looks for the colon in http:) and then for the preceeding blank. This ensures that broken referrer URLs which contain blanks or double quotes are handled correctly.

    To select either logfile format, edit the configuration file of your web server and define the fields to be logged. See the web server's documentation for information how to customize logging.

    Automatic detection of the logfile format

    http-analyze tries to automatically detect the correct logfile format by analyzing the first few entries of a logfile (this works only if your server records a hyphen (`-') for empty referrer URL or user-agent fields). If http-analyze detects referrer URL and user-agent information, it assumes the ELF variant of the W3C Extended Logfile Format. To process the DLF variant, specify the logfile format explicitely using the option -F.

    Logfile data used by http-analyze

    The statistics report shows a summary of the information which has been recorded into the logfile by the web server. For each logfile entry http-analyze processes the origin (sitename) and date of the request, the request method, the URL of the requested object, the server's response on behalf of the request, the size of the requested object and optionally the user-agent and the referrer URL if sent by the client.

    Note that http-analyze does not recognize visitors, email addresses of users visting your server, the path a user took through your web site, the last page visited by a user before leaving your site nor anything else not recorded in the server's logfile. Although hostnames are recorded for each request, they must not necessarily correspond to the real system actually used by a visitor - the request could be forwarded through a dialup service for example. Furthermore, no request may get logged by your server at all while someone is surfing through cached copies of parts of your site depending on the configuration of his/her browser ...

    BASIC OPERATION

    By default, http-analyze creates a full statistics report for a whole month, which contains complete details for the period determined by the timestamps of the first and last logfile entry processed. It is therefore extremly important to always feed all logfiles for a whole month into http-analyze, no matter how frequently you rotate (save) the logfiles.

    The recommended way of providing an up-to-date statistics report for a web server is to have a script running http-analyze automatically on a regular base, say twice per day, and have it process the current logfile of the web server from the beginning of the current month until today. At the first of a new month, the logfile should be saved elsewhere and the web server should be restarted to create a new logfile for the new month. Then run http-analyze on the old (saved) logfile to create a final statistics report for the previous month. A history file is used to produce a summary for the last 12 months on the main page of the statistics report without having to analyze logfiles for those older periods again.

    If you rotate the logfile more often to be able to compress them - for example, once per day -, you must uncompress and concatenate all separate logfiles for the whole month into one, chronologically ordered data stream, which the can be processed by http-analyze.

    Full statistics report

    Due to technical reasons, a full statistics report will not be created before the second day of a new month, although the totals for the first day of the new month on the summary main page of the report will be updated. A full statistics report contains a detailed summary including the following items (see the section Interpretation of the results for an explanation of the terms):

  • the number of hits, files sent/cached, pageviews, sessions and the amount of data sent
  • the total amount of data requested, transferred, and saved by caching mechanisms
  • the total number of unique URLs, sites, sessions, browser types and referrer URLs
  • the total number of all response codes other than Code 200 (OK)
  • the total number of requests which required authentication
  • the average load per week, day, hour, minute and second
  • the Top 7 days, 24 hours, 5 minutes and 5 seconds
  • the Top 30 most commonly accessed URLs (hits, files, pageviews, sessions, data sent)
  • the 10 least frequently accessed URLs (hits, files, pageviews, sessions, data sent)
  • the Top 30 client domains, browser types, and referrer hosts
  • an overview and a detailed list of all files, sitenames, browser types and referrer URLs
  • a list of all Code 404 (Not Found) responses
  • Short statistics report

    Since analyzing the complete logfile for a whole month increases processing time on heavily accessed web servers, you can instruct http-analyze to create a short statistics report for the current day only. In this mode, http-analyze updates only the daily totals for the current month in the Hits by Day section of the report and saves the results in a history file. If the analyzer is then run a second time to update the short statistics report, it skips all logfile entries from the beginning of the month until it detects any entries for the current day, which are then processed to produce an up-to-date Hits by Day section in the statistics report.

    In short statistics mode, http-analyze needs only a fraction of processing time required for a full statistics report, but it updates only a very small part of the statistics report so that this should be considered an additional feature rather than a replacement for the full statistics mode. The recommended way for using this feature is to have http-analyze generate a full statistics report once per day or week, while generating an up-to-date short statistics report as often as once per hour or day.