Whether you're planning to redesign your site or just maintain it, careful examination of log files can point out intranet problems and opportunities.
There are nuggets of pure gold in your Web server logs-if you take the time to look for them.
Getting a comprehensive picture of intranet users and their behaviors is a particularly interesting challenge for organizations with disparate audiences. Task-based testing usually involves only a handful of representative users completing a few tasks. Yet, on a typical day, thousands of people may use the intranet in hundreds of ways. By its very nature, task-based testing creates an artificial environment that may not reflect user behavior under natural working conditions-noisy office environment, multitasking with five applications open on the desktop, or scaled-down browser windows that cut off sections of the screen, to name a few.
Log files are often overlooked as sources of information to help you better understand and interact with users. Analyzing log files does not exclude doing audience analysis when planning for a new Web area. Nor does it supplant other investigative techniques like heuristic evaluation, user surveys, task-based testing, and field testing. The method of inquiry must fit the nature of the research.
Web log file analysis has its detractors. Many say that it is the least useful type of data for understanding users. You have no way of measuring outcomes-did the users find the information they were seeking or did they simply leave? What was the purpose of their visit? Some say that log files are only useful for capacity planning, to decide if it's time to add more servers or increase bandwidth. Others will tell you stories about how someone overrated the importance of certain log file measures such as hits. The key with log file analysis is to know what to look for and how to separate the wheat from the chaff. Log file analysis works best as one part of a well-thought-out program of usability research and Web site evaluation.
IDENTIFYING USER PATTERNS
Analyzing server logs uncovers oodles of interesting user patterns. Here are just a few things that might come to light.
* Who is using your site? Who never uses your site? What departments and occupational groups make the most use of the intranet? What are they doing?
* How much time is usually spent viewing a particular page? If a large number of user cases indicate that users typically spend an inordinately long time on a page before moving to the next, you can start to speculate that the page is either very confusing or particularly worthwhile. Either way it's worth looking at.
* Exit pages, the point where someone leaves your site, offer some interesting clues. Is the spot a logical leaping-off point like a page of related links, or are red flags going up that users are switching to greener pastures? Keep in mind that the last page in your log file might not be the exit page. The user may have used the Back button to return to a page in their browser's cache.
* Can employees complete transactions on the intranet such as requesting a purchase order, submitting a travel expense claim, or ordering new office supplies? What is the completion rate once a form has been requested? Are employees entering in bogus responses in form fields to circumvent bad design?
GETTING DOWN TO BASICS
Assuming logging is turned on, every time the Web server receives a request for a file, the requested file is noted in a log file. Access log files can be configured to capture a variety of specific details about the request. Two typical log formats are the common log format and the extended or combined log format. In the accompanying table below, you can see the entries from a combined Web log file. The first entry shows a page requested from a public Web site, while the second and third entries have recorded some Web pages retrieved from a password-controlled area of the Web site. We'll zoom in and look at some of the specific types of data collected.
IP Address or Hostname
IP addresses may be more useful on an intranet for leaning about visitors than on a general Web site if you can link IP numbers to specific groups of staff or staff roles. In analyzing Web site traffic of a library site, ideally the use by library staff could be separated from accesses by research analysts in marketing or product development groups. Being able to isolate blocks of visitors by specific characteristics produces a clearer picture of usage. In some cases, the value of the IP variable is limited. Some corporate intranet users may be connecting to the intranet Web site via an ISP that provides a temporary IP address, or the use of a corporate proxy server may prevent unique identification. In these cases, you can't easily discern one person's requests from another.
Excerpt of a Web server log file
Login or Identity
This field is typically empty. It was part of the early log standards and is unreliable and seldom used.
User Authentication
For certain types of authentication, a username is recorded whenever someone attempts to log into a password-protected Web area. Both valid and invalid usernames are recorded. Once logged in, the username will be recorded for subsequent files requested by this user.
Date and Time of the Request
This records when the file was requested.
Method and Path
This records the request method used by the client and the directory path to the file on the server.
Status Code
This variable indicates whether the transmission between the client and the Web server succeeded or failed. A status code such as "404" is one that we've all learned to recognize. Generally, 200 codes are successful requests by a client, 300's report server redirects, 400's are used for client errors, and 500's are used for server errors. For a complete description of status codes, consult the HTTP/1.1 Status Code Definitions at the W3C [www.w3.org/Protocols/rfc2616/rfc2616-sec10.html].
Content Length
This refers to the numbers of bytes transferred.
Referrer
The referrer variable lists the page that the user just came from.
User agent
This records what type of browser or search engine made the request.
ANALYZING LOG FILE DATA
A wide variety of software is available to process log files. These packages provide a range of reports that can report summary data. There are free tools to run on your Web server, including Wusage www.boutell.com/wusage] and Analog [www.analog.cx]. Tools, like WebTrends [www.Webtrends.com] and Sawmill [www.sawmill.net], range in price from a few hundred to thousands of dollars. Most of these products produce summary tables, charts, and graphs. For a listing of log analysis tools, check the Yahoo! directory [http://dir.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Servers/Log_Analysis_Tools].
For speedier and easier interpretation of log files and reports, make sure that your site uses short descriptive filenames. Cryptic filenames from database-driven sites like id=23421 yield little information at first glance and may require a second lookup.
WHAT YOUR ACCESS LOGS CAN TELL YOU
The following ideas demonstrate some ways that log files can be used to assess a Web site. One caveat to keep in mind is that the quality of the data in your log file will vary from one intranet to another and depends on a range of factors:
* Whether users are authenticated
* If users have unique IP addresses
* Whether caching servers are used
Site Traffic Charts and Tables
* Which method is used to pass parameters to dynamic pages or scripts
* If cookies or other mechanisms keep track of sessions.
How visible are your links and menus?
Most good Webmasters are tinkerers at heart. We enjoy making small improvements and look for confirmation that the changes were beneficial. We try "moving things a little to the left, a little to the right" to watch what happens. This can come in handy in many situations. When you have launched a new area of the site, what types of links send the most traffic to the new area? Running a special referrer report for a new area can tell you just that. If performance is less than expected, try a different label, link style, or position.
On one Web site, we recently modified a section of links on a top-level page. Would the new organization work? Prior to the new layout, the main content area of the Web site looked like a crazy quilt of disjointed links. Links were in rows and uncategorized. Would users even notice the links in the new layout? This new design upped click-through rates dramatically on a number of links that had been "invisible" in the old design. Of course, we cannot be certain, based on log file data alone, that the users are more successful at meeting their information needs with the new layout. Maybe our new links are leading them astray, down the garden path. Log files only indicate that the link was "visible" to the extent that it was clicked more often. We need to use our experience, judgment, and heuristic evaluation techniques to assess whether the change is likely to be an improvement for users. A program like Click Tracks [www.clicktracks.com] is a nifty tool for visually displaying all the links on the page and percentage of visitors who click on them.
What clues can you pick up about effective navigation and labels?
Are employees able to get from here to there? In a recent Web team meeting, evaluating a site-wide menu, I pointed out that generic terms rather than "brand names" were more effective for site navigation. Without any hard evidence to back up my claim, the meeting moved on. After the meeting I had an "ah-ha" moment. Let's look at how many people actually selected this area from the home page based on the brand-name label rather than the generic term. A quick glance at the log file revealed in the last 2 days, 200 accesses resulted from the brand-name label, while 1,000 accesses were recorded for a less prominent link with a generic label.
Daily Statistics for June 2003
We could have also looked at search log data to see which keywords related to this area. Obviously the "scent" of the first brand-name term is not nearly as strong as the everyday language of users, and logs provide some evidence towards supporting this hypothesis. Does this mean we should eliminate the brand-name label? Not necessarily, as redundancy can be a very effective strategy. However, only having that brand-name label could be detrimental. If the Web design team is short of real estate for menu options, which of these two labels would you throw off the boat first?
What about entry pages?
Not everyone starts on the home page. Knowing where users typically start can help you identify key places to add news and announcements to the site to ensure that they reach the broadest range of users.
What's hot and what's not?
Log analysis can tell you what pages are popular and show trends over time. Care must be taken not to leap from popular (frequently accessed) to an automatic assumption that the page or area is useful. In fact, 404 errors may be the most popular pages on a site filled with link rot-and we all know how satisfying it is to see one of those.
The trouble with user demand is that we can't know for certain the impact of use-what are the outcome measures? For example, if usage soars in an academic library for a full-text library database, is the quality of papers better? Or are students simply selecting the wrong database for their topic and hunting and pecking through low-relevant result sets? Log analysis can point out some areas that stick out-less usage than expected or far more use. Other testing and evaluation methods are needed to discern whether the area is meeting user needs.
When does content get used?
Do you wonder if anyone ever notices or uses the weekly cafeteria menu or news headlines? Should the Web team keep producing this type of content? Take a look at the server reports and see not only the number of views, but also the time of day for them. Does traffic surge on the day that the new issue of the newsletter is posted? Check reports for a few weeks and see if traffic peaks on those days for that page. This is learned behavior-users who value this information come back repeatedly to access it.
What Web browsers do you need to support?
In some organizations, you can be reasonably confident that you know what browsers employees are using to surf the intranet. In other free-flowing intranet cultures, server log reports can be the best method of determining what types of computers and Web browsers are being used by significant numbers of employees. This information helps with formulating cross browser compatibility requirements for the intranet.
Walking in the shoes of your Web site visitors can be an eye opener.
Following a path or thread in your log file allows you to retrace the route a user took while navigating your site. You can discover:
* where they entered the site.
* the sequence of pages they viewed.
* in some cases how they went from one page to the next.
* data that may have been typed into a form.
* where they left the site.
Be forewarned that this type of analysis leads to more questions and lines of enquiry.
Some log analysis packages can summarize typical paths taken by visitors or drill down to the path of a single user. Keep in mind that caching and dynamic IP addresses can lead to erroneous reports.
Sometimes it is sufficient to extract a small sample of specific cases from a log file that represents single user sessions and retraces the footsteps of a few visitors. If you're looking at a handful of cases, you can load the relevant log file entries into Excel for manipulation. You may notice that there are gaps in the sequence of pages. This usually means that the Back button has been clicked. Long pauses between page requests are difficult to interpret -did the person really examine the page for a long time or were they chatting with a co-worker?
One of the fundamental challenges of designing Web sites is addressing the needs both of occasional users and frequent users. By contrasting the paths of experienced/frequent users with occasional users, we can see if there are particular features that appear to help one group which hinder or harm the other.
EVALUATING ERROR LOGS
Error logs are usually well utilized by Web teams, so I'll just briefly touch on how to use them to improve your Web site.
The error log file records everything that went wrong plus diagnostic messages such as stopping and restarting the Web server. Here are typical data capture elements:
* Date
* Error level
* Client IP address
* Error message and path to requested file
Often, the most common errors are the 404-not found. A 404 error may be a sign of link rot, that someone mistyped the URL, or that the referring page has a typo in the link. If it's a case of link rot, a redirect should be set up to deliver users to a similar page in the new design.
User authentication failures are also recorded in error log files. By looking at the number of errors, you might determine that a "password reminder feature" is needed or login usernames should be case insensitive. The error log file is also a rich treasure-trove of data about why CGI scripts fail. Any CGI program that yields a STDERR (Standard Error) gets dumped to the error log.
During beta testing and immediately after the launching of new Web area, it's good practice to watch the error log in real time. To do this on an Apache Web server on a Unix platform:
tail -f /usr/local/apache/logs/error_log
LEARNING FROM WEB LOG FILES
The key strength of log files is also one of their weaknesses-an abundance of data that records the patterns of use by Web site visitors in the natural environment. Working with gigabytes of raw log data would be overwhelming, and analytic tools are a must. There's a lot you can learn about your users from the log files. You can get clues about parts of the Web site that are underperforming. By analyzing threads, click-through rates, and response times, you can begin to glean information about how users navigate your site.
Keep in mind that Web log files only tell part of the story. They are best used as part of an iterative process of usability testing, heuristic evaluations, and redesign to improve the overall effectiveness of an intranet.
Following a path or thread in your log file allows you to retrace the route a user took while navigating your site.
Darlene Fichter
Northern Lights Internet Solutions, Ltd.
Darlene Fichter [fichter@lights.com] is president of Northern Lights Internet Solutions, Ltd.
Comments? E-mail letters to the editor to marydee@xmission.com.