Usability testing and user testing are slightly different, both are testing techniques in HCI. User testing is more academic and for categorizing users or tasks and improve HCI theory. Usability testing is frequently done by industry and usability experts to discover usability problems in specific UIs. User testing is a more formal process. The tests are specific in the tasks performed by the participants. Usability testing tends be more informal, and the in the tasks performed by the participants are more general. But usability test tasks are specific about the usability concerns addressed in the UI. The analysis for user testing is very quantitative using advance statistics, while the analysis for usability testing tends to qualitative using simple statistics, i.e. frequency and averaging. They both use many of the same techniques and share many of the same goals.
My expertise is in user testing. I hope that you will an opportunity to participate in user tests and see the results. The participation can be your introduction to user testing. In this course you will perform usability testing.
Evaluation Paradigms and Techniques
Paradigm is a typical example or pattern. Preece, Rogers and Sharp propose the following evaluation paradigms:
- Quick and dirty – informal discussions with users at any time perhaps using prototypes.
- Usability Testing – observing the user and recording, for example video taping the session
- Field studies – going to the users’ sites and using surveys or observing users using the UI.
- Predictive – experts using heuristic evaluations or formal models to evaluate the UI, generally at the developers site
Summarizing the differences between the evaluation paradigms:
Evaluation paradigm: | Quick and dirty | Usability testing | Field studies | Predictive |
Role of the user in the evaluation |
Natural behavior | To perform tasks | Natural behavior | None |
Who controls the evaluation |
Evaluators has minimum control |
Evaluator has strong control |
Relationship between Evaluators and customers |
Expert Evaluators |
Location of the evaluation |
Natural environment or lab |
Lab | Natural environment | Lab or on the premises |
When the evaluation is used |
Any time | with prototype or product |
Early | with prototype |
Type of data collected from the evaluation |
qualitative; informal discussions |
Quantitative; statistical | Qualitative, sketches | List of problems |
How the data is Fed back into the design | Sketches and quotes |
Report on performances |
Descriptions at workshop, reports and sketches |
Report |
Philosophy or theory of the evaluation |
User-centered | Scientific/experimental | Ethnographic | Theory |
They categorize usability testing as controlled testing of users performing tasks on a prototype in the laboratory. They also categorize usability testing as quantitative and based on scientific research. This is true compared to the other paradigms categorizes, but not true compare to user testing. In other words, compared to ‘quick and dirty’ and the others, usability testing is the most quantitative and scientific. Usability test results are reported in the academic literature. Usability test report the results of user performance on UI, but also report on user answers on questionnaires and interviews.
Technique is a specific way of preforming a task. For evaluation, Preece, Rogers and Sharp propose these evaluation techniques:
- Observing users – using notes, audio, video and logging to record the use of a system
- Asking users – using interviews and surveys to get users opinion about the system
- Asking experts – for Heuristic Evaluation or Cognitive Walkthrough
- Testing user performance – in the lab or the field
- Predictive using task performance models – for example GOMS or HTA and Fitts’ Law
Summarizing the relationships between evaluation paradigms and techniques:
Technique\Paradigm | Quick and dirty | Usability testing | Field studies | Predictive |
Observing Users | Seeing how users behave in their environment | Video and interaction logs. Analyzed for errors, performance, route in UI, etc. | Ethnography is central to Field Studies | NA |
Asking Users | Discuss w/ potential users, individually or in focus groups | Pre and Post testing surveys. Structured Interviews | Interviews or discussions | NA |
Asking Experts | Provide usability critiques on prototypes | NA | NA | Heuristic Evaluation |
User testing | NA | Testing typical users on typical tasks. Central to Usability Testing | NA | |
Modeling | NA | In academia compare w/ theory. | In academia compare w/ theory | GOMS etc. |
NA = not applicable, not used
Usability testing uses all the techniques except ‘asking experts.’
Basic structure of a usability test session
- Pretest introductions and explain the UI and the tasks.
- Conduct and observe participant performing tasks
- Post-test questionnaires and/or interviews and/or structured discussions to get user feelings and opinions of the UI.
Planning an Usability Test
Preece, Rogers and Sharp use the acronym DECIDE to explain the steps for planning an evaluation:
- Determine the general goals of the test
- Explore the specific questions of the test
- Choose the evaluation paradigms and techniques
- Identify the specific practical issues, such as selecting participants and tasks
- Decide how to deal with ethical issues
- Evaluate, interpret, and present the data.
The acronym, ‘DECIDE’, is good because it stresses that preparation is about making decisions. You must decide on the goals and generate the specific usability concerns or questions. This is critical for usability testing. In user testing the scientist have a hypothesis in mind to test, but in usability testing usability experts have specific usability concerns that they are investigating.
The ‘I’ in DECIDE is for identifying the practical issues. (I would have forgotten this aspect of planning for a test.)
- Users: which users do you want to evaluate? What participants can you evaluate? How can you solicit the participants? How will they participate, e.g. what tasks?
- Facilities and equipment in the laboratory. What is the prototype? How to gather the data?
- Schedule and budget constraints influence the number participants and the procurement of new equipment, also what and how much analysis the usability expert can perform.
- Expertise: What expertise does the evaluation team have? Can they perform the test and evaluation?
Ethical Issues:
Preece, Rogers and Sharp offer these procedures to insure proper ethics.
- Tell the participant the goals of the test in language that they will understand, but not in so much detail that it will bias the results. Tell them, “We want to learn how will this UI is for you,” and not, “We want to know if you will miss seeing a button.”
- Explain the tasks clearly and without biasing the results.
- Be sure to tell them that all data will be confidential and promise anonymity.
- Tell the users that they can stop at any time, any portion or the entire test.
- Pay users when you can; this makes the relationship between you and the participant professional. You will not be able to for this course.
- In the report avoid quotes etc. that identify the user.
- Ask users in the future if you can quote them.
The above is a good procedure, but they do not replace the most important aspect: treat the participants with respect. (This is why I call them participants and not subjects.) Treat them with respect even while you are designing the test. I have learned that the better you treat the participants (before and during the test) the better the results will be.
Graduate students need to go through the CITI training.
Evaluate and interpret results:
Before the actual testing, evaluating potential results. Consider:
- Reliability – are the results repeatable
- Validity – does it measure correctly the usability aspects you want to investigate
- Biases – there can be some bias, but you should be aware of them and if they should make the conclusions invalid.
- Scope of test – this is important, you can not test everything.
- Ecological validity – environmental factors that might bias the results.
Usability Test Development
You should have a good idea what is a usability test, but only a general idea how to plan the test. I’ll try to address some specific issues of developing a usability test, but I can not go over everything. Preece, Sharp and Rogers in Interaction Design, chapters 12-15, discuss some of the specifics and give some examples. Barnum in Usability Testing and Research, chapters 5 through 7, gives a very detail example of his usability test for Medline and hotmail website.
Goals and Usability Concerns
Test goals and usability concerns of the UI are the most important aspect of designing a good test. Preece, Sharp and Rogers do not give much insight into how to generate goals and concerns. In industry, the client (a representative from a software company or design team) frequently gives vary vague goals for the usability testing. It is your job to determine the goals for the test.
Barum suggests answering these two questions:
- What do you want to learn from the test?
- How will you measure what you want to learn?
The list above begs the question, but does point out that you can learn from a test on what can be measured or observed. Although you should consider what you can measure, I think it is better to first generate a list of questions about the use of the UI and then determine how you can make the measurements.
Rojek and Kanerva (“A data-collection strategy for usability tests” in IEEE transaction on Professional Communication, 1994, from Barnum’s Usability Testing) gives a list of questions:
- What do you want to learn from the test?
- What are your greatest fears for the product?
- Are there different opinions in the design team about an issue in the design?
- What can be changed about the design as a consequence of the test results?
- Have the design team made assumptions about the users?
- Are particular features of the design targeted for a specific issue?
- How will you know if the product is good?
This list of questions is good for industry, but maybe not so good for our course or more experimental UIs. I suggest this list of questions to ask your selves:
- Is there something unique about the UI that can be tested?
- Is there a concern about an interaction aspect of the UI that you can test?
- Is there a concern about a graphical or information displayed in the UI?
- Is there an ergonomics aspect of the device that is a concern?
- Is there an environmental aspect that is a concern?
- What will users think of the device or UI?
- What are the vertical and horizontal extents of the prototype?
Answer these questions and make a written list of goals for the test. If a heuristic evaluation was performed on the UI, you can look at the heuristic evaluation to generate some concerns, or you can first conduct an heuristic evaluation on the design.
Observations and Tasks
Your goal is to write a test plan. Part of generating a test plan is to design a task for the participant to perform, so consider what can be measured by observing the task:
- Time to perform a task
- Percentage of task completed
- Number of errors
- Time to recover from an error
- Number of repetitions or failed commands
- Number of features or commands not used
- Time spent navigating and/or searching for information
- Number of clicks or taps to perform a task
- Quality/Quantity of information found
- and more
The usability test can also include post task questionnaires or interviews. What can you learn and measure from questionnaires:
- How does the participant feel about the product?
- Was the participant frustrated?
- Was the participant satisfied using the product?
- Was the participant amused using the product?
- What was the participant thinking while using the product?
Questionnaires can generate quantitative measures especially if they use a Likert scale (See below.), but they are also used for qualitative measures.
Another technique is observing the user while performing the task:
- Facial expression
- Vocalization
- Hand motion
- body language
Current HCI research is trying to develop quantitative measures from these observations, but you can use them as qualitative measures.
Finally usability testing can use think aloud protocol. Psychologists have formal techniques for analyzing think aloud, but these are involved and tedious. You could use ‘think aloud’ as an informal qualitative measure.
Triangulation
Frequently usability testing cannot use a single task or measurement to answer questions about design concern. Consider the design concern or question: “Is the product easy to use?” The question is a legitimate usability concern, but how do you measure it? The time to perform a task is a good and easy quantitative measure, but how long is too long? Observation of the user may show facial expression suggesting that the user is perplexed or frustrated. Participants may answer questions in a survey or interviews that indicate the user was frustrated or thought that the task was hard. If you use all these measures, you have confirming evidence. Using multiple techniques to probe a usability concern is called triangulation .
You can compare your list of potential concerns with what can be measured and throw out concerns that cannot be measured. With the remaining list of concerns, you can generate test goals.
Test Plan and Scenarios
Your short term goal is to generate a test plan composed of tasks. Each task or set of tasks has at least one test goal and frequently several associate measurements. Usability test plans generally are composed of several test scenarios. Test scenarios are short stories that you tell the participants before they perform the tasks. Test scenarios set the scene for the participant and suggest what the participants should do. In usability testing, in contrast to user testing, test scenarios are essential. The usability test administrator cannot explicitly tell participants what to do by saying, “Move the cursor to a button and click.” (A user test administrator could tell the participants exactly what to do.) So how does the usability test administrator explain to the participant what to do? He tells a story like,
“You are a costumer that would like to purchase a new broom to sweep the floor please find a broom in this website, choose a broom, and make the purchase.”
The scenario avoids explicitly telling the user what to do. Also note that this scenario contains several tasks. Using scenarios, usability testing can measure more than how long it takes to press a button. For example, what design concerns or test goals could the above scenario address? The scenarios can describe the environment and give fore story to the participant. For example you may want the participant to imagine that they are in a car using the device. Write these descriptions down so that you can repeat it exactly to each participant. Also the test should impose appropriate environmental constraint, for example having the participant sitting or standing.
Now you are ready to write a test plan. A typical test plan outline:
Test Plan Name
Scenario Name
Goals – what you want to learn from the test scenario
Quantitative measurement list – what measurements the loggers will record
Scenario – the actual story (By itself, on a separate sheet of paper)
Task list – short description of the actual tasks the user should perform
Qualitative measurement list
Potential observations of users
Post Scenario interview or questionnaire questions (By itself, on a separate sheet of paper)
Test set up details
For this course you should have at least two scenarios. In industry, the usability experts design enough scenarios to cover all the test goals, which hopefully address most of the design concerns.
Questionnaires and Interviews
Both questionnaires and interviews are lists of questions to ask the participants. It is possible to write bad questions. Bad questions have one of these aspects:
- long questions
- confusing questions
- elaborate sentence construction
- using jargon, especially technical terms
- leading questions
The difference between questionnaires and interviews is how the data is recorded. Participants write the answers to questionnaires, so recording is easy. But information can be missed by questionnaires, for example because a question was not anticipated. Interviews can seek answers that do not tend to be short answers and probe for more information.
For both questionnaires and interviews write the questions out and review them with the team, looking for bad aspects of the questions. Then consider the order of the questions. Answering one question can, for good or bad, lead to an answer to the next question. Consider when the participants answer this question what will they be thinking. Consider the implications for the answers to the next question.
The types of interviews:
- Open-ended interview
- Unstructured interview
- Structured interview
- Semi-structured interviews
- Group interview
Open-end and group interviews take a lot of skill to conduct and are hard to analyze. You should perform structured interviews.
During an interview:
- Introduce your self and what the interview is about
- Ask a few easy warm up questions
- Ask your main questions
- Ask a few easy cool down questions
- Thank the participant
Be professional, dress similar to the interviewee. If you are using a recorder make sure it works before the interview. If you will be taking notes try to write them down exactly, at least do not change the meaning of the answer, and be consistent about how you abbreviate the answers.
Standard types of survey questions:
- Yes/No Maybe? questions
- Likert Scales – Likert scales is really the process of making the scale, hear mean the form of the question
- Agree and disagree scale
- frequency of use
- Semantic differential scales – this is similar to a Likert scale, but adjectives of a target are scale.
- Check box options
- Comparison questions
- Ranking
- Short answers
Keep the questionnaire short, 20 questions. The questionnaire should not take more than 10 minutes to complete. Be sure you really need the answers to the questions you ask and that you know how you will use the data. I have stopped answering many questionnaires because they went on and on. Another point, especially for online questionnaires or forms, is to be sure that there is an honest indicator of progress through the questionnaire.
Consider using ‘short answer’ questions. Although they are hard to quantify, they can give a lot of information that you would not predict. Equally important they can help you identify bias or bad design in the questionnaire itself. Do not forget to ask, “Do you have suggestions on how to improve the …?” This shows respect for their input, and you will be surprised by their answers.
Observing and Recording Tests
The basic techniques for observing participants during usability testing:
- Notes
- Audio recording
- Still photographs
- Video
- Event logging software
Advantages and disadvantages of each technique might not be completely clear. Video and audio data can capture a lot of data but they are hard to analyze and take a lot of expertise. Also care must be taken to be sure that the subject appears in the field of view of the camera (frequently many camera have to be used). During audio recording, care must be taken to assure that the desired recording is not obscured by noise. Event logging software can efficiently capture a lot of data and make analysis easier, but they are either expensive or take time to program. My user tests are logging software that I have written; using Java, they have 16 msec time resolutions. ‘Note taking’ is cheap and effective means of observing. Time resolution is lower then event logging because the notes must be handwritten or typed. The time resolution can be improved by creating a short hand and programming macros used in a word processor. So you will want to measure tasks that take a longer time to perform, at least 5 seconds.
Logging by Notes
If anyone develops a set of macros for Word or a program for writing notes, please share them with me and the class. (See page 245 in Barum’s Usability Testing and Research.) The document should help you generate a csv (coma separated variable) file. Actually, some other delimiter might be more useful. The columns in the log file only need to be line number, time stamp, event code and event description. So the macro could automatically generate the line number and time stamp after each line return, then all the logger needs to do is type the event code and description.
You will want a list of event codes. How many and what events? This depends on the test and the analysis you expect to perform. The event code serves two purposes; a short hand for logging and assistance in the analysis. So the event codes should be unique single characters that are easy to remember. Without too much practice a logger can remember about 5 event codes. Not very many, so they can not be too specific, for example ‘h’ = ‘selected the help menu’ or ‘e’ = ‘hit enter key’ might be too specific. But if they are two general like ‘c’ = ‘made a command’ or ‘f’ = ‘user made a face’ they might not help too much. The description can make the event code more specific. For example
c help
could mean ‘mouse clicked help menu.’ As discussed below you will probably have more than one logger. So each logger can have their own set of event codes, 3 loggers would approximately total 15 event codes. Using multiple log files requires that there is a synchronization event, such as a start event voiced by the test administrator, and then the time stamps can be synchronized across the log files.
Software Event Logging
There are several options for recording UI events for Android UI development:
- Writing to the log file
- Writing into SQLite database
- Writing to file
Writing to Grails or framework log file is easy to implement. But the log file is verbose and you will have to parse it. Also the file will be written in the operating system files. I do not recommend it because it is not much easier than writing to a file.
Writing to a database would require additional Model or Domains. Logging to a database has the advantage is that the write would be fast and memory usage minimal.
Writing to a file is simple. The file should be in comma separated variable (csv) format. The app will have to make a FILE,
File = new File("path/log-<time>.csv");
A relative path will be in web-app/ directory which you can access. You’ll lose the file if you redeploy.
In the controller, writing to the file is easy:
file.write “$time, $event”
file << “$time, $event”
Design the format of the file. I think it is best to consider all actions on the UI as events that the evaluator could use
“<event number>, <view>, <event name>, <event target>, <event time>, \”<detials>\” \n”
where
<event number> = sequential number of events
<view> = activity or view name
<event name> = name of the event for example onClick or onCreate etc.
<event target> = the name of the button or item
<event time> = time
For example
1, MainActivity, onCreate, ,123456789, “started app”
2, MainActivity, onClick, MenuItem1, 123457123, “clicked button”
…
40, MainActivity, finish, ,123459123, “finished app”
This file format can easily be interpreted by Excel or statistics packages.
Writing events that occur in the controller is easy. Using recording only controller events will allow you to record when participants enter a webpage and when they have submitted an observation. This could be a useful performance measure. Controller events will not record the time that participants enter text into a textbox. Writing these interaction events is harder and requires using JavaScript, saving a JSON and passing JSON in a request to the Controller (this could be a controller action) and then writing the JSON to the log file.
Audio Logging
Why use an audio recording unless you are interviewing? You could use participants’ groans for determining their mood, or you might be using ‘think aloud’ protocol. Think aloud is when you ask the participants to vocalize their thoughts while they perform the tasks. This is an effective technique and only slows down the time to perform tasks a little. It is hard to formally analyze, but in the context of usability testing you can get some idea of the participant was thinking with an informal analysis. You may want to use a post test interview to get an idea about what the user was thinking. Post test interviews are not as reliable because the participants may not remember and may answer what they think they should have though. So only ask about memorable events.
Usability Test Team
When you conduct the test, your team should have assigned roles. For example:
- Administrator/facilitator/briefer
- Recorder/logger
- Observer
Consider having several loggers splitting up the responsibilities of recording events. For example a logger could track what the user does on the GUI and another logger could track facial expressions or body language. You can have more than one logger recording a single event to assure reliability or refine what they are observing.
Loggers do not have time to make general observation, and the facilitator/briefer is too busy being attentive to the user to take notes or make observations. You may want a general purpose observer, who makes notes on the general progress of the test and if anything unusual occurred.
Conducting Usability tests
The general procedure for the usability tests session:
- Prepare test room: make sure the programs and equipment work, and you have the forms and questionnaires.
- Greet the Guest: introduce your self and the other members of the team. Briefly describe what will happen and give the consent form. Describe what is on the consent form so the participant does not have to read the form if they do not want to.
- Pre test questionnaire: includes demographics. This should be a separate piece of paper.
- Explain interface: or any other equipment.
- Tell the scenario: And any other specific instructions, such as tasks to be performed. This should be on a separate piece of paper. The participant should not see the whole test plan.
- Post scenario questionnaire: or interview
- Repeat: steps 4-6 for each scenario
- Post test questionnaire: This should be on a separate piece of paper.
- Thank the participant
- Organize the files
The test should be short, no longer than an hour, probably a half hour. Participants can not concentrate for more than an hour.
Immediately after the test organize your notes. Enter your notes in the computer as soon as possible; if you can, immediately after the test session.
Practice and Pilot Studies
Write a script and practice; practice your test among yourselves. I always practice the complete user test before administrating the test. This gives me an idea of how long the test will take and preliminary results. You get only one shot with a participant do not lose that opportunity. I have felt sad whenever I have not been able to use data from a test because something trivial was wrong. It was a waste of time for the participant and me. You probably cannot make pilot studies but you can perform the test on your self. You can determine how long the test will take and sometimes verify the correct parameters.
In industry the first couple of participants can be used as the pilot study. But this requires that there is time to change the test before the next scheduled participant. If the test has to be redesigned then the results from the pilot study are not used in the final analysis. If the test is not changed then the results from the pilot study can be used in the final analysis.
We will have a mock test day. Graduates students will work with the whole group going through the test. The application should be on the phone. This is a dress rehearsal.
Analyzing Test Results
General Procedure
Analyzing the test results can consists of:
- Collocate log files for each participant/scenario
- Summarize each scenario across all participants, includes plots and trend analysis
- Quantitative measures
- Qualitative observations
- Questionnaire answers
- Make conclusions for each test goal
If there are many log files for each user/scenario then you will want to collocate the files, meaning you correlate the events in the different files into a single file. If the log files are in csv format with time stamps then this is easy and not too tedious:
- load each file into a spread sheet
- Convert time stamp to relative time stamp using the synchronization event.
- Merge the files into a single spread sheet
- Sort by relative time stamp and add global line numbers
- Add additional columns and/or rows for summarizing results and make calculations
- Add questionnaire results, this is easy if using Likert scale.
Do this for each participant. You will have all the data from one participant and scenario together in one file; later when you do trend analysis or look for collaborating results you will be able to find them easily.
Summarizing the test results is now easy. For each scenario, you look in the summary form log files and questionnaire data for each participant and make a table or plot. You should begin to make conclusions, so you compare answers from the questionnaires to the quantitative results. You can look for other observations in the participant’s collocated file.
Using the summary results you should be able to make conclusions and address the test goals.
Outliers
In user test analysts, it is possible to throw out outliers or if the sample is large enough the outliers will be washed out. In usability tests with small number of participants, you cannot throw out outliers, they are significant. You should investigate careful why this participant performed different. You may be able to uncover a usability concerns for a specific user type. At a minimum you should be able to write a story about the outlying participant.
Positive and Negative Findings
Record positive findings because:
- Everyone likes to hear good news.
- If you don’t document the positive aspects of the UI, they could be changed in the future.
Typically, the goals of a usability test include uncovering usability problems, so also report the negative findings (called ‘findings’ for short in industry). Barnum suggests analyzing negative finding by scope and severity. Scope is either global to the whole interface or local to a specific task or form. Severity is expressed by levels:
- Level 1: prevents task completion
- Level 2: creates significant delay and frustration
- Level 3: has minor effect on usability
- Level 4: subtle problem: points to a future enhancement
There are other scales, see page 270 in Barnum’s Usability Testing
Qualitative Analysis
(From Preece, Rogers and Sharp Interaction Design)
Qualitative analysis can be used to tell a story. Also they can support quantitative results with examples.
Qualitative analysis involves making collections out of the data so that they can be compared or reveal trends. Preece, Rogers and Sharp suggest using a team to analyze the qualitative analysis to provide many perspectives.
Qualitative analysis can determine categorizations, for example the general types of usability problems. Also it can reveal patterns, such as if this happens then a problem will occur. There are many formal methods for analyzing qualitative data:
- Activity Theory
- Content Analysis
- Discourse Analysis
- Conversation Analysis
- Think Aloud Protocol Analysis
I do not know them and they would be a course to themselves.
Barnum does discuss two approaches to qualitative analysis; Top-down categorization and Affinity Analysis. Top-down categorization is predefining categories (Usability Yardstick page 169 Barnum) and then sorting the usability findings into the categories. Typically the categories are defined before the test and a goal of the test is to measure the interface using these categories. So a count of the frequency can reveal a general usability problem. Affinity analysis is typically done with several analysts (page 250 Barnum), iteratively arranging the findings into groups until there is consensuses among the analysts. Only after the findings are grouped are the groups labeled.
Triangulation Analysis
Barnum suggests triangulation analysis which is a comparison of:
- Performance measurements
- Subjective measurements form questionnaires and interviews
- Issue lists from the Top-down categorization of usability findings
And use this analysis to justify the usability problems and severities.
Report
Barnum gives detail descriptions and examples of Usability Reports. The usability test reports tend to be very long. They have a lot to report, and several formats to give the results because many different people (executive, designers, and experts) will read the report or some part of the report. The outline of the report:
- Cover letter
- Executive summary
- Introduction
- Methodology
- Results
- Recommendations/actions
- Appendices
Barnum also suggests previewing the report with a short document, called roadrunner (Harrison and Melton). The report should:
- Catchy graphics
- Brief, one page
- Includes charts
- speak the reader’s language
- Include users comments
- Positive feedback
- Tie the results to the original usability goals
- Emphasize the need to read the final report for full results
- Include a short summary/implications
For this course
I do not need nor will I read a lengthy report, but I do not want a hyped report like the roadrunner. My outline:
Cover page: name the group and the usability expert (graduate student)
Introduction: Description of UI, test goals and brief description of tests (1 page)
Test Plans: the original you created for testing (approximately 1 page per scenario)
Results: These should be plots and charts with explanations (what it takes, ~ 2 pages)
Conclusions: Usability problems and suggestions for improving the UI (1 page)