As an example, a fictitious doctoral student wanted to derive definitions for varying "levels of saliency" of parental behavior intended to direct the attention of their children with autism. The parental behaviors (i.e., cues) used to direct the children's attention were gestures, words, actions, and noises of toys created when the parent activated the toy. The doctoral student hypothesized that more salient cues would be more successful in directing children's attention than would less salient cues. She could not find an operational or sufficiently elaborated conceptual definition of "salience" in the extant literature to guide reliable judgments regarding parental attentional cues in a free-play session with the parents' children. However, she was confident that "she knew a salient cue when she saw one" and that other people who were familiar with young children with autism would be able to identify salient cues, too. She wanted a content-valid definition of saliency so she thought it was important to rely on more than her own judgment of what saliency meant. She needed a panel of "experts."
The first step was to identify the "experts." The definition of expert varies depending on the content area (Hoffman et al., 1995). However, it is probably useful to select people with both explicit (those who teach others to do the skill) and implicit (those who practice the art or skill being studied) knowledge of the context-dependent behavior or generalized characteristic of interest. The doctoral student selected faculty members responsible for teaching college students who were training to be early childhood educators of young children with autism. These faculty members had also spent at least 100 hr interacting with children with autism. Four faculty members were selected to allow evidence of convergence without relying on a sole expert.
Although there are many types of materials one can select to elicit information, one of the most efficient ways to elicit information is to identify "test" or "tough" cases (Hoffman et al., 1995). However, such materials tend to elicit an uncomfortable feeling of being evaluated (Hoffman et al., 1995). In contrast, "familiar" materials tend to elicit rapid expert judgments and do not elicit anxiety from experts (Hoffman et al., 1995). Therefore, it has been suggested that a combination of familiar and test materials may be useful in eliciting cooperation and information (Hoffman et al., 1995). In our example, the doctoral student edited twelve 3 - 10 s video clips of parents directing their children's attention. The children's responses to the parental behavior were carefully removed. The doctoral student located and edited two clear (i.e., familiar) examples, each of "high," "medium," and "low" saliency categories, based on her own intuitive definitions. The doctoral student selected three more "test" examples that she judged to be "somewhere between high and medium saliency." Finally, three more test examples were selected that the doctoral student considered "somewhere between medium and low saliency."
Asking experts to sort materials into categories is one informative method for eliciting information (Hoffman et al., 1995). In our example, the doctoral student independently asked each expert to sort the video clips into one of three categories that varied by saliency level. No time limit was given and experts could change their minds. This continued until each expert said that he or she was finished with the sorting.
The relative value of individual versus group interviews is unknown. However, empirical studies indicate that when experts meet as a group, they tend to rapidly find and argue over the small number of points on which they disagree (Hoffman et al., 1995). It is consensus that we seek. Therefore, we prefer that experts be interviewed independently.
The literature on eliciting expert knowledge indicates that structured interviews are more efficient than unstructured ones (Hoffman et al., 1995). In our example, the doctoral student examined the sorted materials (a) to identify how the test cases were sorted and (b) to identify any familiar cases that were classified in a category that was unexpected. Generic probes were asked about these items (Hoffman et al., 1995). The questions and answers were audio recorded for later analysis. For example, assume clip 2 is a test case for classification to either high or medium and clip 4 is a familiar case for high saliency. The doctoral student might ask "Why did you classify clip number 2 as highly salient?" to elicit rules that might define high saliency. It should be noted that both the clip and the category were indicated in the interviewer's question. This aids later analysis.
Further imagine that clip 4 was unexpectantly classified as medium saliency. The doctoral student might say "I noticed you classified clip 2 as medium and clip 4 as high" to record on the tape how the clips were classified. Then the doctoral student might ask the expert, "What is different about clip 2 (i.e., the 'familiar' clip that was expected to be classified as high but was classified as medium) than clip 4 (i.e., the test clip that was categorized as high)?" The latter is meant to elicit particularly useful rules that might reveal "conditional rules" (i.e., those that are used under some conditions but not others). If such conditional rules are uncovered, then the doctoral student might test her understanding of the condition of the rule by asking "What if (the condition) were not present, would (the rule) apply?" This type of structured interviewing would be continued until the interviewer felt she understood the experts' rationale for their sorting.
After interviewing all experts and transcribing (or marking wave files of statement via a computer software such as NVivo; Bazeley & Richards, 2000), commonalities among the experts' responses to questions are analyzed. This is sometimes called "theme analysis" or "category analysis" (Bazeley, 2007). In our example, the doctoral student identified (a) the number of sensory modalities and (b) the number of behaviors as two themes that at least three of the four experts used to justify her choices. By "behaviors," we mean actions such as "gesture to," "moves on," "talks about," and "operates" referent objects.
To "test" the accuracy with which the abstracted themes classified the 12 video clips, clips were assigned 3 through 1 to correspond with high through low classification, respectively. Average numerical scores were derived across experts to estimate "expert" classification of each of the 12 video clips. Using the two themes of number of actions and number of sensory modalities, and the average expert sorting of the 12 video clips, the doctoral student derived the following operational definitions for the three levels of saliency. These definitions classified the 12 video clips in accordance with the average expert sorting. High saliency was defined as using at least three behaviors to draw the child's attention to an object that appealed to at least two sensory modalities. Medium saliency was defined as using two behaviors to draw the child's attention to an object that appealed to two sensory modalities. Low saliency was defined as one behavior to draw the child's attention to an object that appealed to at most two sensory modalities. This complex definition classified all 12 clips reliably by 2 observers. Ultimately, the test of such a process is whether a study using a systematic observational measurement system relying on its derived definitions results in confirming hypotheses regarding saliency and the child's correct responses to parents' attentional directives.
Defining Segmenting Rules
Once it is clear what the lowest level of distinction will be, the onset (and if needed the offset) of instances of categories may need to be defined. If the number or duration of events will be the metric of interest, then segmenting rules are necessary. Interval coding or time sampling behavior sampling does not require segmenting rules for reasons indicated in the next chapter. Briefly, interval coding means that one indicates whether at least one instance of the key behavior has occurred within a fixed interval of time (e.g., 10 seconds).
One can see the need for segmenting rules when events occur close in time. For example, assume we are measuring the number of communication acts. The child reaches for a ball and looks at an adult. Then one second later, the child says "ah" while looking at the ball and looks at the adult. Finally, another second later, the child says "ball" and looks at the adult. Segmenting rules are needed to determine whether this cluster of child behaviors is 1 versus 3 communication acts. In-seat behavior is an example of a behavior for which duration is important. In addition to precise definitions of onset, definitions of offset are necessary to differentiate examples of in-seat behavior from near nonexamples of offset. For example, is the momentary lifting of both buttock cheeks from contact with the chair seat sufficient to end in-seat behavior or does such contact need to cease for more than one second?
Segmenting rules almost always involve a certain amount of arbitrariness. For example, we may decide to treat the potentially three clusters of behavior in our example above as one act because the onset of the first behavior occurred within 3 s of the onset of the last behavior. The use of a temporal criterion, the use of "onset" instead of "offset" as the boundaries for the temporal criterion, and the decision to call the reach and gaze, the vocalization and gaze, and word and gaze, all "parts" of the same act are all questionable, but defensible. Our empirical and theoretical knowledge about most social phenomenon of interest is almost never sufficiently specific to guide this level of decision making. But, these decisions must be made; hence, there is some degree of arbitrariness. Usually, the best we can do is to make sure that our decisions are defensible and consistently applied. If they are carried out consistently, the potential lack of content validity that may occur from their existence will be worth the almost certain gain in reliability due to reduced inter-observer disagreement on segmenting.
Defining When to Start and Stop Coding
Having clearly defined the lowest level coding distinction and with guidance about how to segment potential events, the next step is to decide when observers should begin and end their coding of an observation session. This section will include two types of start-stop signals. First, there are those at the beginning and end of the observation session. Because of the inconsistency of behavior among many observational sessions, it is useful to make explicit the signal for beginning and ending coding.
One ill-advised way to do this is to ask the adult administering the procedure to say when to start and stop coding. Alternatively, a start signal might be when the clock of the media file turns from 0 to 1 s. This requires that the administrator or cameraperson be consistent in giving the verbal signal or in beginning the clock relative to the onset of the observation procedure. The problem with these approaches is that it shifts the responsibility for providing the signal consistently to the administrator or cameraperson. Investigators rarely check on the consistency of the timing for such signals.
Instead, it is better to use a behavior that the participant does or one that the examiner does in the course of conducting the procedure to mark the beginning and ending of coding. For example, one might indicate that coding begins when the adult first speaks about the objects in the session or first speaks to the child. Similarly, stop signals might be when the examiner removes all toys from the table. Whatever the signals, they need to be included in the recording of every session and occur at times that do not exclude many codeable acts.
Another use of start-stop signals is to define the duration of the codeable sections of the observation session. When such start and stop coding rules are provided, they are done so as to potentially reduce the measurement error due to unexpected events or events that are known to inhibit or interfere with the occurrence of key behaviors in a way that does not reflect the phenomenon of interest. For example, the participant may be off-screen if the session is being recorded for later coding.