Compiling Legal Hypertexts

Martin Choquette, Daniel Poulin and Paul Bratley

1. Introduction

For fifteen years or so, from the middle of the 70s to the beginning of the 90s, research on systems for legal information retrieval and management made relatively little progress, at least if one looks only at the results that filtered through to users. This period of dormancy was broken in the last four or five years by the discovery, or rediscovery, of hypertext, and by the rapid increase in accessibility of standard electronic environments such as the Internet. Although from some points of view these technologies added little that was radically new, they breathed new life into the fields of information dissemination and retrieval. Their application in the field of legal information is particularly interesting: the sheer volume of the sources in this area challenges all the existing techniques for indexing, finding and presenting information. Furthermore, these sources are constantly being updated: the courts regularly add new decisions to the existing jurisprudence, while the legislature produces new laws and modifies those that already exist. Updating, finding and consulting the relevant parts of this living, growing mass of information are crucial problems.

Recent information retrieval approaches have tried to go beyond the limits imposed by the Boolean or the vector model. The Flexicon system [Gelbart 91] uses a concept tree to augment a vector retrieval mechanism and make queries more precise. However the work involved in constructing the necessary concept hierarchies renders it difficult to apply the system on a large scale. Quite recently there have been efforts to commercialise systems based on Bayesian networks [Turtle 91]. These undoubtedly help in retrieving information, although they do not offer the kind of hypertext facilities that are natural in the legal field. The appearance on the Internet of new distribution protocols such as HTTP has led to the creation of a number of servers offering hypertext interfaces to distributed legal sources, but these do not have sophisticated retrieval mechanisms [Berners-Lee 94]. However, we believe that this offers an excellent opportunity to improve current techniques of disseminating legal documents. Were the hypertext environment allowed by such systems to be complemented by adequate construction procedures and retrieval mechanisms, the whole field would be transformed.

According to Shneiderman [Shneiderman 89], the golden rules of hypertext can be summed up as follows: hypertext is useful "[if] there is a large body of information organized into numerous fragments, if the fragments relate to each other, and if the user needs only a small fraction at any time". Even a superficial analysis will confirm that these conditions hold in the legal field. However, before we can thoroughly exploit the advantages offered by hypertext, two prior problems must be solved. The first is to improve the available methods of construction of large hypertext document archives; the second is to add sophisticated retrieval techniques to these hypertexts. In this paper we consider some solutions to the first problem. The following section is devoted to a survey of some of the approaches that have been proposed. Our own research on the automatic construction of hypertexts is presented in Section 3. Section 4 extends our ideas and presents a way to construct evolving hypertexts automatically. Section 5 sketches the next steps we hope to take.

2. Related work on automatic hypertext construction

To be of practical value on large corpora, the hypertext construction approach must be automated. For instance, in the legal domain, the conversion of a one megabyte document such as the Civil Code of Québec involves the creation of thousands of components. Such a repetitive task, when done manually, is not only tedious but error-prone as well. This is only exacerbated when one considers important--and frequently revised--collections of standard legal texts. The construction of parsers for such collections seems to be the only way to ensure the wider use of hypertext.

As far as automatic construction of hypertext is concerned, several experiments have been carried out with encouraging results, although they often concern rather small sets of documents. Furuta et al. [Furuta 89a, Furuta 89b] describe four projects involving the automatic construction of hypertexts starting with relatively small collections of documents and resulting in a Hyperties database. Some of the document collections were well-structured (for example, course descriptions), while others had to be edited manually. These researchers use an approach involving two converters: a parser guided by a deterministic finite automaton transforms the source document into a "linearized hypertext", and then a second program, independent of the source document, converts this linearized hypertext into a set of files forming the Hyperties database. Rada has studied a similar problem, namely the conversion of an existing textbook into hypertext [Rada 92]. He uses semantic nets to generate "first-order" links that are explicitly marked in the source text, and "second-order" ones that do not depend on mark-up. Other workers have carried out projects on a much more ambitious scale. Raymond and Tompa studied the automatic construction of a hypertext based on the Oxford English Dictionary [Raymond 88]. In a more specifically legal context Wilson [Wilson 90] and Greenleaf [Greenleaf 91] mention the use of automated conversion procedures for legal texts.

Automatic construction of hypertext is thus feasible. Still, the experiments mentioned above all have something of an ad hoc character. If automatic construction techniques are ever to be applied on a large number of document classes, they will have to be formalised. The following section is a step toward formalisation. It presents a grammar-based approach applicable to well-structured documents that are either logically or physically marked up.

3. Using grammars to convert legal texts into hypertext

The idea underlying the chosen approach is that it must be general and formal enough so that tools can be developed that will make it easier to write grammars (and hence converters) for new classes of documents. The approach has been applied successfully to statutes, decisions and scholarly legal texts [Choquette 94, Marcil 94].

To simplify the discussion, we assume here that conversion is done on a single linear text of arbitrary size. The result is an exploitable hypertext description defining nodes and links. A table of contents is created along with hierarchical and sequencing ("Next" and "Previous") links. When combined with information retrieval functions acting as a sophisticated index, these provide essentially the same kind of navigational aids found in books. But the types of link produced automatically are certainly not restricted to the two types mentioned above. Other possible types include cross-reference and glossary links as well as links for footnotes and bibliographic entries.

The resulting hypertext description takes various forms depending on the target system. The following discussion does not focus on this feature, however, since it has proved fairly easy to modify existing code to suit the needs of a particular system provided it has some kind of "import" function. In fact, the procedure described below was used to convert texts for three different hypertext environments in the course of its evolution. The first two were custom-built systems developed with Smalltalk-80 and HyperCard. The third is the WWW environment.

Leaving aside retrieval aspects, the complete conversion process can be broken up into four steps: text preparation, pattern collection, construction of tables of link end points, and node and link creation.

3.1 Text preparation

Text preparation is a generic term designating activities from which an accurate and structurally consistent text file will result. Here, a "structurally consistent text" means one in which analogous parts are treated, to a certain extent, in the same manner. When using a highly expressive text representation format such as LaTeX or RTF, part of the preparation job also consists of translating characters, for instance, from escape sequences to high ASCII codes, and removing useless formatting codes or blanks that only make the specification of a grammar more obscure. This can be done automatically using pattern matching utilities or the search and replace commands provided by a word processor.

3.2 Pattern collection

Pattern collection consists of finding and collecting the various patterns signaling the nature of the text's components. This is an essential preparatory step to designing a grammar. The task is greatly simplified when text is logically marked up. (Logical mark-up specifies the way a text is structured into parts, chapters, and so on.) It may even be possible to omit it if the mark-up has been done exhaustively. However most of the texts available these days are physically marked up. (Physical mark-up describes the appearance of a text: the fonts used, page size and spacing, and so on.) When faced with such texts, one must pay attention to typographical clues such as line justification, font styles, and so on, and to combinations of these. A centred and italicised line may denote, for instance, a division title. Other useful clues are embedded in the text itself: these may include uppercase letters, punctuation, numbering schemes, or special keywords. For example, a part may begin with the word "PART" followed by a Roman numeral.

3.3 Constructing tables of end points

Using the list of patterns assembled earlier and looking at how the components relate to each other, it should be fairly easy at this point to infer the main structure of a document. Of course, if the text is logically marked up and a definition of its structure is already at hand, there is no need to infer it. No matter how this structure is obtained, it serves as a frame for grammar composition, which is at the heart of the conversion procedure. In fact, a grammar is the specification of a parser, the latter being obtained by compiling the former. The first grammar to be written essentially describes the position relative to each other of components that can be the end points of links, whether these are structural (chapters, parts, divisions,...) or free (non-structural elements such as cross-references and definitions). During parsing, these components will be recognised by means of their patterns. The grammar must therefore specify the patterns associated with a component as well as any actions to be performed once it has been recognised.

The objective of the first parsing of a text is primarily exploratory, and indeed we refer to the first grammar as the "reconnaissance grammar". This parsing must, in particular, provide all that is needed to build an intermediate table giving the types of the various components, the kernels of their names, their locations in the formal hierarchical tree, their associated node identifiers and their scopes. This table will later be rearranged into smaller tables (the end-point tables) that, as we shall see, serve to determine the proper end point of a link when given its starting point.

The fields forming a table entry merit a short description. The type of a component indicates whether it is a structural component, a definition, a footnote or any other type of component requiring specific treatment. This field is needed to produce the end-point tables. There is one such table for each possible treatment of a reference.

The kernel of a name is the characteristic part of a name used in the text to refer to a component. For example, both strings "section 26" and "s. 26" refer to section 26; the kernel of the name of this component is "26". For a second example, consider definitions: the kernel of the name of a definition is the expression actually being defined. A large number of these kernels will later be incorporated into the specification of a second grammar.

The organisation of the document's structural elements can be represented as a tree, which we call the formal hierarchical tree because it may not correspond to the original organisation (for example, when a new document is made up of many existing documents). The path leading to a vertex of the tree identifies a component regardless of the numbering used in the original text.

Although the segmentation of a text into nodes is not done at this stage, it must be already be kept in mind since node identifiers must be assigned to components. Once a segmentation scheme is adopted, it remains to embed simple mechanisms for assigning a unique identifier to each node into the grammar .

Finally, the scope of a component is the list of subtrees in which its content is valid. This is especially useful when dealing with multiply-defined words, to indicate which definition applies to a particular part of the text. In statutes, for instance, an expression can bear different meanings according to its location.

On recognising a component, actions must be performed to update the intermediate table. These range from discarding a useless character to more sophisticated functions such as keeping track of the parser's position in the formal hierarchical tree or, as may be expected, adding appropriate information to the entries in the tables. We discovered while working on statutes, however, that some of these entries may be easier to fill in by hand afterwards. For instance, glossary entries have their scope specified in natural language. Since definitions appear in groups that occur infrequently in text, it often turns out to be less costly to assign their scope manually than to devise a way to do it automatically.

After parsing, the intermediate table is rearranged by a program into end-point tables, in which components of the same type are brought together. These tables have different forms according to the type of component involved and the needs of actions defined at the next step. For instance, the table for structural elements holds information on components' children and siblings while other tables do not.

3.4 Node and link creation

Having found all possible link end points and kept the required information about them in tables , the next step is to find the link starting points and to use these tables to complete the links accordingly. This is done by a second parse. Writing the second grammar (the production grammar) consists of augmenting the reconnaissance grammar by grafting onto it descriptions of the link starting points. Furthermore new actions have to be associated with parsed components to build the table of contents, create the nodes of the hypertext and complete the links.

References to definitions and bibliographic entries require special attention since they are very variable. Like cross-references to structural elements, most of them are composed of a keyword indicating the type of the components referred to and an list of the kernels of the names of these components. References to parts of other texts have the same form, but are usually either preceded or followed by an "external reference string". An example of such an external reference is "paragraphs 241(10)(a) and (b) of the Income Tax Act".[1]

Actions to be performed on recognising a component vary greatly with component's type. However these actions are quite standard and do not vary much with classes of documents. Thus once they are defined for a particular type of component, they can be used for the conversion of many classes of documents.

Whenever a hierarchical component is recognised, its title must be added to the table of contents and a link from this entry to the corresponding node must be created. If the recognised component is made up of hierarchical subcomponents, its entry in the table of contents should somehow allow one see the titles of these subcomponents. On the other hand, if the recognised component coincides with a node, node handling procedures must be called. These procedures are primarily responsible for updating and closing previously created nodes if this has not already been done, allocating expandable memory space for the new node and initialising it. In particular, closing and initialisation procedures must set the value of node attributes and, if necessary, create coordination links (for example, links to the table of contents, "Next" and "Previous" links, and so on).

The remaining links are created when the parser runs into a text passage corresponding to the starting point of a link. Provided the grammar gives the address of glossary and bibliographic reference expressions in a common end-point table (this can be done using actions), a single set of procedures can handle both kinds of reference. This address gives direct access to a list of the nodes defining the expression under consideration with their corresponding scopes. Since the scopes are given by a list of subtree roots in the formal hierarchical tree, determining the proper node to link to is only a matter of picking the closest ancestor of the node containing the starting point. The text segment in the destination node, to which the browser should jump when the starting point is activated, is also specified at this stage.

Occasionally, links to footnotes can be generated in the same manner as for glossary and bibliographic references. However in most text representation formats, the text of a footnote is inserted exactly where the reference to this note must appear. It is then the responsibility of the output system (or, in this case, of the conversion procedure) to treat footnotes properly. Accordingly, footnote handling procedures must insert a footnote sign in the text, move the text of the footnote to wherever the designer decided (for example at the end of the current node, in a new node, or in a footnote node) and link the former to the latter.

The treatment of cross-references is more complex, especially in statutes, where the destination of a reference is context-dependent. This implies the use of "context variables" and a great deal of vigilance to keep them up-to-date. The examples below illustrate this. Unless explicitly stated, the context of a reference is its location in the hierarchy. For instance, the passage "For the purposes of subsection (4)..." in the Unemployment Insurance Act, subsection 6(5) refers to subsection (4) of the current section, that is, of section 6. An example of an explicitly stated context is "under paragraph 27(1)(c), (d) or (e)..."[2], where the reference string "(d)" should lead to paragraph (d) of subsection 27(1). To state this algorithmically, when a new cross-reference is encountered, the current context must be set to the location of the reference, and then actions must arrange for the current context to be overridden by any new explicitly stated one. Furthermore, for every new context thus computed, a link from the string that changed the context to the node it points to should be created using the proper end-point table.

This works properly with cross-references that are not external, which in the Unemployment Insurance Act, means eighty percent of the cross-references. However it cannot be applied unaltered to external cross-references, for then erroneous and dangling links would result. One simple solution might be to decide to skip external cross-references. Looking at the first example of an external reference given earlier, mechanisms to do this seem easy to implement. However passages like "Where the Minister imposes a requirement under subsection (10), subsections 224(2), (3), (4), (5) and (6) and subsection 227(10) of the Income Tax Act are applicable to..."[3], where the first reference should lead to subsection 57(10) and the others should be skipped, show that it is hard to decide by considering syntax alone whether a reference is external or not. For these admittedly unusual cases, semi-automated techniques might be envisaged. Section 4 of this paper presents an alternative way to manage external references.

Depending on the type of link generated, actions may also have to make provision for the creation of inverse links. Such links are useful, for instance, to show judgments citing a previous one. A simple strategy for handling inverse links is to gather them, together with any useful information, in a temporary memory space and to process them after text parsing.

At the end of this crucial step, a comprehensive hypertext description of the original text is available.

4. Text modules, compilers and link editors

The previous section explained how we look for structural links, finding first their possible destinations and then their starting points. The implicit assumption was that "a hypertext" is a complete whole, so that all (or nearly all) the required destinations are in fact present, and setting up the hypertext is essentially one monolithic operation, although for practical reasons it may be split into several stages. For a static text that will be intensively used by an uncritical group of users, this is no doubt the correct strategy. However for a text that is subject to frequent updating, that is only infrequently consulted, or where users may not all have the same needs, it is not evident that the construction once and for all of a complete set of links is the best approach.

Suppose, for example, that the hypertext is a book of regulations that is frequently altered and brought up to date. The hypertext links must all be rebuilt every time the text changes. Even if the updated text is available in some easily accessible electronic form, this may present a steady drain on the support group's resources. Furthermore, if the text changes so frequently that each newly-revised hypertext receives little use before it joins its predecessors in the trash, most of these resources may well be wasted. Various measures to improve this situation can be imagined: we might, for example, confine ourselves to detecting link starting and end points. If a starting point is activated at browsing time, we can compute its destination on the fly, given an appropriate data structure.

The idea then is to use a hypertext "compiler" that scans the source text to detect link starting points and candidate end points, but does nothing further about them. We choose the word "compiler" by analogy with a compiler for a programming language. A Pascal compiler, for instance, detects calls on the execution-time system or on the user's own subroutines, but merely notes their existence. Its output is an object module that includes a list of external references, saying where the reference is made and what the reference is to. The work of satisfying these references is left to a subsequent link-editing pass. Similarly, it seems advantageous to use a hypertext compiler to scan a text to be linked into a hypertext base, and to produce an "object module" where the link starting points and their intended destinations have been noted, but the links are not completed.

Extending the analogy, a program object module usually includes not only a list of the external references it makes, but also a list of those references it is able to satisfy: a library module might announce, for example, that it includes subroutines to calculate square roots. The link editor's job is precisely to match requests in some modules with suitable subroutines in others. In the same way, our hypertext compiler should produce lists both of link starting points that need destinations, and of possible destinations that may be useful to other modules. Thus, when we compile just one section of a statute, say, the resulting module will include both unsatisfied links to articles in other sections, and also the list of cross-references that it is itself able to satisfy. The hypertext "link editor" then resolves links from one module to another.

If texts can be compiled and then later linked into an existing hypertext base, it is no longer necessary to think of setting up a hypertext as essentially a single operation. Additional texts can be compiled and linked into the existing structure at any time. If only part of a complex structure changes, only the corresponding modules need be recompiled. Furthermore the same compiled module can be integrated into several different structures with little extra work. Imagine, for example, a group of scholars working on the same legal texts, but each with his own set of annotations, reflecting different interpretations or experience from his own practice. A new text, once compiled, could be linked into any of these sets, so the same link starting point could lead different users to different destinations. Conversely, a good glossary, designed perhaps for students, could be available on (or, with appropriate addressing, be shared by) several sites, so that users could link into it as routinely as programmers do into standard subroutine libraries.

It may be objected that such a scheme places an impossible burden on the hypertext compiler. Recognising link starting points of some common types is of course no harder than before. Text grammars and related techniques can still be used to identify cross-references to other laws or sections, references to precedents, and so on. Implementing a scheme for keeping external references in an explicit list available to a subsequent link-editing pass is not difficult. Similarly automatic analysis of the structure of a text can produce an explicit list of possible destinations to be used by other modules--essentially a table of contents--almost exactly as before. More difficult problems are posed by the notion of glossaries and similar reference tools, where it is not obvious how a text compiler can know which words figure in the glossary, and are therefore possible starting points for links. We take it for granted, of course, that in the glossary module itself the possible destinations (the targets of incoming links) are explicitly marked in some way. Unless a better strategy for glossary words can be devised, it looks therefore as if links to definitions must be restricted to definitions that already exist when the text is compiled. In many circumstances, this is not a severe constraint.

5. Transmutable links

Although the approach suggested in the previous section is viable, it is not entirely satisfactory. In a totally modular system, it may well be that at the moment when a user compiles a text to be added into the hypertext base, the corresponding glossary has not yet been installed, or perhaps not even written. Should we require the user to guess what it will eventually contain and mark his text accordingly? If not, how can we discover the possible starting points for links? Automatic recompilation of text modules whenever a new glossary is added to the system is possible but inelegant. What shall we do?

Space does not allow us to detail our answer to this question. The basic idea is, however, simple: instead of marking link starting points in the by-now-traditional way using colour or underlining, we propose to leave them (or at least defined words) unmarked. The user will be invited to click wherever he chooses in a text. If a link is there, control will follow it; if not, nothing will happen. This leaves us free to mix precompiled links with others that are completed at execution time in whatever way suits us best. Since links are not explicitly marked, nothing need be altered when a new module is added to the hypertext: only the execution-time system needs to know that something has changed. As the system evolves, the same link may not always lead to the same place. Such a system with transmutable links will be the subject of a subsequent paper.

References

Berners-Lee 94 Berners-Lee, Tim, Robert Cailliau, Ari Luotonen, Henrik Frystyk Nielsen and Arthur Secret. "The World Wide Web". CACM, 37(8), August 1994, pp. 76-82.

Choquette 94 Choquette, Martin. L'application des systèmes hypertextes dans un cadre juridique. Masters Thesis, Département d'informatique et de recherche opérationnelle, Université de Montréal, 1994.

EIC 90 Unemployment Insurance Legislation and Relevant Employment Legislation. Employment and Immigration Canada, December 1990.

Furata 89a Furata, Richard, Catherine Plaisant and Ben Shneiderman. "Automatically Transforming Regularly Structured Linear Documents into Hypertext". Electronic Publishing, 2(4), December 1989, pp. 211-229

Furata 89b Furata, Richard, Catherine Plaisant and Ben Shneiderman. "A Spectrum of Automatic Hypertext Constructions". Hypermedia 1(2), 1989, pp. 179-195.

Gelbart 91 Gelbart, Daphne and Joseph C. Smith. "Beyond Boolean Search: FLEXICON, A Legal Text-Based Intelligent System". Proceedings of the Third International Conference on Artificial Intelligence and Law, St. Catherine`s College, Oxford, 1991, ACM Press, pp. 225-234.

Greenleaf 91 Greenleaf, Graham, Andrew Mowbray and Alan Tyree. "The Datalex Legal Workstation -- Integrating Tools for Lawyers". Proceedings of the Third International Conference on Artificial Intelligence and Law, St. Catherine`s College, Oxford, 1991, ACM Press, pp. 215-224.

Marcil 94 Marcil, François. Navigation et repérage d'information dans les hypertextes de grande taille. Masters Thesis, Département d'informatique et de recherche opérationnelle, Université de Montréal, 1994.

Rada 92 Rada, Roy. "Converting a Textbook to Hypertext". ACM Transactions on Information Systems, 10(3), July 1992,
pp. 294-315.

Raymond 88 Raymond, Darrell R. and Frank W. Tompa. "Hypertext and the Oxford English Dictionary". Communications of the ACM, 31(7), July 1988, pp. 871-879.

Savoy 92 Savoy, Jacques. "Bayesian Inference Networks and Spreading Activation in Hypertext Systems". Information Processing & Management, 28(3), pp. 389-406.

Sheiderman 89 Shneiderman, Ben. "Reflections on Authoring, Editing, and Managing Hypertext". in The Society of Text, (ed. E. Barrett), MIT Press, 1989.

Turtle 91 Turtle, Howard. Inference Networks for Document Retrieval. Doctoral dissertation, Computer Science Department, University of Massachusetts at Amherst, 1991.

Wilson 90 Wilson, Eve. "Links and Structures in Hypertext Databases for Law", Proceedings ECHT `90, Paris (France), November 1990, Cambridge University Press, pp. 194-211.