Document Properties: Links

The Links properties page allows you to specify how to handle hyperlinks.

Maximum Link Depth

You tell iSiloX how far to follow hyperlinks by specifying a value for the Maximum link depth. A new installation of iSiloX has this value initialized to a default of one. The root source files are considered to be at a depth of zero. Files to which they link are at a depth of one. Files to which those files link are at a depth of two, and so on.

Recommendation

If you are creating a document based on a Web site, you are recommended to leave the Maximum link depth value at one because each additional increment in depth beyond one will likely cause an exponential increase in the size of the document. For example, at a link depth of one, if the converted document is one megabyte in size, at a link depth of two, it might be ten megabytes, and at a link depth of three, it could be 100 megabytes.

Off-site Links

An off-site link is defined as a link to a target in a different domain. iSiloX treats all file paths as belonging to the same domain. For URLs, iSiloX treats the domain as the protocol (e.g., http://) and the hostname. To tell iSiloX to not follow links to targets in different domains, uncheck the Follow off-site links checkbox. This is useful to limit the amount of irrelevant content brought into the document.

iSiloX performs the off-site link check anew for each root source file. What this means is that you can have root source files in different domains. For example, you can have two root source files, one with the URL <http://www.iSilo.com> and another with the URL <http://www.palm.com>. Assuming that you have unchecked the option to follow off-site links, then when iSiloX converts the content at <http://www.iSilo.com>, it only follows links from there with target URLs that begin with <http://www.iSilo.com>. When iSiloX converts the content at <http://www.palm.com>, it only follows links from there with target URLs that begin with <http://www.palm.com>. If the content at <http://www.palm.com> had a link to <http://www.iSilo.com/whatsnew.htm>, iSiloX will not follow that link.

Maximum off-site link depth

When you enable the option to follow off-site links, you can also specify how far to follow hyperlinks that go off-site. A value of zero is equivalent to unchecking the option to follow off-site links. The depth is relative to the source file containing the off-site link, rather than relative to the root source files.

If you uncheck the Follow off-site links option, then the Maximum off-site link depth setting has no effect.

Note that the value for the Maximum link depth setting still limits the total link depth. So if the maximum link depth is set to two and the maximum off-site link depth is set to one, and there is an off-site link from a source file at depth two, that link is not followed, although it is at a depth of one relative to the source file with that off-site link.

The maximum off-site link depth option is useful in the case where you specify a maximum depth value greater than one in order to include more content from a given site but want to allow links to off-site articles.

Following Only Sub-Folder Links

In many cases, websites are structured hierarchically within folders and sub-folders. And in such cases, it is also probably the case that the URLs referencing the pages of such a site are also orgznied as such, with slashes separating the different levels of folders. For example, the iSiloX.com website has all support pages within a folder named "support". Within the support folder, there are sub-folders for different categories of support, such as a sub-folder named "manual" where the manuals are located. However, such sub-folder pages may also have links to pages outside of the folder. If you want to limit followed links to only sub-folders of the root source pages then you can check the Follow only links that are sub-folders of the root source paths checkbox to do so. If you do, then iSiloX only follows links which match up to the last slash of any of the root source URLs.

As an example, if you wanted to get all the support pages from the iSiloX.com website, you might specify http://www.iSiloX.com/support/index.htm as the root source URL and check Follow only links that are sub-folders of the root source paths. The page http://www.iSiloX.com/support/index.htm has a reference to the home page of the site http://www.iSiloX.com. However, because you check the Follow only links that are sub-folders of the root source paths option, that link will not be followed. However, a link such as http://www.iSiloX.com/support/faq.htm to the frequently asked questions page will be followed.

Unresolved Link Detail

In most cases, since you can tell iSiloX to only follow links up to a given maximum depth and to not follow off-site links, you end up with a document that has hyperlinks to content not brought into the document. These hyperlinks are referred to as unresolved links. You can choose whether to include the target URLs of these unresolved links in the document or not by checking or unchecking the Include unresolved link detail checkbox.

Including unresolved link detail

If you choose to include the unresolved link detail, iSiloX creates a document with an additional page at the end that lists the URLs of all unresolved links. iSiloX sets the target of each unresolved link in the document to jump to its corresponding target URL on this last page. This is useful for later reference and for finding broken hyperlinks.

Not including unresolved link detail

If you choose not to include the unresolved link detail, the unresolved hyperlinks essentially have no target. When viewing the document within a reader and attempting to follow such a hyperlink, the reader will tell you that the hyperlink was unresolved, but gives no indication of the target URL.

Common sources of unresolved links

The most common sources of unresolved links are the following:

Links that are at a depth greater than that specified in the Maximum link depth setting.

Links that are outdated and thus are broken because the target has moved.

Links whose targets are specified incorrectly.

URL Filters

Click URL Filters to access the URL Filters dialog to specify patterns for excluding images and the following of links based on the image or link target URL. URL filters are useful for excluding unwanted images and content and for reducing document sizes. A filter is specified using either a wildcard or regular expression pattern matching string.

Exclusion filters

If the URL of an image matches against one of the exclusion patterns, it is not included in the document. If the target URL of a link matches against one of the exclusion patterns, the link is not followed and hence the target content is not included in the document. Exceptions to exclusions can be specified using inclusion filters.

Adding an exclusion filter

Click Add Exclusion Filter to access the dialog for specifying a new exclusion filter. In the URL Filter dialog, select a pattern type of either Wildcard or Regular Expression:

Wildcard: A wildcard pattern provides a simple way to specify simple patterns. In such a pattern, the character '*' matches zero or more of any mix of characters and the character '?' matches exactly one of any character. A URL matches against a wildcard pattern if the pattern appears anywhere in the URL.
Regular Expression: Regular expression patterns use a powerful pattern matching language. This implementation uses the PCRE (Perl Compatible Regular Expressions) library, version 3.9. For more information about PCRE and the syntax for regular expressions, you can consult the PCRE website. In particular, follow the link there labeld "PCRE man page" and then go to the section with the heading "REGULAR EXPRESSION DETAILS".

In the Pattern field, enter the pattern to use. Check Case-sensitive to perform a case-sensitive match. By default, matching is case-insensitive, with the lowercase letters 'a' through 'z' matching the uppercase letters 'A' through 'Z'.

Deleting an exclusion filter

Select one or more exclusion filters, then click Delete Selected Exclusion Filters to delete them. You will be asked for confirmation before the filters are deleted.

Modifying an exclusion filter

Double-click an exclusion filter to modify it.

Inclusion filters

An inclusion filter serves as an exception to the exclusion filters. If a given URL matches against an exclusion filter the inclusion filters are applied to the URL, and if there is a match against an inclusion filter, the URL is not excluded. Click Add Inclusion Filter to access the dialog for specifying a new inclusion filter. To delete one or more inclusion filters, select them, then click Delete Selected Inclusion Filters. To modify an inclusion filter, double-click it.

Example

This example specifies two exclusion filters and one inclusion filter.

First exclusion filter: A regular expression pattern of table[1-9].jpg
Second exclusion filter: A wildcard pattern of figures*plant?blue
Inclusion filter: A case-sensitive wildcard pattern of Table8

The first exclusion filter specifies a regular expression pattern that is case-insensitive. The pattern matches the text "table" followed by any digit character from '0' through '9' and then followed by the text ".jpg". So the pattern will match against any of the following:

Table1.jpg
http://www.acme.org/table3.jpg
c:\My Documents\table5.jpg
/home/acme/docs/TABLE9.jpg

But the pattern will not match against any of the following:

Table1.gif
http://www.acme.org/table0.jpg
c:\My Documents\table5
/home/acme/docs/tables.htm

The second exclusion filter is a wildcard pattern and is also case-insensitive. The pattern matches the text "figures" followed by zero or more of any mix of characters, followed by the text "plant", followed by any single character, and finally followed by the text "blue". The pattern will thus match against any of the following:

http://blueflowers.com/figures/plantablue.htm
http://blueflowers.com/figuresplant1blue.htm

But the pattern will not match against any of the following:

http://blueflowers.com/figures/plantabblue.htm
http://blueflowers.com/figuresplantblue.htm

The first exclusion filter would exclude the URL "http://www.acme.org/Table8.jpg". However, because the inclusion filter notes it as an exception, the URL would actually not be excluded. Note that this inclusion pattern specifies a case-sensitive match, and so "http://www.acme.org/table8.jpg" would not be noted as an exception.

External Documents

Click External Documents to access the External Documents dialog to specify which links are to external documents. A document may have links to zero or more external documents.

In the External Documents dialog, the External document list lists the document name and link prefix fields of each external document specification for the document. Generally you will have one external document specification for each external document to which the document will link.

An external document specification consists of four pieces of information, as shown in the External Document Specification box of the dialog:

Document name: This field gives the relative path to the external document as it will be when the user accesses it using a link. If the external document is a .pdb file, the .pdb extension is optional. The reader application will attempt to open the file with the exact path name you provide first. If the open is unsuccessful, another attempt is made to open it with the .pdb extension if it was not provided or without the .pdb extension if it was provided.

Note that on Palm OS® that if a file is stored in the internal storage memory that the document title serves as the file name, so when converting external documents, it is best to ensure that the document title and document file name are the same. Also, on Palm OS®, when a document is stored in the internal storage memory, any external documents to which it links must also be stored in the internal memory, and in this case, the reader application ignores the directory part of external document paths.

Version 4.3 and later of iSilo™ support searching for the first of multiple possibilities. You can specify multiple names to search for by enclosing each name within double-quote characters and separating each double-quote enclosed name from the next with a space. When you do this, iSilo™ opens the first document that it finds in the order listed. This is especially useful in the case for Palm OS®, where when a document is in the internal database storage memory, its internal database name is used since there is no notion of a file name, but when a document is on a memory card, its file name is used.

Here is an example of specifying three different possibilites for the name: "Gulliver's Travels" "Gulliver_s_Travels.pdb" "Gul. Travels"

Link prefix: This field gives the text to match links against for identifying a link as one to an external document. When performing the match, the converter takes the URL of the link and removes all leading periods, forward slashes, and baskslashes. Then it performs a case-insensitive comparison of the prefix string against the beginning of the remaining URL. A match indicates that the link is to that of an external URL.

Keep prefix: This field determines whether the link prefix is kept for lookup.

As an example of a scenario where the prefix should not be included in the lookup, consider two documents, call them document A and document B, that externally link to one another such that each document's content is wholly contained in its own directory. Say that the directory containing document A's content is named DirA and that the directory for document B's content is named DirB. In order for document A to link to document B, for document A, you would specify DirB as the prefix for identifying links as those to document B. For document B, you would specify DirA as the prefix for identifying links as those to document A. The target names within a given document are relative to the first source, which would presumably be some file immediately within the document's directory. Hence, the directory name would not be part of the target name and thus the prefix, which would be the same as the directory name, should not be included for lookup.

As an example of a scenario where the prefix should be included in the lookup, consider two documents, call them document A and document B, that externally link to one another such that each document's content is spread across two directories. Say that the directories containing document A's content are DirA1 and DirA2 and that the directories containing document B's content are named DirB1 and DirB2. Further, say that the directory containing all four directories is named DirAB. In addition, say that an index file immediately within DirAB links to content in all four subdirectories DirA1, DirA2, DirB1, and DirB2. To create the two documents that link externally to one another, for document A, you would specify two external document specifications, both for externally linking to document B. For the first specification, DirB1 would be the prefix. For the second specification, DirB2 would be the prefix. But since the index file is at the same level as those two directories, you would want to keep the prefix.

Map file: This field gives the full path of the map file for the external document. When using the ID or Offset lookup methods the map file is necessary for determining the target ID or target offset value for links to the external document. The path must be a full path and can be an HTTP URL. This latter capability allows the targets of a document to be easily made public.
When converting the target external document, be sure to use the option to generate the map file. If two documents link to one another, it is necessary to perform two conversion passes. The first pass generates the map file and the second pass uses the map files for looking up the associated target IDs or offsets.
Lookup by: The setting of this option determines the format in which the link information is stored as well as how the lookup is performed in the external document. Set it to one of the following:
- ID: A numeric value, also known as the target ID, that uniquely identifies the target is stored and used to identify the target location within the external document when a jump to the target occurs. A map file for the external document is needed to lookup the target ID values of the external document during conversion.
- Offset: A numeric value, also known as the target offset, that represents the location of the target in the external document is stored and used when a jump to the target occurs. A map file for the external document is needed to lookup the target offset values of the external document during conversion.

Lookup method tradeoffs

The lookup methods each have their own individual advantages and disadvantages.

For the document storage space tradeoffs among the methods, the Name method requires the largest amount of storage space in the linking document as well as in the targeted external document unless the number of target names are very few and short in length. The ID and Offset methods require approximately the same amount of storage space as each other in the linking document. In the targeted external document, the Offset method requires no additional storage space, while the ID method requires an amount of storage space that is generally less than the Name method.

In terms of the speed of performing the lookup when a jump occurs to an external document, the difference perceived by the user is probably negligible. But the Name method requires the most amount of processing. The ID method comes next, while the Offset method requires the least amount of processing for lookup.

The other important tradeoff among the methods concerns synchronization between a document and the external documents to which it links. For the purposes of this discussion, let us say that we have a document named DocSource that has links to an external document named DocTarget and that DocTarget is updated indepedent of DocSource. The content and targets in DocTarget change periodically such that content and targets may be added and removed. Assume though that the targets to which DocSource links to in DocTarget are always there, though the specific location of the targets within the content of DocTarget may change.

Given the scenario just described, if the lookup method is Name, even though DocTarget may undergo many changes and DocSource stays the same, the links from DocSource to DocTarget will always work.

If the lookup method is ID this may not be the case. The IDs assigned to each target within DocTarget depend to some extent on all other external targets within DocTarget. If DocTarget gets a new target or one is removed, the target IDs for the other targets may change. As a result, the target IDs stored in DocSource for the targets in DocTarget may become invalid. However, if only the content in DocTarget changes, the target IDs will still be valid.

If the lookup method is Offset, then neither the content nor the targets in DocTarget may change if the links from DocSource to DocTarget are to remain valid.

The Name lookup method, though requiring the most storage space, is the best method to use for documents that can change independent of one another. The Offset lookup method requires the least amount of storage space and is a good method to use for documents that will change together. The ID lookup method generally requires only a modest amount of storage space compared to the Name method and is a good method to use when only changes to the content, such as minor corrections, are expected to occur in an external document.

Adding a new external document specification

To add a new external document specification, fill in the fields in the External Document Specification box and then click Add.

Modifying an existing external document specification

In the External document list select the specification to modify. The fields in the External Document Specification box change to show the values for the selected specification. Make the modifications, then click Modify.

Deleting one or more external document specifications

To delete individual specifications, select them individually, then click Delete. To delete all specifications, click Delete All.

Changing the order of the specifications

The order of the specifications may be important for your document set. To change the order of a specification, select it and then use the Move Up and Move Down buttons to move the specification up and down, respectively, in the list. Specifications are applied in order from top to bottom. The first specification that matches a given link is the one used.