This draft needs to be revised to take Corrigendum 2 of the
standard into acount. Due to pressure of other work, that has
not yet been done. A great deal of hard work and serious debate
has gone into improving the standard; any remarks about the
standard must be treated as unfair. (Except that
char_conversion/2
always was an obviously bad idea
and still isn't gone.)
The ISO Prolog standard does not adequately support the needs of
Prolog users, because as announced in the Logic Programming conference
in Melbourne, 1985,its predecessor BSI Prolog was explicitly
intended to be a minimal standard with essentially no library.
For example, the standard deliberately omitted
append/3
, length/2
, and member/2
(although an example on page 33 shows clauses for
append/3
and an example on page 85 shows clauses for
member/2
).
[Ulrich Neumerkel has a draft
Prologue for Prolog which includes append/3
(p.p.2),
between/3
(p.p.4), length/2
(p.p.3),
maplist/[2–8]
(p.p.7), member/2
(p.p.1),
select/3
(p.p.5), and succ/2
(p.p.6).
We owe all the people involved in producing that draft and writing it
up a vote of thanks. I wish I had heard about it earlier.]
The argument at the time was that these things would go in some library. I urged Quintus to make their library available. No luck, despite the fact that the very dated DEC-10 Prolog library on which it was based was available.
Since 1998 the Logtalk system has provided language extensions and a substantial library. There is abundant documentation, many examples, and the range of topics covered by the library considerably exceeds what this document covers. Logtalk is implemented in Prolog and runs on top of most if not all currently important Prolog systems. The author, Paulo Moura, has made a major contribution to the Prolog community by writing, maintaining, and freely providing the system and its library, and I was at fault for not mentioning him or it in earlier drafts.
What I wanted to accomplish in 1983 was to get community agreement on a useful library of data structures and algorithms for core Prolog. Many areas are crying out for standardisation, including
This is an attempt to deal with basic data structures. Perhaps more important than the specific data structures and operations dealt with is the stress placed on consistent naming and argument order, and the use of projection as a means of ensuring that argument order is consistent.
This is not written in standardese. It is meant for Prolog users to discuss. The time to turn it into standardese is when we are confident about what we want to standardise.
I provide three kinds of declaration for the "base" predicates:
when
-declarations the most intuitive way to specify when
a predicate should make progress. (If a goal is suspended, when should
it be resumed? When the when
declaration is satisfied.)
Even in a non-coroutining Prolog
(like ISO Prolog) the information is still useful: it describes the
modes in which it makes sense to call the predicate.
mode
declarations. The "mode" det
would mean that such a
query has a single solution and cannot fail; semidet
means that there is at most one solution. I intend something
stronger here: when a call has the described form, the goal should
either fail in a finite amount of time or should terminate in a finite
amount of time with the Prolog system "aware" that no choices remain to
be explored. The bounded
"mode", not found in Mercury,
means that such a goal has a finite number of solutions (possibly zero)
and that such a goal must be able to report all solutions in a finite
amount of time and then fail in a finite amount of time without
infinite backtracking. The argument instantiation test
list_skel(Var)
is satisfied when Var
is a proper list; the elements of the list are not constrained.
The sample code uses a convention taken from DEC-10 Prolog via C Prolog: the dollar sign is treated as a lower case letter, and predicates whose names contain a dollar sign are "implementation detail" not visible to the user. Such predicates are only for specification purposes, and need not conform to quality requirements for visible predicates such as steadfastness.
Perhaps the most important notational issue is that
I give no indication in the running text of the module a type or
predicate is defined in. It is clear that the four "basic" list
processing predicates append/3
, length/2
,
member/2
, and memberchk/2
are so useful and so
used that they had better be treated as part of the language. It would
be quite appropriate if the others were all in a module named 'lists'.
See plface.pl for one distribution.
I assume the following predefined types:
:- type list(T) ---> [] | [T|list(T)]. :- type order ---> (<) | (=) | (>). :- type pair(K,T) ---> K-T. :- type integer. :- type float. :- type number. :- type atom. :- type char = atom. % such that atom_codes(It, [_]) :- type chars = list(char). :- type code = integer. % such that between(0, 16'10FFFF, It). :- type codes = list(code).
Other types will be introduced where they are used.
With Prolog being used in e-business and stuff like that, we have to deal with HTML 4 (defined in terms of Unicode) and XML (defined in terms of Unicode). If we can't portably deal with Unicode, we can't portably deal with the Web. We can no longer afford to put up with Prolog systems that don't handle Unicode cleanly. Quintus Prolog handled 16-bit character sets via JIS and EUC nearly two decades ago. It was obvious then that the byte-at-a-time way that QP did so was a makeshift with a use-by date in the near future (now the rather less near past).
The ISO Prolog standard was published in 1995. In 1983, when I started work on an informal Prolog standard, it was obvious that 8-bit character sets were dead and Prolog would have to support 16-bit character sets. By 1993, it was obvious that there was a single "wide" character set to use in future language standards: ISO 10646/Unicode. Unicode 1.0.0 came out in October 1991. The reconciliation between ISO 10646 and Unicode resulted in Unicode 1.0.1, which came out in June 1992. This left the ISO Prolog committee with 3 years to reflect on the fact that there was now one ISO character set with round-trip conversion between the main national character set standards and itself.
This was certainly obvious to others. For example, the X/Open technical study E401, "Universal Multiple-Octet Coded Character Set coexistence and Migration" had been published the previous year. Of even greater relevance: a great deal of work had gone into updating Ada for what became Ada 95. By 1993, I already had a pile of Ada 95 background documentation and proposals in my office about 50cm high, and Unicode was clearly going to be the character basis of Ada 95 then.
The main reason why certain people at Quintus were not as hostile to the introduction of the single-character-atom design botch as they should have been was concern for EBCDIC on IBM mainframes. However, I have studied the current z/Architecture Principles of Operation, and note that current IBM mainframes have hardware support for Unicode.
With this background, we can see that the ISO Prolog committee made some questionable decisions:
Æneas
or of including my sister-in-law's
name "Chéri Ford"
in a string literal, in marked
contrast to Ada, C99, and C++.
atom_chars/2
and
number_chars/2
, in the interests of solving a problem
which the adoption of Unicode as the character basis of Prolog would
have completely exorcised without the need for such wholesale breakage
of existing working code.
char_conversion
must go!"This is not negotiable", for the reason that no sane programmer who
is acquainted with the facts would wish to retain
char_conversion
.
That's strong language. Let's summarise what
char_conversion/2
and current_char_conversion/2
do.
char_conversion('X', 'Y')
replaces ConvC with
µ(ConvC,x,y),
where x is the character code of X and
y is the character code of Y.
current_char_conversion(A, B)
is true when A and B are single-character atoms
and b=ConvC(a) where
a and b are the character codes of the first
characters of the names of A and B respectively,
provided (and this is the most confusing bit)
a and b are different.
Concerning the excessive strength of char_conversion/2
,
consider the goal char_conversion('.', ',')
. If you give
that as a query to the interactive top level, you haven't shot yourself
in the foot, but in the head. There is no way to recover from this.
In fact it's worse than that. Section 5.5.3 reads "A processor may
support some other initial value of ConvC ...", which
means that there is no portable Prolog syntax at all.
This implementation permission should have required all
Prolog processors to start with ConvC mapping the
characters used in the Core Prolog syntax to themselves. Indeed, it
might have been advisable to decree that char_conversion
should not be allowed to change the identity mappings for the core
syntax characters.
The :- char_conversion(From, To)
directive is unusable
in portable programs, even programs intended to be used within a single
locale using the same coded character set. The reason is that it is
explicitly not specified whether char_conversion
directives
in a source file are limited in effect (the way it was desired by many
that op
directives should be) or unlimited in effect (the
way op
directives were in DEC-10 Prolog and C Prolog).
More precisely, a program using this feature can be portable only if
every source file spells out the same set of (typically
190) mappings: the fact that it is not specified whether the effect is
limited or not means that a program cannot simply consult a single
mapping file into each source file.
Concerning the weakness of char_conversion/2
, we have
to start by looking at the rationale:
I take this to refer to the "full width" (zenkaku) and "half width" (hankaku) copies of ASCII present in some Japanese character sets (and therefore also present in Unicode).
The char_conversion/2
machinery does not satisfy the
explicit requirement in the rationale that it should apply to Prolog
data. For example, it does not apply to
atom_chars/2
, and there is no analogous built-in predicate
which does character conversion. "Words" read from a file are only
converted to a normal form by ConvC mapping if they
are read in Prolog syntax using one of the read
family.
There should surely have been
converted_atom_chars(Atom, Chars) :- ( atom(Atom) -> atom_chars(Atom, Name), $convert_chars(Name, Chars) ; var(Atom) -> $convert_chars(Chars, Name), atom_chars(Atom, Name) ; $error ). $convert_chars([], []). $convert_chars([Char|Chars], [Conv|Convs]) :- ( current_char_conversion(Char, Conv) -> true ; Conv = Char ), $convert_chars(Chars, Convs).
An even worse way in which this misfeature fails to satisfy its
explicit rationale is that character conversion does not apply to
number_chars/2
, and there is no analogous built-in
predicate which does character conversion. Numbers read from a file
are only converted to a normal form by ConvC mapping
if they are read in Prolog syntax using one of the read
family.
The failure to deal with run-time number conversion is a particular nuisance for people using Arabic or Indic scripts. As a preliminary sketch, it would be useful if something like the following predicates were adopted:
:- pred integer_codes(integer, codes, integer, code). integer_codes(Integer, Codes, Base, Zero) :- Delta is Zero - 0'0, ( integer(Integer) -> integer_codes(Integer, ASCII, Base), $map(ASCII, Codes, Delta, 0'., 0'E) ; var(Integer) -> $map(ASCII, Codes, Delta, 0'., 0'E), integer_codes(Integer, ASCII, Base) ; $error ). integer_codes(Integer, Codes, Base) :- between(Base, 2, 36), like number_codes/2 but only for integers and using the first Base digits from "0..9A..Z". integer_codes(Integer, Codes) :- integer_codes(Integer, Codes, 10). :- type float_format ---> e(integer) | e | f(integer) | f | g(integer) | g | k(integer) | k | p. :- pred float_codes(float, codes, float_format, code, code, code). float_codes(Float, Codes, Format, Zero, Decimal, Exponent) :- Delta is Zero - 0'0, ( float(Float) -> float_codes(Float, ASCII, Format), $map(ASCII, Codes, Delta, Decimal, Exponent) ; var(Float) -> $map(ASCII, Codes, Delta, Decimal, Exponent), float_codes(Float, ASCII, Format) ; $error ). float_codes(Float, Codes, Format, Zero, Decimal) :- float_codes(Float, Codes, Format, Zero, Decimal, 0'e). float_codes(Float, Codes, Format, Zero) :- float_codes(Float, Codes, Format, Zero, 0'., 0'e). float_codes(Float, Codes, Format) :- like number_codes/2 but only for floats, ( Format = e(P), integer(P), P >= 0 -> generate or recognise %.Pe format ; Format = e -> generate: use default P, recognise: accept any %e ; Format = f(P), integer(P), P >= 0 -> generate or recognise %.Pf format ; Format = f -> generate: use default P, recognise: accept any %f ; Format = g(P), integer(P), P >= 0 -> generate or recognise %.Pg format ; Format = g -> generate: use default P, recognise: accept any float ; Format = k(P), integer(P, P >= 0 -> like e(P), but with 1..3 digits before the decimal point so that the exponent is a multiple of 3; a very useful format when you have it ; Format = k -> generate: use default P, recognise: accept any P but still allow only results of k(_) formatting ; Format = p -> % precise generate: same as e(P) with P large enough that full precision is generated, enough digits for round trip conversion to be one-to-one ; $error ). $map([0'+|Xs], [0'+|Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E). $map([0'-|Xs], [0'-|Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E). $map([0'.|Xs], [ D |Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E). $map([0'e|Xs], [ E |Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E). $map([0'E|Xs], [ E |Ys], Z, D, E) :- !, $map(Xs, Ys, Z, D, E). $map([ N |Xs], [ M |Ys], Z, D, E) :- !, plus(Z, N, M), between(0'0, 0'9, N), $map(Xs, Ys, Z, D, E).With such predicates, we then define
:- pred number_codes(number, codes, code). number_codes(Number, Codes, Zero) :- ( integer(Number) -> integer_codes(Number, Codes, Zero) ; float(Number) -> float_codes(Number, Codes, p, Zero) ; var(Number) > ( integer_codes(Number, Codes, Zero) -> true ; float_codes(Number, Codes, p, Zero) ) ; $error ). number_codes(Number, Codes) :- number_codes(Number, Codes, 0'0).
At last, 10 years late, this would give people using Arabic or Indic scripts (amongst others) the ability to convert numbers using their own characters, and it would, 10 years late, give Europeans who like a rather ambiguous use of commas to convert floating-point numbers using their preferred decimal point character.
There is another pressing reason why something like
float_codes/3
needs to exist, and that is the
popularity of JSON-encoded data on
the Web. JSON is a (data) sublanguage of
JavaScript,
which has no integers, only floating-point numbers. To process
JSON data consistently with the way JavaScript would process it,
it is necessary to both accept all numeric forms that are allowed
in JSON (including for example 1e10
), and to read
them as floating point, even ones that look like integers.
The float_codes/3
predicate can do this;
the number_codes/2
predicate cannot.
Is there any other way char_conversion/2
fails?
Yes, there is. I shall take personal examples. The only languages
that really matter to me personally (other than mattering because I
think other people should be able to use their own language and script)
are English, Latin, Greek, Hebrew, Serbo-Croatian, and Maori.
Writing in English, using ISO Latin 1, I would like to be able to
use my father's name, Æneas. The letter ash is not one of the
normal Prolog characters, so I would have to map it to two
characters: A, e. But I can't do that. I cannot have the atom
hæme
mapped to haeme
either, for the
same reason. Since Latin1 includes the characters
"¼", "½", and "¾", which are part of the number
syntax of my native locale, I would like to use them in numbers, so
that ¼=0.25, 2½=2.5, and 10¾=10.75. This could
be done if I could map "¼" to ".25" and so on. But with
char_conversion/2
so limited, I can't. If
char_conversion/2
can't even handle ISO Latin 1, what
good is it? It has been defended on the grounds that it wasn't
designed to handle these issues. But it was clear at the
time that it should have been. And it is even clearer in
2013 with Unicode 6.3 that char_conversion/2
is
inadequate for dealing with non-ASCII text.
I suppose I should have said "Croatian" rather than "Serbo-Croatian",
but there was a generation of New Zealanders who knew that their grandparents
were from Yugoslavia who had to look the natal village up on a map to figure
out which side they would have been on when Yugoslavia broke up. As it
turned out, my relatives were all on the side labelled "refugee", and we
have lost touch; the natal village is a heap of rubble. The NZ branch of
the family figured out which to call themselves by looking at letters my
grandfather's brother had written: they used the Latin script, not the
Cyrillic script. Except that Croatian doesn't quite use the
Latin script. There are four letters (DZ, DZ with caron, LJ, and NJ)
which look like two English letters but are regarded as single letters in
the script. It would probably be a good idea to map DZ to D,Z;
Dz to D,z; and dz to d,z. But I can't do that with
char_conversion/2
.
New Zealand has two official languages, English and Maori. For historical reasons, there are three spelling conventions for Maori:
Accordingly, in processing Maaori text in which vowel length is
marked, the best scheme would be to map U+0100 (capital A with macron)
and U+00C4 (capital A with daieresis) to A,A, and so on for the other
9 upper and lower case vowels. But I can't do that with
char_conversion
.
Do I need to mention the Unicode compatibility mappings which map one character to several? Maybe I had better. Take the character U+00A0 "DIAERESIS", for example. Its compatibility mapping is "<compat> 0020 0308". A quick tally of the number of replacement characters in the Unicode 3.1.1 database yielded
Replacement | Number of |
---|---|
length | occurrences |
1 | 3005 |
2 | 1585 |
3 | 360 |
4 | 62 |
5 | 15 |
6 | 2 |
8 | 1 |
18 | 1 |
This is of course a lower bound; the replacement rules in that file have to be applied recursively, so the average replacement length is somewhat longer.
In conclusion, the entire char_conversion/2
framework
is too narrowly conceived, too little applied, and too dangerous in
effect.
In order to deal effectively with Unicode (and it was the plain responsibility of the ISO Prolog committee to address this), some other means entirely will have to be found.
Unicode Standard Annex #15 defines four normalisation forms. Canonical Equivalence in Applications defines two more. We also need a "verbatim" normalisation form which acts as the identity function.
Name | Code | Source | What it is |
---|---|---|---|
v | Here | No change | |
NFD | d | TR15 | Canonical decomposition |
NFC | c | TR15 | Canonical decomposition then canonical composition |
FCD | fcd | TN5 | "Fast C or D" decomposition |
FCC | fcd | TN5 | FCC form |
NFKD | kd | TR15 | Compatibility decomposition |
NFKC | kc | TR15 | Compatibility decomposition then canonical composition |
Canonical decomposition and (re)composition apply, in principle, to any character set containing both precomposed accented letters and floating diacriticals. Unicode was by no means the first such character set.
FCC and FCD forms are intermediate between NFC and NFD; they make sense whenever NFC and NFD make sense.
Compatibility decomposition applies, in principle, to any character
set containing multiple representations of the "same" character.
In particular, the Japanese zenkaku and hankaku are just such characters.
In fact, compatibility decomposition is pretty much exactly what
char_conversion/2
was supposed to be all about, except that
it is by no means true that all compatibility mappings in all such
character sets have single-character replacements.
Basically, to really do the job that code_conversion/2
was supposed to do but doesn't, we need Normalisation Form KD or
even better, Normalisation Form KC.
We need two modules, one for each representation of characters. (Unicode-hacking life would be so much easier if we only had to deal with integers in the range 0..16'10FFFF. For example, while '\xD800\\xDC00\' is, according to Unicode rules, the encoding of one character, it is not a 'char' according to ISO Prolog rules.)
Some Prologs, notably SWI Prolog, make it about as convenient to process text represented by atoms as text represented by lists. Such Prologs could offer a third 'atom' module with basically the same interface.
The following type is shared by codes
and chars
:
:- type normalisation_form ---> v | c | d | fcc | fcd | kc | kd.
It is a matter for debate which normalisation forms should be supported by all Prolog systems. It is a practical fact that some normalisation forms should be supported by most Prolog systems. Whatever set of normalisation forms is eventually agreed on, implementors ought to be allowed to extend the set.
:- module(chars, [...]). :- pred normalise(normalisation_form, chars, chars, chars). :- normalise(NF, Xs, Ys0, Ys) when NF, Xs. :- normalise(NF, Xs, Ys0, Ys) is semidet when ground(NF), ground(Xs). normalise(NF, Xs, Ys0, Ys) :- Ys0\Ys is Xs in normalisation form NF. normalise(NF, Xs, Ys) :- normalise(NF, Xs, Ys, []). normalise(Xs, Ys) :- normalise(kd, Xs, Ys). normalised_compare(NF, R, Xs, Ys) :- normalise(NF, Xs, Xn), normalise(NF, Ys, Yn), compare(R, Xn, Yn). normalised_compare(R, Xs, Ys) :- normalised_compare(kd, R, Xs, Ys). normalised(NF, Xs) :- normalise(NF, Xs, Xs). normalised(Xs) :- normalise(Xs, Xs). ...
Unicode also offers a case folding standard. For details,
see the discussion of CaseFolding.txt
in
Unicode
Technical Report 44 and sections 3.13 (Default Case Algorithms),
4.2 (Case), and 5.18 (Case Mappings) of
the Unicode
6.1 standard. The "Caseless Matching" subsection of 5.18 is our focus
here: "Caseless matching is implemented using case folding,
which is the process of mapping characters of different case to
a single form, so that case differences in strings are erased.
Case folding allows for fast caseless matches in lookups because
only binary comparison is required.
It is more than just conversion to lowercase. ...
The CaseFolding.txt
file in the Unicode Character
Database is used to perform locale-independent case folding. ...
The Unicode case folding algorithm is defined to be simpler and
more efficient than case mappings.
It is context-insensitive and language-independent (except ...).
Caseless matching is only an approximation of
the language-specific rules governing the strength of comparisons. ...
In most environments, such as in file systems, text is not and
cannot be tagged with language information.
In such cases, the language-specific mappings must not be used.
Otherwise, data structures such as B-trees might be built based on one
set of case foldings and used based on a different set of case foldings.
This discrepancy would cause those data structures to become corrupt.
For such environments, a constant, language-independent, default case
folding is required."
This case folding should be provided in Prolog. It is more appropriate for non-display purposes than the traditional convert-to-lower-case or convert-to-upper-case approaches.
The simplest interface I can think of is to extend
the normalisation_form
type with three more
values:
:- type normalisation_form ---> v | c | d | fcc | fcd | kc | kd | ci | cic | cid.
where ci
signifies Unicode case folding,
cic
signifies Unicode case folding followed
by conversion to normalisation form NFKC, and
cid
signifies Unicode case folding followed
by conversion to normalisation form NFKD. The cic
and cid
forms are explicitly suggested in section
5.18, without being named.
What you can compare, you should probably be able to hash as well.
:- pred normalised_hash(normalisation_form, chars, integer, integer). :- normalised_hash(NF, Xs, Modulus, H) when NF, Xs, Modulus. :- normalised_hash(NF, Xs, Modulus, H) is semidet when ground(NF), ground(Xs), ground(Modulus). normalised_hash(Xs, Modulus, H) :- normalised_hash(kd, Xs, Modulus, H).For this predicate, Modulus must be a positive integer. A hash value in the range 1 to Modulus inclusive is returned. For any valid normalisation form NF it is required that
normalised_compare(NF, =, Xs, Ys) & normalised_hash(NF, Xs, Modulus, HX) & normalised_hash(NF, Ys, Modulus, HY) ==> HX = HY.
Please remember that these are specifications, not
implementations. It is possible to implement
normalised_compare/[3,4]
so that they work in a single
pass over their arguments without building any intermediate list.
Similarly, it is possible to implement
normalised/[1,2]
and
normalised_hash[3,4]
so that they work in a single pass
over their string argument without building any intermediate list.
As specified above, normalised_compare/3
is defined
only on character lists. It could be extended to all ground terms.
I do not recommend this.
:- module(codes, [...]). :- pred normalise(normalisation_form, codes, codes, codes). :- normalise(NF, Xs, Ys0, Ys) when NF, Xs. :- normalise(NF, Xs, Ys0, Ys) is semidet when ground(NF), ground(Xs). normalise(NF, Xs, Ys0, Ys) :- Ys0\Ys is Xs in normalisation form NF. normalise(NF, Xs, Ys) :- normalise(NF, Xs, Ys, []). normalise(Xs, Ys) :- normalise(kd, Xs, Ys). normalised_compare(NF, R, Xs, Ys) :- normalise(NF, Xs, Xn), normalise(NF, Ys, Yn), compare(R, Xn, Yn). normalised_compare(R, Xs, Ys) :- normalised_compare(kd, R, Xs, Ys). normalised(NF, Xs) :- normalise(NF, Xs, Xs). normalised(Xs) :- normalise(Xs, Xs). :- pred normalised_hash(normalisation_form, codes, integer, integer). :- normalised_hash(NF, Xs, Modulus, H) when NF, Xs, Modulus. :- normalised_hash(NF, Xs, Modulus, H) is semidet when ground(NF), ground(Xs), ground(Modulus). normalised_hash(Xs, Modulus, H) :- normalised_hash(kd, Xs, Modulus, H). ...
As specified above, normalised_compare/3
is defined
only on character code lists. It could not be extended to all ground
terms, because there is no general way of telling which lists of positive
integers are supposed to be Unicode strings and which are not.
There needs to be a stream property
normalisation_form
(NF) which
can be used to ensure that characters delivered to a Prolog application
are already normalised, and that characters generated by a Prolog
application are appropriately normalised before being delivered to their
destination.
Several Prolog implementations (Xerox Quintus Prolog and SWI Prolog
amongst them) have a "string" data type. Such implementations should
have a string
module analogous to codes
and
chars
. See section 10.
In the good old days, it was sufficient to have a library(ctypes) module. With ISO's thrice-accursed single-character-atom representation for characters, we need two modules, as noted in the previous section. So we might as well put the character classification predicates in 'chars' and 'codes'. The classification predicates in each module will have the same name and basically the same meaning, but different argument types.
The ISO Prolog standard uses the following character groups:
is_space/1
;
is_solo/1
;
is_quote/2
;
is_csym/1
;
is_digit/[1,2]
;
is_csymf/1
;
is_alpha/1
;
is_lower/1
;
is_upper/1
;
is_symbol/1
.
Unicode defines the term "Graphic Character"
quite
precisely: “A character with the General Category of Letter
(L), Combining Mark (M), Number (N), Punctuation (P), Symbol (S), or
Space Separator (Zs).”
This makes the term "graphic character" a most unfortunate choice.
For example, the comma is a Graphic Character according to
Unicode, but it isn't a graphic character according to ISO
Prolog (or to isgraph()
or iswgraph()
in
ISO C). ISO Prolog "graphic characters" were previously known as
"symbol characters" or "operator characters" or "agglutinating
characters".
That's a hint. We want to be able to do what C does, plus we want to be able to cover the categories used in Prolog itself.
Fortunately, there's a freely downloadable table classifying over a hundred thousand characters, actively maintained by someone else. Even more fortunately, the categories it uses can be mapped onto the categories we want quite easily. The table is part of the Unicode data base.
I propose a single base predicate, from which the others are easily derived.
:- type unicode_category ---> lu | ll | lt | lm | lo | mn | mc | me | nd | nl | no | pc | pd | ps | pe | pi | pf | po | sm | sc | sk | so | zs | zl | zp | cc | cf | cs | co | cn.
These codes are simply the Unicode category names converted to lower case for convenience in Prolog.
:- pred is_unicode(code, unicode_category). :- is_unicode(C, K) when C ; K. :- is_unicode(C, K) is semidet when ground(C). :- is_unicode(C, K) is bounded when true. is_unicode(C, K) :- between(0, 16'10FFFF, C), UnicodeData.txt says C has category K, K \== cn. % not assign :- pred is_digit(code, integer). :- is_digit(C, V) when C ; V. :- is_digit(C, V) is semidet when ground(C). :- is_digit(C, V) is bounded when true. is_digit(C, K) :- is_unicode(C, nd), UnicodeData.txt gives V as the decimal value of C.
The Unicode standard and associated reports explain implementation techniques for this. Even with just one byte per character, a simple implementation would take 1.1 megabytes. Using multilevel tables can compress this dramatically, not least because only 3 of the 17 planes currently have any characters at all.
is_unicode(C) :- is_unicode(C, _). is_latin1(C, K) :- between(0, 16'FF, C), is_unicode(C, K). is_latin1(C, K) :- is_latin1(C, _). is_ascii(C, K) :- between(0, 16'7F, C), is_unicode(C, K). is_ascii(C) :- is_ascii(C, _). is_alnum(C) :- is_unicode(C, K), memberchk(K, [nd,lu,ll,lt,lm,lo]). is_alpha(C) :- is_unicode(C, K), memberchk(K, [lu,ll,lt,lm,lo]). is_blank(C) :- % known as is_white/1 in Quintus Prolog is_unicode(C, zs). is_bracket(L, R) :- is_unicode(L, ps), is_unicode(R, pe), R is the closing bracket corresponding to L. is_cntrl(C) :- % C0 and C1 controls + formats is_unicode(C, K), memberchk(C, [cc,cf]). is_csym(C) :- is_unicode(C, K), memberchk(C, [pc,nd,lu,ll,lt,lm,lo]). is_csymf(C) :- is_unicode(C, K), memberchk(C, [pc,lu,ll,lt,lm,lo]). is_digit(C) :- is_unicode(C, nd). is_endfile(-1). is_endline(C) :- is_unicode(C, K), memberchk(K, [zl,zp]). is_graph(C) :- is_unicode(C, K), nonmember(K, [zs,zl,zp,cc,cf]). is_layout(C) :- is_lower(C) :- is_unicode(C, ll). is_print(C) :- is_unicode(C, K), nonmember(K, [zl,zp,cc,cf]). is_punct(C) :- is_unicode(C, K), memberchk(K, [pc,pd,ps,pe,pi,pf,po,sm,sc,sk,so]). is_title(C) :- is_unicode(C, lt). is_quote(0'', 0''). is_quote(0'", 0'"). is_quote(0'`, 0'`). is_quote(L, R) :- is_unicode(L, pi), is_unicode(R, pf), R is the final quote corresponding to L. is_space(C) :- is_unicode(C, K), memberchk(K, [zs,zl,zp]). is_upper(C) :- is_unicode(C, lu). is_xdigit(Base, C, V) :- between(2, 36, Base), Base1 is Base - 1, between(0, Base1, V), ( V >= 10 -> ( C is "A" - 10 + V ; C is "a" - 10 + V ) ; is_digit(C, V) ). is_xdigit(C, V) :- is_xdigit(16, C, V). is_xdigit(C) :- is_xdigit(16, C, _).
There are two debatable decisions here. In this draft I have decided
that the "Other, Format" characters of Unicode should be grouped with the
"Other, Control" characters for is_cntrl/1
, on the grounds
that these things fairly obviously are control characters and
that if you just want the C0 and C1 controls it is easy enough to test
for 'cc' yourself. Indeed, is_ascii(C, cc)
will check
for the ASCII (C0) controls. I have also decided that the "csymf"
characters should include all the "Punctuation, Connector" characters,
not just the LOW LINE "_".
I have read through rather more material about wide character classification in C than was enjoyable, and found stunningly little guidance about how non-ASCII characters should be classified, so I must emphasise that the definitions above are a draft.
The is_bracket/2
and is_quote/2
predicates
do not correspond to anything in C, but I found them very useful in
Quintus Prolog and equivalents very useful in a text editor.
Unicode 3.1.1 adds only single quotes, double quotes, single guillemets,
double guillemets, and single and double high reversed-9 quotes to the
Pi category, so is_quote/2
isn't much of a burden. For the
brackets we'd need a somewhat larger table, but there are plenty of
nice looking brackets in Unicode it would be nice to recognise.
The "solo" characters in ISO Prolog are "!,;|%()[]{}". What is the appropriate generalisation for Unicode?
In ASCII, "()[]{}" is the full set of brackets. So Unicode solo characters should include all Unicode brackets. The other characters are all in the Po category, but there are many characters in the Po category which should be symbol characters, not solo characters. The simplest rule seems to be that a non-bracket character should count as a solo character if and only if it is in the Po category and is a script or compatibility variant of one of the existing non-bracket solo characters. In Unicode 3.1.1, that's just the following 24 characters:
It is possible to recognise this set of characters by pattern matching (carefully!) on the names, but since I do not propose access to the names, this has to be expressed by means of a table.
is_solo(C) :- is_unicode(C, K), memberchk(K, [po,ps,pe]), ( K = po -> memberchk(C, [16'0021,16'0025,16'002C,16'003B,16'055C,16'055D, 16'060C,16'061B,16'066A,16'1363,16'1364,16'1802, 16'1808,16'3001,16'FE50,16'FE51,16'FE54,16'FE57, 16'FE6A,16'FF01,16'FF05,16'FF0C,16'FF1B,16'FF64]) ; true ).
The so-called "meta" characters in ISO Prolog are the ASCII umlaut, acute accent, grave accent, and reverse solidus. This is a rather puzzling grouping. Three of them are used for quoting string-like tokens, one of them isn't. Unlike Lisp, the reverse solidus has no "meta" semantics in Prolog except as part of string-like tokens; when it occurs unquoted it is just a plain symbol character.
Since we already have is_quote/2
, which covers the
characters that begin string-like tokens, I do not think that an
*is_meta/1
character classification predicate is appropriate.
The confusingly named "graphic" characters of ISO Prolog are "#$&*+-./:<=>?@^~". This is especially confusing, because what we want to know is not "which characters stand for themselves in quoted tokens" (the answer is: anything except the corresponding closing quote, the reverse solidus, or a Zl or Zp character), but "which characters can be used in unquoted quoted literals that look vaguely like mathematical operators". That is, we need a character class that includes these characters and also includes the reverse solidus, for the sake of ISO Prolog atoms such as \=, \==, =\=, \+, /\, \/, and \ . This is in fact the set "graphic token char" of section 6.4.2.
If we look these characters up in the Unicode data base, we find that they fall into four groups:
These groupings actually fit our intuitions about which characters should be allowed in unquoted atoms very well; when extended to Latin 1 they yield exactly the set of characters my Prolog-in-C parser has accepted since Latin 1 came out. The only problem is that some Po characters are not symbol characters, but we already have a recogniser for those, so:
is_symbol(C) :- is_unicode(C, K), memberchk(K, [pd,po,sc,sm]), \+ is_solo(C).
If Prolog uses Unicode internally, then we don't have to worry much about locales for character classification, so I'll leave this section empty for a while.
When processing text in any programming language, it is often convenient to skip or to pick up a run of characters belonging to a given set of characters, or to test whether a character belongs to a given set.
The traditional approach was to use a bitmap with one bit per possible character. This is compact and fast for 8-bit character sets. Alas, this is not practical for Unicode. In fact, there are several ways to implement Unicode character sets, and it's not our job to choose between them, only to warn that single character lookup should be fast.
Jocelyn Ireson-Paine provided a “sets as intervals” library, used for working with sets of characters, which is convenient and compact but takes time proportional to the number of intervals. Turning that into a binary search tree reduces the time to logarithmic in the number of intervals, but this is still too slow for high-frequency lookup.
There needs to be a clear separation between building a character set, where some human-oriented expression form might be a good idea, and using a character set, where we want an unspecific ground term that can work fast and does not have to look pretty. Here is one possible design.
:- type usable_character_set ---> uniset(not further specified) :- type humane_character_set ---> {unicode_category} | {codes} | {[char|chars]} | humane_character_set + humane_character_set % union | humane_character_set - humane_character_set % difference | humane_character_set * humane_character_set % intersection | not(humane_character_set) | {string} % if strings are available | {usable_character_set}. :- pred compile_uniset(humane_character_set, usable_character_set). :- compile_character_set(H, U) when ground(H). :- pred code_in_uniset(code, usable_character_set). :- code_in_uniset(C, U) when U. :- pred char_in_uniset(char, usable_character_set). :- char_in_uniset(C, U) when U.
This should certainly be extended to include the derived character sets in the Unicode Database as named sources.
For example, suppose we want to handle "identifiers" which start with an Lu, Ll, or Lt character, and continue with any number of Lu, Ll, Lt, Nd, "_", "-", or "." characters. (A simplification of XML identifiers.)
:- dynamic my_sets/2. :- initialize compile_uniset(lu + ll + lt, Start), compile_uniset(Start + nd + "_-.", Rest), assert(my_sets(Start, Rest)). identifier([S|R]) --> [S], { my_sets(Start, Rest) }, { code_in_uniset(S, Start) }, rest_identifier(R, Rest). rest_identifier([C|Cs], Rest) --> [C], { code_in_units(C, Rest) }, !, rest_identifier(Cs, Rest). rest_identifier([], _) --> [].
I beg your forgiveness for using the Pascal set operators + - * instead of Unicode's U+222A UNION (∪) U+2216 SET MINUS (ࢨ) and U+2229 INTERSECTION (ࢵ) but while the set minus character is not not the same character as backslash, it looks a lot like it, and it still isn't as easy as it might be to type these symbols. (On a Mac running Mac OS X 10.7.5, clicking on the intersection sign in Edit|Special Characters takes you to union, and vice versa. This really doesn't help.)
In ASCII, every letter was either a lower case letter with a unique context-independent upper case letter or an upper case letter with a unique context-independent lower case letter, and this was the only variation between letter forms.
In the ISO 8859 family of 8-bit character sets, there could be letters which were neither upper nor lower case, and there could be letters which did belong to a case but did not have an opposite case equivalent (there are two of those in Latin-1), and there could be letters with positional variants (there are five of those in 8859-8). Worse still, there could be letters whose opposite case form was not the same length (ß).
Of course it's well known that Greek has two lower case forms for the letter "s" both corresponding to the same upper case form.
Interestingly enough, the Unicode 3.1.1 database defines case mappings, and its case mappings are one-to-one. the letter "ß" is handled by not mapping it to anything else at all, and the Greek letters are handled by mapping the capital "S" to the medial (not the final) letter "s", regardless of the context.
The Unicode standard points out that case mapping is locale-sensitive, and gives the example of Turkish.
Accordingly, we do need some kind of locale objects. This is a very preliminary draft here!
:- type character_locale. % not specified :- pred character_locale(atom, character_locale). :- character_locale(Name, Locale) when Name. :- character_locale(Name, Locale) is semidet when ground(Name). character_locale(Name, Locale) :- Name is the name of a locale and Locale is a private representation of its character facet.
One special locale object is 'unicode', which represents the Unicode case conversion tables. We might also need 'posix' and/or 'C' locales for this. A locale name like 'en_NZ' might or might not be acceptable as a locale object at the implementor's choice.
We get two sets of case conversion predicates. None of them takes a single character as argument or delivers a single character as result. We have, in general, no right to expect that lc(x++y)=lc(x)++lc(y).
The arity four interfaces were said to be semidet because the Locale argument might have the right form to be a locale name but fail to name a known locale. After recent experience, discovering that my preferred locale is known as en_NZ.ISO8859-1 in some flavours of UNIX, as en_NZ.iso88591 in Linux, and is missing entirely from others, I believe that this was a mistake. An unknown locale must be reported as an exception, which in old Quintus-speak would have been called an existence error.
The arity four interfaces are still said to be semidet because the "outputs" might be guessed incorrectly, and that should be a simple failure, not an error. In all cases, if the Xs argument is not a list of valid character codes, that's an error deserving an exception; but there is nothing about a list of valid character codes as such that can cause these predicates to fail. In particular, passing a string that contains no upper- or title-case letters to locale_to_lower/[3,4] or to_lower/[2,3] is expected to return the original string. This is just like the way that tolower(ch) in C returns ch unchanged if ch is not an upper case letter.
:- pred to_lower(character_locale, codes, codes, codes). :- to_lower(L, Xs, Ys0, Ys) when L, Xs. :- to_lower(L, Xs, Ys0, Ys) is semidet when ground(L), ground(Xs). locale_to_lower(Locale, Xs, Ys0, Ys) :- Locale is a character locale and Xs is a string, Ys0\Ys is Xs converted to lower case according to Locale. locale_to_lower(Locale, Xs, Ys) :- to_lower(Locale, Xs, Ys, []). to_lower(Xs, Ys0, Ys) :- locale_to_lower(unicode, Xs, Ys0, Ys). to_lower(Xs, Ys) :- to_lower(Xs, Ys, []). :- pred to_title(character_locale, codes, codes, codes). :- to_title(L, Xs, Ys0, Ys) when L, Xs. :- to_title(L, Xs, Ys0, Ys) is semidet when ground(L), ground(Xs). locale_to_title(Locale, Xs, Ys0, Ys) :- Locale is a character locale and Xs is a string, Ys0\Ys is Xs converted to title case according to Locale. locale_to_title(Locale, Xs, Ys) :- locale_to_title(Locale, Xs, Ys, []). to_title(Xs, Ys0, Ys) :- locale_to_title(unicode, Xs, Ys0, Ys). to_title(Xs, Ys) :- to_title(Xs, Ys, []). :- pred to_upper(character_locale, codes, codes, codes). :- to_upper(L, Xs, Ys0, Ys) when L, Xs. :- to_upper(L, Xs, Ys0, Ys) is semidet when ground(L), ground(Xs). locale_to_upper(Locale, Xs, Ys0, Ys) :- Locale is a character locale and Xs is a string, Ys0\Ys is Xs converted to upper case according to Locale. locale_to_upper(Locale, Xs, Ys) :- locale_to_upper(Locale, Xs, Ys, []). to_upper(Xs, Ys0, Ys) :- locale_to_upper(unicode, Xs, Ys0, Ys). to_upper(Xs, Ys) :- to_upper(Xs, Ys, []).
The title of this section refers to the corresponding section of the Unicode Standard, where rules are given for finding
In Latin1, there is a one to one correspondence between the thing
you might call a `char' in a program and the thing a normal person
looking at a printed text might call a `character'. The imaginary
rock band ``Death-Töngue'' (from ``Bloom County'') clearly has
12 whatsits in its name on either construal. If we step through the
Prolog list (of Latin1 codes) "Death-Töngue"
one
element at a time, we shall visit one written character at a time.
This neat equivalence did not hold in ASCII, although that is not
widely known. In ASCII, it was legitimate to encode
<ö
> as <o,BS,"
> or
<",BS,O
>.
This accounts for some of the compromise character shapes:
U+0022 QUOTATION MARK
had to serve for opening double quote, closing
double quote, and diaresis,
U+0027 APOSTROPHE had to serve for apostrophe, right quote,
and acute accent,
U+002C COMMA had to serve for comma and cedilla, and so on.
The result is that except for US English, the relationship between a
sequence of characters and a sequence of chars was neither one to one
nor unique.
Unicode brings back the complexities of ASCII, doubled and redoubled. There is in principle no upper bound to the number of diacritical characters which may be attached to a base character, hence no upper bound on the number of Unicode code elements which may be required to represent a single scripteme.
I propose the following predicates for inclusion in both the 'chars' module and the 'codes' module. I illustrate only the 'codes' versions, but the 'chars' versions are obvious.
:- pred next_character(normalisation_form, codes, codes, codes). :- next_character(NF, Xs, Ys0, Ys) when NF, Ys0. :- next_character(NF, Xs, Ys0, Ys) is semidet when ground(NF), ground(Xs), ground(Ys0). :- next_character(NF, Xs, Ys0, Ys) is semidet when NF==v, ground(Ys0). :- next_character(NF, Xs, Ys0, Ys) is bounded when ground(Ys0). next_character(NF, Xs, Ys0, Ys) :- append(Zs, Ys, Ys0), Zs \== [], normalised_compare(NF, =, Xs, Zs), Zs is the longest prefix of Ys0 not containing a character boundary. next_character(Xs, Ys0, Ys) :- next_character(v, Xs, Ys0, Ys). next_character(Xs, Ys0) :- next_character(Xs, Ys0, Ys). :- pred next_word(normalisation_form, codes, codes, codes). :- next_word(NF, Xs, Ys0, Ys) when NF, Ys0. :- next_word(NF, Xs, Ys0, Ys) is semidet when ground(NF), ground(Xs), ground(Ys0). :- next_word(NF, Xs, Ys0, Ys) is semidet when NF==v, ground(Ys0). :- next_word(NF, Xs, Ys0, Ys) is bounded when ground(Ys0). next_word(NF, Xs, Ys0, Ys) :- append(Zs, Ys, Ys0), Zs \== [], normalised_compare(NF, =, Xs, Zs), Zs is the longest prefix of Ys0 not containing a word boundary. next_word(Xs, Ys0, Ys) :- next_word(v, Xs, Ys0, Ys). next_word(Xs, Ys0) :- next_word(Xs, Ys0, Ys). :- pred next_sentence(normalisation_form, codes, codes, codes). :- next_sentence(NF, Xs, Ys0, Ys) when NF, Ys0. :- next_sentence(NF, Xs, Ys0, Ys) is semidet when ground(NF), ground(Xs), ground(Ys0). :- next_sentence(NF, Xs, Ys0, Ys) is semidet when NF==v, ground(Ys0). :- next_sentence(NF, Xs, Ys0, Ys) is bounded when ground(Ys0). next_sentence(NF, Xs, Ys0, Ys) :- append(Zs, Ys, Ys0), Zs \== [], normalised_compare(NF, =, Xs, Zs), Zs is the longest prefix of Ys0 not containing a sentence boundary. next_sentence(Xs, Ys0, Ys) :- next_sentence(v, Xs, Ys0, Ys). next_sentence(Xs, Ys0) :- next_sentence(Xs, Ys0, Ys). :- pred next_line(normalisation_form, codes, codes, codes). :- next_line(NF, Xs, Ys0, Ys) when NF, Ys0. :- next_line(NF, Xs, Ys0, Ys) is semidet when ground(NF), ground(Xs), ground(Ys0). :- next_line(NF, Xs, Ys0, Ys) is semidet when NF==v, ground(Ys0). :- next_line(NF, Xs, Ys0, Ys) is bounded when ground(Ys0). next_line(NF, Xs, Ys0, Ys) :- append(Zs, Ys, Ys0), Zs \== [], normalised_compare(NF, =, Xs, Zs), Zs is the longest prefix of Ys0 not containing a line boundary. next_line(Xs, Ys0, Ys) :- next_line(v, Xs, Ys0, Ys). next_line(Xs, Ys0) :- next_line(Xs, Ys0, Ys).
call/
NThe ISO Prolog standard defines call/1
in section 7.8.3.
Looking at the definition is like a slap in the face; the committee
rejected my denotational semantics for Prolog on the grounds that it
was written in Pascal, then when I pointed out that on the contrary,
it was written in a pure functional language that wasn't anything
like Pascal, it was rejected on the grounds that it was too
"implementation-oriented". If that doesn't describe the definition
of call/1
, I don't know what does.
If call/1
were not in the language, we could define it
thus:
:- dynamic $call/1. call(Goal) :- asserta(( $call(Goal) :- retract(($call(_) :- _)), !, Goal )), $call(Goal).
In the Prolog type checking paper that Alan Mycroft and I wrote
[“A polymorphic type system for Prolog” Artificial
Intelligence 1984, volume 23, pages 295–307 — a revision of
work presented at the Logic Programming Workshop at Albufeira in 1983],
we proposed the call/
N family, in order to
permit type-checking Prolog higher-order code. This family was also
described in PLSTD.MSS, last revised on 23 July 1984, posted to
comp.lang.prolog on 12 August 1984, and included in the BSI register
of documents relating to the BSI Prolog standard as PS/6.
Since then, some
Prologs have implemented this feature in the library (notably DEC-10
Prolog, C Prolog, Quintus Prolog) with an upper bound on N,
and some Prologs have built it into their language (SWI Prolog and,
I believe, NU Prolog).
Roughly speaking, one expects
call(p(X1,X2,X3)) :- p(X1, X2, X3). call(p(X1,X2), X3) :- p(X1, X2, X3). call(p(X1), X2, X3) :- p(X1, X2, X3). call(p, X1, X2, X3) :- p(X1, X2, X3).
Indeed, one way to implement the call/
N
family is to start with a skeleton
call(P, ...) :- \+ callable(P), !, error(uncallable term). [special purpose code for control structures if N <= 2] [gap] call(P, ...) :- error(undefined predicate).
and then whenever a predicate p/N is defined,
automatically add clauses to call/1, ..., call/N+1
in the appropriate gaps. In a WAM-based system, this can be
done by exploiting switch-on-term without any choice-points
being created, so call/N in the normal case is just
a hash table look-up, a bit of argument shuffling, and a jump.
The space requirement is O(P.A2), where P is the
number of predicates and A is the arity. Since arities are
usually fairly low, this is in effect linear in the size of the
program. I have tried this, and found it to be a pleasantly
efficient (as well as obvious) way to implement call/
N.
There are other ways to implement call/
N which are
nearly as efficient, and have only O(P) space overhead, the hash table needed
to map from a predicate's functor to its address. It's worth noting
that for a fixed k, call
/k itself might
be automatically generated when needed, in a manner similar to autoloading.
Because autoloading is so useful, predicate_property/2
should be allowed to fail in general for predicates that are not defined
yet but should not be user-redefined.
This family of predicates has proven its usefulness. It has been nearly 30 years. It's time they were in the community standard.
:- pred call(void(T1), T1). :- pred call(void(T1,T2), T1, T2). :- pred call(void(T1,T2,T3), T1, T2, T3). :- pred call(void(T1,T2,T3,T4), T1, T2, T3, T4). :- pred call(void(T1,T2,T3,T4,T5), T1, T2, T3, T4, T5). :- pred call(void(T1,T2,T3,T4,T5,T6), T1, T2, T3, T4, T5, T6). % Sample implementation. % Beware: this is highly inefficient. Don't do it this way. call(P, Y1) :- P =.. L0, append(L0, [Y1], L1), Q =.. L1, call(Q). call(P, Y1, Y2) :- P =.. L0, append(L0, [Y1,Y2], L2), Q =.. L2, call(Q). call(P, Y1, Y2, Y3) :- P =.. L0, append(L0, [Y1,Y2,Y3], L3), Q =.. L3, call(Q). call(P, Y1, Y2, Y3, Y4) :- P =.. L0, append(L0, [Y1,Y2,Y3,Y4], L4), Q =.. L4, call(Q). call(P, Y1, Y2, Y3, Y4, Y5) :- P =.. L0, append(L0, [Y1,Y2,Y3,Y4,Y5], L5), Q =.. L5, call(Q). call(P, Y1, Y2, Y3, Y4, Y5, Y6) :- P =.. L0, append(L0, [Y1,Y2,Y3,Y4,Y5,Y6], L6), Q =.. L6, call(Q).
meta_predicate
As soon as you have call/1
, you have a problem.
Cross-referencers don't work. You can fix this for the specific
cases of call/1
, findall/3
,
bagof/3
, setof/3
, and a few others,
but as soon as a Prolog programmer writes
gcc(Goal) :- \+ \+ Goal. % garbage-collecting call
the cross-referencer stops working. The C call graph tool on my workstation is similarly limited; if you have a function which is passed to another function and not called directly, it thinks that function is never called. That's not what I call a usable call graph tool; it is worse than having no call graph at all.
Unfortunately, the ISO committee considered
meta_predicate
to be about modules.
It is, but only accidentally.
The real function of meta_predicate
declarations is to
provide just enough type information for a cross-referencer to work.
The idea of meta_predicate
declarations was to start with
existing mode declarations (so the arguments would be +, -, or ?) and
change some of the arguments to something else to say "this argument is
called".
Had call/1
been the only game in town, something like
'call' would have been adequate, so we'd have had
:- meta_predicate findall(?, call, -), % not final version gcc(call). % not final version
But the call/
N family came into Prolog about
the time the DEC-10 cross-referencer was written, and replaced prior
code that did things like
% actual code from APPLIC.PL in the DEC-10 library maplist(_, [], []). maplist(Pred, [Old|Olds], [New|News]) :- apply(Pred, [Old,New]), maplist(Pred, Olds, News).
Very often, when an argument is passed to a meta-predicate, it is a closure, not a goal; the callee will supply more arguments (at the right) before calling it. This is information that a cross-referencer needs.
So, quite independently of any module system, we need meta-predicate declarations of this form:
meta_predicate_directive --> [(:-),(meta_predicate)], meta_predicate_declarator, rest_meta_predicate_declarators. rest_meta_predicate_declarators --> ['.']. rest_meta_predicate_declarators --> [','], meta_predicate_declarator, rest_meta_predicate_declarators. meta_predicate_declarator --> atom(_). meta_predicate_declarator --> atom(_), ['('], meta_predicate_argument, rest_meta_predicate_arguments. rest_meta_predicate_arguments --> [')']. rest_meta_predicate_arguments --> [','], meta_predicate_argument, rest_meta_predicate_arguments. meta_predicate_argument --> [+]. meta_predicate_argument --> [-]. meta_predicate_argument --> [?]. meta_predicate_argument --> [:]. % for modules. meta_predicate_argument --> integer(N), {N >= 0}.
The meta_predicate arguments (+), (-), and (?) are ignored by the system, but may be processed by a mode inference program like Chris Mellish's, and indicate the intended use of the corresponding arguments.
The meta_predicate argument (:) is only useful when there is a module system, otherwise it is just like (+). It says that the corresponding argument will not be treated as a goal or closure, but should be subject to module name expansion anyway.
As a mode, the meta_predicate argument N has the same force as (+). But N tells the cross-referencer that the corresponding argument is to be a goal that is missing its rightmost N arguments.
With the aid of this declaration, we can write
:- meta_predicate findall(?, 0, ?), gcc(0), maplist(2, ?, ?). maplist(_, [], []). maplist(P, [X|Xs], [Y|Ys]) :- call(P, X, Y), maplist(P, Xs, Ys).
The ISO committee not only got hold of the wrong end of the stick, they broke it, in these ways:
metapredicate
,
which is downright ugly.
The result is that the ISO metapredicate
directives
are useless for cross-referencers. They are best regarded as obsolete.
But what if you don't care about cross-referencing? I guess you've
never used Masterscope for Lisp, or cscope(1) for C, or the Smalltalk
IDE, or ... if you can ask such a question. Come to think of it, if
you haven't used a reasonably reliable cross-referencer for Prolog,
you don't realise what a useful debugging tool it can be. But let's
suppose you don't care about cross-referencing, and you
don't care about the added documentation value from stating
the number of missing arguments. Is there any reason why you
should care about meta_predicate
?
Yes. From its inception (which I can speak with authority about,
because I invented it), the intention was that the
meta_predicate
directive would licence a nonstandard
translation of meta-calls. Take maplist/3
for example:
maplist(append(Front), Xs, Ys)
can be translated as
maplist(<{address of append/3}, Front>, Xs, Ys)
where <Address,Arg1,...> is a special kind of
term that looks like a compound term, but has an address where the
functor should be. This permits an
implementation of call/3
which spreads the arguments
and jumps straight to the given address. This eliminates the space
overhead for meta-calls (there is no longer a need for a table of
the addresses of all predicates; the compiler has arranged for the
addresses of just the meta-called predicates to be available directly).
All right, all right; so that would also make cross-module meta-calls as efficient as any other meta-calls.
There is another thing that meta_predicate
does,
and it is related to the possibility of unusual implementations
of meta-calls, but it has a rather more profound theoretical reason.
If T1 and T2 are two terms representing goals, then T1 = T2 is,
strictly speaking, a second-order unification (or worse). But it
isn't implemented that way. In order to avoid the obvious soundness
and completeness problems, we want to check at compile time that
two meta-terms are unified only if at least one of them is
an unbound variable. This means that a meta-argument of a predicate
should only be a variable in the head, and that it should only be
unified with another variable, and that it should only be passed to
a meta-argument of some predicate. This is a type checking problem,
and it is again a very useful debugging tool. But you can't even
get started on it without meta_predicate
; the ISO
metapredicate
abortion is quite useless for the job.
So even if we don't include modules in the community standard,
and even if we don't include call/
N, we
still need either meta_predicate
or a richer
type declaration/checking system.
As David H. D. Warren showed many years ago, we don't strictly speaking need lambda-expressions in Prolog; ordinary closures suffice quite often, and when they don't we can always write an extra predicate. But then, lambda-lifting shows that you don't strictly speaking need lambda-expressions in functional languages either; you can always move the thing out to top level, give it a name, and partially apply it at the original place of use.
The rest of this section is temporarily removed until I've had time to digest Ulrich Neumerkel's proposal See his proposal and his sample implementation
meta_predicate
directives with 0
as a meta-argument annotation distinct from ':'. This is of great
practical use. A non-module compiler can simply ignore such directives.
A module compiler that cannot exploit the extra information can simply
treat them as alternative syntax for metapredicate
.
call/
N
family; they are trivial to implement (I do not say trivial to implement
efficiently, but even that is not particularly hard) and very
useful.
We can derive many of the operations in the Quintus Prolog library (which is a revision and extension of the DEC-10 Prolog library) by projection from a small number of predicates. These "base" predicates are specified in some detail, the specifications of the "derived" predicates are then obvious consequences. I do not claim that the "base" predicates are "basic" in any sense, only that they are "universal relations" from which a range of more useful "views" can be projected.
"Base" predicates have been chosen solely for convenience in
specification. I do not say that they are "basic" in any
way. They are generally harder to understand than some of their
derived predicates. They are not, as a rule, a good implementation
basis. (For example, it would be very silly to implement
append/3
the way it is specified here.) What makes a
predicate a "base" predicate is that it makes a good base for
specification, so that details of types and so on have to be
repeated as little as possible.
The notion of "projection" is closely related to the notion of "bootstrapping" (did Mrs Malaprop choose that word?) in the ISO Prolog standard. Many (but by no means all) of the "bootstrapped" predicates in the ISO Prolog standard are projections. Predicate P is a projection of predicate Q if and only if P is not only a special case of Q, but has its arguments in the same order as Q. The idea is that the strong semantic link between P and Q should make it easy for a programmer to memorise the argument order of Q and thereafter be able to correctly predict the argument order of P without any additional memory burden.
append_append_length/7
If any predicates have universal acceptance in Prolog, they must
be append/3
, member/2
, and length/2
.
They are all projections of a single predicate.
:- pred append_append_length( list(X), list(X), list(X), list(Y), list(Y), list(Y), integer). :- append_append_length(A, B, AB, C, D, CD, N) when A ; AB ; C ; CD ; N. :- append_append_length(A, B, AB, C, D, CD, N) is semidet when list_skel(A) ; list_skel(C) ; nonvar(N). :- append_append_length(A, B, AB, C, D, CD, N) is bounded when list_skel(AB) ; list_skel(CD). % Sample implementation. append_append_length(A, B, AB, C, D, CD, N) :- ( var(N) -> $append_append_length(A, B, AB, C, D, CD, 0, N) ; integer(N), N >= 0 -> $append_append_length(A, B, AB, C, D, CD, N) ; $error ). $append_append_length([], B, B, [], D, D, N, N). $append_append_length([X|A], B, [X|AB], [Y|C], D, [Y|CD], N0, N) :- N1 is 1 + N0, $append_append_length(A, B, AB, C, D, CD, N1, N). $append_append_length(A, B, AB, C, D, CD, N0) :- ( N0 =:= 0 -> A = [], B = AB, C = [], D = CD ; N1 is N0 - 1, A = [X|A1], AB = [X|AB1], C = [Y|C1], CD = [Y|CD1], $append_append_length(A1, B, AB1, C1, D, CD1, N1) ).
Error conditions: One of the most important things defined in
the standard and in extensions to it like the
Prologue to Prolog is not the normal case, which is often obvious,
but what should happen in abnormal cases. It is especially important
in a logic programming language to be clear about which goals are
not errors, just false. In this document I have previously
finessed the issue by giving pred
and when
declarations which could be held to imply this. However, there is a
subtle point which even
Prologue to Prolog overlooks.
length/2
in
the Prologue to Prolog.)
append_append_length(A, B, AB, C, D, CD, N)
,
if either A or C is neither a proper list nor a partial list,
and there is nothing wrong with N, the outcome is a quiet
failure. (Compatible with length/2
in
the Prologue to Prolog.)
Cyclic terms are generally a pain to work with. They are very much easier for implementors — who may use undocumented or low level features not accessible to ordinary programmers — to deal with than ordinary programmers. Since cyclic terms are explicitly allowed (but not required if) by ISO Prolog, it is helpful if built-in predicates do something sensible with them.
Semantics: append_append_length(A, B, AB, C, D, CD, N)
is true when length(A, N)
and length(C, N)
and
append(A, B, AB)
and append(C, D, CD)
. In
Mercury, where the compiler reorders code according to data flow, we could
write
append_append_length(A, B, AB, C, D, CD, N) :- length(A, N), length(C, N), append(A, B, AB), append(C, D, CD).
In ISO Prolog, however, this implementation is quite unsatisfactory.
?- append_append_length(A, "yz", "xyz", [X], R, "abc", N).
will find the (unique) solution, but if that is rejected, will then
backtrack forever, trying ever longer bindings for A
.
There is, in fact, no fixed ordering of these goals which is satisfactory for all queries, where "satisfactory" is defined as finding all solutions and terminating in a finite amount of time, for a query having a finite number of solutions.
append_append(A, B, AB, C, D, CD) :- append_append_length(A, B, AB, C, D, CD, _). append_length(A, B, AB, D, AD, N) :- append_append_length(A, B, AB, A, D, AD, N). append(A, B, AB, D, AD) :- append_length(A, B, AB, D, AD, _). append_length(A, B, AB, N) :- append_length(A, B, AB, A, B, AB, N). append(A, B, AB) :- append_length(A, B, AB, _). same_length(A, C, N) :- append(A, [], A, C, [], C, N). same_length(A, C) :- same_length(A, C, _). one_longer(A, C) :- same_length(A, [_|C]). shorter_list(A, C) :- append([_|A], [], [_|A], _, _, C) cons(H, T, L) :- append([H], T, L).
Note that the description of append/3
here is intended
to be compatible with that in the Prologue to Prolog except
that the Prologue is silent about cyclic lists, which this
document explicitly requires
Ones = [1|Ones], append(Ones, [], Ones)
to quietly fail.
Once you can append two lists, it is interesting to append many lists.
In Haskell and SML, this is called concat
. In Prolog, it has
traditionally been called append/2
. Unlike append/3
,
it is not reversible, for the simple reason that even
append(Xs, [])
has infinitely many solutions. Why? Because
given a value for Xs, inserting another [] anywhere in it will produce another
solution. To get a reversible predicate, we have to insist on the
elements of Xs being non-empty lists. The best name for that
operation is partition/2
because if you think of the list as
representing a natural number, it computes partitions of it.
:- pred append(list(list(T)), list(T)). :- append(Lists, All) when list_skel(Lists). % Too weak. :- append(Lists, All) is semidet when ground(Lists). % too strong. append([], []). append([List|Lists], All) :- append(List, Rest, All), append(Lists, Rest). :- pred partition(list(list(T)), list(T)). :- partition(Lists, All) when list_skel(All) ; list_skel(Lists). % Too weak. :- partition(Lists, All) is semidet when ground(Lists). % too strong. :- partition(Lists, All) is bounded when list_skel(All). partition([], []). partition([[X|List]|Lists], [X|All]) :- append(List, Rest, All), partition(Lists, Rest).
The following predicates come from library(list_parts)
in the Quintus library. However, in order that the argument order of
these predicates should be consistent with the other list processing
predicates, it is necessary to switch the argument order from
Whole,Part to Part,Whole, because that's the way append/3
is.
head(H, L) :- cons(H, _, L). tail(T, L) :- cons(_, T, L). prefix(P, L) :- append(P, _, L). proper_prefix(P, L) :- append(P, [_|_], L). suffix(S, L) :- append(_, S, L). proper_suffix(S, L) :- append([_|_], S, L).
The next group of predicates are already fully consistent with
append_append_length/7
.
length(X, N) :- append_length(A, [], A, N). member(X, L) :- append(_, [X|_], L). memberchk(X, L) :- member(X, L), !. :- nonmember(X, L) when ground(X), ground(L). :- nonmember(X, L) is semidet when ground(X), ground(L). nonmember(X, L) :- \+ member(X, L). select(X, L, R) :- append(_, [X|T], L, T, R). selectchk(X, L, R) :- select(X, L, R), !. select(X, Xs, Y, Ys) :- append(_, [X|Xs], L, [Y|Ys], R). selectchk(X, L, Y, R) :- select(X, L, Y, R), !. last(ButLast, X, L) :- append(ButLast, [X], L). last(X, L) :- last(_, X, L). nextto(X, Y, L) :- append(_, [X,Y|_], L).
Cyclic lists: Consistently with the Prologue to
Prolog, member
and memberchk/2
(not
actually included in the Prologue) do not raise any
exceptions. However, the question of sensible handling of
acyclic lists arises. Given the list Ones = [1|Ones]
,
the query member(2, Ones)
would not terminate, even
though inspection of a finite amount of information would justify
eventual failure. Suppose we have a list L such that
L = F ++ B and
B = C ++ B, for finite
F and B. Then we would like
member(X, L)
to be equivalent to
(member(X, L) ; member(X, C))
.
It's not clear to me that we can get that at an affordable price.
Since a cyclic list need not have a unique
(F,C*) parse,
and since the uniqueness of the parse may depend on the instantiation
state of the elements, it is not clear to me that it even makes sense
to demand this. What we can demand is that a call to
member/2
will terminate after examining some
finite prefix of L. The same applies
to memberchk/2
.
The fact that sensible handling of cyclic lists is desirable,
and very difficult for programmers to achieve, would seem to be
sufficient justification for standardising both
member/2
and memberchk/2
.
select/3
(included in the Prologue to Prolog)
and selectchk/3
(not included) are similar to
member/2
and memberchk/2
. Because they
are asking about an unknown but finite prefix of the
two list arguments, they are not subject to the “cyclic lists
fail quietly” requirement of append/3
, but if
either the 2nd or the 3rd argument is cyclic, we do want them to
fail after examining a finite prefix.
Ulrich Meumerkel points out that it is undesirable for
different Prolog systems to give different answers to queries
involving standard predicates like member/2, and that it is
highly undesirable for a single system to give different answers
depending on how the arguments to the built-in were constructed.
If I ever find an affordable way to solve the Indirect Cycle
Detection Problem (that is, given a list Front ++ cycle Loop,
find a canonical parse for it in linear time), then it will be
time to require member/2
, memberchk/2
,
select/2
, and selectchk/2
to terminate on cyclic lists. But not until then. Sigh.
The next predicate corresponds to the old correspond/4
predicate, but has a consistent argument order:
member_member(X, Xs, Y, Ys) :- append(_, [X|_], Xs, _, [Y|_], Ys).
One operation which is commonly asked for is finding the
Nth element of a list. Here, unfortunately, we have
two clashing archetypes: arg/3
and the
append_append_length/7
family. The Quintus library
assimilated "Nth element" to arg/3
,
with predicates
nth0(N, List, Member) nth0(N, List, Member, Residue) nth1(N, List, Member) nth1(N, List, Member, Residue)
The clash with member(Member, List)
and
select(Member, List, Residue)
is unfortunate.
The fact that 0 and 1 have other meanings in the Quintus library
(and other meanings in connection with lists) is also unfortunate.
I am driven to the conclusion that the analogy with
arg(N, Term, Arg)
is bogus. In particular,
arg/3
only works one way around: it does not
solve for N
(although at the price of notably worse
generated code, it could) and it cannot solve for
Term
. This means that its first two arguments are,
on any understanding of the term, "inputs", and by the
inputs-before-outputs principle, should come first.
The "Nth element" predicates do not share
arg/3
's input/output restrictions. Indeed, they
are more likely to be used to find N for a known
Member than the other way around; it is tolerable
for linear search to take linear time, but not for indexing.
I use the name component "_offset" to refer to a number of elements to skip. You can think of this as 0-origin indexing if you like, but it is actually more fruitful to think of it quite literally as the length of some list segment. Indeed, "length" could be used to be consistent with the other "_length" predicates, except that in this context it might be confusing.
I use the name component "_index" to refer to a position in a list. Like argument positions in a term, member positions in a list always start with 1.
The Nth element predicates are
member_member_offset(X, Xs, Y, Ys, Offset) :- append_length(_, [X|_], Xs, _, [Y|_], Offset). member_member_index(X, Xs, Y, Ys, Index) :- member_member_index(X, [X|Xs], Y, [Y|Ys], Index), Index =\= 0. member_offset(Member, List, Offset) :- append_length(_, [Member|_], List, Offset). member_index(Member, List, Index) :- append_length(_, [Member|_], [Member|List], Index), Index =\= 0. select_offset(X, Xs, Y, Ys, Offset) :- append_length(_, [X|T], Xs, [Y|T], Ys, Offset). select_index(X, Xs, Y, Ys, Index) :- select_offset(X, [X|Xs], Y, [Y|Ys], Index), Index =\= 0. select_offset(Member, List, Residue, Offset) :- append_length(_, [Member|Tail], List, Tail, Residue, Offset). select_index(Member, List, Residue, Index) :- append_length(_, [Member|Tail], [Member|List], Tail, [Member|Residue], Index), Index =\= 0.
Prolog has traditionally had reverse/2
.
Common Lisp also provides (REVAPPEND - -)
,
and such a predicate is commonly part of the implementation
of reverse/2
.
:- pred reverse_append_length(list(T), list(T), list(T), integer). :- reverse_append_length(L, B, RB, N) when L ; RB ; N. :- reverse_append_length(L, B, RB, N) is semidet when list_skel(L) ; integer(N). :- reverse_append_length(L, B, RB, N) is bounded when list_skel(RB). % Sample implementation. reverse_append_length(L, B, RB, N) :- append_length(L, [], L, _, B, RB, N), $reverse(L, B, RB). $reverse([], B, B). $reverse([X|L], B, RB) :- $reverse(L, [X|B], RB).
Semantics: reverse_append(L, B, RB, N)
is true when RB = reverse(L)++B
and
length(L) = N
.
Notation: There is no new notation here, but I feel that I must repeat the warning that the "sample implementations" of "base" predicates are crafted for exposition and specification. This is not the way that I would implement this predicate for practical use. What implementor would use a two-pass algorithm when a one-pass algorithm could be used?
Except that L
is to be reversed, the argument
order of this predicate is identical to the argument order of
append_length/4
, as the name suggests.
reverse_append(L, B, RB) :- reverse_append_length(L, B, RB, _). reverse_length(L, R, N) :- reverse_append_length(L, [], R, N). reverse(L, R) :- reverse_length(L, R, _).
It may be worth while exhibiting a direct implementation of
reverse_append/3
.
:- pred reverse_append(list(T), list(T), list(T)). :- reverse_append(L, B, RB) when L ; RB. :- reverse_append(L, B, RB) is semidet when list_skel(L). :- reverse_append(L, B, RB) is bounded when list_skel(RB). reverse_append(L, B, RB) :- $reverse_append(L, B, RB, RB). $reverse_append([], RB, RB, _). $reverse_append([X|L], B, RB, [_|Bound]) :- $reverse_append(L, [X|B], RB, Bound).
This illustrates the often useful technique of passing a second "copy" of an "output" argument to serve as a bound when you are running a predicate backwards.
permutation_length
:- pred permutation_length(list(T), list(T), integer). :- permutation_length(A, B, N) is bounded when list_skel(A) ; list_skel(B) ; nonvar(N). % Sample implementation permutation_length(A, B, N) :- same_length(A, B, N), $perm(A, B). $perm([], []). $perm([X|Xs], Ys1) :- $perm(Xs, Ys), append(_, R, Ys, [X|R], Ys1).
permutation(A, B) :- permutation_length(A, B, _).
pairs_keys_values/3
Given the great practical utility of keysort/2
,
it is useful to be able to build and dismantle the lists it
requires and produces.
I am not satisfied with the name that was used in the Quintus
library (keys_and_values/3
, because it does not make clear
which argument is which. Nor is there any way to appeal to 'inputs
before outputs' because the predicate is used both ways around. I am
tentatively calling it pairs_keys_values/3
.
Whatever it's called, it is a surprisingly useful predicate.
:- pred pairs_keys_values(list(pair(K,V)), list(K), list(V)). :- pairs_keys_values(Pairs, Keys, Values) when Pairs ; Keys ; Values. :- pairs_keys_values(Pairs, Keys, Values) is semidet when list_skel(Pairs) ; list_skel(Keys) ; list_skel(Values). % Sample implementation pairs_keys_values([], [], []). pairs_keys_values([K-V|Pairs], [K|Keys], [K|Values]) :- pairs_keys_values(Pairs, Keys, Values).
Note that the sample implementation doesn't actually satisfy the requirements. In a traditional first-argument-indexing Prolog, if the first argument is a variable but one of the others is a list skeleton, the determinism of the query will not be detected.
I'm not sure it's worth bothering with these.
pairs_keys(Pairs, Keys) :- pairs_keys_values(Pairs, Keys, _). pairs_values(Pairs, Values) :- pairs_keys_values(Pairs, _, Values).
The definition of the "delete" predicate is something I
struggled with for years. Part of the problem is that if list
R does not contain X, there are infinitely many lists L such
that deleting X from L yields R, so the predicate really only
works as a function. That tells us that L and X should precede
R in the argument order. Should it be
delete(Member, List, Residue)
to be consistent
with select/3
? Or, because delete/3
is dangerously different from select/3
, is it
actually a good thing if they are different?
Prolog has three versions of equality: (=)/2, (==)/2, and (=:=)/2. Which of them should be used? Or should we have three predicates, the way Scheme used to have DELQ!, DELV!, and DELETE!?
Should the item to be deleted be a strict input, or should
delete/3
be able to backtrack over the elements of
the input list as in Cugini's library?
If X does not occur in L, is that a good input, or should it fail?
There is a definition in the Quintus library. It's basically this:
/* Historic background, NOT for inclusion */ :- pred delete(list(T), T, list(T)). :- delete([], _, _) when true. :- delete([H|T], X, R) when ground(H), ground(X). :- delete(L, X, R) is semidet when ground(L), ground(X). % Sample implementation delete([], _, []). delete([H|T], X, R) :- H == X, !, delete(T, X, R). delete([H|T], X, [H|R]) :- delete(T, X, R).
I don't want to take delete/3
away from the
people who use it, if there are any, but it clearly doesn't
belong in even an informal standard. The problem is that there
are so many choices that none of them is going to be the right
one for the majority of uses (if uses there be); none of them
deserves the exclusive right to the name delete/3
.
The history of Scheme may be of interest here. There used to be
three delete-from-list functions in Scheme, but there aren't
any these days.
One of the questions about deleting from a list is how many
elements to delete. I note that when I want to delete a
known element from a list, I almost always want to delete just
one copy (usually because I know there is only one copy). For
that purpose, selectchk(X, L, R)
works very well.
The result is that despite writing delete/3
,
I cannot recall ever using it except to test it.
If delete/3
is not standard, what can programmers
do instead? One easy answer is to use findall/3
:
findall_member(Member, List, Test, Result) :- findall(Member, (member(Member, List), \+ \+call(Test)), Result). ... findall_member(Y, L, \+ Y = X, R) ... findall_member(Y, L, Y \== X, R) ... findall_member(Y, L, Y =\= X, R)
Using findall_member/4
instead of
delete/3
has some important advantages:
delete/3
in earnest, I would have used this.
The only bad thing about this predicate is efficiency.
But that's why I introduced findall_member/4
in
the first place, instead of calling findall/3
directly. It's comparatively simple for a compiler to recognise
calls to findall_member/4
and generate a three-clause
loop (essentially the structure of delete/3
shown
above), using a cache to ensure that the same code is not generated
repeatedly. Or for a compiler such as BinProlog, where
findall/3
is fast, the compiler could just convert
a findall_member/4
call to the corresponding
findall/3
code.
The findall/4
predicate, which combines the effect
of findall/3
with the effect of append/3
,
is unaccountably missing from ISO Prolog. When you know about it,
you find nearly as many uses for findall/4
as for
findall/3
. This suggests to me that the right "base"
predicate is not findall_member/4
but
findall_member/5
.
:- pred findall_member(T, list(T), void, list(T), list(T)). :- findall_member(M, L, P, R) is semidet when list_skel(L), nonvar(P). % Sample implementation. findall_member_append(_, [], _, R, R). findall_member_append(M, [X|Xs], P, R0, R) :- \+ (X = M, call(P)), !, findall_member_append(M, Xs, P, R0, R). findall_member_append(M, [X|Xs], P, [X|R0], R) :- findall_member_append(M, Xs, P, R0, R).
findall_member(M, Xs, P, R) :- findall_member(M, Xs, P, R, []). delete(L, X, R) :- findall_member(M, L, X \== M, R). disjoint(S1, S2) :- findall_member(M, S1, memberchk(M, S2), []). intersection(S1, S2, S) :- findall_member(M, S1, memberchk(M, S2), S). difference(S1, S2, S) :- findall_member(M, S1, nonmember(M, S2), S). union(S1, S2, S) :- findall_member(M, S1, nonmember(M, S2), S, S2). symdiff(S1, S2, S) :- findall_member(M, S1, nonmember(M, S2), S, S3), difference(S1, S2, S3).
I am not actually recommending any of the predicates in this
section. In particular, I am not recommending any of the
traditional quadratic-time set-as-unordered-list predicates for
inclusion in the "community standard library". What I do
recommend is that either no such predicates (letting people call
findall/3
themselves, or write and document their
own deletion predicates), or else one predicate should be
included in the library, and that predicate should be
findall_member/5
.
The Quintus library contains delete(L, X, N, R)
which deletes the first N
copies of X
in L
, or fewer if there aren't that many. To be
honest, it is difficult to imagine any serious use of this.
N=0
is just R=L
. N=1
is just selectchk/3
. N
larger than
the length of L
is just delete/3
. I
never did find any other uses for it.
sort/2
and keysort/2
may not be the most
useful predicates in Prolog, but they are somewhere near the top. In
a language without mutable data structures, but with a total order on
the data structures it does have, sorting is often indispensable for
getting code that is usefully fast.
The notion of a sorted list is clearly spelled out in the Prolog
standard. Unbelievably, there are no means whatsoever of sorting a
general list yourself; the only use of the "sorted list" notion is in
setof/3
. What's more, the Prolog standard doesn't even
guarantee that the result of setof/3
will remain
sorted, or even be sorted when it is first provided: the
ordering on variables is only required to be consistent during the
construction of a sorted list. Does this mean that the Prolog standard
allows an implementation where
length(L, N), N > 0, setof(X, member(X, L), S1), setof(X, member(X, L), S2), S1 == S2
may fail? Yes, it does.
In the same way that setof/3
needs sort/2
(yet the Prolog standard does not provide it), so bagof/3
needs keysort/2
, and the Prolog standard does not provide
it. While this makes it distressingly plain that the Prolog standardisers
did not do a very good job, is it anything more than a nuisance? After
all, it is quite easy to build a good sorting predicate on top of the
compare/3
primitive.
Oddly enough, while ISO Prolog provides
(@=<)/2
, (@<)/2
,
(@>=)/2
, (@>)/2
,
(==)/2
, and (\==)/2
,
it does not provide compare/3
.
If one wants a minimal standard, this is back to front.
Given compare/3
, the other six comparisons
can be implemented at low cost. (Although == and \== can
be done even faster.) But given the other six comparisons,
the best you can do for compare/3
is
compare(R, X, Y) :- ( X @> Y -> R = (>) ; X == Y -> R = (=) ; R = (<) ).
which would nearly double the cost of some sorting algorithms.
Using the six comparison operators instead of compare/3
also forces you to use cuts or if-then-else in many cases where
pure logic with no cuts of any kind could be used. This too is often
an efficiency problem.
There is an additional problem. In the presence of variables,
the ISO Prolog term ordering is not required to be stable.
It's required to be stable during a call to a built in predicate
such as (@=<)/2
or setof/3
, but it is
not required to be consistent from call to call. This means
that
X @< Y, X @< Y X @< Y, setof(Z, (Z=X;Z=Y), [X,Y]) sort([X,p(X)], S1), sort([X,p(X)], S2), S1 == S2
sort/2
or keysort/2
for themselves
in ISO Prolog. Now the
The built-in predicate compare/3
must
be available with the same status as any other built-in predicate. For example,
it should be in the same module as is/2
or
clause/2
.
The term ordering must be stable in the sense that only the disappearance of a variable (when it becomes bound to something else) can alter the order of two terms.
It is expected that compare(R,T1,T2)
is linear in the
size of its arguments (in fact, of the smaller of the two), and allocates
no memory. It is possible to implement Prolog atoms with a symbol table
so structured that atom comparisons take O(1) amortised time, but the
requirement is actually "linear in the number of characters".
keycompare(R, K1-_, K2-_) :- compare(R, K1, K2).
This is useful
for specifying keysort/2
and a number of other predicates.
There are two ways to sort a list of items based on part of
the information in (or implied by) each item. One is to write
your own comparison predicate which extracts this information
every time a comparison is done. This is the method used by C's
qsort()
function, for example. This requires
2nlog2n
extractions to sort n items, but has few restrictions
on what it can do.
The right names for these two predicates re
sort/3
and msort/3
. Unfortunately,
some existing Prolog systems have pre-empted these names for
other, sometimes quite strange, variations on sorting. Plagiarism
is the sincerest form of flattery, so I propose to adapt the name
of the Haskell sortBy
function.
:- pred sort_by(+void(?order,T,T), +list(T), ?list(T)). :- sort_by(P, L, S) is semidet. :- pred msort_ny(+void(?order,T,T), +list(T), ?list(T)). :- msort_by(P, L, S) is semidet.
The sort_by/3
predicate sorts its second argument,
using the first argument to determine the order, and unifies its
third argument with the result. If P(X1,X2) fails for any X1,X2
in L, sort_by(P, L, S) may fail. If P is not a partial order, the
result is not defined. P may be called up to O(N.lgN) times, where
length(L, N). In addition, only O(N.lgN) other work may be done.
If two or more elements of L are equal according to
P, only the first (leftmost) element will be retained.
The msort_by/3
predicate is like sort_by/3
but does not remove duplicates. It must be stable: if
L = [...,X1,...,X2,...] and P(=,X1,X2), then S = [...,X1,...,X2,...].
There is at least one implementation
of these predicates that works by generating and caching specialised
code, so there is no implication that there need be any detectable
calls to call/3
made by these predicates.
The second way to sort based on part of the information in each item is to precompute, for each item, a key which will natively be sorted the way we want the items to be sorted. If the extraction amounts to selecting and reordering parts of a term, together with perhaps some arithmetic operations such as negation, this will work. It has the advantage of requiring only n extractions to sort n items.
The definitions here are meant to specify the relation these
predicates compute, and should not be treated as model implementations.
This approach was taken to make the definitions simple, especially
that of keyed_sort/3
. An actual implementation
should call Key_Extractor just once for each element of L.
keyed_sort(Key_Extractor, L, S) :- sort_by($keyed_compare(Key_Extractor), L, S). keyed_msort(Key_Extractor, L, S) :- msort_by($keyed_compare(Key_Extractor), L, S). $keyed_compare(Key_Extractor, R, X, Y) :- Key_Extractor(X, KX), Key_Extractor(Y, KY), compare(R, KX, KY).
We can define the traditional sorting predicates in terms of the higher-order ones.
sort(L, S) :- sort_by(compare, L, S). msort(L, S) :- msort_by(compare, L, S). keysort(L, S) :- msort_by(keycompare, L, S). keysort_by(P, L, S) :- msort_by($keycompare(P), L, S). $keycompare(P, R, K1-_, K2-_) :- P(R, K1, K2).
The sort/2
and keysort/2
predicates
are traditional. The need for msort/2
has often been
felt; it is present in SWI Prolog, for example. The need for generalised
versions has also been felt; LPA Prolog had a rather clumsy version
which the higher order sorting predicates can simulate. With
sort_by/3
we can do
sort_by(codes:normalised_compare(kc), L, R)
without requiring a special-purpose predicate.
The derived comparison and sorting predicates should be in the
same module as compare/3
.
SWI Prolog includes merge/3
, but not the obvious (and
sometimes useful) keymerge/3
; they are currently not
included in this proposal. It also includes a merge_set/3
predicate, which appears to be an alias for ord_union/3
.
Not only is it possible to have a stable sort with guaranteed worst case O(N.lgN), it is possible to have a stable sort which has linear cost when the input list is the concatenation of a small number of sorted lists (sorted by @=<, not just @<). For example,
merge(L1, L2, L) :- append(L1, L2, L12), msort(L12, L).
could be an O(|L|) implementation of merging. It is tempting to require that such an algorithm be used; this makes converting an already sorted list to ordered set form cheap, for example.
Term ordering in Prolog is based on the underlying coded
character set. It is not intended to conform to the collating order
of any natural language in any culture. Amongst other things,
compare/3
must never report two terms as equal unless
they are completely indistinguishable by Prolog. However, natural
language collating orders may do things like ignoring alphabetic case
so that 'bin' and 'BIN' sort together. Term ordering is also supposed
to be fast; locale-sensitive ordering can be quite costly.
When producing output for people, it can be useful to compare and
sort text in a way that people find reasonable. Just as ISO C has
both strcmp()
and strcoll()
, so ISO Prolog
needs both compare/3
and locale_compare/[3,4]
.
:- pred locale_compare(atom, order, T, T). locale_compare(LocaleName, Order, T1, T2) is semidet when ground(LocaleName). :- pred locale_compare(order, T, T). locale_compare(Order, T1, T2) is semidet. locale_compare(R, T1, T2) :- "L is the user's default locale", locale_compare(L, R, T1, T2).
These are defined in terms of locale names, not locale terms. Presumably they would be implemented by caching, so that sorting a list of a million terms would determine the locale's collation rules at most once.
locale_sort(R, S) :- sort_by(locale_compare, R, S). locale_msort(R, S) :- msort_by(locale_compare, R, S). locale_keysort(R, S) :- msort_by($locale_keycompare, R, S). $locale_keycompare(R, K1-_, K2-_) :- locale_compare(R, K1, K2). locale_sort(L, R, S) :- sort_by(locale_compare(L), R, S). locale_msort(L, R, S) :- msort_by(locale_compare(L), R, S). locale_keysort(L, R, S) :- msort_by($locale_keycompare(L), R, S). $locale_keycompare(L, R, K1-_, K2-_) :- locale_compare(L, R, K1, K2).
People sometimes want a way to generate a subset of a given set.
The Quintus Prolog predicates subseq
/3,
subseq0
/2, and subseq1
/2 address that.
subseq
(AB, A, B) is true when AB is an interleaving
of A and B. If AB represents a set, A will be a subset of AB, and
B will be AB\A. It is regarded as a list operation, rather than a
set operation, because it preserves the order of the elements.
:- pred subseq(list(T), list(T), list(T)). % a mode test to check for a full list skeleton. $list_skel(V) :- var(V), !, fail. $list_skel([]). $list_skel([_|L]) :- $list_skell(L). subseq(AB, A, B) :- $list_skel(AB), !, $subseq(AB, A, B). subseq(AB, A, B) :- $list_skel(A), $list_skel(B), !, $subseq(AB, A, B). subseq(AB, A, B) :- instantiation error. $subseq([], [], []). $subseq([X|AB], A, [X|B]) :- $subseq(AB, A, B). $subseq([X|AB], [X|A], B) :- $subseq(AB, A, B). subseq0(AB, A) :- $list_skel(AB), !, $subseq(AB, A, _). subseq0(AB, A) :- instantiation error. subseq1(AB, A) :- $list_skel(AB), !, $subseq(AB, A, _), A \== AB. subseq1(AB, A) :- instantiation error.
The problem with this is that there are 2|AB| solutions. That makes these predicate useless except for rather small lists. In most potential applications for this predicate, there is some kind of constraint that reduces the number of solutions drastically, which should be interwoven with the generator.
Do not include these predicates. In fact, think 50 times before including any predicates with exponential cost. Revisit this recommendation if and when coroutining or other constraint processing becomes sufficiently wide-spread or standard.
The following predicates from the DEC-10 and/or Quintus libraries have not been slotted into families yet.
min
,
max
}_{member([P, ]X, L)
,
select([P, ]X, L, R)
}. In particular, it is time that
msort/2
was adopted.
What self-respecting list-processing language would be caught dead without a basic kit of higher order list processing operations? Only Prolog, I guess.
But not Logtalk, of course. The Logtalk
library includes a meta
object which has some of the
DEC-10 library higher-order list predicates and adds more original ones.
The interfaces here pre-date Logtalk; fortunately, where there are
argument order differences there are name differences as well, so the
two can co-exist, especially given Logtalk's better-than-ISO-modules
objects.
The notation p/N+k is used for a family of predicates iterating over N lists, having k other arguments.
The following predicate families come from DEC-10 Prolog and Quintus Prolog:
map/
N+1.
maplist(P, [X11,...,X1n], ..., [Xm1,...,Xmn]) :- P(X11, ..., Xm1), ... P(X1n, ..., Xmn).
SWI Prolog has settled on the name 'foldl', so we may as well follow that here, and add 'foldr' as well.
There is an issue about argument order. In Haskell we find
foldl :: (a -> b -> a) -> a -> [b] -> a foldl f a (x:xs) = foldl f (f a x) xs foldl _ a [] = a foldr :: (b -> a -> a) -> a -> [b] -> a foldr f a (x:xs) = f x (foldr f a xs) foldr _ a [] = a
In SML we find
val foldl :: ('b * 'a) -> 'a -> 'b list -> 'a fun foldl f a (x::xs) = foldl f (f(x,a)) xs | foldl _ a [] = a val foldr :: ('b * 'a) -> 'a -> 'b list -> 'a fun foldr f a (x::xs) = f(x, foldr f a xs) | foldr _ a [] = a
If there is no unequivocal convention for us to imitate, we can try to come up with a principled argument order for ourselves. The core goal is consistency. The following guide-lines seem to be appropriate:
The interfaces presented here are consistent with those guidelines.
:- pred foldl(void(T1,...,Tm,R,R), list(T1),...,list(Tm), R, R). :- foldl(P, L1, ..., Lm, R0, R) when list_skel(L1) or ... or list_skel(Lm). foldl(P, [X11,...,X1n], ..., [Xm1,...,Xmn], V0, Vn) :- P(X11, ..., Xm1, V0, V1), ... P(X1n, ..., Xmn, V', Vn). :- pred foldr(void(T1,...,Tm,R,R), list(T1),...,list(Tm), R, R). :- foldr(P, L1, ..., Lm, R0, R) when list_skel(L1) or ... or list_skel(Lm). foldr(P, [X11,...,X1n], ..., [Xm1,...,Xmn], V0, Vn) :- P(X1n, ..., Xmn, V0, V1), ... P(X11, ..., Xm1, V', Vn).
Note that the last two arguments form a typical difference pair, so at least when the predicate is just building data structure, it is common for the last argument to be the "input" (the trivial case of the data structure) and the second-to-last argument to be the "output" (the complete data structure). For example, to convert a list of goals to a conjunction, you might do
flip_conj(Goal, (Goal,Conj), Conj). ?- foldl(flip_conj, [p,q,r], Conj, true). => Conj = (p, q, r, true).
If append/2
did not exist, it could be defined
using either of these families:
concat(Ls, L) :- foldl(flip(append), Ls, L, []). concat(Ls, L) :- foldr(append, Ls, [], L).
These predicates are supposed to compose, so that you can process lists of lists conveniently without resorting to Currying. For example, suppose you have a matrix represented as a list of lists of numbers. Then
add(X, Y, Z) :- Z is X + Y. ?- Matrix = [[1,2,3],[4,5,6],[7,8,9]], foldl(foldl(add), Matrix, 0, Sum1), foldr(foldr(add), Matrix, 0, Sum2). => Sum1 = Sum2, Sum2 = 45 .
scanl(P, [X11,...,X1n], ..., [Xm1,...,Xmn], V0, [V0,V1,...,Vn]) :- P(X11, ..., Xmn, V0, V1), ... P(X1n, ..., Xmn, V', Vn).
some(P, [X11,...,X1n], ..., [Xm1,...,Xmn]) :- ( P(X11, ..., Xm1) ; ... ; P(X1n, ..., Xmn) ).
Why don't we need it? Suppose we generalise member,
to member
/2m:
member(X1, [X1|_], ..., Xm, [Xm|_]). member(X1, [_|Xs1], ..., Xm, [_|Xsm]) :- member(X1, Xs1, ..., Xm, Xsm).
Then instead of
some(P, Xs1, ..., Xsm)
we can write
member(X1, Xs1, ..., Xm, Xsm), P(X1, ..., Xm)
and we not only avoid all overheads of passing P and get to call it directly, if the query is true, we get to find out which values of X1...Xm make it true.
If an analogue of some
/N+1 is implementable,
so is an analogue of member
/2N.
somechk(P, [X11,...,X1n], ..., [Xm1,...,Xmn]) :- ( P(X11, ..., Xm1) -> true ; ... ; P(X1n, ..., Xmn) -> true ).
Similarly, instead of
somechk(P, Xs1, ..., Xsn)
we can write
(member(X1, Xs1, ..., Xn, Xsn), P(X1, ..., Xn) -> true)
and get to find out why P is true as well that it is true.
include(P, [], ..., [], []). include(P, [X1|Xs1], ..., [Xn|Xsn], Included) :- ( call(P, X1, ..., Xn) -> Included = [X1|Included1] ; Included = Included1 ), include(P, Xs1, ..., Xsn, Included1).
Note that this really requires P to be determinate, but it does allow variables in P to be bound. For a ground P, we could manage without this, writing
findall(X1, ( member(X1, Xs1, .., Xn, Xsn), P(X1, ..., Xsn) ), Included)
would do the job. Some Prolog systems (BinProlog) may implement
findall
/3 so very efficiently to make this an attractive
alternative.
exclude(P, [], ..., [], []). exclude(P, [X1|Xs1], ..., [Xn|Xsn], Included) :- ( call(P, X1, ..., Xn) -> Included = Included1 ; Included = [X1|Included1] ), exclude(P, Xs1, ..., Xsn, Included1).
There were and are others, but this will do to start with.
The obvious missing predicates are
scanr(P, [X11,...,X1n], ..., [Xm1,...,Xmn], Vn, [V0,V1,...,Vn]) :- P(X1n, ..., Xmn, V', Vn), ..., P(X11, ..., Xm1, V0, V1).
which will be added soon.
:- meta_predicate min(2, +, -, -). :- pred min(void(T,number), list(T), T, number) is semi_det. :- meta_predicate $min_sans_default(+, 2, -, -). :- pred $min_sans_default(list(T), void(T,number), T, number) is semi_det. :- meta_predicate $min_with_default(+, 2, +, -, +, -). :- pred $min_with_default(list(T), void(T,number), T, T, number, number) is semi_det. min(Price, Candidates, Best, Cost) :- $min_sans_default(Candidates, Price, Best, Cost). $min_sans_default([Candidate|Candidates], Price, Best, Cost) :- ( call(Price, Candidate, Candidate_Cost) -> $min_with_default(Candidates, Price, Candidate, Best, Candidate_Cost, Cost) ; $min_sans_default(Candidates, Price, Best, Cost) ). $min_with_default([], _, Best, Best, Cost, Cost). $min_with_default([Candidate|Candidates], Price, Best0, Best, Cost0, Cost) :- ( call(Price, Candidate, Candidate_Cost), Candidate_Cost < Cost0 -> $min_with_default(Candidates, Price, Candidate, Best, Candidate_Cost, Best) ; $min_with_default(Candidates, Price, Best0, Best, Cost0, Cost) ). :- meta_predicate max(2, +, -, -). :- pred max(void(T,number), list(T), T, number) is semi_det. :- meta_predicate $max_sans_default(+, 2, -, -). :- pred $max_sans_default(list(T), void(T,number), T, number) is semi_det. :- meta_predicate $max_with_default(+, 2, +, -, +, -). :- pred $max_with_default(list(T), void(T,number), T, T, number, number) is semi_det. max(Value, Candidates, Best, Gain) :- $max_sans_default(Candidates, Value, Best, Gain). $max_sans_default([Candidate|Candidates], Value, Best, Gain) :- ( call(Value, Candidate, Candidate_Gain) -> max_with_default(Candidates, Value, Candidate, Best, Candidate_Gain, Gain) ; $max_sans_default(Candidates, Value, Best, Gain) ). $max_with_default([], _, Best, Best, Gain, Gain). $max_with_default([Candidate|Candidates], Value, Best0, Best, Gain0, Gain) :- ( call(Value, Candidate, Candidate_Gain), Candidate_Gain > Gain0 -> $max_with_default(Candidates, Value, Candidate, Best, Candidate_Gain, Best) ; $max_with_default(Candidates, Value, Best0, Best, Gain0, Gain) ).
min(Price, Candidates, Best, Cost) is true when
member(Best, Candidates) and
Price(Best, Cost) and
there is no Other such that
member(Other, Candidates) and
price(Other, Other_Cost) and
Other_Cost < Cost.
That is, given a pricing (partial) function and a list of candidates,
return a cheapest candidate and its cost. If no candidate is in the
domain of Price, there is no solution, so fail. If more than one
candidate has the minimal cost, it is not specified which you will
get; it may depend on the order of the candidates. You can rely on
the Price function being called once for each candidate, but not on
the order in which these calls are made.
max(Value, Candidates, Best, Gain) is true when
member(Best, Candidates) and
Value(Best, Gain) and
there is no Other such that
member(Other, Candidates) and
price(Other, Other_Gain) and
Other_Gain > Gain.
That is, given an evaluation (partial) function and a list of
candidates, return a most valuable candidate and its gain.
If no candidate is in the domain of Value, there is no solution, so
fail. If more than one candidate has the maximal value, it is not
specified which you will get; it may depend on the order of the
candidates. You can rely on the Value function being called once for
each candidate, but not on the order in which these calls are made.
In 1989 I gave some lectures at Melbourne University in which I showed how these predicates could be eliminated using Burstall/Darlington partial execution to get first-order Prolog code; this was done by manually typing commands at a toolkit that carried out the transformations. This was obviously not a practical user interface.
I had previously seen (the outside of) an automatic partial execution tool for a commercial Prolog system, which failed to impress because it could not at that time handle recursive predicates, such as these.
The Logtalk library includes a
meta_compiler
object which automatically turns
suitable calls to the meta
operations into calls to
automatically generated auxiliary predicates, making these higher-order
list operations as cheap to use as hand-written code. This is not
unlike the way the Erlang compiler turns list comprehensions into
calls to automatically generated functions. It is just one of the
ways in which Logtalk tries to encourage better structured code.
SWI Prolog also has machinery adequate for in-line expansion of calls to typical higher-order predicates. While it can be clumsy, this is of more concern to people writing higher-order library predicates than to people using them.
maplist
/N+1,
either under that name for backwards compatibility,
or under the name map
/N+1
for cultural compatibility with the functional programming
community.
scanlist
/N+3,
but under the name foldl
/N+3.
cumlist
/N+3,
but under the name scanl
/N+3.
foldr
/N+3 and
scanr
/N+3.
include
/N+2 and
exclude
/N+2, under those names;
if we called one of them 'filter', what would we call the
other?
member
/2N under some name
for a useful range of N.
some
/N+1
or somechk
/N+1, because it is easy to
manage without them. I do not say that nobody should
provide them, only that we should not require them of
everyone.
This is very much a minimal set of higher-order list operations.
Many set representations can be used in Prolog: unordered lists, ordered lists, delta-coded ordered lists, bitstrings, binary search trees of various types, ternary search trees.
It is important that the commonly available implementation of sets should be efficient enough to use without hesitation. It is also important that the operation names should not conflict with widely used operation names for other set representations.
This means that either names such as union/3
must
be avoided, or set predicates must not be used without a module
prefix to identify the representation.
Without a change to ISO Prolog, we cannot enforce a requirement that module prefixes be used, so we have to avoid commonly used names.
The representation most widely used in Prolog textbooks is the unordered list representation. This is also the representation used in Common Lisp.
The ordered list representation has its flaws (adding, removing, or checking for the presence of a single element is O(|Set|) rather than O(lg|Set|)), but the unordered list representation has no virtues.
One of the flaws of any term comparison-based set representation
in ISO Prolog is going to be that it allows variables to be set
elements, so that two terms can be in a set because when they were
entered they were not identical, but subsequent variable bindings
can make them identical. The answer to this is to use a mode system
which requires that set elements are sufficiently instantiated,
perhaps via a "safe" version of compare/3
.
I recommend the ordered set representation. It is space efficient,
time efficient for bulk operations, and is already used in Prolog
(the sort/2
and setof/3
predicates deliver
results of this form, which we might as well exploit).
In order to preserve backwards compatibility and to avoid name
clashes with the unordered list representation, all predicate names
have the form ord_
something.
The ordset representation cannot be implemented in Prolog without
the use of Prolog's term ordering.
As noted in the section on Sorting,
compare/3
is not provided, even though term
ordering is defined. That is really bad.
The compare/3
predicate must be available to Prolog programmers.
ord_op/5
:- type ordset(T) = list(T). :- pred ord_op(integer, ordset(T), ordset(T), ordset(T), ordset(T)). :- ord_op(M, Xs, Ys, S1, S2) when M, Xs, Ys. :- ord_op(M, Xs, Ys, S1, S2) is semidet when ground(M), ground(Xs), ground(Ys). ord_op(M, [], Ys, S1, S2) :- ( M /\ 2'000010 =:= 0 -> S1 = [] | S1 = Ys ), ( M /\ 2'010000 =:= 0 -> S2 = [] | S2 = Ys ). ord_op(M, [X|Xs], Ys, S1, S2) :- $ord_op1(M, Xs, Ys, S1, S2, X). $ord_op1(M, Xs, [], S1, S2, X) :- ( M /\ 2'000001 =:= 0 -> S1 = [] | S1 = [X|Xs] ), ( M /\ 2'001000 =:= 0 -> S2 = [] | S2 = [X|Xs] ). $ord_op1(M, Xs, [Y|Ys], S1, S2, X) :- compare(R, X, Y), $ord_op12(R, Xs, Ys, S1, S2, X, Y, M). $ord_op2(M, [], Ys, S1, S2, Y) :- ( M /\ 2'000001 =:= 0 -> S1 = [] | S1 = [Y|Ys] ), ( M /\ 2'001000 =:= 0 -> S2 = [] | S2 = [Y|Ys] ). $ord_op2(M, [X|Xs], Ys, S1, S2, Y) :- compare(R, X, Y), $ord_op12(R, Xs, Ys, S1, S2, X, Y, M). $ord_op12(<, Xs, Ys, S1, S2, X, Y, M) :- ( M /\ 2'000001 =:= 0 -> S1 = T1 | S1 = [X|T1] ), ( M /\ 2'001000 =:= 0 -> S2 = T2 | S2 = [X|T2] ), $ord_op2(M, Xs, Ys, T1, T2, Y). $ord_op12(>, Xs, Ys, S1, S2, X, Y, M) :- ( M /\ 2'000010 =:= 0 -> S1 = T1 | S1 = [Y|T1] ), ( M /\ 2'010000 =:= 0 -> S2 = T2 | S2 = [Y|T2] ), $ord_op1(M, Xs, Ys, T1, T2, X). $ord_op12(=, Xs, Ys, S1, S2, X, Y, M) :- ( M /\ 2'000100 =:= 0 -> S1 = T1 | S1 = [X|T1] ), ( M /\ 2'100000 =:= 0 -> S2 = T2 | S2 = [Y|T2] ), ord_op(M, Xs, Ys, T1, T2).
Looking at this predicate, we see that there is a clear notion of inputs (M, Xs, Ys) and outputs (S1, S2).
There are two ways to define is_ordset/1
. One
enforces the groundness of an ordered set. That is appropriate
for systems which can delay calls to (some analogue of)
compare/3
. The other is useful for hacking, and
allows set elements to be non-ground. At this stage I do not
wish to make any recommendation, so I merely note the two definitions:
is_ordset(S) :- % enforces groundness ground(S), sort(S, S). is_ordset(S) :- % allows sets of variables nonvar(S), $is_ordset(S). $is_ordset([]). $is_ordset([X|Xs]) :- nonvar(Xs), $is_ordset(Xs, X). $is_ordset([], _). $is_ordset([Y|Ys], X) :- nonvar(Ys), X @< Y, $is_ordset(Ys, Y).
The list processing predicates, especially
length/2
, member/2
, and select/3
can be used with ordered sets. The memberchk/2
and
selectchk/3
predicates are rather more dubious, and
need their own implementations.
list_to_ordset(List, Set) :- sort(List, Set). ord_op(M, Xs, Ys, S1) :- ord_op(M, Xs, Ys, S1, []). ord_compare(R, S1, S2) :- ord_op(2'010001, S1, S2, D12, D21), $ord_compare(R, D12, D21). $ord_compare(<, [], [_|_]). $ord_compare(>, [_|_], []). $ord_compare(=, [], []). ord_disjoint(S1, S2) :- ord_op(2'100, S1, S2, []). ord_disjoint_union(S1, S2, S) :- ord_op(2'100011, S1, S2, S, []). ord_intersect(S1, S2) :- ord_op(2'100, S1, S2, [_|_]). ord_intersection(S1, S2, S) :- ord_op(2'100, S1, S2, S). ord_intersection([], []). ord_intersection([S1|Ss], S) :- ord_intersection(Ss, S2), ord_intersection(S1, S2, S). ord_memberchk(X, S) :- ord_intersect([X], S). ord_nonmember(X, S) :- ord_disjoint([X], S). ord_proper_subset(S1, S2) :- ord_op(2'010001, S1, S2, [], [_|_]). ord_proper_superset(S1, S2) :- ord_op(2'001010, S1, S2, [], [_|_]). ord_selectchk(X, S1, S2) :- ord_op(2'010, [X], S1, S2). ord_subset(S1, S2) :- ord_op(2'001, S1, S2, []). ord_subtract(S1, S2, S) :- ord_op(2'001, S1, S2, S). ord_superset(S1, S2) :- ord_op(2'010, S1, S2, []). ord_symdiff(S1, S2, S) :- ord_op(2'011, S1, S2, S). ord_union([], []). ord_union([S1|Ss], S) :- ord_union(Ss, S2), ord_union(S1, S2, S). ord_union(S1, S2, S) :- ord_op(2'111, S1, S2, S). ord_union(S1, S2, U, D) :- ord_op(2'001111, S1, S2, U, D).
There are two names I am not happy with here.
I would prefer it if ord_subtract/3
had a less "imperative"
name, such as ord_difference/3
. It would also be clearer
if ord_union/4
were named ord_union_difference/4
.
However, that would break backwards compatibility, and that's bad manners.
There are four predicates missing from the old library.
ord_seteq/2
has been generalised to
ord_compare/3
.
ord_add_element(X, S0, S)
is replaced by
ord_union([X], S0, S)
.
ord_del_element(X, S0, S)
is renamed to
ord_selectchk(X, S0, S)
to be compatible with
the list operations.
ord_setproduct/3
preserves set ordering but does not
depend on it. Its analogue probably belongs among the list operations.
Of course, nothing would prevent an implementation providing those predicates for backwards compatibility.
The higher order list operations can be applied to ordered sets.
Results will not necessarily be ordered sets, but can be converted
to ordered sets by sorting. (Except, of course, in the
thrice-accursed ISO "Prolog", where there is no "standard"
sort/2
.) It is useful that
[include,exclude]
/N+2 do not alter the order
of their 2nd argument and do not introduce duplicates, so filtering
an ordered set yields an ordered set.
Should be based on q(Unary,L0,L) representation. Each basic operation is O(1).
A queue is a sequence, rather like a list. Adding items or lists of items at either end is rather like appending, and so is removing an item or list of items from the head end. (That is to say, what I'm calling a "queue" here is really an "output-restricted deque").
There are many ways to implement queues. A single list makes some operations cheap and other operations very expensive. A pair of back-to-back lists works nicely in a strict functional language, but since there is a slow rebalancing operation which may be executed repeatedly on backtracking, they are not efficient in Prolog. The representation I propose here was invented by someone at SRI. Fernando Pereira told me about it, and I am sorry to say that I have forgotten the inventor's name. This representation has the nice property that adding an element at either end or removing an element from the head end requires just a single unification. This makes it well suited to Prolog.
There is one flaw in this representation: it uses list differences so that appending an item at the tail end must be regarded as destroying the original queue. Queue uses should be single threaded. Mercury can enforce this, but does not support the partially instantiated terms the technique relies on. Prolog does support partially instantiated terms, but cannot enforce single threading.
We found a similar problem with sets; a representation which Prolog is good at has technical flaws which mean that it has to be used carefully, but the efficiency of the representation is such that we are unwilling to abandon it.
:- type queue(T) ---> q(unary, list(T), list(T)). :- type unary ---> z ; s(unary). is_queue(q(U,L0,L)) :- nonvar(U), nonvar(L0), $is_queue(U, L0, L). $is_queue(z, L0, L) :- L0 == L. $is_queue(s(U), [_|L1], L) :- nonvar(U), nonvar(L1), $is_queue(U, L1, L). item_queue_append(X, q(U,L0,L), q(s(U),[X|L0],L)). list_queue_append([], Q, Q). list_queue_append([X|Xs], Q, Q0) :- item_queue_append(X, Q1, Q0), list_queue_append(Xs, Q, Q1). queue_item_append(q(U,L0,[X|L]), X, q(s(U),L0,L)). queue_list_append(Q, [], Q). queue_list_append(Q0, [X|Xs], Q) :- queue_item_append(Q0, X, Q1), queue_list_append(Q1, Xs, Q). queue_list(q(U,L0,L), Xs) :- $queue_list(U, L0, L, Xs). $queue_list(z, L, L, []). $queue_list(s(U), [X|L1], L, [X|Xs]) :- $queue_list(U, L1, L, Xs). cons_queue(X, Q0, Q1) :- item_queue_append(X, Q0, Q1). head_queue(X, Q1) :- item_queue_append(X, _, Q1). tail_queue(Q0, Q1) :- item_queue_append(_, Q0, Q1). member_queue_offset(X, q(s(U),L0,_), N) :- $member(X, U, L0, 0, N). $member(X, _, [X|_], N, N). $member(X, s(U), [_|L1], N0, N) :- N1 is N0 + 1, $member(X, U, L1, N1, N). member_queue_index(X, q(s(U),L0,_), N) :- $member(X, U, L0, 1, N). member_queue(X, Q) :- member_queue_index(X, Q, _). memberchk_queue(X, Q) :- member_queue(X, Q), !. queue_list_length(Q, Xs, N) :- Q and Xs have the same elements, N of them. queue_empty(Q) :- queue_list(Q, []). queue_length(q(U,L0,L), N) :- ( integer(N) -> N >= 0, $queue_length(U, L0, L, N) ; var(N) -> $queue_length(U, L0, L, 0, N) ; $error ). $queue_length(z, L, L, N, N). $queue_length(s(U), [_|L1], L, N0, N) :- N1 is N0 + 1, $queue_length(U, L1, L, N1, N). $queue_length(s(U), [_|L1], L, N0) :- N0 > 0, !, N1 is N0 - 1, $queue_length(U, L1, L, N1). $queue_length(z, L, L, 0). portray_queue(Q) :- ( write('[|'), member_queue_offset(X, Q, N), ( N =:= 0 -> true ; write(', ') ), write(X), fail ; write('|]') ).
I have omitted the following predicates from the Quintus library:
list_queue(Xs, Q)
is queue_list(Q, Xs)
.
queue_last(Q, X)
has the wrong argument order and
is not a fast operation.
queue_last(Q0, X, Q)
has the wrong argument order
and is not a fast operation.
Since the Quintus module is called 'newqueues', this should not be a problem.
I think I may already have too many operations providing access to the elements of a queue other than by popping them. It doesn't feel right to propose any higher-order operations on them. The idea of having 2N map*/N+1 predicates,
map(P, L1, L2) map_list_queue(P, L1, Q2) map_queue_list(P, Q1, L2) map_queue_queue(P, Q1, Q2)
doesn't appeal either. Once the higher order list operations are there, anyone who wants higher order queue operations can easily program them him/herself.
DEC-10 Prolog supported integers up to ±131071, so people had no trouble writing or reading numbers correctly. The ISO standard requires floating point numbers to be supported, so we find things like
pi(3.141592653589793238462643383279502884197169399375105821).
Many modern Prologs support big integers, so we can have occasion to read and write large integers like 8200794532637891559375 (the number of possible phylogenetic trees on 20 species). Even a simple bitmask written as a base 2 number with 64 bits is hard to get right. Writing and reading numbers with many digits is an old problem with an old solution: allow digits to be separated by a "digit separator". The traditional separator in European languages is either a dot or a comma. That conflicts with other uses of those characters in programming. Fortran and Algol allowed spaces. Since a number can never be followed by another number in Prolog,
pi(3.14159 26535 89793 23846 26433 83279 50288 41971 69399 37508).
would be unambiguous. However, it would be anomalous; no other unquoted tokens in Prolog allow embedded spaces. Ada has a very similar solution: use the underscore as digit separator. So we could write
phylogenies(20, 8_200_794_532_637_891_559_375).
Ada requires that there be a(n extended) digit on each side of an underscore in a number. This seems to work very well. But there is another problem with large numbers in Prolog. Some file systems require the size of a record to be limited. For example, RMS-11 in VMS didn't cope with records longer than 32,767 bytes. Sending source code though e-mail used to run into problems with mail transport agents truncating or garbling long lines. Pretty-printers try to keep line lengths reasonable for the sake of people reading code. Every other kind of token in Prolog is either short by nature or allows an embedded line break for presentation purposes, the famous C-style backslash-newline hack. We can't use that hack in Prolog numbers, because
p(1\ 2).
is legal syntax if \
is declared as an infix
operator.
I therefore propose that the syntax of numbers in the ISO standard should be revised to
integer constant = decimal digit char, {[digit separator], decimal digit char}; binary constant = binary constant indicator, binary digit char, {[digit separator], binary digit char}; octal constant = octal constant indicator, octal digit char, {[digit separator], octal digit char}; hexadecimal constant = hexadecimal constant indicator, hexadecimal digit char, {[digit separator], hexadecimal digit char}; fraction = decimal point char, decimal digit char, {[digit separator], decimal digit char}; digit separator = underscore char, {layout char}, [new line char, {layout char}];
Note that this uses {layout char}
,
not [layout text sequence]
. This is deliberate;
comments are not allowed inside numbers.
Is this backwards compatible with the current ISO standard? Yes: 1_234 currently looks like an integer 1 followed by a variable _234 and this is never legal in ISO Prolog syntax.
Note that the language should impose no fixed rule about where the underscores can go, other than requiring digits on both sides. For example, if you are copying a number out of a table you probably want to keep its division into groups, and a popular group size is 5, even though we use 3 normally in English. Also, an Australian would want to write 123_456_789, but an Indian writing in English would still prefer to write 12_34_56_789.
IEEE floating point arithmetic was introduced in IEEE 754-1985, ten years before the Prolog standard came out. The current standard is IEEE 754-2008. When the Prolog standard came out, the personal computer and workstation market was overwhelmingly dominated by machines claiming to support IEEE floating-point. The best known holdouts were the VAX and the IBM mainframes.
The VAX is dead. Its successor, the Alpha, was designed from the
beginning to support IEEE arithmetic. The Alpha is dead. (Thanks, HP.)
IBM mainframes now support IEEE arithmetic and have for some time.
The Java language requires either IEEE arithmetic (under strictfp
)
or something close to it (otherwise). For the last 10 years we have had a
standard binding to even the more esoteric features of IEEE arithmetic
through C99.
But we can't do so for Prolog.
Why not?
Section 8.7, "Arithmetic comparison", says amongst other things
For IEEE arithmetic, +0.0 and -0.0 are distinct values with different
behaviour. They are readily distinguished using copysign()
,
for example. Because they behave differently, they must not unify.
But the IEEE standard requires that arithmetic comparison regard them
as numerically equal. So for IEEE we must have
eqF(+0.0, -0.0) and x ≠ y, but the ISO
Prolog standard says we can't.
Conversely, if a Prolog variable X is bound to a term representing an IEEE Not-a-Number value, X = X must be true for Prolog, but eqF(X, X) must be false for IEEE.
IEEE comparison is not a total order on floats. There are pairs of floats which are not comparable. The default behaviour for an IEEE system is that order predicates like < should raise an "unordered" exception in such cases and equality predicates should fail. This the Prolog standard does not allow: trying to order two floats is never allowed to raise an exception.
The standard lists amongst others the following axioms:
These errors need to be fixed.
It is essential for the correctness of sorting algorithms that ordered comparisons (< =< > >=) should be transitive.
Sadly, the ISO Prolog arithmetic comparisons are nothing of the kind. The reason for this was well understood when the ISO Prolog standard was prepared. Indeed, I warned the committee about it. However, they stuck with it.
Here is the problem. The definition of (=<
)/2
uses leqFI(x,n), which is defined as
leqF(x, floatI->F(n))
if the intermediate result doesn't overflow, or as an overflow
exception if it does. That is, mixed comparison works by converting
integers to floats.
The problem with this is that it is possible to find integers x, z and a float y such that x < z but they both round to y. Possible? It's easy.
Transitivity is easily ensured by requiring comparison to work as if the integers were converted to a hypothetical floating point format with enough bits for the conversion to be precise. (Common cases can even be handled fast at assembly level by doing the integer to float conversions in hardware and trapping to slower code if the INEXACT flag is set.)
It is not possible to write a sorting predicate that correctly sorts mixed integer/float lists in ISO Prolog as it stands.
The change required has the effect of producing correct answers in more cases. Existing systems already have to be incompatible with the ISO Prolog standard's errors noted above if they are to give sensible results, so we should not let backwards compatibility with the standard bind us here. Mistakes really should not be forever.
There are three predicates which are so basic that the ISO Prolog
standard itself uses them. Sadly, it does not grant them to Prolog
programmers. We're lucky: we nearly didn't get
length/2
.
succ/2
provides the successor function on
natural numbers only. It is essential to the purposes
for which succ/2
was invented that it should not accept or
produce negative integers. It is also essential that it should not
accept or produce non-integral rational numbers.
This is in the Prologue to Prolog, and the error conditions
implied by the declarations here are believed to be identical to those
in the Prologue.
:- pred succ(integer, integer). :- succ(X, Y) when X ; Y. :- succ(X, Y) is semidet when X ; Y. % Sample implementation succ(X, Y) :- ( integer(X), integer(Y) -> X >= 0, Y is X + 1 ( integer(X), var(Y) -> X >= 0, Y is X + 1 ; var(X), integer(Y) -> Y > 0, X is Y - 1 ; var(X), var(Y) -> ( X == Y -> fail ; instantiation fault ) ; $error ).
plus/3
provides reversible arithmetic on exact numbers.
It was intended to work with rational numbers as well as integers.
ISO Prolog does not require rational number support. An ISO Prolog system
is not, for example, allowed to make X is 2/5
return a
rational number, since the standard is explicit that the result of
(/)/2
is always the result of a floating-point division.
However, despite there being no portable or semi-portable way to
exploit the fact, there are Prolog implementations that support
rational numbers. In the specification below, I use a type test predicate
rational/1
, which is intended to be true of all supported
rational numbers, including integers. In a Prolog system without
rational number support, read rational/1
as
integer/1
It is essential to the purposes for which plus/3
was
invented that it should be a proper logical predicate; the nature of
floating point arithmetic means that it cannot behave as intended
if it allows floating-point arguments.
It could be extended to Guassian integers, though.
:- pred plus(rational, rational, rational). :- plus(X, Y, Z) when X, Y ; Y, Z ; Z, X. :- plus(X, Y, S) is semidet when X, Y ; Y, Z ; Z, X. % Sample implementation plus(X, Y, Z) :- ( rational(X), rational(Y) -> Z is X + Y ; rational(Y), rational(Z) -> X is Z - Y ; rational(Z), rational(X) -> Y is Z - X ; var(Y), Z == Y -> % Ulrich's case ( X = 0, nonvar(Y) -> true ; nonvar(X) -> fail ; $error ) ; $error ).
between/3
is astonishingly useful, not least because of
its frequent use in bounding the search space of other predicates. It
is the Prolog equivalent of a for
loop, and as such, it
is hard to write much Prolog without it, just as it is hard to write
much Prolog without append/3
or member/2
.
The description here is intended to be compatible with
section p.p.4 of
a Prologue for Prolog. ("greater or equal" and "less or equal"
in that document should be corrected to "greater than or equal to"
and "less than or equal to" respectively.) The code sketch here is
more complicated in part to show that the last successful exit from
a call to between/3 must not leave a choice point behind. (The
Prologue seems to imply that it must; I do not
believe that is intentional.) The error conditions at any rate are
meant to be identical. There is one possible exception: if
var(X) then between(X, X, Y) has one solution in logic
(X = Y with Y constrained to be an integer),
but the first two arguments of between/3
have
always been intended as strict inputs.
:- pred between(integer, integer, integer). :- between(L, U, X) when L, U. :- between(L, U, X) is semidet when ground(L), ground(U), ground(X). :- between(L, U, X) is bounded when ground(L), ground(U). % Sample implementation. between(L, U, X) :- ( integer(L), integer(U) -> ( integer(X) -> L =< X, X =< U ; var(X) -> L =< U, $between(L, U, X) ; $error ) ; integer(L), integer(X), L > X -> fail ; integer(U), integer(X), X > U -> fail ; $error ). $between(L, L, L) :- !. $between(L, _, L). $between(L, U, X) :- M is L + 1, $between(M, U, X).
The order in which the sample implementation enumerates solutions for X is part of the specification.
There is of course a prettier alternative to between/3
,
and that is to extend the syntax of comparison.
Comparison form | between/3 equivalent
| ||
E =< X =< F | L is E, | U is F, | between(L, U, X) |
E =< X < F | L is E, | U is F-1, | between(L, U, X) |
E < X < F | L is E+1, | U is F-1, | between(L, U, X) |
E < X =< F | L is E+1, | U is F, | between(L, U, X) |
E >= X >= F | U is -F, | L is -E, | between(L, U, Z), X is -Z |
E >= X > F | U is 1-F, | L is -E, | between(L, U, Z), X is -Z |
E > X > F | U is 1-F, | L is -1-E, | between(L, U, Z), X is -Z |
E > X >= F | U is -F, | L is -1-E, | between(L, U, Z), X is -Z |
This requires "distfix" operators, which are easy enough to add to Prolog syntax and have been so added in the past. The reason I mention them is that cascaded comparison operations like this have been standard mathematical notation for a very long time. HP had a systems programming language with a double-comparison operator (and corresponding hardware instruction). One version of SETL used double-comparison operators for enumeration in this way, that's where I got the idea. Lisp also allows multiple comparison in a single form.
These operators resemble the existing arithmetic comparison operators in evaluating their "outside" arguments; unlike the existing arithmetic comparison operators they do not evaluate their "inside" argument and only accept integers and variables. This approach generalises neatly to multiple generated arguments, e.g., 1 =< X < Y =< N.
While I call this approach prettier, I do not recommend it for adoption.
In its present form, between/3
could not be usefully
extended to rationals.
However, a form between
(L, U, S, X)
meaning "L =< X, X =< U, S =\= 0, and X
is a multiple of S" is imaginable.
However, between/3
is the simplest thing that could work
and has been proven over many years, so let's stick with it.
The problem with integer division is that there are several different versions, and the one you get is seldom the one you need. We should consider seriously whether Common Lisp analogues such as
floor(N, D, Q, R) ceiling(N, D, Q, R) round(N, D, Q, R) truncate(N, D, Q, R)
predicates with
floor(N, D), floor_remainder(N, D) ceiling(N, D), ceiling_remainder(N, D) round(N, D), round_remainder(N, D) truncate(N, D), truncate_remainder(N, D)
as arithmetic functions, might not be a good idea. Obviously, one pair of these arithmetic functions would duplicate the effect of the existing // and mod, but can you instantly recall, without looking it up, which?
I brought this up in 1984 and with the BSI and with the ISO committee, to a deafening silence. Surely I can't be the only person who is never quite sure what // is going to do with negative numbers?
I note that although the ANSI Smalltalk standard is generally minimalist in approach, it includes both truncate-to-zero and round-to-minus-infinity division and remainder.
Chapter 9 of the ISO Prolog standard lists the following evaluable functors:
Functor | Note |
---|---|
X + Y | but not +X |
X - Y | |
- Y | |
X * Y | |
X / Y | It always delivers a floating-point result, even when X is divisble by Y. An implementation which has rational numbers is completely forbidden to return a rational answer to 1/2. This is most unfortunate, and complicates Smalltalk systems layered on top of Lisp or Smalltalk. |
X // Y | Yields an implementation-defined answer. |
X rem Y | yields the remainder corresponding to X // Y. |
X mod Y | yields X - floor(X/Y). Knuth's strong and carefully reasoned advice that X mod 0 should be allowed and should equal X has been ignored, sadly. |
??? | There isn't any division corresponding to mod. Of course, this broke all the old Prolog code that relied on // and mod going together, but the ISO Prolog committee of the day explicitly rejected consideration for existing code as a criterion. |
abs(X) | |
sign(X) | Despite the overwhelming predominance of IEEE floating-point arithmetic on popular machines, there is no equivalent of IEEE's copysign(), nor is it clear that copysign() can be emulated. |
float_integer_part(X) | |
float_fractional_part(X) | |
float(I) | |
floor(X) | |
truncate(X) | |
round(X) | |
ceiling(X) | |
X ** Y | 2**2 is 4.0, not 4. |
sin(X) | |
cos(X) | |
atan(X) | The one version of atan() no sane programmer uses! Each year I have to explain to 1st year surveying students that although Excel has ATAN2, Visual Basic for Applications does not, and why this matters very much to them. |
exp(X) | |
log(X) | A pity this wasn't spelled ln(X). |
sqrt(X) | |
Bits << Shift | Dangerously incomplete definition. |
Bits >> Shift | Dangerously incomplete definition. |
X /\ Y | Dangerously incomplete definition. |
X \/ Y | Dangerously incomplete definition. |
\(X) | Inconsistent definition |
What can be my excuse for calling a definition "dangerously incomplete?" I mean that it can be extremely difficult for a Prolog programmer working on someone else's code to tell whether a use of one of these operations is in the defined part or not. This is in contrast to languages like Java, Lisp, and Smalltalk.
Let's start at the bottom and work up.
There are Prolog systems which support bignum arithmetic (much bigger than hardware registers); there are Prolog systems which only support integers smaller than hardware registers (RT-11 Prolog with 14-bit integers on 16-bit hardware, DEC-10 Prolog with 18-bit integers on 36-bit hardware, &c).
How, for example, could we do bitwise operations on sign-and-magnitude values as if they were twos-complement values? \(X) is simple; it is just -1-X. For other operations, the sign-and-magnitude bit pattern for the value X should be converted to the twos-complement bit pattern for the same value, operated on, and the result converted back.
sign_and_magnitude_to_twos_complement x = case compare x 0 of GT -> x EQ -> 0 LT -> x `xor` MAXINT twos_complement_to_sign_and_magnitude x = if x >= 0 then x else if x == MININT then error "overflow" else x `xor` MAXINT
Considering the other overheads in typical Prolog implementations, the conversion cost is tolerable; considering the rarity of bitwise operations compared with other operations (like procedure calls) the conversion cost isn't important; and considering the rarity of non-twos-complement machines the costs are overwhelmed by the improved portability of code.
I hasten to add that I like sign-and-magnitude and wish that it had prevailed over twos-complement. But it didn't.
char
). Java programmers will remember that the Java
designers found it advisable to have two right shift operators. With
the prior art of Common Lisp before them, it was quite inexcusable for
the ISO Prolog committee to insert a C tumour into poor old Prolog.
There really isn't any way of setting up a "logical" right shift
without exposing the width of integers. Assume for the moment that
twos-complement is used. What is the value of (-1)>>1
,
when that's an unsigned shift? It's INTMAX. Which means that it is
different for different Prolog systems, even on the same
hardware. For example, on 32-bit hardware, Prolog immediate integers
might be 24 bits, or 25 bits, or 28 bits, or 29 bits, or 30 bits,
or the Prolog system might support 32-bit integers, or even 64 bit
integers, or might have unbounded integers. The number of bits supported
might even change from one release to the next: one Prolog I know went
from 29-bit integers to 32-bit integers.
In other words, while a C programmer writing for a particular machine may have a use for unsigned shifts on signed numbers, a Prolog programmer writing for a particular machine cannot, because there isn't the slightest reason to expect two Prolog implementations on the same hardware to have the same integer width. The only kind of right shift that could possibly make sense for Prolog is signed right shift. Prolog.
atan
is provided, but tan
is not. I can't for the life of me think of any Prolog implementor whose
life would be simplified by leaving tan
out. The tangent
function is a standard part of Fortran, C, and Lisp.
atan2/2
is important because it
is under that name that the operation is widely documented:
atan/1
is C's atan()
and
atan2/2
is C's atan2()
. Why make the names
different if we don't have to?
Apparently the standard was intended to allow IEEE floating point arithmetic, see section 5.5.10. However, that does not account for all the differences between IEEE arithmetic and its predecessors. It is certainly difficult to see how the difference between +0.0 and -0.0 can be accomodated. Read the details of the standard with close attention while asking "what about ±0? what about ±Inf? what about NaNs?" and you may be as puzzled as I about whether IEEE arthmetic is truly allowed.
The minimum for what counts as "support" is an environment enquiry that tells you whether the system running your code offers IEEE arithmetic or not.
It is reasonable for the standard not to require IEEE floating-point arithmetic, although these days even IBM mainframes support it. But an environment enquiry is essential.
These days we have a new challenge: not only is there an IEEE standard for decimal floating point arithmetic, but there are at least two hardware lines (z/Series and POWER) that support it. We shall soon be facing a need to clarify whether one has
mod
is a nuisance. It means that you have to write
div(X, Y, Z) :- Z is (X - X mod Y) // Y.
The Prolog implementor could have done it for you, much more
efficiently. In fact, using div
for integer division
is older art in Prolog than using //.
There are some obvious things missing. If you are using trig functions, sooner or later you need pi. We might as well have pi/0 and e/0 as evaluable functors.
Maximum and minimum are extremely useful. They've been in Fortran for years. They're in Lisp. If it weren't for the IEEE details they would be easy to implement, but if you've ever tried to write a C99-compatible fmax() you never want to do it again and will be pained if your favourite language doesn't offer these functions.
It is a long-standing minor nuisance that there is no exclusive-OR
operator. This really should have been fixed back in DEC-10 Prolog.
The historic recommendation is to use ><
as the
exclusive-OR operator, with the same precedence as \/
,
but xfx
icon to avoid confusion. This has not been
widely taken up, sadly. Perhaps the simplest thing would be to
spell this one out and not make it an operator at all: xor(X,Y) is
at least tolerably clear.
To evaluate max(X, Y)
, having first
evaluated X and Y
To evaluate min(X, Y)
, having first
evaluated X and Y
Changes from previous drafts:
max/2
and min/2
as well.
What has not changed is that the result is uniquely determined by the input, even when the inputs are arithmetically equal but termwise distinct.
The C99 functions fmin*()
and fmax*()
treat a Not-a-Number argument as simply missing, so that
fmax(NaN, 0.0)
is 0.0. This is rather puzzling.
Consider the following C loop.
float x = 1.0f, y = 1.0f; for (int i = 1; i <= 50; i++) { x /= 10.0f, y /= 10.0f; printf("%2d %e\n", i, fmaxf(x/y, 0.0f)); }
The output ends with
40 1.000000e+00 41 1.000000e+00 42 1.000000e+00 43 1.000000e+00 44 1.000000e+00 45 1.000000e+00 46 0.000000e+00 47 0.000000e+00 48 0.000000e+00 49 0.000000e+00 50 0.000000e+00
where the last five lines are silently wrong.
It seems better to rely on an exception being raised; the same
exception that would be raised by (<)/2
.
Given a correct implementation of (<)/2
,
max/2
and min/2
act like
calls to the predicates
$max(X, Y, Max) :- ( Y < X -> Max = X ; X < Y -> Max = Y ; \+ float(Y) -> Max = X ; \+ float(X) -> Max = Y ; X == -0.0, Y == 0.0 -> Max = Y ; Max = X ). $min(X, Y, Min) :- ( Y < X -> Min = Y ; X < Y -> Min = X ; \+ float(Y) -> Min = X ; \+ float(X) -> Min = Y ; X == -0.0, Y == 0.0 -> Min = X ; Min = Y ).
except for the unordered_comparison
exception
from (<)/2
being reported as coming from max or
min.
In an actual implementation, we would expect the type of each argument to be tested at most once, and only one comparison to be made. The test for -0 and +0, given that X and Y are equal floats, can be done by testing for one variable being all bits 0 and the other being different, which are both fast tests.
+
operator.
div
infix operator.
xor/2
as a functor but not an operator.
tan
(X), atan2
(X,Y),
acos
(X), and asin
(X).
(^)/2
or something else is less important; there
are arguments either way.
max
(X,Y) and min
(X,Y) as specified
above.
pi
and e
constants.
Recently my attention was drawn to the following facts:
:- include(TextUnit).
% 7.4.2.7
This is a directive, not a built-in predicate.
It may only be used statically, and in this form. That is,
it is not possible in an ISO Prolog program to have
code which computes TextUnit and then calls include/1. To
repeat, there is no include/1
built-in
predicate, only a compile-time :-include(_)
directive.
:- ensure_loaded(TextUnit).
% 7.4.2.8
This is a directive, not a built-in predicate.
It may only be used statically, and in this form. That is,
it is not possible in an ISO Prolog program to have
code which computes TextUnit and then calls ensure_loaded/1.
To repeat, there is no ensure_loaded/1
built-in predicate, only a compile-time
:-ensure_loaded(_)
directive.
There are additional problems with those directives.
open/4
predicate. In fact there are excellent
reasons to expect that it won't: given 'foobar'
it is useful for a Prolog system to try
"foobar", "foobar.qof", "foobar.pl" in that order, or some
such thing.
:-ensure_loaded('foobar.pl')
will work at all.
:- ensure_loaded/1
were available as a
built-in predicate, the whole point of it is to load a file
once and only once. That makes it useless for reloading a file
after it has been changed.
compile_predicate(Name/Arity)
built-in predicate was proposed to the committee (by me) with a
clear rationale. Here's a trivial implementation:
compile_predicate(Name/Arity) :- scratch_file_name(Foobar), telling(Old), tell(Foobar), listing(Name/Arity), tell(Old), close(Foobar), consult(Foobar), % 'compile' if you have it. delete_file(Foobar).
That's not industrial-strength code, but it shows that all the Prolog systems listed above could easily support this operation.
open/4
for opening a file, but
there is no standard way to call that predicate. The
format of the file name argument is implementation defined and
there is no way for a program to discover what that format is.
Contrary to the impression given by the examples in the standard,
there is no requirement that any atoms ever be acceptable as file
name arguments, nor, if accepted, that they be interepreted in
accordance with the host file system.
open/4
should be defined as accepting an atom,
a non-empty list of atoms, or a non-empty list of character
codes as a file name, plus additional implementation-
defined ground terms. An atom provides a sequence of characters
(its name). A non-empty list of atoms provides a sequence of
characters determined by concatenating the names of its elements.
A non-empty list of character codes provides a sequence of
characters corresponding to the codes. In these three cases,
the character sequence is taken verbatim and passed to the
host file system for interpretation.
:- include/1
should be defined as accepting an
atom, a non-empty list of atoms, or a non-empty list of
character codes as a text unit name, plus additional
implementation-defined ground terms. A text unit name of these
forms should always be interpreted as referring to the contents
of a host file system file. If and only if there is no host file
system file with exactly the name given, an implementation may try
implementation-defined transformations of the file name. Typical
transformations include adding a ".pl" or ".pro" suffix if there
is no suffix, removing all "_" characters, changing " " to "-"...
:- ensure_loaded/1
should be defined as accepting
the same kinds of argument as :- include/1
and as
interpreting them in the same way, except that it might try a
different series of transformations, such as trying ".qof" before
trying ".pl".
ensure_loaded/1
should also be available as a
built-in predicate.
include/1
as a built-in predicate, not entirely unlike the old
consult/1
. Unfortunately, consult/1
has been given too many meanings to be adopted.
See the previous section. There are at least two dimensions of portability:
In section 7.1 I complained that there is no way of specifying a
file name to open/4
which is portable in the second sense.
A Prolog implementor I shan't name has taken me to task over this,
saying that it is quite reasonable for there to be no "portable" file
name terms because only the first kind of portability matters, and
the file names would not be portable anyway.
He is wrong. For example, I have Quintus Prolog, SICStus Prolog, XSB Prolog, SWI Prolog, Ciao, ECLiPsE, and a couple of others (including a version of Poplog I can't actually get at directly, as part of the SPARK/Ada toolkit), all running on the same machine. If you are trying to write code for publication, you want to test it under several Prolog systems. If you are trying to write library code for general use, you want to test it under several Prolog systems. Encouraging portability between different language implementations is one of the major purposes of a language standard. The really annoying thing here is that those implementations do share a common file name term (atom), so why can't the standard say so?
He is also wrong about the portability of file names. It is true that different operating systems have different file name syntax, but words matching the regular expression /[A-Z][A-Z0-9]{0,5}/ are portable between MVS, CMS, VMS, TOPS-10, TENEX, TOPS-20, RT-11, RSTS/E, all versions of UNIX from at least version 6 on, all versions of DOS, all versions of Windows, all versions of MacOS, the Burroughs (now Unisys) MCP, PR1MOS, ITS, and several other operating systems I haven't used. This is hardly surprising: that's the syntax for Fortran 66 identifiers, and Fortran programmers expected to be able to use Fortran identifiers as file names. The fact that some file names are not portable across a wide range of operating systems is no excuse for failing to provide standard access to file names that are portable across a wide range of operating systems.
In fact, there is a hidden assumption in the ISO Prolog standard which
restricts ISO Prolog input/output to DOS, Windows, and POSIX-compatible
systems such as Solaris and Macos-X. Since DOS and Windows accept forward
slashes in file names as well as reverse slashes, then one can expect
relative file names formed from the POSIX portable filename character set
and using forward slash as the directory separator with names restricted to
"8+3" form to be usable across all the file systems where ISO Prolog can be
expected to work. What's more, while file names containing slashes are
not legal OpenVMS (VMS, TOPS-10, TENEX, TOPS-20) file names, they can easily
be mapped to file names which are legal. It is quite reasonable for
open/4
to take as host file names those character sequences
which are possible host file names and to use some kind of POSIX-to-host
mapping for character sequences which are not.
Section 3.94 of the ISO standard lists the input/output modes
read
, write
, and append
.
Section 7.10.1.1 repeats this.
It is not clear whether an implementation may
support additional input/output modes, such as rewrite
.
The relevant wording in section 7.10.1.1 is
append -- Output. The
source/sink is a sink. If the sink already exists then output shall start
at the end of that sink, else an empty sink shall be created.
|
The problem is that "the end of the sink" is not a well defined term.
Stream options (section 7.10.2.11) include type(text)
or
type(binary)
. Now, in UNIX, Windows, and MacOS, the contents
of a file are a sequence of bytes. (In many other operating systems, they
are a sequence of records.) The end of a sequence of bytes is well defined.
But a text file is a sequence of characters encoded as a sequence of bytes.
For example, in TOPS-10 the last block of a file was typically padded with
ASCII NUL characters. Where is the end? Is it at the end of the block,
or is it at just after the last non-NUL character? Or consider DOS. It
was based on CP/M, where a file was a sequence of 128-byte records. When
you closed a character stream, if there was any space left in the last
record, you'd get a Ctrl-Z followed by as many NULs as there were room
for. Although Windows no longer uses this convention for output, it still
obeys it for input. So if a 2 MB file contains a Ctrl-Z as its first
character, a C program reading it as a text stream will think it is empty.
Where is "the end of that sink"? Is it at the logical file end as reported
by the file system, or is it at the first Ctrl-Z? This is actually serious
problem, because if you start writing after the last byte in a file, the
characters you write may be completely invisible when you read back:
?- open('HAIRY', write, Bin, [type(binary)]), put_byte(26), close(Bin), open('HAIRY', append, Txt, [type(text)]), put_code(0'*), nl, close(Txt), open('HAIRY', read, Src, [type(text)]), get_code(X), close(Src).
What is the value of X? Is it 0'*, or is it the end of file code?
To get text mode appending right, it is necessary to open the file for reading and writing, read and decode the current data, and then decide where the end of the text contents is found.
The matter of encoding is even more complicated than that. There is not a one-to-one correspondence between characters and bytes. There are over 98,000 characters in Unicode. To convert between a stream of characters and a stream of bytes it is necessary to use some kind of encoding. This encoding may be stateless (like UTF-8), or it may be stateful (like Unicode Technical Report 6), or for that matter like dynamic compression schemes such as dynamic Huffman encoding and Lempel-Ziv encoding. Stateless encodings are not a problem. Some stateful encodings can be restarted, by which I mean that you can put the encoder back into its initial state and keep or resume writing, and still recover the intended character stream without error or loss of information. Lempel-Ziv compression and Unicode Technical Report 6 are like that. If you create a file, write CS1, close it, open for append, write CS2, close it, and then read it back you will get CS1++CS2, even though you will not get the same byte stream that you would have got from writing CS1++CS2 in one go. But some stateful encodings cannot be restarted, because the original designer never thought of providing a code that says "forget what you know".
There is no way in ISO Prolog to get the effect of the O_APPEND flag in POSIX, which is something of a pity.
The behaviour of append
mode for text files needs to be
clarified.
When encodings are supported, the interaction between append
and encodings needs to be clarified.
An early draft of the ISO standard said that a stream position was an integer. In the days of ASCII/Latin-1 on UNIX and MacRoman on MacOS, the mapping between position in character sequence and position in byte sequence was the identity mapping. MS-DOS, with its CR-LF sequences, spoiled that, and meant that there was no general way to go from character position to byte position without reading all the previous bytes in the file. That was actually our good luck, because other encodings make it even harder.
However, the ISO standard makes stream positions harder to use than it has to. A normal use of stream positions is to remember them as you generate a file, and same them out in some kind of index structure. Then to look in the file, you load the index structure and use that to tell you where to go.
That usage is not allowed in ISO Prolog. The ISO Prolog standard says, in section 7.10.2.8, that stream position terms are stable while the stream is open. A stream position could, for example, include the stream identifier, making it unusable with any other stream for the same file.
It should be said that a stream-position term uniquely identifies a particular position in a source/sink that is connected to a persistent file as long as the contents of that file preceding the position are not modified.
encoding
(EncName)normalisation_form
(NF)Even Prolog programs need to be tested. One of the most useful facilities in the Quintus library was its provision for generating random data structures of various sorts. The minimum requirement here is generating random numbers.
I note that the new C++ standard has extensive support for generating random numbers according to many distributions and using many generators. That would be nice to have, but one good generator is all we need to get started.
The basic requirements on the design are that it not require global mutable state (because some Prologs now support coroutining and/or multi-threading), that it should permit the efficient generation of large batches of random numbers, that it should allow random number states to be written and read back within a single system (but not across all systems), and that it should encourage the use of high quality generators.
:- type random.
A random generator state is a ground acyclic term which is not a list, not an atom, and not a floating-point number. It may be quite large; the Mersenne Twister requires 624 32-bit integers. Two random generator states should unify if and only if they represent the same abstract state. It should be safe to assert a clause containing a random generator state. It should be safe to write a random generator state to a text file, and it should read back as the same random generator state whatever the operator settings in force at either write time or read time might be. The written form of a random generator state should not end with a token that would accept "." as a continuation.
An implementation should state which random number generation algorithm it uses, and should document whether a random state is an integer (which might be reasonable in systems like SWI Prolog with bignum arithmetic) or if not what the principal functor is. The arguments need not be documented, but some indication of the number of bits in a random state should be given.
:- pred seed_random(+Seed: list(integer), ?Random: random).
This predicate is a pure deterministic function from lists of integers to random generator states. The Seed argument could be empty; it could contain a single integer of any size; it could be a list of character codes. Any list of integers should be accepted.
It is expected that the algorithm used to compute a random state from integers will be reasonably efficient. It need not be cryptographically strong.
This predicate is used when you want a reproducible sequence.
:- pred seed_random(?Random: random).
This predicate makes up a random state using any information it can
get, such as the time of day, /dev/random
, or any such thing.
This is used when you want a non-reproducible sequence.
:- pred floats_random(?List: list(float), ?Length: integer, +Random0: random, ?Random: random)
This is true when length(List, Length)
and the
elements of List are pseudo-random floats in the open interval (0,1),
Random0 is the initial random generator state, and Random is the
final random generator state. This predicate is a pure function
from Length and Random0 to List and Random such that
list_random(
Xs, L, R0,
R1)
and list_random(
Ys,
M, R1, R2) and
append(
Xs, Ys, Zs)
and plus(
L, M, N)
implies list_random(
Zs, N,
R0, R2)
.
The argument order here is chosen so that (a) the input-output pair
are at the end, as in a DCG, and (b) since length/2
is a
projection of list_random/4
, the corresponding arguments match.
The random generator must have a period of at least 1012. The implementor should be able to testify that it has passed the DieHard tests or better. Note that even a generator with high period may produce duplicate floats in even a short list; it is the entire sequence that doesn't repeat too soon, not single elements.
There are good technical reasons to demand that neither 0.0 nor 1.0 should ever be returned by such a generator.
There are two issues here: what can we do generally, and what can we do for POSIX? We can't have a requirement for directories generally, any more than C can, because some systems that could host a Prolog (or C) system don't have directories. But there is quite a lot we could do for POSIX, and a POSIX interface is available for Windows, for IBM mainframe operating systems, and even for Unisys.
I endorse the POSIX interface proposed for ISO Prolog in its entirety.
That proposal is not a complete binding for POSIX, but it offers the most critical missing features. Nothing less than that will serve. I have carefully examined that document, and find that everything in it makes sense on Windows; some of the things you might expect not to are in fact simulated in Microsoft's C library. The discussion of file name parts needs a little more detail.
I claim that socket support doesn't belong in the POSIX binding because sockets are available for Windows and IBM mainframe operating systems and other operating systems, so we should see if a cross-platform interface can be developed.
The major feature missing from the proposed POSIX feature is the ability to run external programs.
:- type command % this is a union of existing types = atom % but not empty + codes % but not empty + chars % but not empty + string. % but not empty :- pred system. :- pred system(+command). :- pred system(+command, -integer) system :- get_environment_variable('SHELL', Shell), system(Shell). system(Command) :- system(Command, 0). system(Command, Status) :- % in effect, convert Command to a C string, % call system(that string), and unify Status % with the result.
Why doesn't the definition of system/0
supply a
default, such as /bin/sh
, if there is no
$SHELL
? Because it can be useful for security
reasons to deny this facility to Prolog, and undefining
$SHELL
is a straightforward way to do it.
Why the restriction to commands that are not empty?
So that command text can be represented in any convenient way
without ambiguity. No matter what "" is taken to represent,
it's not allowed. Neither is '', so that the rule applies
uniformly. If someone does want a POSIX command that does
nothing, ' '
, ':'
, and
'true'
are all available.
As noted in section 7.1, it would be very useful if Prolog could read the output of a program. There is a standard POSIX facility for this, popen() and pclose(). It is in fact not terribly hard to build your own version of popen() on top of raw POSIX I/O, so providing this feature need not force implementors to use C stdio.
:- type filename = atom % but not empty + codes % but not empty + chars % but not empty + string % but not empty + pipe(command) + ... :- type mode ---> read | write | append. :- type open_option --> ... :- type stream. :- pred open(+filename, +mode, -stream). :- pred open(+filename, +mode, -stream, +list(open_option)).
The restriction to non-empty file names is actually a POSIX restriction. Traditional UNIX practice has been that "" is equivalent to "." as a file name, however section 3.170 of SUSv4 is quite explicit: "Filename. A name consisting of 1 to {NAME_MAX} bytes used to name a file. The characters composing the name may be selected from the set of all character values excluding the <slash> character and the null byte. The filenames dot and dot-dot have special meaning."
Given that POSIX does restrict filenames to be non-empty, we can take advantage of that. Is [] an empty codes or an empty chars? Who cares? Both are illegal. Whatever "" stands for, it's illegal. This does mean that '[]' is not a legal file name, but that's no real hardship; './[]' is legal. Note that any implementation has to be able to report some file names as malformed, so this restriction is consistent with ISO Prolog.
The actual novelty here is pipe(Command). Opening a stream this way is similar to calling popen() except that it need not involve a stdio FILE. Closing a stream this way is similar to calling pclose().
There are other and better ways to specify a command.
A model we should certainly follow later
is the posix_spawn()
function, which provides
control over streams, environment variables, and a bunch of
other stuff, and above all does not require obsessive care
with quoting the arguments of the program.
The first thing to be understood about strings is that
STRINGS ARE WRONG! |
It is always important, and in Prolog extremely important, to bear in mind the distinction between an interface format for data and a processing format for data.
Interface formats are designed to make it easy for human beings or other programs to produce and/or consume. For example, the 11th of December 2013 might be represented for human consumption as "11/12/13". Very convenient for people. Very bad for computation. The textual format of XML is convenient for a number of purposes, but for processing, it is pretty bad. If you like working with XML held as plain old strings, you probably sleep on a bed of nails. If you can do this correctly, you are smart enough not to.
Processing formats are designed to make it easy to write correct and efficient code that does useful calculations with the data. They are not designed to be read or written in bulk by human beings or other programs, but for internal use.
This is basically the same as the distinction between surface structure and deep structure in Chomskian linguistics or between source code and abstract syntax trees in programming.
For almost any use case that involves some kind of processing, the only sensible thing to do with string data is to turn it into some kind of tree early. Prolog is brilliant with trees.
The bad case for strings is to say "programmers using other programming languages are used to strings so they need something they're familiar with in Prolog".
That is a bad argument for two key reasons. The first is that code written using strings in other languages is very often bad code even in terms of those languages. In anything more capable than AWK, turning a string into a least a sequence of tokens often lets you write far more efficient code. We should not be encouraging people to retain bad habits.
The second thing is that there are better tools in Prolog for processing data that a Perl programmer might represent as a string. People coming to Prolog should learn Prolog and how to use it effectively.
Strings are an interface data structure. There's a lot of stuff we want interfaces to. For example, Prolog implementations have been built on top of Interlisp, Common Lisp, Pop-11, Java, SML, and even JavaScript. All of those languages have strings. It would help in writing Prolog code that interfaces with code in the host language if Prolog and the host language could exchange strings as strings. At a minimum, the host language should be able to give Prolog a string, Prolog should stare at it in bafflement, and give it back.
Two particularly important interfaces are data base interfaces (especially SQL but also things like CouchDB) and JSON (which some data bases use for interfacing). In JSON,
For data bases, the issue is not constructing the query. This is the 21st century; the evidence is in; the right way to construct SQL queries is as trees, linearising them as strings only at the last minute. See Exploits of a Mom.
The issue is the flood of data coming in. Most of this data is either going to be thrown away or flood straight back out without any processing other than equality or perhaps ordering. (If you embed program-relevant structure in a relational column, you have sinned against the commandments of the Prophet, E.F.Codd.) This was already identified as an issue in the 1980s.
It is not the case that every Prolog program manipulates algebraic formulas, so not every program needs rational numbers. (Hmm. Maybe that is a bad example.) It is not the case that every Prolog program talks to a data base or a web socket, so not every program needs strings. Let's let little Prolog implementations stay little.
But if people are going to be dealing with strings in Prolog, they should do so the same way. What we need is a library interface for strings that lets people write useful code that is portable to those Prologs that provide strings at all.
Section 6.5.7 of the ISO Prolog standard defines a lexical syntax
for back quoted strings and ends with the sentence “It would be
a valid extension of this part of ISO/IEC 13211 to define a
back quoted string
as denoting a character string constant.”
Let it be so.
There should be a new flag back_quotes
.
Possible values: string
, chars
,
codes
, atom
, none
.
Changeable: "NO".
Description: This flag exists so that a program can discover what
the implementation will do with back quoted strings (ISO section 6.4.7).
string
.
atom
.
chars
or
codes
it offers as the default value of the
double_quotes
flag.
none
.
none
,
it shall be a syntax error if a back quoted string occurs in input.
string
,
strings must be written as back quoted strings.
Ulrich Neumerkel pointed out that canonical syntax is important. Because strings are not a standard ISO Prolog data type, they currently do not have a canonical syntax, but back quoted strings are clearly there to supply it. Let us take that hint. I originally suggested that the flag should be changeable, but Ulrich is clearly right that allowing changes would mean that strings would not have a canonical syntax in systems that could offer one.
The flag values copied from double_quotes
are here to
provide some way of coping with data sent from a Prolog with strings
to a Prolog lacking them. A system that supports strings should not
use them.
What are we to do about Prolog systems where strings are written using double quotes? Wince, regret the JavaScript envy, and request a mode that is more Prolog-like.
:- type string. :- pred string(+term). :- string(X) when true. :- pred string_length(+string, ?integer). :- pred string_codes(string, codes). :- pred string_codes(string, codes, integer). :- pred string_codes(string, codes, integer, integer). :- pred string_codes(string, codes, integer, integer, integer). :- pred string_chars(string, chars). :- pred string_chars(string, chars, integer). :- pred string_chars(string, chars, integer, integer). :- pred string_chars(string, chars, integer, integer, integer). :- pred substring(string, string). :- pred substring(string, string, integer). :- pred substring(string, string, integer, integer). :- pred substring(string, string, integer, integer, integer). :- pred integer_string(integer, string, integer, string). :- pred integer_string(integer, string, integer). :- pred integer_string(integer, string). :- pred float_string(float, string, float_format, string, string, string). :- pred float_string(float, string, float_format, string, string). :- pred float_string(float, string, float_format, string). :- pred float_string(float, string, float_format). :- pred number_string(number, string, string). :- pred number_string(number, string). :- pred atom_string(atom, string). :- pred atom_append(atom, atom, atom). :- pred string_append(string, string, string). :- pred atomics_to_string(list(atomic), string). :- pred atomics_to_string(list(atomic), string, string). :- pred string_leading_count(string, set, integer). :- pred string_leading_count(string, set, integer, integer). :- pred string_trailing_count(string, set, integer). :- pred string_trailing_count(string, set, integer, integer).
This is basically a subset of the predicates offered by SWI Prolog. In general, omission should simply be read as omission from this draft. The specifics are below.
The integer_string/[2-4]
,
float_string/[3-6]
,
number_string/[2-3]
,
and atom_string/2
predicates are analogues
of the conversion predicates in section 1,
except that they use strings everywhere that a single character
or character sequence would have appeared.
The atom_append/3
and string_append/3
predicates are close analogues of append/3
except for
working with atoms and strings respectively instead of lists.
string_codes(String, Codes, Before, Length, After)
is true when String is the concatenation of A++B++C, Before is the
length of A, Length is the length of B, and After is the length of C,
and Codes is B converted to a list of character codes. With two
arguments, it is equivalent to Before=0, var(Length), After=0.
This lets us extract part of a string as a list of codes,
which could be important if you have loaded a whole file as a string.
string_chars(String, Chars, Before, Length, After)
is the same except that Chars is B converted to a list of single-character
atoms.
substring(String, Part, Before, Length, After)
is basically the same except that Part is B as a string and omitting
Before and/or After is equivalent to leaving them unbound. This is
copied from Xerox Quintus Prolog; SWI Prolog has essentially the same
predicate but with the arguments in a weird order incompatible with
the usual rule for putting optional arguments last.
string_leading_count(String, Set[, Length0], Length)
examines a string. Let String = A++B++C such that Length0 (taken as
0 if omitted) is the length of A, every character of B is included in
the Set, and C does not begin with a character included in the Set.
Then Length is the length of A++B.
string_trailing_count(String, Set[, Length0], Length)
examines a string. Let String = A++B++C such that Length0 (taken as
0 if omitted) is the length of C, every character of B is included in
the Set, and A does not end with a character included in the Set.
Then Length is the length of B++C.
For example, suppose we want to remove both leading and trailing spaces and tabs from a string.
trim(String, Trimmed) :- Set = " \t", string_leading_count(String, Set, 0, Leading), ( string_length(String, Leading) -> Trailing = 0 ; string_trailing_count(String, Set, 0, Trailing) ), substring(String, Trimmed, Leading, _, Trailing).
atomics_to_string/[2,3]
comes straight from
SWI Prolog. It's basically what Xerox Quintus Prolog called
concat_string/[2,3]
. Avoiding anything that suggests
backtrackable concatenation maybe an excellent thing here.
string_code(Index, String, Code) :- ( var(Index) -> string_codes(String, [Code], Index0, 1), succ(Index0, Index) ; succ(Index0, Index), string_codes(String, [Code], Index0, 1) ).
with the difference that a programmer won't expect string_codes/4 to be O(1).
read_string/5
below.
Rêveॶd'aucuneॶautreॶmanière.ॶ
where ॶ is a blank. You can see this on your screen. You can count. You see 29 characters, including the blanks. So you ask for 29 characters. And what you get is missing the last two. Oh joy, the text wasn't normalised, so the accents counted as an extra character each. I haven't found it useful to read a fixed number of characters since I stopped using punched cards, and Unicode has made me dread ever doing so again. When I read "fields" from data, generally I need to specify a set of characters to skip, a set of characters to take, and an upper bound on the number of base characters.
read_string/5
provides no way to do this;
if I want characters trimmed from one end, they will be
trimmed from the other as well. The split_string/4
predicate has this defect as well.
Let me state the case against read_string/[3,5]
in
a very practical way. You have data in CSV form. You want to read
the next record as a list of strings. How does these predicates help?
What else is there to read a string?
There clearly needs to be some way to read strings, and
somebody clearly thought read_string/[3,5]
would be useful to somebody, so I'm not saying that these
operations should not be readily expressible. I'm saying that they
leave me reading any data I've ever had to deal with (since
I stopped using punched cards) one character at a time. Something
better is needed.
Lisp systems had "read tables" a long time ago, and I recall suggesting in 1984 that Prolog should have something similar. (Not terribly logical, but then, neither is I/O by side effect terribly logical.) Something not entirely unlike Java's StreamTokenizer, for example, but less clunky.
One very simple approach would be predicates like these:
skip_input([Input_Stream, ]Character_Set) :- % Input_Stream must be open for character input. % Character_Set must be a section 1.5 character set. % While the stream has a next character included in the set, discard it % and any subsequent floating diacriticals. read_string([Input_Stream, ]Character_Set, String[, Bound, Bases]) % Input_Stream must be open for character input. % Character_Set must be a section 1.5 character set. % While the stream has a next character included in the set, and either % there is no Bound or that many base characters have not been read, % read that character and any subsequent floating diacritical and % incorporate them into a string. % At the end, unify String with the string and Bases (if present) % with the number of base characters.
Note that the Bases value returned is not the length of
the string as defined by string_length/2
but the number
of base characters, which is invariant under normalisation, and the
Bound is a limit on this number. It fixes the dreaming-in-no-other-manner
problem. It is not quite powerful enough to emulate SWI's
read_string/5
, so there is still design work to be done.
This is a draft. It has been revised in the light of the Prolog Prologue and Corrigendum 1 but not yet Corrigendum 2.
More operations will be added, and community discussion — which is the point of offering this document — may result in adding, changing, or removing some of the operations listed here.
There is nothing here which was not obvious at least two years before the Prolog standard came out. In particular, at the Melbourne conference in 1985, I warned explicitly that a "minimal" standard was not at all a good idea, and by 1986 most of the predicates listed above could have been designed, had a thorough overhaul of the library been desired instead of minor revisions.
Over the years, a number of public-spirited people have donated libraries of Prolog code. See Joceyln Ireson-Paine's list of several. They need to be reviewed to see what should be in here, or if not, what they need here or in ISO Prolog to support modernised versions.