Feature #1490

Should there be a datatype for regular expressions?

Added by Lars Marius Garshol over 2 years ago. Updated over 2 years ago.

Status:Rejected Start date:2009-10-28
Priority:Normal Due date:
Assignee:Lars Marius Garshol % Done:

0%

Category:-
Target version:-

Description

Michael Quaas asks: "Shouldn't the regular-expression-constraint have an
occurrence of type regexp?"

Discussion thread: http://www.isotopicmaps.org/pipermail/sc34wg3/2009-October/004279.html

History

Updated by Lars Marius Garshol over 2 years ago

Hannes Niederhausen schrieb:

Looking at http://isotopicmaps.org/tmcl/tmcl.html#sect-regexp-constraint it seems to have it already.

That's an occurrence type, not a datatype. :-)

Updated by X B over 2 years ago

There are multiple instances of what people usually call a "regular expression language". We should allow for supporting multiple instances of them (maybe not initially, but we should not close the path for the future).

For example, the current TMCL draft refers to http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#regexs as definition for what regular expression language should be used. That very same draft also says that another regular expression language (http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap09.html) could be used. There are many other regular expression languages, such as Perl-Compatible Regular Expressions, Ruby Regular Expressions, Java Regular Expressions, "Glob" (e.g. UNIX-shell filename expressions), ... .

Some regular expression languages support grouping (e.g. the "bar|baz" in "fo(o(bar|baz)b)*la"), some don't. Some support Unicode better, some worse. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/#regexs itself says

The ·regular expression· language defined here does not attempt to provide a general solution to "regular expressions" over UCS character sequences. In particular, it does not easily provide for matching sequences of base characters and combining marks. The language is targeted at support of "Level 1" features as defined in [Unicode Regular Expression Guidelines]. It is hoped that future versions of this specification will provide support for "Level 2" features.

This means we cannot be sure which regular expression language is the "best" one to be chosen. Actually, as regular expression languages still develop, if we just import a regular expression language by referencing its spec, it may not be clear which version we do import, and it may not be clear whether we want to participate in the future developments of the regular expression language chosen.

Additionally, implementing a regular expression engine is not a trivial task. This means there is an incentive for vendors to just use "their native regular-expression language" for the implementation. On the other hand, a feature-less regular-expression language (like xmlschema-regular-expressions) may be disregarded as "too feature-less", adding an incentive for vendors to just use a feature-rich regular-expression language, which happens to exist pre-built and free and well-tested on the market. This means that it is likely that each vendor actually does support regular expressions, but in different forms and variants, creating an incompatible plethora of regular expression language usages.

We may not be fully able to prevent this from happening, but we may be able to document this and alleviate the pain, by encouraging (or enforcing) the declaration of which exact regular expression language is used in a particular instance of a regular expression. To denote the exact type of the particular regular expression, the datatype is the right tool for this job. As

"3.0"^^xsd:string
has a different meaning than
"3.0"^^xsd:decimal
,
"fo(o(bar|baz)b)*la"^^tmcl:xmlschema-re
has a different meaning than
"fo(o(bar|baz)b)*la"^^tmcl:pcre
or
"fo(o(bar|baz)b)*la"^^tmcl:java-re

This still does not mean that compatibility due to the forces to be incompatible is achieved, but at least the tools to process a TMCL instance know when they should fail. Else (that is, in case all regular expressions have just datatype "xsd.string") problems will happen like "this TMCL schema works with the validator on vendor Y, but not on the validator on vendor Z, but we do not know why". Or worse, the TMCL schemas are incompatible and silently fail to validate correctly, and only due to small differences between regular-expression engines of different vendors (where none of these regular-expression engines are restricted to just the regular expression language as defined by the TMCL spec, but all engines offer "extra features".).

As regular-expression languages are typically very dense (that is, nearly every instance of one regular-expression language is also an instance of another regular-expression language, albeit with a different meaning), we cannot expect this "wrong regular expression language used" problem to be detectable by TMCL engines by regular expression parsers raising errors. Instead, these regular expressions interpreted in the wrong language will silently "work", with a silently changed, unintended meaning. Such situations are to be prevented.

Updated by Lars Marius Garshol over 2 years ago

X B schrieb:

There are multiple instances of what people usually call a "regular expression language". We should allow for supporting multiple instances of them [...]

I couldn't disagree more! It's bad enough that we force TMCL implementors to implement one regular expression language. If we did this we'd effectively be forcing them to implement more than one language. In the interests of interoperability (ensuring that all TMCL schemas are interpreted the same way everywhere) and simplicity of implementation we should pick just one language and stick with it.

Some regular expression languages support grouping

True, but we don't need grouping in TMCL.

This means we cannot be sure which regular expression language is the "best" one to be chosen.

True. However, if you want a language which has a proper specification (which is an absolute requirement for us) there are not many to choose from. Since we are using XSD for our datatypes anyway it seems reasonable to adopt the regexp language used by XSD, particularly since this opens the way for using the XSD model for extensible datatype definitions. The XSD regexp language also has the benefits of being a true regexp language, not including more features than necessary, and having excellent Unicode support.

So overall it seems a good choice.

This means that it is likely that each vendor actually does support regular expressions, but in different forms and variants, creating an incompatible plethora of regular expression language usages.

This is definitely a danger, and this is why we want to pick a single well-defined regexp language.

Updated by Lars Marius Garshol over 2 years ago

  • Status changed from New to Resolved
  • Assignee set to Lars Marius Garshol

The Leipzig meeting resolved that a datatype for XSD regexps not be added.

Updated by Lars Marius Garshol over 2 years ago

  • Status changed from Resolved to Rejected

Wrong state.

Also available in: Atom PDF