Good Ideas: 1. The lexical syntax of programs (i.e. POD and literal strings) should be declared separately from the runtime treatment of strings. # This is like putting binmode() on the source file itself # at parse time and should never propagate into runtime =!lang ja (this is new) =!encoding Big5 say "String" # Japanese but encoded in Big5 "locale" makes a fine runtime :lang/:encoding choice (i.e. read from LC_* or CP settings) but makes no sense at parse time. Just like how XML always knows about its xml:encoding and xml:lang, Perl string literals and documentation should always know their encoding/lang informatino for correct presenation of e.g. pod2html (which would depend on lang to render CJK fonts correctly). The "lang" should never be inferred from encoding -- it makes no sense because lang usage shifts with time: People are writing Trad.Chinese in GBK all the time now. 2. BOM sniffing of .pl files, but currently the set it knows is (UTF16[LB]E, UTF8+BOM, ASCII(really latin* as default)) it should be: (UTF32[LB]E, UTF16[LB]E, UTF8(default)) 3. Per-handle stackable IO layers makes sense. But it should allow introspection into different layer-chunks: # storage layers (:mmap) # textual transformation layers (:encoding, :crlf) # format (MIME Header, HTTP Header, XML) # semantic (:language(ja)) $fh.layers.pop; 4. Per-string "SvUTF8" tag to denote strings from buffers But buffers are currently too limited - they cannot be used as C buffers ($buf[3..10] should transparently work) 5. String offsets allows COWing: use Devel::Peek; my $string = "This is a large string"; my $substr = substr($string, 0, 5); Dump($substr); # reuse the PV inside the string with SvCUR $string = "Hello, Kitty"; # $substr is COW'ed at this point # and still just "This " except it works ever better now because the fragment it refers doesn't need to be invalidated if other fragments change. 6. Use UCM tables that allows bidirectional fallbacks via |1 |3 to maintain PUA fallback for round-trippability. By default, Perl 6 should always use the same fallback semantics as Perl5's Encode.pm, instead of dying horribly like iconv. ===================================================================== Bad Ideas 1. Autopromotion of buffers to strings via ${^ENCODING} $str = $buf.read(:encoding); # Not Encode::decode() $str = $buf; # should simply die $buf = $str.write(...); 2. Fixed contiguous memory chunk for Strings is a bad idea (it works well for buffers) ### encoding != charset ### -> semantics of sort() etc ### -> P6 should just have one charset = UCS ### -> to sort via stroke count etc, use Unicode Collation ### -> to sort as bytes, well just use a buffer! # If this is just "cat"ing two shiftjis files into one file # $fhA is :encoding(shiftjis) # $fhB is :encoding(euc-jp) $huge_string_a = slurp($fhA); # *MUST* be unicode string here $huge_string_b = slurp($fhB); # *MUST* be unicode string here my $string = "$huge_string_a$huge_string_b" # huge malloc print $string; # but if the output encoding is shiftjis # then it just transcodes the euc-jp part $string should just have buffer fragments and how to "view" these buffers, namely $string should be a sparse array - Pointer --> $huge_string_a - at offset ... - Pointer --> $huge_string_b - at offset ... The IO actions like .slurp and .read should do validation (because we need to offset info anyway), by "preparing" the string to know that how many "highest desirable units" are in there (i.e. defaults to count up to chars but not graphemes) this can be done very quickly for most DBCS with a one-pass scan (and latin* with a O(1) scan). Then the buffer is tagged with the length info and original encoding -- if it's used as a unicode string for processing, decoding is done on the fly for the "buffer fragment" inside the string, i.e. to substr ($string, 0, 1) as chars should not decode $huge_string_b. But you should _never_ be able to treat the string as one single buffer because there may be multiple ones underneath -- the text semantics is encapsulated as one "Unicode text" string even though the underlying storage may be transcoded/cached into fixed ASCII/UTF16/UTF32 representations. utf8::downgrade() # evil! should never allow this! # esp. the utfX is not utf8 anyway # so user should never see utfX # (i.e. the internal representation) # Consequently: piconv -f eucjp -t eucjp # this should be very fast with one-pass validation 3. In Perl5, transcoding always go through UCS (in-memory with UTF/X), which is unneccessary for e.g. eucjp->shiftjis. So transcoders should be able to convert directly, provided that they will _always_ produce the same as-if it had roundtripped through UCS for all valid inputs. This should be exposed to the user of from_to(), and also used during lazy transcoding in outputs. 4. # 0 and 1 below are "codepoint" in perl5 # or "byte" if $string is Buf substr($string, 0, 1) # take the first codepoint # All the string manipulation takes unit adverbs, which # defaults to the lexical scope's settings # The "$string" no longer has a say on how it should be # viewed -- the action takes a view that suits its purpose $string.substr(0, 1, :bytes) $string.index(5, :bytes) The 0 and 1 should be the "position" type, if you write them out as literals, it responds to the lexical setting of char unit: .bytes # pretend strings are buffers .codepoints # same as perl5 - not terribly useful # - basically unsigned integers with 21 bits .characters # this should be the default: # - COMBINING MARKS # - BOM (and other zero-width assertions) .graphemes # visual rendering - includes metadata like # - LANGUAGE TAG blocks # - VARIATION SELECTOR # - LTR/RTL SELECTOR # - Act as pre-decomposed forms (for canonical decomposition) Lexical pragma determines what 0 and 1 means, but you can also construct them explicitly with "pos(0, :byte)" or "character(1)" (XXX the syntax needs work) 5. Treating Str as SvLV makes no sense at all. Currently P5 has one single form of treating Str as SvLV, namely lvalue substr: $string = "Hello, World!"; my $lv = \substr($string, 1, 4); # a shallow LV reference to the Scalar $string = "Hi, Kitty!"; $$lv = "oho"; print $string; # "Hohoitty!"; This should totally die, then Str becomes immutable and can assume value semantics i.e. shared/COWed across threads, etc. It may make sense to make buffers mutable (but not resizable), but strings should always be "constant" and it's the scalar container that changes -- exactly like integers. This also enables shared string tables ala Ruby, so "str".WHICH and "str".WHICH would compare always the same. In other words, you can ask a string to write itself into a buffer with some encoding/lang/format/blah combination, and if it happens to agree with how it's internally constructed, it could be a O(0) operation (and a transcoding otherwise), but the user may _never_ ask a string "what encoding are you using" or any other questions pretaining the internal Buf layouts. 6. UTFX (UTF8+offset cache) should die as an internal representation because it allows no sane interaction with the C world, so I think internally fixed-width-unicode-represetation should be preferred: - UTF/0bit # uninitialized null buffer with a length # - alloc(NULL, 100000000000); # - not malloc()ed until populated - UTF/7bit # ASCII - UTF/8bit # LATIN1 (just internal, never exposed) - UTF/16bit # UCS2 - UTF/32bit # UCS4 to allow O(1) random access to codepoints, and to allow chars/graphemes to refer to codepoint units instead of raw UTF-X length (which had to be invalidated after each destructive operation). If you insert a UTF/8bit into the middle of a UTF/16bit (as in 4-arg substr or s///), then the 8bit should be promoted to 16bit (very fast too) without invalidating any length caches. We can safely do this now because utf8::downgrade is no longer exposed to the user, so the C land may have macros like CHARS which returns a (char*), or W_CHARS returning the 32-bit w_char, or whatever the native C library wants (e.g. ICU wants UTF16). But these macros are all very fast (at most a null-interleaving) and can safely capture errors for invalid downgrading (e.g. viewing CJK as ASCII). 7. Then the default CHECK should _not_ be "substitution character", but rather a "soft failure" that is "die" under fatal and could be handled with a definedness test -- i.e. the example below produces something "undef" instead of "???". # this a "soft failure" my $x = encode("CJK", :encoding(ascii)); print $x; # without handling it, then it's promoted to a die; # but you can handle it with $x err die "..." if $x { ... }