Edit: Factor/GSoC/2010/Improve Unicode library

= Mentor = [[Daniel Ehrenberg]] = Skills required = - Knowledge of at least one non-English language would be nice - Non-ASCII, international character sets in general - Reading specifications = Skill level = Advanced = Technical outline = Factor's Unicode library is in the %basis/unicode/% directory, along with the encoding support in %core/io/encodings/% and %basis/io/encodings/%. The library is pretty complete but a few tasks of varying complexity remain to be implemented. We would expect a student to at least attempt all of them over the summer, but depending on the student's skills, only doing the a portion of the tasks would be acceptable. === Normalized output streams === We want a %normalized-stream% type which wraps an underlying character stream, to convert output to normalization form C or normalization form D. Support for normalization is already in %unicode.normalization%, but the stream needs to be done. === Unicode number input === Unicode defines various code points for digits other than the usual ASCII 0..9. For example, many Indic scripts define their own exact equivalent of 0-9, and these are still used in certain contexts. Parsing numbers with these code points would be useful. Perhaps this could even be integrated with the %roman% library for a high-level number parser. === Encodings === The encoding API for converting strings to byte arrays and vice versa is mostly done ([[http://docs.factorcode.org/content/article-io.encodings.html]]) - The ISO2022 encoding, used for Japanese, is missing. - Implement heuristics to auto-detect encodings. This is always unreliable but useful for client applications, such as an e-mail client, where you can auto-detect the encoding and give the user an option to change it manually. === Line and sentence breaks === The Unicode standard specifies complex rules for detecting the boundary between sentences, and the possible and mandatory line breaks. These also have modifications for different locales, some of which are quite complex. For example, to detect possible line breaks in Thai, a dictionary is needed since spaces are not normally used between words, but words should not be broken with a line break. Detecting sentence boundaries is useful for navigation, in a text editor. Detecting line breaks is useful for rendering text. === BIDI === To support right-to-left scripts like Hebrew and Arabic in Factor's editor widget, Factor needs to use the Unicode Bi-Directional Text Algorithm. This algorithm specifies how left-to-right and right-to-left scripts are mixed. === Tailoring and CLDR === Many Unicode algorithms, such as collation, word break detection, etc. should act differently in different locales. For example, Swedes sort ö after z, where as Germans sort it before p. The Common Locale Data Repository has information about how these algorithms should be tailored to fit different locales. The CLDR also has information about, for example, standard date formats in different locales, and certain pieces of text which are commonly localized in applications. It would be a huge benefit to Factor applications to have access to this information. === Performance === New and old parts of the Unicode library will need to be optimized for performance. In particular, the choices of data structures will have to be reexamined, and possibly a compressed trie implementation will be useful. Generally, data strutures used for Unicode should take advantage of Factor's new capabilities for packed memory structures, which were not available when the original library was written. === Internationalization framework === The UI and web frameworks could have a system for internationalization and localization where text strings are held in a resource file of some format. These resource files should be used based on user request, and seamlessly substituted in for everything that needs them. The hard part will be designing the right API for using the localized strings, and a good format for the resource files. = Value to the student = The student gains experience with internationalization and localization. = Value to the community = Factor would be better suited to writing applications which deal with non-ASCII text. All of the world's languages, including English, use non-ASCII text. Factor's Unicode support, which is already more advanced than most languages, would be world-class after these changes.

Describe this revision:

Contents

Edit: Factor/GSoC/2010/Improve Unicode library