I recently had to convert a database of a large Greek website from single-byte Greek to Unicode (UTF-8).
One of the problems I faced was the stored PHP serialized data: As PHP stores the length of the data (in bytes) inside the serialized string, the stored serialized strings could not be unserialized after the conversion.
I didnāt want anyone to go through the frustration I went through while searching for a solution, so here isĀ a little function I wrote to recount the string lengths, since I couldnāt find anything on this:
function recount_serialized_bytes($text) {
mb_internal_encoding("UTF-8");
mb_regex_encoding("UTF-8");
My initial approach was to do it with regular expressions, but the PHP serialized data format is not a regular language and cannot be properly parsed with regular expressions. All approaches fail on edge cases, and I had lots of edge cases in my data (I even had nested serialized strings!).
Note that this will only work when converting from single-byte encoded data, since it assumes the stored lengths are the string lengths in characters. Admittedly, itās not my best code, it could be optimized in many ways. It was something I had to write quickly and was only going to be used by me in a one-time conversion process. However, it works smoothly and has been tested with lots of different serialized data. I know that not many people will find it useful, but itās going to be a lifesaver for the few ones that need it.
While exploring browser-supported Unicode characters, I noticed that apart from the usual @ and . (dot), there was another character that resembled an @ sign (0xFF20 or ļ¼ ) and various characters that resembled a period (I think 0x2024 or ā¤ is closer, but feel free to argue).
Iām wondering, if one could use this as another way of email hiding. Itās almost as easy as the foo [at] bar [dot] com technique, with the advantage of being far less common (Iāve never seen it before, so thereās a high chance that spambot developers havenāt either) and I think that the end result is more easily understood by newbies. To encode foo@bar.com this way, weād use (in an html page):
fooļ¼ barā¤com
and the result is: fooļ¼ barā¤com
I used that technique on the ligatweet page. Of course, if many people start using it, I guess spambot developers will notice, so it wonāt be a good idea any more. However, for some reason I donāt think it will ever become that mainstream :P
By the way, if youāre interested in other ways of email hiding, hereās an extensive article on the subject that I came across after a quick googlesearch (to see if somebody else came up with this first ā I didnāt find anything).
I recently wanted to post something on twitter that was just slightly over the 140 chars limit and I didnāt want to shorten it by cutting off characters (some lyrics from Pink Floydās āHey Youā that expressed a particular thought I had at the moment ā it would be barbaric to alter Roger Watersā lyrics in any way, wouldnāt it? ;-)). I always knew there were some ligatures and digraphs in the Unicode table, so I thought that these might be used to shorten tweets, not only that particular one of course, but any tweet. So I wrote a small script (warning: very rough around the edges) to explore the Unicode characters that browsers supported, find the replacement pairs and build the tweet shortening script (I even thought of a name for it: ligatweet, LOL I was never good at naming).
My observations were:
Different browsers support different Unicode characters. I think Firefox has the best support (more characters) and Chrome the worst. By the way, itās a shame that Chrome doesnāt support the Braille characters.
The appearance of the same characters, using the same font has huge differences across browsers. A large number of glyphs are completely different. This is very apparent on dingbats (around 0x2600-0x2800).
For some reason unknown to me, hinting suffers a great deal in the least popular characters (common examples are the unit ligatures, like ć or ć). Lots of them looked terribly unlegible and pixelated in small sizes (and only in small sizes!!). Typophiles feel free to correct me if Iām mistaken, but judging by my brief experience with font design, I donāt think bad hinting (or no hinting at all) can do that sort of thing to a glyph. These characters appeared without any anti-aliasing at all! Perhaps it has to do with Cleartype or Windows (?). If anyone has any information about the cause of this issue, I would be greatly interested.
Itās amazing what thereās in the Unicode table! There are many dingbats and various symbols in it, and a lot of them work cross browser! No need to be constrained by the small subset that html entities can produce!
I might as well write a bookmarklet in the future. However, I was a bit disappointed to find out that even though I got a bit carried away when picking the replacement pairs, the gains are only around 6-12% for most tweets (case sensitive, of course case insensitive results in higher savings, but the result makes you look like a douchebag), but Iām optimistic that as more pairs get added (feel free to suggest any, or improvements on the current ones) the savings will increase dramatically. And even if they donāt I really enjoyed the trip.
Also, exploring the Unicode table gave me lots of ideas about scripts utilizing it, some of which I consider far more useful than ligatweet (although Iām not sure if Iāll ever find the time to code them, even ligatweet was finished because I had no internet connection for a while tonight, so I couldnāt work and I didnāt feel like going to sleep)
By the way, In case you were wondering, I didnāt post the tweet that inspired me to write the script. After coding for a while, It just didnāt fit my mood any more. ;-)