Copying Schemas? Don’t Forget the Charset!

As usual on Wednsdays, we refresh the development schemas with data from production. Thursday morning, the users complained that one of their queries fail with:

ORA-600: [kafspa:columnBuffer2], [4001], [4000]

According to our good friends at Metalink, this means that we have 4001 characters in varchar 4000 field. cool! But how did they get in there?

Ah, maybe its a good time to mention that production uses WE8ISO8850P1 charset while the development schema is on a database that uses UTF8. Not brilliant. Even less brilliant is assuming, as I did, that since WE8ISO8850P1 is a subset of UTF8, I can just export and import and everything will be fine.  I assumed that because for the last 2 years we did just that and everything was fine. Turned out that we were lucky.

Oracle support found the segment, file, block and row where the problem occurred from my trace files (unfortunately they did not tell me how they did that). I went through a bit of trial and error to find the correct column (it had to be a table with over 20 varchar 4000 columns). Then I used the primary key of that row to find the right data in production (remember that I couldn’t query that row in development without getting ORA-600), removed few chars and reinserted it into development.

This solved the ORA-600. But what caused it anyway?

I took a long look at all 4000 characters of original data.  One of them was ‘¿’. Its decimal value is 191 in ASCII, but 49855 in UTF8. 49855 takes two bytes while 191 takes only one. I had no idea this could happen – I thought all ASCII chars retained their original value in UTF8. I also learned that when you say varchar (4000 char) it actually means (4000 bytes), so its a bit misleading.

So, import converted one character from one byte to two bytes, in a field that already had 4000 bytes (but should have had 4000 chars), and in the process corrupted my data. Isn’t this a bug?

At this point I did what I should have done few days ago – read what the documentation says about character set migration.  And the doc says:

“The Export and Import utilities can convert character sets from the original database character set to the new database character set. However, character set conversions can sometimes cause data loss or data corruption.”

and also:

“In most cases, a full export and import is recommended to properly convert all data to a new character set. It is important to be aware of data truncation issues, because columns with character datatypes may need to be extended before the import to handle an increase in size.”

Well, I still think that ORA-600 should have not happen, and as the doc says, the field should have been truncuated. However, I can’t complain, I should have been warned that data loss and corruption may happen.

This is a good reminder that things that bite you are the things you think you know when you don’t. I was so sure that moving from ISO8850 to UTF8 is safe, that I didn’t even bother to read the doc.

Advertisements

11 Comments on “Copying Schemas? Don’t Forget the Charset!”

  1. hi Chen,
    i’m not sure to have understood correctly your issue. First of all: what charset is ISO-8850?
    Do you use exp/imp or expdb/impdb? Doing same operations like you with exp/imp, to me has happened to have same problem caused by the nls_lenght_semantics=BYTE, so varchar2(100) is varchar2(100 bytes) and so if i import a string from a database with charset WE8ISO8859p1 to a database with charset AL32UTF8 and i’ve a column with exactly 100 chars and some of this chars on UTF8 are encoded with two bytes (e.g. “è”) i got errors during import of type “value to large”. Really i don’t remember if i have ever tried with varchar2(4000) that is the maximum value.
    I agree on the fact that if you got later an ORA-00600 it is a bug, as oracle says ORA-00600 happens on unespected situations.

  2. According to the doc,
    Independently of the maximum length in characters, the length of VARCHAR2 data cannot exceed 4000 bytes

    So if you got a VARCHAR2(3000 CHAR), you will be able to insert only 2000 chars maybe. If you get an ORA-600, yes, it is a bug, I am convinced you already reported it to Metalink, did not you ?

    :mrgreen:

  3. Dan Bernstein says:

    “I thought all ASCII chars retained their original value in UTF8”. That is true, but ¿ is not ASCII. ASCII defines 128 characters, encoded as 0-127, and the Unicode code points for those characters are indeed the same. This is true for all 256 characters in ISO-8859-1 (Latin 1), however Unicode code points 128 and higher do not have a 1-byte UTF-8 encoding.

  4. Al says:

    I had no idea this could happen – I thought all ASCII chars retained their original value in UTF8
    Well, as you have learned (but maybe not your readers), all ASCII characters in the 0-127 range retain their numerical value in UTF-8 (and thus their width).
    The inverted ? is an extended character (meaning its value is in the 128-255 range), defined in some ISO-8859-xxx character set, and there is no guarantee concerning its width in UTF-8. It could be1, 2, 4 bytes…

  5. Adi Stav says:

    Ah, now that’s a subtlety.

    Indeed, the representation of any ASCII string remains the same after you convert it to UTF-8. But character 191 is not in ASCII at all, as ASCII only goes up to 127. Character 191 is in ISO-8859-1, and is represented in UTF-8 as a multi-byte character (its position in Unicode is still 191, but this is encoded with two bytes). You couldn’t possibly represent all of ASCII *and* ISO-8859-1’s 256 different characters all as single-byte characters in UTF-8, because then how would you signal the beginning of a multi-byte character?

    What you are guaranteed, is that if you convert a ISO-8859-1 string to UTF-8 and back, you’d get the exact same data back (insufficient buffer space and hypothetical implementation bugs notwithstanding).

  6. prodlife says:

    Wow. I did not expect so many responses.
    Seems like I’m the only one who didn’t know that characters can grow when converted to UTF8 🙂

    @Cristian
    ISO-8850 is what Oracle calls WE8ISO8850P1, its what usually used in American databases.
    The problem I ran into is with IMPDP. Import seems to fail during the import, which is slightly nicer since it allows me to fix the issue before the customer gets ORA-600.

    @Laurent
    Metalink insisted that it is not a bug since documentation says that data can be corrupted and ORA-600 is a proper response for corruption. I gave up after a week.

    @Dan Bernstein
    Did you happen to be the same Dan Bernstein who wrote QMail?

  7. Dan Bernstein says:

    No.

  8. Sokrates says:

    prodlife, I think, there’s a typo in your post (I suspect, that’s the cause of Cristian’s question) and also in your reply to Cristian’s question:

    SQL> select ascii(convert(chr(49855), ‘WE8ISO8850P1’)) from dual;
    select ascii(convert(chr(49855), ‘WE8ISO8850P1’)) from dual
    *
    FEHLER in Zeile 1:
    ORA-01482: unsupported character set

    SQL> select ascii(convert(chr(49855), replace(‘WE8ISO8850P1’, ‘0’, ‘9’))) from dual;

    ASCII(CONVERT(CHR(49855),REPLACE(‘WE8ISO8850P1′,’0′,’9’)))
    ———————————————————-
    191

    nice observation anyway !

  9. > WE8ISO8850P1
    Surely a kezboard issue 🙂

    About data corruption, I can well read in the globalization support guide :
    > However, character set conversions can sometimes cause data loss or data corruption
    If Oracle means that the ORA-600 is a normal answer to data corruption, ask them to write Knowledge Content if none available !

  10. Diabolic! i don’t know well how to translate in english but we say something like “make mistakes is human, perservere is diabolik”.

    I think you are speaking about WE8ISO8859P1. on the web i haven’t found ISO-8850 but surely a similar typo error official documentation http://download.oracle.com/docs/cd/B14117_01/server.101/b10749/ch2charset.htm#sthref164

  11. […] una ricerca sul supporto Oracle ho trovato poco, facendone una Google sono arrivato qui, come dire: a volte ritornano. Infatti in coda al post che non ricordavo di aver gia letto ho […]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s