Copying Schemas? Don’t Forget the Charset!Posted: January 13, 2009
As usual on Wednsdays, we refresh the development schemas with data from production. Thursday morning, the users complained that one of their queries fail with:
ORA-600: [kafspa:columnBuffer2], , 
According to our good friends at Metalink, this means that we have 4001 characters in varchar 4000 field. cool! But how did they get in there?
Ah, maybe its a good time to mention that production uses WE8ISO8850P1 charset while the development schema is on a database that uses UTF8. Not brilliant. Even less brilliant is assuming, as I did, that since WE8ISO8850P1 is a subset of UTF8, I can just export and import and everything will be fine. I assumed that because for the last 2 years we did just that and everything was fine. Turned out that we were lucky.
Oracle support found the segment, file, block and row where the problem occurred from my trace files (unfortunately they did not tell me how they did that). I went through a bit of trial and error to find the correct column (it had to be a table with over 20 varchar 4000 columns). Then I used the primary key of that row to find the right data in production (remember that I couldn’t query that row in development without getting ORA-600), removed few chars and reinserted it into development.
This solved the ORA-600. But what caused it anyway?
I took a long look at all 4000 characters of original data. One of them was ‘¿’. Its decimal value is 191 in ASCII, but 49855 in UTF8. 49855 takes two bytes while 191 takes only one. I had no idea this could happen – I thought all ASCII chars retained their original value in UTF8. I also learned that when you say varchar (4000 char) it actually means (4000 bytes), so its a bit misleading.
So, import converted one character from one byte to two bytes, in a field that already had 4000 bytes (but should have had 4000 chars), and in the process corrupted my data. Isn’t this a bug?
At this point I did what I should have done few days ago – read what the documentation says about character set migration. And the doc says:
“The Export and Import utilities can convert character sets from the original database character set to the new database character set. However, character set conversions can sometimes cause data loss or data corruption.”
“In most cases, a full export and import is recommended to properly convert all data to a new character set. It is important to be aware of data truncation issues, because columns with character datatypes may need to be extended before the import to handle an increase in size.”
Well, I still think that ORA-600 should have not happen, and as the doc says, the field should have been truncuated. However, I can’t complain, I should have been warned that data loss and corruption may happen.
This is a good reminder that things that bite you are the things you think you know when you don’t. I was so sure that moving from ISO8850 to UTF8 is safe, that I didn’t even bother to read the doc.