Convert VARBINARY CSV File contents back to VARCHAR
Hello,
For a specific usecase, I want users to be able to import a custom CSV file. The import of this file triggers inserts in more than one table. Therefore, the standard import function Thinkwise offers is not sufficient at the moment.
In order to tackle the problem. I have a start task which has 2 relevant task parameters:
file_name (data_type=NVARCHAR(512) and control=File upload). Here, the users selects a file from file explorer
file_data (data_type=VARBINARY_MAX). Here, the varbinary file data is stored
The file which is being uploaded is UTF-8 Encoded. After using
CAST(@csv_varbinary as varchar(max))
I can see the VARCHAR information. However, special characters like “ö” do not seem to work with this method: “ö” becomes: “ö”.
Any advice on how to proceed from here? After doing some research on the internet I tried the following statements already:
select @test_varb, CAST(@test_varb as varchar(max)), CAST(@test_varb as varchar(max)) COLLATE Latin1_General_100_BIN2_UTF8, CAST(@test_varb as varchar(max)) COLLATE Latin1_General_100_CI_AI_SC_UTF8, CONVERT(VARCHAR(MAX), @test_varb, 1208), CONVERT(VARCHAR(MAX), @test_varb, 1)
Page 1 / 1
Hi Nick, can you try converting it to nvarchar(max) instead and see what happens?
Hi Nick, can you try converting it to nvarchar(max) instead and see what happens?
Hi Anne,
Using NVARCHAR instead of VARCHAR results in human unreadable text (both for using CONVERT & CAST). My example is a .csv file which I open in Notepad. You can see the encoding at the bottom:
I only display the word “Abtlöbnitz” here because this is a typical example where it goed wrong in our case.
After uploading the file via Thinkwise, a task parameter gets the varbinary representation of the file and we store this varbinary field in the database. In my example this is variable @test_varb.
After your suggestion I tried the following query:
select @test_varb, CAST(@test_varb as varchar(max)) as cast_varchar, CAST(@test_varb as nvarchar(max)) as cast_nvarchar, CONVERT(VARCHAR(MAX), @test_varb) as convert_varchar, CONVERT(NVARCHAR(MAX), @test_varb) as convert_nvarchar, CONVERT(VARCHAR(MAX), @test_varb, 1) as convert_varchar_1, CONVERT(NVARCHAR(MAX), @test_varb, 1) as convert_nvarchar_1, CONVERT(VARCHAR(MAX), @test_varb, 1208) as convert_varchar_1208, CONVERT(NVARCHAR(MAX), @test_varb, 1208) as convert_nvarchar_1208, CAST(@test_varb as varchar(max)) COLLATE Latin1_General_100_BIN2_UTF8 as collate_Latin1_General_100_BIN2_UTF8_varchar, CAST(@test_varb as nvarchar(max)) COLLATE Latin1_General_100_BIN2_UTF8 as collate_Latin1_General_100_BIN2_UTF8_nvarchar, CAST(@test_varb as varchar(max)) COLLATE Latin1_General_100_CI_AI_SC_UTF8 as collate_Latin1_General_100_CI_AI_SC_UTF8_varchar, CAST(@test_varb as nvarchar(max)) COLLATE Latin1_General_100_CI_AI_SC_UTF8 as collate_Latin1_General_100_CI_AI_SC_UTF8_nvarchar
Most of the VARCHAR options are able to transform the varbinary representation back to a human readable strong, but unfortunately not one of the options correctly retrieves the character “ö”.
Using the NVARCHAR options results in something like this:
Extra information:
When I copy the value “Abtlöbnitz” directly to SQL Server Management Studio and do the following it works as expected:
Transform the string to VARBINARY
Transform it back to VARCHAR
So it appears Indicium (?) converts the string into a different VARBINARY representation?
SELECT @test, @test_varb, CAST(@test_varb AS VARCHAR(MAX)) as result
Result:
Hello @NickJanssen,
There are two ways to do this in SQL Server.
Method 1
create function ndbo].]convert_utf8_varbinary_to_nvarchar](@varbinary_source varbinary(max)) returns nvarchar(max) as begin declare @temp table(string varchar(max) collate Latin1_General_100_CI_AS_SC_UTF8); insert into @temp select @varbinary_source; return (select convert(nvarchar(max), string) from @temp) end
Method 2
create function ndbo].]convert_utf8_varbinary_to_nvarchar](@varbinary_source varbinary(max)) returns nvarchar(max) as begin return (select cast(concat('<?xml version="1.0" encoding="UTF-8" ?>', @varbinary_source) as xml).value('.','nvarchar(max)')) end
Method 1 might seem more convoluted, but I would actually recommend that one, because you won't have to deal with any XML escaping. I've created scalar functions for you so they're easy to reuse.
Regarding your comment:
So it appears Indicium (?) converts the string into a different VARBINARY representation?
Yes, but it is Indicium that uses UTF-8, whereas SQL Server uses UTF-16 (or UCS-2 specifically). By default, SQL Server assumes UCS-2 for all nvarchar<->varbinary conversions and if you declare the variable and assign the text value in SSMS, as you did, the encoding will be UCS-2 and so you will get the correct result. The issue is that the varbinary value supplied by Indicium is a UTF-8 representation of the text.
This is the binary value that Indicium will supply for the word Abtlöbnitz: 0x4162746CC3B6626E69747A .
You were close with your collation solution, honestly one would expect that to work, but sadly it doesn't. Luckily, inserting the value into a table on which the column is configured to use a UTF-8 collation and then selecting the value does work.
I hope this helps.
Hello @NickJanssen,
I wanted to add one more thing to my previous answer, because it's probably the best solution for your problem. You can also assign a nvarchar process variable to the file data output parameter of your task process action. This way, Indicium will handle the encoding itself. I expect that this should work without any problems in your case.
I do want to stress that using a text-based data type like nvarchar for a binary output parameter can only work for plain text files like .txt and .csv. In all other circumstances, varbinary is the only correct data type when dealing with reading files.
Hi @Vincent Doppenberg ,
Actually I think that for my use case, your suggested Method 1 is better.
If I were to use a process flow, the file will be saved into the storage location, which is not what I want: I want to capture the varbinary in a task parameter, convert it to back to string and store the fields in a target table.
Thanks a lot for your response, this helps perfectly!