Boy how I wish there weren't a difference between Unix and Windows line endings. I recently needed to search through a bunch of map reduce output (tons of reduce files in text format), so I decided to bulk insert them into a SQL database table.
I wrote a bulk insert statement for each of the reduce files (well, some SQL code wrote the bulk insert statements based on a list of file names), and I quickly encountered errors when attempting to execute the statements. The bulk insert statements looked something like this:
BULK INSERT ReduceData_View FROM 'path\part-r-00000'
WITH (FIRSTROW = 1,FIELDTERMINATOR = '\t', ROWTERMINATOR = '\n');
The reduce files are rows of data that are tab separated. The frustrating part is that I had failed to remember that the map reduce job was done on a Linux machine, and therefore each file contained Unix style line endings. I had been viewing the files in Notepad++ (which will display different file formats without complaint), so it was confusing to me when I would see errors. I changed the ROWTERMINATOR value to '\r\n', and everything worked fine.
Lesson - pay attention to the source system type when processing data files. I didn't spend a long time on this, but it was still frustrating.
SQL Views
I had created a view to insert into, because I had an identity column on my target table. It seemed like an easy way to avoid the problem of bulk insert trying to insert into the identity column. The problem occurred when I ran the SQL to do the inserts, and some of the columns from the source files had values that were too wide for the table columns. I changed the target table columns to use varchar(max), since some of the source data had values that were larger than 4000 chars. I ran the bulk insert again and I still had errors due to the data being too large for one of the target columns. What the heck? What I hadn't realized is that the view I created had information for the columns in the target table from when the target table was first created. The table had the column width changed to varchar(max), but the view thought the table column width was varchar(4000). I ran the SQL to recreate the view and then the bulk insert worked just fine.
No comments:
Post a Comment