Solved

Best practice for inserting large amounts of rows into one table

  • 20 November 2019
  • 10 replies
  • 144 views

Userlevel 6
Badge +14

/-This is more of a SQL question-/

For the project I work on I'm in charge of realizing functionality to import large amounts of Product data (ASCII format / .txt file data). The database I work with is placed in a environment where multiple companies work on the same database. In the application, they only see their own data. For example, if 30 companies use the database and all have a pool of 1 million Products, then the 'Product’ table will have 30 million records in total but will only see their own 1 million products.

The data that I will import consists possibly over a million records and will be inserted into the single ‘Product’ table. What I'm worried about is that inserting 1 million products in one go will take a very long time, so I want to look at possible ways to accelerate this process and making sure that importing big files does not take hours.

I have read about removing indexes, parallel processes and such to speed up the process but I'm not fully convinced it would be the best solution to this particular case. 

Community, do you have suggestions for me how I can make this Import functionality as quick as it can be? 

Thanks in advance!

icon

Best answer by Erwin Ekkel 26 November 2019, 17:10

@Mark Jongeling based on what you mentioned, the temp table is probably your best bet. But try to do a select into for the new temp table, instead of insert. I think it's best to give this a try with some mock data and compare the execution plans and run time. You mainly want your primary key columns to be correct in the temp table. 

View original

10 replies

Userlevel 4
Badge +2

Hi Mark,

There are a couple of methods that you can use for inserting a large amount records in SQL Server. You can read about them here. Regardless of the rest of my answer, I recommend carefully reading this article, because all methods have their own advantages and disadvantages, otherwise there wouldn't be different methods. One of the key questions that needs to be answered before choosing a method is if the file has exactly the same column structure and data types as the table you are importing it into or if it requires some type of column mapping or data formatting.

As far as pure speed goes, I believe that BCP is the fastest method for inserting large amounts of data into a table, but it is an external utility tool. The BULK INSERT statement should be able to achieve a very similar level of performance.

Aside from the method you use there are several things to consider that will have a far greater impact on the performance, including things like:

  • Constraints and triggers on the target table.
  • Indexes on the target table. It might depend on the amount of data already in the table versus the amount of data you are importing whether it is smart to drop the indexes or not.
  • Using the TABLOCK query hint.

You should be able to find your answer in the article I linked. Maybe someone with more experience on the matter can give you a more definitive answer.

Userlevel 6
Badge +14

Hey Vincent, 

Thanks for your reply! Query hints is something I will try out and see if that gives the greatest performance in my situation. For the Bulk import, in this project we sadly can’t use it since the client does not want any data files to be stored on the Database server and has ensured the Database can’t read off the File server.

How I currently read the files is through the use of the Process Flow that reads files from the file server and puts it's data into a variable. After that I chop the data into pieces of certain lengths which I then can insert into a Global temporary table with all columns specified already. Using substring(variably, start, length) I can divide the variable and insert data into the right column.

So the insert will be done from the Global temporary table to the database table "Product”. 

Maybe this information can help in finding possible optimizations:smile:

 

Userlevel 5
Badge +6

There is no easy answer to this question. Because like with most things SQL related it depends on a lot of factors. Make sure you don’t have too many indexes and statistics on the table ‘Product’. Rule of thumb: max 5. Like Vincent said, you do not want too many triggers and constraints on this table.

Is all imported data new? Or could there be updates/deletes or rows ignored (since they already exist? If so, don't use the merge statement but use individual insert/update/delete statements (all set based of course).

If possible use select into instead of insert into for the global temp table. Also consider creating a (clustered) index on the global temp table based on your clustered index of the Product table. 

Userlevel 6
Badge +14

@Erwin Ekkel, sadly the data I import data / Global temporary table does not have the same table structure as the "Product” table.

Would you recommend me creating a local temporary table structured like the "Product” table, then insert the data from the global temporary table into there, enforce a Clustered index similar to the "Product” table clustered index, then insert the data from the local temporary table to the real table? In my mind it sounds like it probably would benefit the performance, but will it? Not sure if you know but asking is always possible :wink:

Userlevel 5
Badge +6

@Mark Jongeling What are the differences in the primary key columns in the global temp table? Is it just the data types or do you need to concatenate columns in the temp table to create the columns from the product table? If so it is best to fix these issues when you fill the global temp table with data. Then you can create an index and insert the data. 

 

 

Userlevel 6
Badge +14

@Erwin Ekkel, the global temporary tables are structured as according to the standard PAB 2.03 (a way of structuring article and product data), for example:

Veldnaam being the column name, Lengte being the length of the varchar.

However, the "Product” table looks like this (in short):

So they don't look alike and queries have to make sure the data gets transferred correctly. Currently there's no table in between the Global temp table and the real table meaning I transfer the data and do my magic on it to get it into the "Product” table successfully.

A new temp table wouldn't have no data in it at first but the real "Product” table currently already has millions of products in it. My thought is to create a Temp product table where I put the data in, all structured and such, then transfer it to the real "Product” table. Will/might this be better than the current solution?

Hope this clears things up.

Userlevel 5
Badge +6

@Mark Jongeling based on what you mentioned, the temp table is probably your best bet. But try to do a select into for the new temp table, instead of insert. I think it's best to give this a try with some mock data and compare the execution plans and run time. You mainly want your primary key columns to be correct in the temp table. 

Userlevel 6
Badge +4

Consider changing the recovery model of the database to BULK LOGGED during the data import operation. This will reduce the IO of the log file a lot. The article Vincent mentioned follows up on it a bit later as well.

Userlevel 5
Badge +6

Consider changing the recovery model of the database to BULK LOGGED during the data import operation. This will reduce the IO of the log file a lot. The article Vincent mentioned follows up on it a bit later as well.

@Anne Buit I assume this is only advised when the current recovery model is set to full model. 

Userlevel 6
Badge +4

The simple recovery model will truncate the log after a transaction savepoint is reached, but would still log every inserted record. I'm quite sure this would still cause more IO than the bulk model (logging for changes in table extents) when a large import is done in a single statement/transaction. But it would be interesting to put it to the test.

At any rate, I'd try to avoid using the simple recovery model in production environments, to be able to restore the database to an earlier point in time if needed.

Reply