Skip to content

The Basics

Tilo edited this page Feb 13, 2022 · 11 revisions

Vanilla CSV Files

If you have a vanilla CSV file which is comma separated, and has standard line endings.

$ cat  /tmp/test.csv
"CATEGORY " ," FIRST  NAME" , " AGE "
 "Red","John" , " 34 "

Reading and processing the CSV file is straight-forward, and you get an array of hashes containing the data:

data = SmarterCSV.process( filename )
 => [{:category=>"Red", :first_name=>"John", :age=>"34"}]

You will notice that the sample CSV file had a couple of extra spaces, which were stripped off, and the fields from the header line were converted into Ruby symbols.
Notice how the double-space in " FIRST NAME" becomes an underscore in :first_name.

All this is default behavior, assuming that you want to hand this data to an ORM, but these defaults can be overwritten.

Another Vanilla CSV File

This sample file has a few fields empty, and one row without any values.

$ cat /tmp/pets.csv
first name,last name,dogs,cats,birds,fish
Dan,McAllister,2,,,
,,,,,
Lucy,Laweless,,5,,
Miles,O'Brian,,,,21
Nancy,Homes,2,,1,

$ irb
 > require 'smarter_csv'
 > pets_by_owner = SmarterCSV.process('/tmp/pets.csv')
  => [ {:first_name=>"Dan", :last_name=>"McAllister", :dogs=>"2"}, 
       {:first_name=>"Lucy", :last_name=>"Laweless", :cats=>"5"}, 
       {:first_name=>"Miles", :last_name=>"O'Brian", :fish=>"21"}, 
       {:first_name=>"Nancy", :last_name=>"Homes", :dogs=>"2", :birds=>"1"}
     ]
 
 > SmarterCSV.warnings
  => {3=>["No data in line 3"]}
 > SmarterCSV.errors
  => {}

Another default behavior of SmarterCSV is that it will remove any key/value pairs from the result hash if the value is nil or an empty string. The reasoning here is that if you would update a database record containing valid data with an empty string, you would destroy data. SmarterCSV is trying to be safe here, and avoid this scenario by default. But this default behavior can be changed if needed.

You'll also notice that there is a way to get any errors or warnings which may occur during processing. In this case, there was no data in line 3 - all the values were empty.

But my CSV Files are not Comma-Separated

$ cat /tmp/test2.csv
"CATEGORY";"FIRST--NAME";"AGE"
"Red";"John";"35"

To read this file, we just need to tell SmarterCSV which column-separator col_sep to use.

data = SmarterCSV.process('/tmp/test.csv', {col_sep: ';'})
 => [{:category=>"Red", :first_name=>"John", :age=>"35"}]

Notice how the double-dash becomes an underscore in :first_name.

But I don't want Symbols as Keys

If you don't want symbols as Keys, you can just pass this option in:

data = SmarterCSV.process('/tmp/test.csv', {header_transformations: [:none, :keys_as_strings]})
 => [{"category"=>"Red", "first_name"=>"John", "age"=>"35"}]

Again, the default is to strip whitespaces and downcase the headers, because ORMs in Ruby have lower-case attribute names.

The keyword :none disables any defaults for header_transformations, before we specify :keys_as_strings.

But I have a crazy Binary Format

There are a couple of database dump formats, which are basically CSV files, but with pretty exotic values for column-separator and row-separator.

SmarterCSV can easily read these files, if configured correctly:

filename = '/tmp/strange_db_dump'   # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes!)
options = {
  :col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/, :chunk_size => 100 
}
n = SmarterCSV.process(filename, options) do |chunk|
    Resque.enque( ResqueWorkerClass, chunk ) # pass chunks of CSV-data to Resque workers for parallel processing
end
=> returns number of chunks

This has an advanced option :chunk_size, but for the purpose of this example just note that the column-separator was set to "\cA" which is equivalent to \001, and the row-seprator was set to "\cB\n" which is equivalent to \002\n. We also ignore any lines which start with the comment regular expression /^#/.

But I want the RAW headers!

data = SmarterCSV.process('/tmp/test.csv', {header_transformations: [:none]})
 => [{"CATEGORY "=>"Red", " FIRST  NAME"=>"John", " AGE "=>"35"}]

Congrats! Now you have to strip those spaces yourself 😛 🎉