I don’t know exactly how many CSV files I’ve read/written to date, but I’m willing to bet it’s a lot. These kind of files are a simple and common way to exchange data and are interoperable with spreadsheet programs as well, making them more easily accessible to non-programmer types. There is some excellent support out there for reading and writing CSV files and so here, from simplest to most complex, are a few different ways to load and query the data trapped inside your CSV files. But first off…
CSV is Not XML
Or any other suitably constrained data format for that matter. There are some pretty clear ‘rules’ out there for transporting data in this format, but of course hardly anybody listens to them. I found this and this in the first page of a quick Google search.
However, while looking for available sample material on the web to demonstrate with, I found many great examples of how to screw it up: not quoting empty strings, not escaping embedded line breaks, commas embedded in unquoted text, etc. All of these are things that prevent automation from properly interacting with the data. There are also, of course, the CSV ‘formats’ that insist on embedding comments and descriptions, trying to present more than one matrix of information or any of the other crazy things you can do when writing to a format with no real metadata layer.
< \rant>
Unfortunately, there’s nothing as convenient as an xml schema available for describing what information is encoded in the document and, in general, the only potentially useful metadata in an individual file is the header line. The good news is that all of these deficiencies can be overcome if you know they exist. And there’s some pretty nifty tools you can use to take out much of the guesswork. I’ll suggest a couple of ideas for how to deal with files that don’t adhere to the ‘rules’, but the main example I’ve decided on is a straightforward and well formed set representing the periodic table of elements, provided generously by Akiscode. This file is directly machine readable(has no headers or embedded data), quotes all values and just follows all the ‘rules’. Here’s the first five lines as an example.
[table id=4 /]
Groovy All by Itself
Groovy makes dealing with String objects pretty painless. By adding facilities like easy casting and find()/findAll() for quickly turning raw Strings into real data, you can program some fairly complex questions. In this simple example I’m depending on all values being quoted and I am not protecting against casting problems. I’m running this through Maven which really helps to keep test data organized; all you have to do is drop your test file in /src/test/resources and it’s automagically available on the test classpath.
No specific handling is done for different types here; everything is read in as a String and cast to a more specific type as needed. In a ‘real’ application you’re very likely going to encounter mixed alphanumeric data in any given column of data – ‘UNKNOWN’ instead of a number in a column labeled ‘Quantity’ for instance – and be more careful about casting.
[groovy language=”true”]
//load and split the file
InputStream inputFile = getClass().classLoader.getResourceAsStream(TEST_FILE_NAME)
String[] lines = inputFile.text.split(‘\n’)
List<String[]> rows = lines.collect {it.split(‘,’)}
/**
* A little helper method to get rid of the quotes in the input
* and cast values so they can be compared properly.
*/
private double castToDouble(string)
{
return string.replaceAll(‘"’, ”).toDouble()
}
//OK, it’s parsed – let’s ask some questions
private static final int ATOMIC_MASS = 1;
def elementsOver200Mass = rows.findAll {castToDouble(it[ATOMIC_MASS]) > 200}
def elementsBetween10And20 = rows.findAll { row ->
double mass = castToDouble(row[ATOMIC_MASS])
mass <= 20 && mass >= 10
}
[/groovy]
So this approach works fine for well formed input, but falls apart quickly in other cases. For instance, if any columns in the file are missing values the split() function treats them as nulls, leading to different size arrays being stored in the list. Similarly you need to deal with quoted VS non-quoted content, embedded line breaks and other issues individually. Fortunately, some nice guys have done that for us.
OpenCSV
Glen Smith’s OpenCSV library nicely abstracts away a lot of the minutiae of dealing with CSV. It properly deals with escaped characters, multiline input and a host of other issues for you. Give it pretty much anything that satisfies the Reader interface as an input and it’s good to go. The return type of the CSVReader.readAll() method also satisfies the same contract as the plain Groovy version shown above, so we can interact with the parsed results in exactly the same way as in the previous example.
All values are still typed as Strings, so we need to cast in order to compare numeric values.
[groovy language=”true”]
import au.com.bytecode.opencsv.CSVReader
List<String[]> rows = new CSVReader(
new InputStreamReader(getClass().classLoader.getResourceAsStream(TEST_FILE_NAME)))
.readAll()
//same finders as in the Groovy version
def elementsOver200Mass = rows.findAll {it[ATOMIC_MASS].toDouble() > 200}
def elementsBetween10And20 = rows.findAll { row ->
double mass = castToDouble(row[ATOMIC_MASS])
mass <= 20 && mass >= 10
}
[/groovy]
HsqlDB
HsqlDB allows for using flat files as Text Tables, effectively turning a CSV file into a database table. Data values are cast to the types specified in your table definition, so any potential type cast failures happen right up front when loading the data. The benefit is that once the data is read in no further manual processing is necessary. Note: by default the csv file and the database file must be located in the same directory, as a security precaution (can be overridden through configuration).
In addition, leveraging sql makes querying the data extremely easy. Groovy brings in the Sql class to abstract away a lot of the normal Java boilerplate you encounter when working with a ResultSet. I didn’t do any in-depth testing to prove it out, but this is also the only one of the strategies described here that doesn’t require holding the entire data in memory in order to do arbitrary queries. That can be mitigated in the previously shown methods by processing files line by line rather than in bulk if memory usage is a concern.
[groovy language=”true”]
//create the table definition to insert
String tableName = ‘elements’
String tableDefinition = """CREATE TEXT TABLE $tableName (
atomic_number INTEGER PRIMARY KEY,
atomic_mass NUMERIC,
name VARCHAR(255),
symbol VARCHAR(3)
);"""
//create a new file database and a table corresponding to the csv file
Sql sql = Sql.newInstance("jdbc:hsqldb:file:${testdbDir.absolutePath}/testdb", ‘sa’, ”
,’org.hsqldb.jdbcDriver’)
sql.execute(tableDefinition)
//set the source to the csv file
sql.execute("SET TABLE elements SOURCE ‘${TEST_FILE_NAME};all_quoted=true’".toString())
//querying the database that’s wrapping our CSV file
def elementsOver200Mass = sql.rows("SELECT * FROM $tableName WHERE atomic_mass > ?", [200])
def elementsBetween10And20 = sql.rows(
"SELECT * FROM $tableName WHERE atomic_mass <= ? AND atomic_mass >= ?", [20, 10])
//simple db aggregates
def count = 0
sql.eachRow("SELECT count(1) FROM $tableName WHERE atomic_mass <= ?", [20]){row->
count = row[0]
}
def avg = 0
sql.eachRow("SELECT avg(atomic_mass) FROM $tableName".toString()){row->
avg = row[0]
}
[/groovy]
Which One do I Use???
Pick the one that fits best for your use case is the real answer. I didn’t know about the HsqlDB option until recently, but OpenCSV has been a personal standby for years. Then again, if you’re writing something quick and dirty at the script level, the simplicity of just applying a couple of split() operations is pretty appealing. Here’s how I generally decide.
[table id=5 /]