Finding Stuff in Big CSV Files

October 16th, 2017

If you have an activity based model OR big(ish) data, from time-to-time, you need to find something. One record, possibly one in half a million or one in a million. You need GNUWin tools for these if you’re on Windows.

Getting the First Line

Getting the first line is pretty easy with the head command:

>head -n 1 file.csv

>head -n 1 jointParticipantResults.csv
id,tourid,hhid,hhsize,purpose,partytype,participantNo,pNum,personType,HhJoint

If you want the last, record, replace ‘head’ with ‘tail’.

Getting the Number of Rows

This is a pretty simple awk script that returns the number of rows:

>awk 'END {print NR}' jointParticipantResults.csv

Getting a Specific Record

This is a simple awk script that returns the row where the  third field is 158568.  Looking at the first script above, the third field is the hhid field:

>awk '$3 == 158568 {print $0}' FS="," jointParticipantResults.csv

Note the FS part – that tells awk that the field separator is a comma.