Finding Stuff in Big CSV Files
October 16th, 2017If you have an activity based model OR big(ish) data, from time-to-time, you need to find something. One record, possibly one in half a million or one in a million. You need GNUWin tools for these if you’re on Windows.
Getting the First Line
Getting the first line is pretty easy with the head command:
>head -n 1 file.csv >head -n 1 jointParticipantResults.csv id,tourid,hhid,hhsize,purpose,partytype,participantNo,pNum,personType,HhJoint
If you want the last, record, replace ‘head’ with ‘tail’.
Getting the Number of Rows
This is a pretty simple awk script that returns the number of rows:
>awk 'END {print NR}' jointParticipantResults.csv
Getting a Specific Record
This is a simple awk script that returns the row where the  third field is 158568.  Looking at the first script above, the third field is the hhid field:
>awk '$3 == 158568 {print $0}' FS="," jointParticipantResults.csv
Note the FS part – that tells awk that the field separator is a comma.