High Performance Computing: Download and prepare data in a batch mode
Over the time, I need to manipulate a lot of data on a Linux cluster. Some of these manipulations actually read/write data, whereas some are essentially file system operations, such as downloading the files.
Here I present a list of similar operations suitable for HPC using pbs job approach whenever possible.
I do not attempt to include all possible methods but only the ones that I find useful and easy to prepare in seconds.
Download
The most efficient way to download MODIS alike data using HPC.
wget -r –no-parent -R “index.html” –retr-symlinks -A “.nc” ftp-url
wget -r –no-parent -R “index.html” -A “MOD17A2.A2000.hdf” -A “MOD17A2.A2000.xml” http-url
wget -r –no-parent -R “index.html” -A “MOD17.hdf” -A “MOD17.xml” http-url
You can basically setup filter for file type, year and granule id.
A live example:
///==========================================================
#!/bin/bash
#PBS -l nodes=1:ppn=1
#PBS -l naccesspolicy=singleuser
#PBS -l walltime=40:00:00
#PBS -M your email address
#PBS -m ae
#PBS -N download
#PBS -q standby
cd $PBS_O_WORKDIR
wget -r –no-parent -R “index.html” –retr-symlinks -A “.tar” ftp://somwhere
///==========================================================
Compress and extract
Examples:
///==========================================================
#!/bin/bash
#use this script to extract tar files under the sub directory
for dir in find -mindepth 1 -maxdepth 1 -type d
do
cd $dir
echo $dir
tar xf *.tar ./
cd ..
done
///==========================================================
#!/bin/bash
Pass the name of the file to unpack on the command line $1
for file in *.gz do gunzip -d “$file” done ///==========================================================
Search grep -rnw ‘/path/to/somewhere/’ -e “pattern” find . -maxdepth 1 -name “string” -print Comiple make &> results.txt
Count find . -name ‘*.cpp’ | xargs wc -l
Debug qsub -I -lnodes=1:ppn=20 -lwalltime=04:00:00 -q boss -X
Simply organize these above bash script and replace with commands, most file system related tasks can be resolved. I will add more related scripts later.