I have many times been faced with a situation where I am trying to move very large files (ISOs or zips upto 1-4 GB in size) but I don’t have a USB drive of that capacity and for some reason I can’t do it over the network. Of course if you want to P2P broadcast of huge files (think updating 200 machines simultaneously) - splitting them up helps in this case specially if you want to replicate a managed bit-torrent like environment. I have found some commercial file splitters out there but they are too slow and clunky. There is no concievable reason why they have to be so slow or I should live without options.
So I just decided to write one from scratch plus it gave me a reason to refresh my NIO knowledge. With some tweaking and proper usage of buffers and channels I have managed to get a comparable/better throughput in java than even the native operating system tools. I tested the integrity of the file and everything was OK.
The amount of code to do it minuscule and quite straight forward. First the splitter:
package net.ahlawat.file;
import java.io.*;
import java.nio.ByteBuffer;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
/**
* Program that splits the file
* User: Pranay Ahlawat
* Date: Jan 18, 2010
* Time: 8:14:03 PM
*/
public class Splitter {
static long BYTE_TO_MB = 1024 * 1024;
static long BUFER_SIZE = 128 * 1024;
public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("splitter [fileName] [split size in MB] [out dir]");
System.exit(1);
}
//create the local variables to be used in the rest of the application
File inFile = new File(args[0]);
long partitionSize = Long.parseLong(args[1]) * BYTE_TO_MB;
File outDir = new File(args[2]);
//create inital counters
final long totalFileSize = inFile.length();
//create the out dirs if they dont exist
if (!outDir.exists()) {
System.out.println("Creating directory : " + outDir.getName());
outDir.mkdirs();
}
FileChannel inChannel = new FileInputStream(inFile).getChannel();
long currentPosition = 0;
int ctr = 0;
ByteBuffer buff = ByteBuffer.allocate((int)BUFER_SIZE);
long start = System.currentTimeMillis();
while(currentPosition < totalFileSize) {
//get the out channel for the file - roughly is the "originalFileName.ext.n" where 'n' is the partition number
FileChannel outChannel = getChannel(inFile, outDir, ++ctr); //init the out channel
//the size of the nth partition
long size = currentPosition + partitionSize < totalFileSize? partitionSize : totalFileSize - currentPosition;
//sout
System.out.print(String.format("Creating part %s of size %s MB", ctr, size/BYTE_TO_MB));
long start2 = System.currentTimeMillis();
//the end position of the nth partition w.r.t the entire file
long endPosition = currentPosition + size;
//write partition in BUFFER_SIZE chunks
while(currentPosition < endPosition) {
//read the chunk into the buffer
long subSize = (currentPosition + BUFER_SIZE) < endPosition ? BUFER_SIZE : endPosition - currentPosition;
inChannel.read(buff, currentPosition);
//prepare for writing
buff.flip();
//write
outChannel.write(buff);
currentPosition += subSize;
//clear the buffer - so we can write again
buff.clear();
}
outChannel.close(); //close
//print throughput for this file partition
double delta = (double)(System.currentTimeMillis() - start2)/1000;
System.out.println(String.format(" -> Transferred in %.2f s @ %.2f MB/s", delta,
(double) size/BYTE_TO_MB/delta));
}
//calculate time
double delta = (double)(System.currentTimeMillis() - start)/1000;
//print out the total throughput
System.out.println(String.format("Copied %.2f MB in %.2f s @ %.2f MB/s", (double)totalFileSize/BYTE_TO_MB, delta, (double)totalFileSize/BYTE_TO_MB/delta));
//finally close the channel
inChannel.close();
}
private static FileChannel getChannel(File inFile, File outDir, int ctr) throws FileNotFoundException {
return new FileOutputStream(new File(outDir, (inFile.getName() + "." + ctr))).getChannel();
}
}
There are a couple of things I would like to mention about this code. First I tried a variety of things - I tried the MappedMemoryBuffers which was not giving me good performance so I reverted to using vanilla byte buffers. Next I tried a variety of buffer sizes unsurprisingly too low a buffer size means too many reads and too high meant very slow buffer manipulation - vanilla byte buffers of 128K seemed to be just right and gave me great speed and memory numbers.
The file under experiment was the open solaris ISO - about 700 MB in size. Here is the output:
Creating part 1 of size 100 MB -> Transferred in 0.27 s @ 366.30 MB/s
Creating part 2 of size 100 MB -> Transferred in 0.25 s @ 403.23 MB/s
Creating part 3 of size 100 MB -> Transferred in 0.24 s @ 413.22 MB/s
Creating part 4 of size 100 MB -> Transferred in 0.25 s @ 406.50 MB/s
Creating part 5 of size 100 MB -> Transferred in 1.19 s @ 84.32 MB/s
Creating part 6 of size 100 MB -> Transferred in 2.16 s @ 46.38 MB/s
Creating part 7 of size 76 MB -> Transferred in 2.21 s @ 34.85 MB/s
Copied 676.99 MB in 6.69 s @ 101.21 MB/s
Not bad I could split the file up in under 7 seconds - this is better throughput than what the native tool gives me. The result of this code was that the big file was split into 100MB chunks (and change).
Next the integrator -
package net.ahlawat.file;
import java.io.File;
import java.io.FileOutputStream;
import java.io.FileInputStream;
import java.nio.channels.FileChannel;
import java.nio.ByteBuffer;
import static net.ahlawat.file.Splitter.*;
/**
* Integrator - integrate files
* User: Pranay Ahlawat
* Date: Jan 18, 2010
* Time: 10:51:43 PM
*/
public class Integarator {
public static void main(String[] args) throws Exception {
if (args.length < 3) {
System.out.println("integrator [fileName] [dir] [out file name]");
System.exit(1);
}
//create core variables
File dir = new File(args[1]);
String baseFileName = args[0];
File outFile = new File(args[2]);
//create the out channel - to which the data will be written
FileChannel outChannel = new FileOutputStream(outFile).getChannel();
//core buffer
ByteBuffer buff = ByteBuffer.allocate((int)BUFER_SIZE);
int ctr = 0;
long start = System.currentTimeMillis();
while(true) {
//some profiling
long start2 = System.currentTimeMillis();
//create the file and test to see if it's there
File file = new File(dir, String.format("%s.%s", baseFileName, ++ctr));
if (!file.exists()) { //no the file 'n' does not exist - integration complete
break;
}
System.out.print(String.format("Integrating %s", file.getName()));
//creat the in channel for the partitioned file 'n'
FileChannel inChannel = new FileInputStream(file).getChannel();
long currentPosition = 0;
long fileSize = file.length();
//read the file in chunks of BUFFER_SIZE
while(currentPosition < fileSize) {
long chunkSize = (currentPosition + BUFER_SIZE) < fileSize? BUFER_SIZE : fileSize - currentPosition;
inChannel.read(buff, currentPosition);
currentPosition += chunkSize;
buff.flip(); //flip the buffer we are ready to write
outChannel.write(buff);
buff.clear(); //clear
}
//close/flush the information
inChannel.close();
//print profiling inforamtion
double delta = (double) (System.currentTimeMillis() - start2)/1000;
System.out.println(String.format(" -> Integration complete in %.2f s @ %.2f MB/s",
delta, file.length()/BYTE_TO_MB/delta));
}
outChannel.close();
double delta = (double) (System.currentTimeMillis() - start)/1000;
System.out.println(String.format("Integration complete in %.2f @ %.2f MB/s", delta, outFile.length()/BYTE_TO_MB/delta));
}
}
Again I tried the outChannel.transferFrom() but it just bew up - the performance was horrible. The best results were when I used vanilla buffers and manipulated them myself.
Here are the results:
Integrating osol.iso.1 -> Integration complete in 0.35 s @ 283.29 MB/s
Integrating osol.iso.2 -> Integration complete in 0.26 s @ 378.79 MB/s
Integrating osol.iso.3 -> Integration complete in 0.25 s @ 393.70 MB/s
Integrating osol.iso.4 -> Integration complete in 0.28 s @ 361.01 MB/s
Integrating osol.iso.5 -> Integration complete in 1.68 s @ 59.56 MB/s
Integrating osol.iso.6 -> Integration complete in 2.10 s @ 47.55 MB/s
Integrating osol.iso.7 -> Integration complete in 1.39 s @ 54.87 MB/s
Integration complete in 6.44 @ 104.97 MB/s
Not bad at all. Just to put how fast this in in perspective - using cygwin just copying about 700 MB takes about 15 seconds.
deepti@aanyalaptop /cygdrive/c/test
$ time cp osol.iso cp_of_osol.iso
real 0m16.014s
user 0m0.031s
sys 0m1.825s
deepti@aanyalaptop /cygdrive/c/test
$
And I wrote this little bat script to measure the throughput of native windows command line.
prompt $d $t $_$P$G
copy osol.iso another_cp.iso
prompt $d $t $_$P$G
Here is the output.
C:\test>prompt $d $t $_$P$G
Tue 01/19/2010 1:07:48.86
C:\test>copy osol.iso another_cp.iso
1 file(s) copied.
Tue 01/19/2010 1:08:00.32
C:\test>prompt $d $t $_$P$G
Tue 01/19/2010 1:08:00.32
C:\test>
Which is approximate 12 seconds…
- java NIO rocks.
I will package this up with a UI and make it available as a tool on ahlawat.net soon for all interested.