CSc 460 Program #1: Creating and Interpolation–Searching a Binary File solved


Category: You will receive a download link of the .ZIP file upon Payment


5/5 - (1 vote)

Overview: In the not–to–distant future, you will be writing a program to create an index on a binary file. I
could just give you the binary file, or merge the two assignments, but creating the binary file from a formatted
text file makes for a nice “shake off the rust” assignment, plus it provides a gentle (?) introduction to binary
file processing.
A basic binary file contains information in the same format in which the information is held in memory. (In
a standard text file, all information is stored as ASCII or UNICODE characters.) As a result, binary files
are generally faster and easier for a program to read and write than are text files. When your data is wellstructured, doesn’t need to be read directly by people and doesn’t need to be ported to a different type of
system, binary files are usually the best way to store information.
For this program, we have lines of data items that we need to store, and each line’s items need to be stored
as a group (in a database, such a group of fields is called a record). Making this happen in Java requires a bit
of effort; Java wasn’t originally designed for this sort of task. By comparison, “systems” languages like C can
interact more directly with the operating system to provide more convenient file I/O. On the class web page
you’ll find a sample Java binary file I/O program, which should help get you started.
Assignment: To discourage procrastination, this assignment is in two parts, Part A and Part B:
Part A Available from lectura is a file named jan2018ontime.csv (see also the Data section, below). This
is a CSV (Comma–Separated Value) text file consisting of 570,119 lines of 19 values each about airline flights
in January of 2018.
Using Java 1.8, write a program named that creates a binary version of the text file’s content
that is sorted (using a stable sort) in ascending order by the fifth field of each record (the ‘FL NUM’ field).
Additional details:
• For an input file named file.csv, name the binary file file.bin. (That is, keep the file name, but
change the extension.) Don’t put a path on it; just create it in the current directory.
• Field types are limited to int , double, and String. Follow the lead of the data. For example, if values
of a field are in quotes, assume that the field is a string (and don’t store the quote characters).
• For each column, all values must consume the same quantity of bytes. This is easy for numeric columns,
but for alphanumeric columns we don’t want to waste storage by choosing a huge maximum length.
Instead, you need to determine the number of characters in each field’s longest value, and use that as
the length of each value in that field. This must be done for each execution of the program. (Why?
Your program needs to work on other flight data files, too. You may assume that the field order, types,
and quantity will not change.) When necessary, pad strings on the right with spaces to reach the needed
length(s). (For example, “abc “, where ” ” represents the space character.
• Because the maximum lengths of the text fields can be different for different input files, you will need to
store these lengths in the binary file so that your Part B program can access them. One possibility is to
store the maximum lengths at the end of the binary file (after the last data record).
Repeating the info at the top of this handout: Part A is due in just one week; start today!
Part B Write another complete, well-documented Java 1.8 program named that performs both
of the following tasks, in this order:
1. Reads and prints to the screen the UNIQUE CARRIER, FL NUM, ORIGIN, and DEST field values of
the first five records of data, the middle three records (or middle four records, if the quantity of records
is even), and the last five records of data in the binary file. Conclude the output with the total number
of records in the binary file, on a new line.
2. Using one or more FL NUM values given by the user, locates within the binary file (using interpolation
search; see below), and displays to the screen, the same four field values of all records having the given
FL NUM value.
Some output details:
• Display the data one record per line, with a single space between the field values.
• seek() the Java API and ye shall find a method that will help you read the middle and last records of
the binary file.
• Use a loop to prompt the user for the FL NUM values. Terminate the program when a zero is entered.
Data: Write your programs to accept the complete data file pathname as a command line argument (for
Prog1A, that will be the pathname of the data file, and for Prog1B, the pathname of binary file created
by Prog1A). The complete lectura pathname of our data file is /home/cs460/fall18/jan2018ontime.csv.
Prog1A (when running on lectura, of course) can read the file directly from that directory; just provide that
path when you run your program. (That is, there’s no reason to waste disk space by making a copy of the file
in your account.)
Each of the lines in the file contains 19 fields (columns) of information. Here is one of them:
Some file information:
• The field names can be found in the first line of the file. Because that line contains metadata, its content
must not be stored in the binary file.
• In some records, such as this sample, some field values are missing (see the adjacent commas?). However,
the binary file requires values to maintain the common record length. For missing numeric fields, store
-1. For missing strings, store a string containing the appropriate quantity of spaces.
• Please keep in mind that we have not combed through the data to see that it is all formatted perfectly.
This is completely intentional. Corrupt data is a huge problem in data management. We hope that this
file holds a few surprises. We want you to think about how to deal with malformed data, and to ask
questions of us as necessary.
Output: Output details for each program are stated in the Assignment section, above. Please ask (preferably
on Piazza) if you need additional details.
Hand In: You are required to submit your completed program files using the turnin facility on lectura. The
submission folder is cs460p1. Instructions are available from the document of submission instructions linked
to the class web page. In particular, because we will be grading your program on lectura, it needs to run
on lectura, so be sure to test it on lectura. Feel free to split up your code over additional files if doing so is
appropriate to achieve acceptable code modularity. Submit all files as-is, without packaging or compression.
Want to Learn More?
• — The provided sample data was downloaded from the U.S. Department of Transportation’s Bureau of Transportation Statistics web site. Other
fields are available. If you’re curious: Click the “On-Time” drop down, and choose “Airline On-Time
Performance Databases”. On the next page, click “Download”. On the following page, select the desired
fields and click “Download” to create CSV.
Other Requirements and Hints:
• Don’t “hard–code” values in your program if you can avoid it. For example, don’t assume a certain
number of records in the input file or the binary file. Your program should automatically adapt to simple
changes, such as more or fewer lines in a file, or changes to the file names and paths. For example, we
may test your program with a file of just two records, or even no records. We expect that your program
will handle such situations gracefully.
• Once in a while, a student will think that “create a binary file” means “convert all the data into the
characters ‘0’ and ‘1’.” Don’t do that! The binary I/O functions in Java will read/write the data in
binary format automatically. See the example program on the class web page.
• Try this: Comment your code according to the style guidelines as you write the code (not an hour before
the due date and time!). Explaining in words what your code must accomplish before you write that
code is likely to result in better code. The requirements and some examples are available from:
• You can make debugging easier by using only a few lines of data from the data file for your initial testing.
Try running the program on the complete file only when you can process a few reduced data files.
• Late days can be used on each part of the assignment, if necessary, but we are limiting you to at most
two late days on Part A. For example, you could burn one late day by turning in Part A 18 hours late,
and three more by turning in Part B two and a half days late. Of course, it’s best if you don’t use any.
• Finally: Start early! File processing can be tricky.
Interpolation Search
Interpolation Search is an enhanced binary search. To be most effective, these conditions must exist: (1) The
data is stored in a direct-access data structure (such as a binary file of uniformly-sized records), (2) the data
is in sorted order by the search key, (3) the data is uniformly distributed, and (4) there’s a lot of data. In such
situations, the reduction in quantity of probes over binary search is likely to be particularly beneficial given
the inherent delay that exists in file accesses. Our data falls short on (3) and (4), but that’s OK; the search
will still work.
Interpolation Search is just like binary search, with one change: Instead of probing the data at the midpoint
(one-half of low index plus high index), we use the following probe index calculation:
probe index = low index + 
target − key[low index]
key[high index] − key[low index] · (high index − low index)
For example, consider a binary file of 60,000 records (indices 0 through 59,999), with keys that range from
100 through 150,000, and a target of 125,000. Thus, low index = 0, high index = 59999, key[low index]
= 100, key[high index] = 150000, and target = 125000. Our first probe into the file would be into record
number 0 + l
150000−100 · (59999 − 0)m
= 49993.