Description
Write a Python program that delivers the requested results. Include appropriate documentation:
comments and docstrings.
1. HTML background
HTML (Hyper-Text Mark-up Language) files contain plain text which can be interpreted by
trained humans, by web browsers, and by analytical software. For more information about
HTML, please consult some of the resources listed at the end of this document. For simplicity,
we will categorize HTML content into
– mark-up (also called tags, or simply HTML), and
– marked-up text.
The mark-up and marked-up text which we will process in this assignment are
– headers and
– tables.
Header tags include numbered openers, such as “
“, ““, and “
“, along with their
corresponding closers: “
“, along with their
corresponding closers: “
“, ““, “
“, etc. Table tags include “
” — again with corresponding closers. The tag letters can be in uppercase or lowercase, or even mixed case. The case of the closer should match the case of the opener. (This rule was not in effect in early versions of HTML, by the way. Note also that some HTML coders are sloppy and sometimes omit a closer entirely.) Headers in HTML are fairly straight-forward, and you might want to work on them first to build your confidence. Tables can become quite complicated. For instance, they can header rows ( ), footer rows ( |
1666 | London |
1871 | Chicago |
1904 | Toronto |
Run the generated program, and we should see…
Table of Important Fires
1666 London
1893 Chicago
1904 Toronto
Your program must work with simple HTML files, and some samples for testing will be
provided. You are encouraged to try to get your program to work on real-world examples.
5/8
INF1340H: Programming for Information Systems, Section 0101
8. tasks along the critical path
Please make sure (as soon as possible) that you are able to
– open and read a text file
– obtain a string from the command line
– download an HTML file from a server
– write (output) a text file
– write a .py file from a program and then run it
Your study and preparation for this assignment should include
– HTML basics
– HTML table basics
– regular expressions — because you might decide to use them
9. more details
Numbered example HTML files can be found in qsand.com/1340/, e.g.
http://qsand.com/1340/example1.html
The minimum requirement for this assignment is to be able to process the first two examples
(example1.html and example2.html) properly. Those files are relatively simple and the
formatting is quite nice.
The other examples are not as nice, and they may be attempted if you have enough time. By
successfully processing higher-numbered examples, you can earn bonus points which could
offset deficiencies (if any) in your program.
Submit your work in a single file and call it a2HTML.py
Identify the starting point by using if __name__ == “__main__”:
resources
https://www.w3.org/ They set standards for XML, and thus implicitly for HTML
Quercus/inf1340/Files/assignment for clarifications (Check at least once before you submit.)
6/8
INF1340H: Programming for Information Systems, Section 0101
to facilitate marking:
– Use the Quercus assignment dropbox, and if you worked as a team, submit only one copy per
pair.
– Provide your name and your partner’s name in a comment at the top of the program. Put your
name on line 4, and your partner’s name on line 5 (lines numbering from 1). Also, provide your
section number (L101 for Tuesday, L102 for Thursday).
– Give your program a title (also in a comment at the top), and indicate your revision number and
the date of the revision. Your program should be compatible with Python 3.5, and it is suggested
that you indicate version compatibility in a comment, too — i.e., which version(s) of Python will
your program run on?
– If you add extra features to your program, please ensure that they do not prevent smooth,
simple operation. Try not to confuse the markers. Do not ask the user any questions at runtime. (Hint: If you need additional information from the user, use optional command-line
arguments.)
– If you know, at submission time, that your program does not run successfully, please make a
note of the known deficiencies in a comment at the top.
– Describe appropriate test cases in your docstring.
– Use prescribed names for constants, variables, and functions. Please draw upon this list as
needed:
# strings containing Python assignment statements:
HEADING_assignment
TABLE_assignment
get_HTML_lines()
extract_heading()
extract_table()
extract_rows()
extract_cells()
write_output_py_file()
display_table()
– If your program is not easy to read, add comments, whitespace, etc., as appropriate. If your
functions are long, rewrite them so that they are shorter, or split the code out into additional
functions. (Note that if you use programming tricks to shorten the code, you should add
comments to explain the tricks.)
7/8
INF1340H: Programming for Information Systems, Section 0101
example of first lines in your .py file:
#
# inf1340, section L101
# assignment 2 – due 2018-11-12
# Steed, John << Names on lines 4 and 5
# Peel, Emma << Blank line if no partner
#
# HTML table extraction program
# v1.0.3 - 2018-11-11
# compatible with Python versions 3.4 - 3.7
# source file: a2HTML.py
#
scoring: (out of 100 possible)
declaration and use of constants as prescribed 5 marks
headers, docstrings, and bodies of functions:
to obtain URL (and filename) from command line 10 marks
to download HTML file for analysis 10 marks
to parse the HTML file 20 marks
to write a context-specific .py file 15 marks
to format and display the extracted data 10 marks
other code, including __main__ 5 marks
coding style and comments 10 marks
Program produces correct output. 10 marks
facilitation of marking (as described earlier) 5 marks
Don't be late! There is a 15% per day (or part of day) penalty for late submission.
8/8