Skip to content

Commit 9e26f03

Browse files
committed
2016-01-20
1 parent aaa796d commit 9e26f03

20 files changed

Lines changed: 429 additions & 153 deletions

.gitignore

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,13 +4,15 @@
44

55
# File
66
*.pyc
7+
*.log
8+
*.csv
9+
*.txt
10+
*.sqlite3
711

812
# Directory
913
build
1014
dist
1115
uszipcode.egg-info
12-
dataset
13-
prepare_dataset
1416

1517
# =========================
1618
# Windows image file caches
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
The data is originally from http://federalgovernmentzipcodes.us/, last updated on ``2012-01-22``.
2+
3+
About the data fields:
4+
5+
- Updated often, but not regularly.
6+
- 81,831 rows of data.
7+
- All 41,891 active zipcodes + 634 decommisioned zipcodes from the recent past.
8+
All 80673 active Primary(41885), Acceptable(13988), and Not Acceptable(24800) placenames. Some additonal- placenames for decommisioned codes.
9+
- 29,971 Standard, 9465 PO BOX, 2437 Unique, and 649 Military codes.
10+
- 50 States +
11+
- 361 AA Military - Americas
12+
- 38 AE Military - Europe
13+
- 164 AP Military - Pacific
14+
- 1 AS American Samoa
15+
- 290 DC Washinton DC
16+
- 4 FM Federated States Micronesia
17+
- 13 GU Guam
18+
- 2 MH Marshall Islands
19+
- 3 MP Northern Mariana Islands
20+
- 176 PR Puerto Rico
21+
- 2 PW Palau
22+
- 16 VI Virgin Islands
23+
24+
Sources:
25+
26+
- Current zipcodes, placenames, zipcode type(Standard, PO, Unique, Military), placename type (Primary, Acceptable, Not Acceptable): USPS
27+
- Military placenames (base or ship name): MPSA 2008 Election Ballot information
28+
- Tax returns filed, estimated population, total wages: IRS 2008
29+
- Latitude and Longitude; National Weather Service supplemented by Google Earth and Maps and occasionally other sources
30+
- Decommisioned zipcodes, Our old database--usually quality sources, but not verifiable.
31+
32+
Other Sources of zipcode information:
33+
34+
- Placenames (Cities, towns, geographic features) can be found at US Geological Survey GNIS Dataset
35+
- The IRS has additional data fields for 2008 and is reviewing their publication procedures for later years. see http://www.irs.gov/taxstats/indtaxstats/article/0,,id=96947,00.html
36+
- The Census publishes data, but they use Zipcode Tabulation Areas (ZCTAs) which 1) have changed areas between the 2000 census and the 2010 census 2) do not map well to USPS zipcodes well. If needed http://www.census.gov/geo/ZCTA/zcta.html
37+
- Social Security recipients by zipcode http://www.ssa.gov/policy/docs/statcomps/oasdi_zip/
38+
- For economic researchers and those who want tons of background on data sources by zipcode, University of Missouri OSEDA project
39+
40+
Appendix:
41+
42+
zipcode wiki: http://en.wikipedia.org/wiki/ZIP_code
43+
zipcode FAQ: http://www.zipboundary.com/zipcode_faqs.html
3.06 MB
Binary file not shown.

dataset/geocoded-data/about.rst

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
This is geocoded data for all address in free-zipcode-database-Primary.csv.
2+
Update on 2016-01-20

dataset/step1_geocoding.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
4+
"""
5+
extract federalgovernmentzipcodes.zip and zcta2010.zip, find
6+
free-zipcode-database-Primary.csv and zcta2010.csv, put it with this script.
7+
8+
And run step1, step2, step3, then you get the zipcode sqlite database for
9+
uszipcode 0.0.8.
10+
"""
11+
12+
from __future__ import print_function
13+
from geomate.tests import GOOGLE_API_KEYS
14+
from pprint import pprint as ppt
15+
import pandas as pd
16+
import geomate
17+
18+
df = pd.read_csv("free-zipcode-database-Primary.csv", dtype={"Zipcode": str})
19+
todo = df["Zipcode"].tolist() # 42522 zipcode
20+
21+
googlegeocoder = geomate.GoogleGeocoder(api_keys=GOOGLE_API_KEYS)
22+
googlegeocoder.set_sleeptime(0.1)
23+
batch = geomate.BatchGeocoder(googlegeocoder, db_file="geocode.sqlite3")
24+
batch.process_this(todo, shuffle=True)

dataset/step2_construct_dataset.py

Lines changed: 116 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,116 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
4+
"""
5+
This script is to construct a zipcode sqlite database for uszipcode extension.
6+
"""
7+
8+
from __future__ import print_function
9+
from pprint import pprint as ppt
10+
from sqlite4dummy import *
11+
import pandas as pd
12+
import json
13+
14+
def titleize(text):
15+
"""Capitalizes all the words and replaces some characters in the string
16+
to create a nicer looking title.
17+
"""
18+
if len(text) == 0: # if empty string, return it
19+
return text
20+
else:
21+
text = text.lower() # lower all char
22+
# delete redundant empty space
23+
chunks = [chunk[0].upper() + chunk[1:] for chunk in text.split(" ") if len(chunk) >= 1]
24+
return " ".join(chunks)
25+
26+
engine = Sqlite3Engine("geocode.sqlite3")
27+
metadata = MetaData()
28+
metadata.reflect(engine)
29+
geo_result = metadata.get_table("geo_result")
30+
31+
# --- google geocoding result ---
32+
geocode_data = dict()
33+
for zipcode, json_text in engine.select(Select(geo_result.all)):
34+
try:
35+
zipcode = json.loads(zipcode)
36+
json_dict = json.loads(json_text)
37+
northeastbound_lat = json_dict["geometry"]["bounds"]["northeast"]["lat"]
38+
northeastbound_lng = json_dict["geometry"]["bounds"]["northeast"]["lng"]
39+
southwestbound_lat = json_dict["geometry"]["bounds"]["southwest"]["lat"]
40+
southwestbound_lng = json_dict["geometry"]["bounds"]["southwest"]["lng"]
41+
lat_google = json_dict["geometry"]["location"]["lat"]
42+
lng_google = json_dict["geometry"]["location"]["lng"]
43+
geocode_data[zipcode] = [northeastbound_lat, northeastbound_lng,
44+
southwestbound_lat, southwestbound_lng,
45+
lat_google, lng_google]
46+
except Exception as e:
47+
pass
48+
print("Got %s zipcode google geocoded." % len(geocode_data))
49+
50+
# --- primary zipcode data ---
51+
primary_zipcode_data = dict()
52+
df = pd.read_csv("free-zipcode-database-Primary.csv", dtype={"Zipcode": str})
53+
for record in df.values:
54+
(
55+
Zipcode, ZipCodeType, City, State, LocationType, Lat, Long, Location,
56+
Decommisioned, TaxReturnsFiled, EstimatedPopulation, TotalWages,
57+
) = record
58+
primary_zipcode_data[Zipcode] = [
59+
ZipCodeType, City, State, LocationType, Lat, Long, Location,
60+
Decommisioned, TaxReturnsFiled, EstimatedPopulation, TotalWages,
61+
]
62+
print("Got %s zipcode primary zipcode." % len(primary_zipcode_data))
63+
64+
# --- read zcta data, construct uszipcode data---
65+
uszipcode_data = list()
66+
df = pd.read_csv("zcta2010.csv", dtype={"ZCTA5": str})
67+
for record in df.values:
68+
(
69+
ZCTA5, LANDSQMT, WATERSQMT, LANDSQMI, WATERSQMI,
70+
POPULATION, HSGUNITS, INTPTLAT, INTPTLON,
71+
) = record
72+
if (ZCTA5 in geocode_data) and (ZCTA5 in primary_zipcode_data):
73+
(
74+
northeastbound_lat, northeastbound_lng,
75+
southwestbound_lat, southwestbound_lng,
76+
lat_google, lng_google,
77+
) = geocode_data[ZCTA5]
78+
79+
(
80+
ZipCodeType, City, State, LocationType, Lat, Long, Location,
81+
Decommisioned, TaxReturnsFiled, EstimatedPopulation, TotalWages,
82+
) = primary_zipcode_data[ZCTA5]
83+
84+
# calculate derived field
85+
try:
86+
Density = POPULATION / LANDSQMI
87+
except:
88+
Density = None
89+
try:
90+
Wealthy = TotalWages / POPULATION
91+
except:
92+
Wealthy = None
93+
94+
if lat_google:
95+
Lat = lat_google
96+
if lng_google:
97+
Long = lng_google
98+
99+
ZipCodeType = titleize(ZipCodeType)
100+
City = titleize(City)
101+
102+
uszipcode_data.append([
103+
Zipcode, ZipCodeType, City, State, POPULATION, Density,
104+
TotalWages, Wealthy, HSGUNITS, LANDSQMI, WATERSQMI,
105+
Lat, Long,
106+
northeastbound_lat, northeastbound_lng, southwestbound_lat, southwestbound_lng,
107+
])
108+
print("Got %s zipcode for uszipcode database." % len(uszipcode_data))
109+
110+
uszipcode_data = pd.DataFrame(uszipcode_data, columns=[
111+
"Zipcode", "ZipcodeType", "City", "State", "Population", "Density",
112+
"TotalWages", "Wealthy", "HouseOfUnits", "LandArea", "WaterArea",
113+
"Latitude", "Longitude",
114+
"NEBoundLatitude", "NEBoundLongitude", "SWBoundLatitude", "SWBoungLongitude",
115+
])
116+
uszipcode_data.to_csv("zipcode.txt", sep="\t", index=False)

dataset/step3_makedatabase.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
#!/usr/bin/env python
2+
# -*- coding: utf-8 -*-
3+
4+
from __future__ import print_function
5+
from pprint import pprint as ppt
6+
from sqlite4dummy import *
7+
import pandas as pd
8+
9+
engine = Sqlite3Engine("geocode.sqlite3")
10+
metadata = MetaData()

dataset/zcta2010/about.rst

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
The data is originally from http://proximityone.com/cen2010_zcta_dp.htm
2+
3+
About the data fields:
4+
5+
- ZCTA5: zipcode
6+
- LANDSQMT: land square meters
7+
- WATERSQMT: water square meters
8+
- LANDSQMI: land square miles
9+
- WATERSQMI: water square miles
10+
- POPULATION: population
11+
- HSGUNITS: housing units
12+
- INTPTLAT: latitude
13+
- INTPTLON: longitude
14+
15+
2010 census data

dataset/zcta2010/zcta2010.zip

938 KB
Binary file not shown.

long_description.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
Welcome to the uszipcode Documentation
2-
====================================================================================================
3-
2+
===================================================================================================
43
``uszipcode`` is the most powerful and easy to use zipcode information searchengine in Python. Besides geometry data (also boundary info), several useful census data points are also served: `population`, `population density`, `total wage`, `average annual wage`, `house of units`, `land area`, `water area`. The geometry and geocoding data I am using is from google map API on Oct 2015. To know more about the data, `click here <http://www.wbh-doc.com.s3.amazonaws.com/uszipcode/uszipcode/data/__init__.html#module-uszipcode.data>`_. `Another pupolar zipcode Python extension <https://pypi.python.org/pypi/zipcode>`_ has lat, lng accuracy issue, which doesn't give me reliable results of searching by coordinate and radius.
54

65
**Highlight**:
@@ -11,7 +10,7 @@ Welcome to the uszipcode Documentation
1110

1211
**Quick links**:
1312

14-
- `Home page <https://github.com/MacHu-GWU/uszipcode-project>`_
13+
- `GitHub Homepage <https://github.com/MacHu-GWU/uszipcode-project>`_
1514
- `Online Documentation <http://www.wbh-doc.com.s3.amazonaws.com/uszipcode/index.html>`_
1615
- `PyPI download <https://pypi.python.org/pypi/uszipcode>`_
1716
- `Install <install_>`_

0 commit comments

Comments
 (0)