Data exploration journal

Over the course of about a month, I parsed through roughly 20 city datasets depicting temperature, tree canopy, population, and more to arrive at the graphs published in my final data project. Many of those datasets were Geojson files, meaning they had geography elements to them that would allow them to be mapped. That presented advantages and difficulties, but those Geojson files were ultimately the basis of my project.

My final sought to address the urban heat island effect in Boston, which, due to historical disinvestment, amplifies temperatures in lower income neighborhoods. I wanted to represent this phenomenon graphically, so the first data I began working with was a set of heat data collected by a team of researchers in July 2019 and published in 2021. Researchers fanned out across the city, recording temperatures at 6 a.m., 3 p.m., and 7 p.m.

As I examined the information, I noticed the city broke down its geographic data into two different forms: small hexagons and census tracts.

It was not immediately clear which would be more helpful, so I decided to visualize the afternoon high temperature data by hexagons and census tracts to compare. The following depicts my first steps in visualizing temperature data using the hexagon Geojson file.

*Color added, and changed afternoon high temp value to average.*

*Updated color range to show more nuance in temp differences.*

*Neighborhood layer added, including labels.*

The hexagons provided an intricately detailed look at the data. It was more specific than the census tracts.

*Same process repeated with census tracts.*

Comparing these maps initially, the hexagons were a clear winner. But as I continued to work, I realized that a key piece of my project would be determining temperature averages for each neighborhood. That data was not provided by the city. Instead, I would have to take either the hexagons or the census tracts and determine which specific data points fell within each neighborhood. This changed my thinking. Ultimately, I ended up going with the census tract data because the city’s census tracts fall relatively neatly within the bounds of the neighborhoods. As for the hexagons, the process of grouping them into neighborhoods would be more laborious, as they didn’t fall as neatly into the borders. There’s also generally more census tract data available outside of this subject matter (housing, population, etc) which I felt would better lend itself to our ideas for visualizations.

My next steps were to begin working with the city’s tree canopy data, which was a big part of the final project. Trees play a huge role in influencing the temperature of a given neighborhood, experts told me. Therefore, there should be a strong correlation between the city’s tree canopy and heat data. I wanted to see that for myself, so I started working with a Geojson file that mapped every tree in the city.

To see the correlation, I made the tree canopy and afternoon high temperatures by census tracts two different map layers on the same map. After playing around with the opacity, I had a map that depicted the correlation I was hoping for. It was clear to see that neighborhoods like Chinatown that get hotter than others, in almost every case, had very little tree cover. The opposite was the case for cooler neighborhoods like Jamaica Plain.

And while I thought the map depicted the correlation between tree canopy and heat pretty clearly, I wanted to make something stronger, something mathematically oriented: a graphic representation of the trend. To do that would require plotting the temperature and canopy data of each individual census tract. I was also still hoping to calculate the average temperature of each neighborhood. To start, I downloaded the relevant Geojson files as CSVs and began cleaning.

The most laborious part of the process involved me looking over a map of the census tracts that I created in Tableau and overlaid with the neighborhood boundaries. That allowed me to see what census tracts fell within what neighborhoods. I made lists of each neighborhood and its corresponding census tracts, double-checking my work to make sure I didn’t miss any tracts. I also had to reformat the census tract IDs in Excel, as the original CSV files attached additional digits that made the tracts difficult to decipher.

*Partially cleaned data. Added neighborhood column, and edited census tract column to reflect proper ID #s.*

Once the census tracts were cleaned and my list of neighborhoods completed, I manually input the corresponding neighborhood of each census tract in Excel.

*Fully cleaned data, with neighborhoods added to corresponding census tracts.*

I added the fully cleaned data to Tableau, and began playing around with different charts. At first, I was working with bar graphs, setting the y-axis as the neighborhoods and the x-axis as the temperature. That showed me the average temperature of each neighborhood. I didn’t think those graphs quite told the story I wanted to tell, so I shifted my focus.

*First look at temperature data broken down by neighborhood.*

I turned to using a scatter plot to graph the temperature and canopy coverage of each census tract. That would confirm the trend I saw in the maps.

*Tweaked the temperature range for a closer look.*

Indeed, the scatter plots confirmed what I had suspected: the areas with less canopy get hot early, are among the hottest in the afternoon, and remain hot well into the evening. Areas with a larger canopy stay cooler in the morning, reach an average temperature in the afternoon, and cool down rather quickly.

Data cleaning & analysis

Final Data Project Plan

Andrew Brinker