Archive Updatepalooza

Mood

mood: excited

In the Age of Slop, curated link directories keep us connected.

I've allowed myself to be weirdly obsessive about the archive the past few weeks and I'm finally done. For real this time.

tl;dr

BIGGER AND BETTER

Search Filtering, or The Utility of Tags

It's Dangerous to Go Alone, Take Tags

I regularly ponder how to improve the archive's usefulness for site and friend hunting, but due to the sheer volume of neighbors I'm generally limited by what's available through the API.

I previously attempted to set up a NeoCities directory using the citizen data I've collected, but I quickly ran into issues due to tagging inconsistencies. As a random example, take the zelda tag. Some users who use the zelda tag have Zelda fansites or write Zelda fanfiction but some apparently use that tag because they just Really Like Zelda. (Valid, but it complicates my sacred quest!)

Tag data may still be useful though, so the tags are listed in a link title popup. But in practice... that's not very useful either! Nobody wants to hover over every single button the archive to find sites about strawberries or webmasters who like strawberries.

Solution: an in-page search filter that allows you to quickly find buttons by name OR tagged interest. This is what strawberry returns in the S page:

strawberry-filter

Neat Search Tricks

If you're looking for friends with shared interests, useful keyword searches in our community might be terms like art, cute, technology, and so on.

Problemos

Secrets of Tag Magic

...NeoCities's own tag search remains one of the best ways to find neighbors with shared interests within the domain. Site tags are powerful dark magic and some webmasters don't even use them! I'll say it again...

Site tags are powerful dark magic.

NeoCities site tags are essentially webrings that never break. The caveat is there's no moderation, so sites might end up in "rings" that don't necessarily reflect their content. But please, I beg you. Become awesome, terrible cyber wizards. Embrace the power of the hyperlink dark arts and fill 👏 in 👏 your 👏 tags.

Scrape Your Neighbors, Scrape Your Friends

I was playing around on eightyeightthirty.one, a site that maps all sites linking each other by 88x31 buttons, and thought, "Can I leverage this data somehow?"

NARRATOR: In fact, they could.

mr-burns-excellent

breq's post on 88x31 Buttons and Network Science pointed me toward https://eightyeightthirty.one/graph.json, which includes linksto, linksfrom, and images for all the sites. Yoink that .json file, search for all links with Neocities linksto domains, and compare those to whatever's in my database. If it's not in there, grab the image urls and save the buttons by username. Easy peasy, but for some reason I couldn't figure out how to get the filenames with extensions, so I downloaded them as files without extensions and used some silly Python voodoo and the handy module filetype to figure out what was what:

for file in os.listdir(IMAGE_DIRECTORY):   
    head, tail = os.path.splitext(file)    
    if not tail:
        src = os.path.join(IMAGE_DIRECTORY, file)
        ext = filetype.guess(src)
        dst = os.path.join(IMAGE_DIRECTORY + file + "." + ext.extension)    
    if not os.path.exists(dst):
        os.rename(src, dst)

Which I'm sure is terrible code but it worked so I don't care. This pulled hundreds sites I hadn't archived yet. Score!

I also decided to see if I could get different versions of buttons, using the Python ImageHash module to compare them. This was dumb. The reason it's dumb is because all the buttons I archive are heavily optimized, and that process changes the resulting hash. Here's an example were copies where 3 copies of the same button returned 3 different hashes:

ff377788880080ff 2cool4fp_2.gif
ff377708888080ff 2cool4fp_blob1 (1).gif
ff377788888080ff 2cool4fp_blob1 (2).gif

Well mama didn't raise no quitter as you all know so I optimized all the scraped buttons, then compared them. It helped. A little. Sometimes. I was able to whittle 3 potentially different images down to 2 in this example.

ff377788880080ff 2cool4fp_2.gif
ff377788888080ff 2cool4fp_blob1 (1).gif
ff377788888080ff 2cool4fp_blob1 (2).gif

So I started out with 7,600 images and whittled those down to about 3,000, but there there were still dupes 😭 so I had to do the final comparison the old fashioned way, with actual human eyeballs. The things I do to slake my irrational thirst for buttons. I swear.

Having nabbed everything from the .json linksto, I now turned my hungry gaze upon linksfrom. By elimination, the resulting pool only contained NeoCities sites that link to other sites, but aren't linked back by anyone in turn. Don't cry! Obscurities are truly the jewels of any fine collection, and I endeavored to diligently and lovingly collect them all. I ended up grabbing 700 more citizens this way. I think I ended up getting about 2000 buttons total from the eightyeightthirty.one database.

button-jail

BUTTON JAIL

The scraper makes two assumptions. The first is that 88x31 buttons linked to NeoCities sites are actual site buttons. The second is that the linked url is correct. Sometimes, this isn't true.

Button jail is necessary because ðŸ˜ą SOME OF YOU USE PLACEHOLDER IMAGES O. M. G. ðŸ˜ą But there are also common spelling differences, like behaviour instead of behavior, and typos (my personal favorite is whitedessert (whitedesert is a great site btw, check it out!)). Sometimes people link sites with completely random 88x31 images. If Crawlie finds something sus she'll plug it into the Wayback machine to see if there's a site history. But in the case of the eightyeightthirty.one database, I mostly have to go through them by hand.

My beloved neighbors, could I interest you in a CSS PLACEHOLDER button during these trying times?

    <style>
        .button {
            background-color: #FF0080;
            color: white;
            text-align: center;
            text-decoration: none;
            vertical-align:middle;
            display: inline-block;
            height:31px;
            width:88px;
            font-size: 11px;
            line-height: 31px;
            border-radius:2px;
        }
    </style>

<a href="#" class="button">PLACEHOLDER</a>

Crawlin' After Yo Links

I did make some crawler modifications, too. I expanded my keyword list to "link","button","friend","88x31","outgoing", "neighbor","affiliate","neocitizen", "wall","sites","bookmark","home", and "network". Whenever Crawlie finds a button for an unarchived user, it now grabs the user's homepage and any linked pages that match a keyword list.

"BUT NEONAUT WHY U NO LINK ME AAAAAAAAAAaaaa"

Just ask! The other way is to make sure you link yourself, either on your homepage or a page named something like friends, links, neighbors, buttons, etc... Crawlie is looking for link images that explicitly reference the neocities domain, like this:

<a href="https://mycoolsite.neocities.org/"><img src="mycoolbutton.blah"></a>

Crawlie cruises the activity feed at my discretion and covers 20 pages worth of updates per crawl, which is about a day's worth of activity feed updates. I do not have a way to reliably grab variant buttons. There's not a good way to automate this, so it's still a manual process.

Anyway, enough of that. For real this time!