Home About Me

Building a Java Short URL Service That Can Handle Tens of Millions of Links

Short URL systems are usually built in one of two ways: hash the original URL into a compact identifier, or issue a sequential ID and encode it into a shorter string. Both approaches work, but they lead to very different trade-offs in collision handling, predictability, storage design, and performance under load.

Two common ways to generate short links

Hash-based generation

The first approach takes a long URL, runs it through a hash function, and turns the result into a short code. The main challenge is collision handling: different long URLs can, in theory, end up with the same short value.

A lot of people instinctively think of MD5 or SHA for hashing, but those are cryptographic algorithms. For a short-link service, reverse decryption resistance is not the main concern. What matters more is speed and a low collision rate. Using a heavyweight cryptographic hash here is usually unnecessary and comes with a performance cost.

A better fit is MurmurHash, a non-cryptographic hash function designed for general hash lookup scenarios. Compared with many widely used alternatives, it tends to produce better distribution for structured keys, and because it is not meant for encryption, it is much faster than algorithms like MD5 or SHA. In practice, its performance is often more than an order of magnitude higher. That combination of speed and distribution quality is why it has been widely adopted in systems such as Redis, MemCache, Cassandra, HBase, and Lucene.

Making the hash shorter

MurmurHash32 produces a 32-bit value, and MurmurHash64 produces a 64-bit value. To make the result suitable for a short URL, the numeric hash can be converted into Base62.

Why Base62 instead of Base64? Because Base62 uses exactly these characters:

  • a-z
  • A-Z
  • 0-9

That makes the generated short code compact while staying URL-friendly.

Dealing with collisions

No hash function is completely collision-free, even a very good one. So collision handling has to be part of the design.

A straightforward table structure looks like this:

CREATE TABLE `short_url` (
  `id` int(11) unsigned NOT NULL AUTO_INCREMENT,
  `lurl` varchar(150) NOT NULL,
  `surl` varchar(10) NOT NULL,
  `gmt_create` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT '创建时间',
  PRIMARY KEY (`id`),
  UNIQUE KEY `idx_surl` (`surl`),
  KEY `idx_lurl` (`lurl`)
) ENGINE=InnoDB AUTO_INCREMENT=15536 DEFAULT CHARSET=utf8;

A basic collision-resolution flow is:

  1. Take the long URL, hash it with murmur64, encode it with Base62, and keep the first 6 characters.
  2. Check whether that short code already exists in the short_url table.
  3. If it does not exist, insert the mapping.
  4. If it already exists, append a random suffix to the original long string, hash again, and repeat until a free value is found.

That version works, but it is not ideal. Inserting one record requires two SQL operations: one lookup, then one insert. Under high concurrency, that extra round trip becomes a bottleneck.

A better design is to let the database help with collision detection:

  1. Add a unique index on surl.
  2. Skip the pre-check and insert directly after hashing.
  3. If insertion fails because of the unique constraint, treat that as a collision, re-hash with a randomized variant, and try again.

This means the normal path touches the database only once. Since MurmurHash collisions are rare, the cost of retrying only appears in exceptional cases, which makes this trade-off practical.

There is another issue: if the same long URL is submitted repeatedly in a short time, repeated generation requests can still cause avoidable work. To reduce that, an LRU cache can be introduced. If the long URL is already in cache, return the cached short URL immediately. Otherwise, generate it once and put the result into the cache.

That combination—single database write on the happy path plus LRU caching—improves throughput significantly.

ID generator approach

The second approach is to maintain a globally increasing ID such as 1, 2, 3.... Each new long URL receives an ID, and that ID is converted to Base62 and appended to the short-link domain.

This method is simple, but it has several drawbacks:

  • it requires a globally coordinated incrementing ID source;
  • the same long URL may receive different short URLs on different requests;
  • the resulting short codes follow a visible pattern and are easier to guess.

Common ways to generate those IDs include:

  • UUID
  • Redis counters
  • Snowflake IDs
  • MySQL auto-increment primary keys

Among these options, Redis-based counters and Snowflake-style IDs are generally more practical choices.

Why choose the hash-based route here

The design here uses hash mapping to generate short links.

For storage, MySQL is used as the system of record. With database sharding, table partitioning, and read/write splitting, MySQL can still deliver very good efficiency at scale.

Why not Redis or HBase as the primary store?

  • Using Redis alone as durable storage risks data loss when cache data is lost.
  • HBase may introduce less predictable efficiency characteristics for this use case.

That is why MySQL is chosen as the underlying persistent store.

Collision test results

The hash function testing results are fairly convincing.

With 1 million records, using murmur32 (a 32-bit hash), the system sees about 121 collisions:

i = 100000(10W), conflictSize = 1 i = 200000(20W), conflictSize = 6 i = 300000(30W), conflictSize = 12 i = 400000(40W), conflictSize = 19 i = 500000(50W), conflictSize = 32 i = 600000(60W), conflictSize = 46 i = 700000(70W), conflictSize = 54 i = 800000(80W), conflictSize = 76 i = 900000(90W), conflictSize = 94 i = 1000000(100W), conflictSize = 121

After switching to murmur64, the result is much better:

  • 1 million records: 0 collisions
  • 5 million records: 0 collisions

Based on that result, murmur64 is the recommended choice.

Core algorithm for generating short URLs

The critical part is not just hashing itself, but how collisions are handled and how the LRU cache avoids repeated work for hot URLs.

public String generateShortUrl(String longUrl) {
    if (StringUtils.isEmpty(longUrl)) {
        throw new RuntimeException("longUrl 不能为空");
    }

    String shortUrl = CacheUtils.get(MapConstants.longCache, longUrl);
    if (StringUtils.isNotEmpty(shortUrl)) {
        return shortUrl;
    }

    return getShortUrl(longUrl, getLongUrlRandom(longUrl));
}

private String getShortUrl(String rawUrl, String longUrl) {
    long hash = HashUtil.murmur64(longUrl.getBytes());
    String base62 = Base62.encode(hash + "");
    log.info("longUrl = {}, hash = {}, base62 = {}", longUrl, hash, base62);
    if (StringUtils.isEmpty(base62)) {
        throw new RuntimeException("hash 算法有误");
    }

    String shortUrl = StringUtils.substring(base62, 6);
    ShortUrl url = new ShortUrl(rawUrl, shortUrl);
    try {
        int insert = shortUrlDAO.insert(url); // 这里进行分库分表 提高性能
        if (insert == 1) {
            CacheUtils.put(MapConstants.longCache, rawUrl, shortUrl);
        }
    } catch (DuplicateKeyException  e) {
        // Hash冲突
        log.warn("hash冲突 触发唯一索引 rawUrl = {}, longUrl = {}, shortUrl = {}, e = {}", rawUrl, longUrl, shortUrl, e.getMessage(), e);
        CacheUtils.put(MapConstants.hashFailMap, rawUrl, shortUrl);
        return getShortUrl(rawUrl, getLongUrlRandom(shortUrl));
    } catch (Exception e) {
        log.error("未知错误 e = {}", e.getMessage(), e);
        throw new RuntimeException("msg = " + e.getMessage());
    }

    return shortUrl;
}

private String getLongUrlRandom(String longUrl) {
    return longUrl + RandomUtil.randomString(6);  // 解决冲突多的问题,随机字符串
}

A few implementation details stand out here:

  • Empty input is rejected immediately.
  • The system first checks an LRU cache keyed by the long URL.
  • If there is no cache hit, it hashes the URL with murmur64.
  • The hash is Base62-encoded, and a 6-character substring is used as the short code.
  • The mapping is inserted directly into MySQL.
  • If the insert triggers DuplicateKeyException, that means the unique index has caught a collision.
  • On collision, a random string is appended and the process retries recursively.

This avoids the classic “query first, insert second” pattern and makes the normal request path more efficient.

Core algorithm for resolving short URLs

The lookup path is also simple and optimized around caching:

public String getLongUrl(String shortUrl) {
    if (StringUtils.isEmpty(shortUrl)) {
        throw new RuntimeException("shortUrl 不能为空");
    }

    String longUrl = CacheUtils.get(MapConstants.shortCache, shortUrl);
    if (StringUtils.isNotEmpty(longUrl)) {
        return longUrl;
    }

    LambdaQueryWrapper<ShortUrl> wrapper = new QueryWrapper<ShortUrl>().lambda().eq(ShortUrl::getSUrl, shortUrl);
    ShortUrl url = shortUrlDAO.selectOne(wrapper);
    CacheUtils.put(MapConstants.shortCache, shortUrl, url.getLUrl());
    return url.getLUrl();
}

The read path works like this:

  • reject empty short codes;
  • check the LRU cache first;
  • if not found, query the database by surl;
  • cache the long URL and return it.

So in the normal case, generating a short URL requires only one database write, and resolving a short URL requires only one database read. That makes the service fast enough for high-volume traffic.

Key optimization points

Several design choices make this approach effective at larger scale:

  1. Only one database interaction is needed when generating a short URL.
    Instead of querying first and then deciding whether to insert, the system inserts directly and relies on the unique index to detect collisions.

  2. An LRU cache keeps recently generated mappings in memory.
    If the same long URL appears again soon after, the service can return the same short URL immediately.

  3. A random suffix is appended after a collision.
    This reduces the chance of repeated collisions for the same input.

  4. murmur64 is chosen over weaker or heavier alternatives.
    The test results show much better collision behavior than murmur32 in this scenario.

  5. Hot read traffic is absorbed by the cache.
    When resolving popular short links, the LRU cache can serve requests directly and prevent the database from being overwhelmed.

Final thoughts

A short-link service looks simple on the surface, but the details matter: hash choice, collision strategy, encoding, cache design, and the storage layer all shape the final result.

This design favors a hash-based scheme using murmur64, Base62 encoding, MySQL as durable storage, and LRU caching on both the write and read paths. The result is a service that keeps the common path extremely lean while still having a practical answer for rare collisions.

If you want to understand distributed systems and performance engineering in a hands-on way, building a short URL service is a very worthwhile exercise.