The test set for most of these experiments uses RDF BIBFRAME graphs transformed from Colorado College's most popular material's MARC records using the Library of Congress's marc2bibframe conversion program. Some properties of this sample set are:
MARC Records | Triples | Works | Instances | Size MB |
---|---|---|---|---|
17,144 | 2,539,234 | 33,535 | 17,144 | 625.05 |
Using the Redis KEYS command that supports glob pattern matching on the entire datastore, To test the raw performance of retrieving an single RDF graph for a particular work, we'll use the Python timeit function and graph the results with Matplotlib with this code snippet to try to simulate a linear increase in useage, initially starting from 1 to 10
import timeit
redis_setup_stmt = """import redis
cc_popular = redis.StrictRedis()"""
redis_stmt="""cc_popular.keys("17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:*:*")"""
redis_trials = []
for count in range(1, 11):
redis_timer = timeit.timeit(stmt=redis_stmt, setup=redis_setup_stmt, number=count)
redis_trials.append(redis_timer)
Now that we have baseline for the raw performance of the KEYS command, we'll now run the same test but this time use the requests to connect to a running Falcon API instance to see what overhead adding a REST service adds to the Redis
import requests
setup_stmt = """import requests"""
falcon_stmt="""requests.get("http://localhost:18150", data={"s": "http://catalog.coloradocollege.edu/58052458"})"""
falcon_trials = []
for count in range(1,11):
falcon_timer = timeit.timeit(stmt=falcon_stmt, setup=setup_stmt, number=count)
falcon_trials.append(falcon_timer)
Here is the setup testing code for the Asynco server:
setup_stmt = """import requests"""
asynco_stmt = """requests.get("http://localhost:7000?s=http://catalog.coloradocollege.edu/58052458")"""
asynco_trials = []
for count in range(1,11):
asynco_timer = timeit.timeit(stmt=asynco_stmt, setup=setup_stmt, number=count)
asynco_trials.append(asynco_timer)
# of Runs | Direct Redis | Falcon API | Asynco |
---|---|---|---|
1 | 1.2281699029845186 | 1.2685371820116416 | 1.3175828389939852 |
2 | 2.6082807540078647 | 2.768362994014751 | 3.0104295710334554 |
3 | 3.5183271229616366 | 3.5287525659659877 | 3.708122154988814 |
4 | 4.681248573004268 | 4.727095118025318 | 4.8970251709688455 |
5 | 5.818655456008855 | 5.897049093968235 | 6.1605203920044005 |
6 | 7.248136567999609 | 7.0755964029813185 | 7.360615084005985 |
7 | 8.514002248994075 | 8.32872693700483 | 8.533214915019926 |
8 | 9.544208430976141 | 9.583168470009696 | 9.845916612015571 |
9 | 10.48881931899814 | 10.6294925850234 | 11.09655712498352 |
10 | 13.434723928978201 | 13.046950068033766 | 12.24355677596759 |
From this first pass, we see that regardless of the test, as the number of trials increased, our execution time increased in roughly linear fashion. This is encouraging and what we would expect. Second, neither Falcon or the Asynco HTTP server added much overhead to the execution of the code. Finally, there are not any appreciable differences between our Falcon and Asynco server code, making the choice between the more of a choice between style and easy of maintenance rather than performance.
We should be careful about drawing too many conclusions from this test run as it should be repeated a few more times to make sure our results are still consistent over time. But for now, we'll look at how we can improve the Redis backend implementation to use something other than the KEYS command before we start trying to optimize what server technology to use; Falcon or Asynco.
As mentioned in the issues section in our first experiment, using the Redis KEYS is discouraged in production. Redis provides an alternative to the keys which doesn't block the server called SCAN that has O(1) time complexity per call. The SCAN command takes a cursor and with each call does one iteration. Eventually, calling the SCAN command multiple times will iterate through the entire Redis key space. The slice of the command space by which the SCAN can be adjusted with a COUNT parameter.
In the test setup code below, we are just going to keep the number of runs consistent at 10 but vary the size of the COUNT parameter. Remember from our last test, the average it took to execute 10 runs KEYS command with our straight Redis call was 13 seconds.
redis_setup = """import redis
cache_redis = redis.StrictRedis()
def test_redis_scan(count):
cur = 0
output = []
while 1:
cur, results = cache_redis.scan(cur, match="17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:*:*", count=count)
if len(results) > 0:
output.extend(results)
if not cur:
break
return output"""
scan_trials = []
for count in [1000, 2500, 5000, 10000, 25000, 50000, 75000, 100000, 250000, 500000]:
redis_stmt = "test_redis_scan({})".format(count)
redis_timer = timeit.timeit(stmt=redis_stmt, setup=redis_setup, number=10)
scan_trials.append((count, redis_timer))
Count | Time (seconds) |
---|---|
1000 | 93.26906220900128 |
2500 | 102.64958874002332 |
5000 | 96.19367762497859 |
10000 | 94.76345123304054 |
25000 | 91.95245929301018 |
50000 | 97.29142061399762 |
75000 | 90.68025794799905 |
100000 | 91.00262290996034 |
250000 | 96.62939810799435 |
500000 | 88.64146210596664 |
Even the best result with 500,000 count size for 10 trials was 88 seconds, significantly worse than our KEYS test from above. Is there any alternative approaches we can examine? Yes! One of the nice features of Redis is that we can use a different data structure to improve performance.
An alternative to using the constructed Redis triple key from the three different SHA1 for
each subject-predicate-object triple, is to use Redis hashes to store triple information.
A first pass at this approach would be for all of the predicate-objects of a subject would
be stored at a hash with a key suffix of :predicate-objects.
In this model the subject hash fields would still use the "predicate-sha1-digest:object-sha1-digest"
pattern and with the field value being an 1 integer.
For the subject used in our example, the companion redis key would be:
17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:predicate-objects.
>>> local_redis.hgetall('17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:predicate-objects')
{b'f610f749c5c2eaf6718eb2bc24bf74559d14637d:1509401ff5fecad20101ebb147f77d82407b75ae': b'1',
b'a548a25005963f85daa1215ad90f7f1a97fbe749:899c924e4dc30d79cb6d11b5d03542e1e3d8c381': b'1',
b'a20301af19937f3787275c059dae953eaff2cb5f:262798fe370ba8f32d0aa9b40381c1cab4b8ae22': b'1',
b'a548a25005963f85daa1215ad90f7f1a97fbe749:8eda00df792b5ab31f14febf8fbae9f78bf820e7': b'1',
b'3c197cb1f6842dc41aa48dc8b9032284bcf39a27:5d1377f4476a1cbfb3caea106dc6b0a7d086410a': b'1',
b'0ba43b890c824baa68a0d304bbd0baab6bff921a:9c1020be297eed7dc8106954d1600ecd68448a64': b'1',
b'38e711cd1701b5f99ae70f2c9974aae8a24944d1:23b04598de49ffeeddf305afbdea4d7fa3127299': b'1',
b'0f08c96e756a4fa720257bf3090efdf76b5d3acc:0672781095db8ffcb1977d58687125df74c3e5c7': b'1',
b'1800d68bd2be5785c5505e09329f3d37b02c9d01:978170a2adcba8e0f7c41158c420d642110cd59e': b'1',
b'3c197cb1f6842dc41aa48dc8b9032284bcf39a27:0139c97cff6054f868eadb4b0e70a58da37238bd': b'1'}
Now we'll run this test code and see what the results are
redis_setup = """import redis
cache_redis = redis.StrictRedis()"""
redis_stmt = """cache_redis.hgetall('17e3f5ff4da952ec9d9f850a571a8d9cf5b1ec69:predicate-objects')"""
redis_hash_trials = []
for count in range(1,11):
hash_timer = timeit.timeit(stmt=redis_stmt, setup=redis_setup, number=count)
redis_hash_trials.append(hash_timer)
# of Runs | Time in Seconds |
---|---|
1 | 0.00512834801338613 |
2 | 0.002946369000710547 |
3 | 0.002188334008678794 |
4 | 0.0025909639662131667 |
5 | 0.0024501450243405998 |
6 | 0.0027079840074293315 |
7 | 0.0031070280238054693 |
8 | 0.003147128038108349 |
9 | 0.0033804820268414915 |
10 | 0.004014718986582011 |
Even running the standard timeit number of 1,000,000 runs results in respectable performance using hashes:
>>> hash_timer = timeit.timeit(stmt=redis_stmt, setup=redis_setup)
>>> print(hash_timer)
243.62794216600014