GithubHelp home page GithubHelp logo

jcjones / ct-mapreduce Goto Github PK

View Code? Open in Web Editor NEW
11.0 11.0 2.0 2 MB

Map/Reduce functions for processing Certificate Transparency. Used for https://LetsEncrypt.org/stats

Home Page: https://ct.tacticalsecret.com/

Go 100.00%

ct-mapreduce's People

Contributors

jcjones avatar pateldt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

hunslater

ct-mapreduce's Issues

cannot parse dnsName "vmext21-065.gwdg.de."

groovetop:ct-mapreduce lcrouch$ ct-fetch -config ~/.ct-fetch.conf
Saving to disk at /tmp/ct
[https://ct.googleapis.com/rocketeer] Starting download.
[https://ct.googleapis.com/rocketeer] Fetching signed tree head...
[https://ct.googleapis.com/rocketeer] Counting existing entries...
[https://ct.googleapis.com/rocketeer] 227988705 total entries at Thu Mar 15 10:00:55 2018
[https://ct.googleapis.com/rocketeer] Going from 0 to 227988705
| 0.0% (2711 of 227988705) Rate: 5287/minute (718h39m0s remaining)
[https://ct.googleapis.com/rocketeer] Download halting, error caught: failed to parse certificate in MerkleTreeLeaf for index 5067: x509: cannot parse dnsName "vmext21-065.gwdg.de."
[ct.googleapis.com/rocketeer] Saved state. MaxEntry=4096, LastEntryTime=2014-09-09 08:29:53.000000573 -0500 CDT

~/.ct-fetch.conf:

issuerCNList = DigiCert
logList = https://ct.googleapis.com/rocketeer # DigiCert is in here
certPath = /tmp/ct

downloadCTRangeToChannel drops entries on contention

This is pretty much my fault for not being more thorough in e09f015 - issue #2 predicted this calling the back-off logic unnecessary, which it really is.

The issue here is that the select statement in downloadCTRangeForChannel provides ways out that neither abort nor pass the CT entry on for evaluation:

// Are there waiting signals?
select {
case sig := <-sigChan:
glog.Infof("[%s] Signal caught: %s", lw.LogURL, sig)
return index, lastEntryTimestamp, nil
case entryChan <- CtLogEntry{logEntry, lw.LogURL}:
lastEntryTimestamp = uint64ToTimestamp(logEntry.Leaf.TimestampedEntry.Timestamp)
lw.Backoff.Reset()
case <-lw.SaveTicker.C:
lw.saveState(index, lastEntryTimestamp)
default:
// Channel full, retry
duration := lw.Backoff.Duration()
metrics.IncrCounter([]string{"downloadCTRangeToChannel", "channelFull"}, 1)
metrics.AddSample([]string{"downloadCTRangeToChannel", "channelFullBackoff"},
float32(duration.Milliseconds()))
time.Sleep(duration)
}

Both the path through the SaveTicker and the default drop the entry in logEntry into the ether and then proceed on at the top of the loop again, grabbing a new entry.

ct-fetch should periodically save its state on long downloads

Currently ct-fetch persists its log state whenever a log download completes / catches-up. That's fine for maintenance, but during an initial sync which can take many days, the log states won't persist unless the user manually issues a SIGTERM or ctrl-c. And then restart, of course.

Really, we should download smaller batches and persist state in between them.

Segment cache data on expiration date+hour

The number of bytes/cert is directly related to the number of certs in a key set. On a set of a few thousand certs, it's 24 bytes/cert, on one of a few hundred thousand, it's 45 bytes/cert.

Let’s Encrypt’s certs expiring on December 30th - 1509338 certs - is 159937855 bytes, or 105.96 bytes per cert.

To improve space utilization, I can segment the in-cache data by issuer/date/hour-of-day easily, because I always know hour-of-day, and it doesn’t affect the final filter. This also would let me do hourly revocation removals, which is a long term goal.

ct-fetch should verify certs

Around these lines:

if len(ep.LogEntry.Chain) < 1 {
glog.Warningf("[%s] No issuer known for certificate precert=%v index=%d serial=%s subject=%+v issuer=%+v",
ep.LogURL, precert, ep.LogEntry.Index, storage.NewSerial(cert).String(), cert.Subject, cert.Issuer)
continue
}
issuingCert, err := x509.ParseCertificate(ep.LogEntry.Chain[0].Data)
if err != nil {
glog.Errorf("[%s] Problem decoding issuing certificate: index: %d error: %s", ep.LogURL, ep.LogEntry.Index, err)
continue
}
metrics.MeasureSince([]string{"insertCTWorker", "ParseCertificates"}, parseTime)

ct-fetch should verify that the certificate was signed by its issuer, to ensure it's a real certificate. This is important in the event that a CT log is coerced to log an invalid certificate.

If the certificate is valid but from an unknown issuer, tools can more readily handle that via whitelisting. But it's much better to ensure that we never log certificates that are actively themselves fraudulent.

iniflags: unknown flag names

with the configuration as suggested ct-fetch doesn't start:

2017/08/16 19:55:33 iniflags: unknown flag name=[geoipDbPath] found at line [2] of file [./ct-fetch.conf]
2017/08/16 19:55:33 iniflags: unknown flag name=[runForever] found at line [10] of file [./ct-fetch.conf]

after commenting out the two lines it runs

HTTP Status 429 should back-off, not wait the whole restart period

[https://ct.googleapis.com/skydiver/] downloadCTRangeToChannel exited with an error: got HTTP Status "429 Too Many Requests", finalIndex=115867925, finalTime=2019-02-16 22:12:58.000000089 +0000 UTC

This error should be caught inside downloadCTRangeToChannel and use the backoff logic to retry.

TTLs not always set for serials lists

Some combination of ct-fetch and ct-reprocess-known-certs fails to set TTL for all Redis cache keys. A recent fixup caught 437 having no TTL set at all, and having not expired yet.

`ct-fetch` sleeps most of the time

Because LogDownloader.EntryChan is an unbuffered channel, and LogDownloader.downloadCTRangeToChannel has some weird backoff logic, a ct-fetch process ends up sleeping most of the time.

Making LogDownloader.EntryChan a buffered channel of the size of the batch increases the ct-fetch performance 10x in our environment:

diff --git a/cmd/ct-fetch/main.go b/cmd/ct-fetch/main.go
index 7bc12b1..f543574 100644
--- a/cmd/ct-fetch/main.go
+++ b/cmd/ct-fetch/main.go
@@ -71,7 +71,7 @@ type LogDownloader struct {
 func NewLogDownloader(db storage.CertDatabase) *LogDownloader {
        return &LogDownloader{
                Database:            db,
-               EntryChan:           make(chan CtLogEntry),
+               EntryChan:           make(chan CtLogEntry, 1024),
                Display:             utils.NewProgressDisplay(),
                ThreadWaitGroup:     new(sync.WaitGroup),
                DownloaderWaitGroup: new(sync.WaitGroup),

I'm not opening this as a PR, because I sense that the backoff logic should probably be removed too. The producer can, and probably just should, block on the channel when it is full.

Support mutlithreaded download per log

Currently we read logs linearly, which is limiting us to single-thread log download throughput. Most logs will accept simultaneous readers, and for catching up it'd be nice to take buckets of, say, 1M entries and process them in parallel.

Unexpected errors


LoadCertificatePEM unexpected error, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2020-01-18, serial=035c5c3d88d9c2ca42fbe6204eccd5169348 time=1m0.051885077s skipping: Couldn't get document snapshot for ct/2020-01-18/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A1xcPYjZwspC--YgTszVFpNI: rpc error: code = Unavailable desc = The datastore operation timed out, or the data was temporarily unavailable.

LoadCertificatePEM unexpected error, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2020-01-03, serial=035bfa58626bca42724556637ffd30fb00c7 time=1m0.128993994s skipping: Couldn't get document snapshot for ct/2020-01-03/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A1v6WGJrykJyRVZjf_0w-wDH: rpc error: code = Unavailable desc = The datastore operation timed out, or the data was temporarily unavailable.

Show a progress bar of 'state of CT'

It'd be nice to have a total-percentage-of-CT that is the sums of all logs and their maxentries, just as a talking point and a general total-time-to-sync mechanism.

Over time, somehow multiple threads for a single log started

yeti2019.ct.digicert.com/log/  [-------] 2 %      13h4m10s   5157404 / 277040229
yeti2019.ct.digicert.com/log/  [-------] 3 %      8h52m45s   7253760 / 271896381
yeti2019.ct.digicert.com/log/  [==>---] 57 %      4h11m48s 151620453 / 264662809

No idea how this happened, with redirects or what (probably not, the internal state wouldn't have moved)

Re-analyze context deadlines

StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Unknown aborting: (2019-11-15/hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) (total time: 1m21.131437568s) (count=0) (offset=24576) err=context deadline exceeded

after many

LoadCertificatePEM failed, issuer=YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg= expDate=2019-11-08, serial=034b33c63cf4212dd42fa582ef6e79789e7c Couldn't get document snapshot for ct/2019-11-08/issuer/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=/certs/A0szxjz0IS3UL6WC7255eJ58: context deadline exceeded

The PEM-loading method should be much more forgiving of congestion now, as it's happening in its own thread.

ApproximateMostRecentUpdateTimestamp is really slow

ApproximateMostRecentUpdateTimestamp, added in #41, uses the Redis scan which is just stupidly slow, even for a really narrow key-scope.

Most of #41 should be reverted in favor of instead analyzing a shared state of the LogSyncEngine's LogWorkers, which it currently does not track, but should.

func (ld *LogSyncEngine) ApproximateMostRecentUpdateTimestamp() time.Time {
var mostRecent *storage.CertificateLog
for _, log := range ld.database.GetAllLogStates() {
if mostRecent == nil || log.LastUpdateTime.After(mostRecent.LastUpdateTime) {
mostRecent = log
}
}
glog.V(4).Infof("Most recently updated log was %+v", mostRecent)
return mostRecent.LastUpdateTime
}

Remove Firestore support

Firestore is just not great. I'd rather see a backend implementation using Google Cloud Storage if we needed bulk PEM data again. In the mean time, once Firestore is phased out of the CRLite deployment the code will rot, and I don't think it should remain in-tree.

structure error: E: integer not minimally-encoded

Behind #4 lurks this error:

groovetop:ct-mapreduce lcrouch$ ct-fetch -config ~/.ct-fetch.conf --offset 5068
Saving to disk at /tmp/ct
[https://ct.googleapis.com/rocketeer] Starting download.
[https://ct.googleapis.com/rocketeer] Fetching signed tree head...
[https://ct.googleapis.com/rocketeer] Starting from offset 5068
[https://ct.googleapis.com/rocketeer] 227985342 total entries at Thu Mar 15 09:02:00 2018
[https://ct.googleapis.com/rocketeer] Going from 5068 to 227985342
| 0.0% (17408 of 227980274) Rate: 4069/minute (933h41m0s remaining)
[https://ct.googleapis.com/rocketeer] Download halting, error caught: failed to parse certificate in MerkleTreeLeaf for index 24462: asn1: structure error: E: integer not minimally-encoded
[ct.googleapis.com/rocketeer] Saved state. MaxEntry=23500, LastEntryTime=2014-09-09 08:42:29.000000919 -0500 CDT

Same config as #4.

diskdatabase.go should be threadsafe

All the good thread-stuff is missing, since we're forced to a single worker thread right now. Diskdatabase.go should choose a mechanism to ensure threadsafety for individual files.

ct-fetch time estimates are always wildly optimistic

ct.googleapis.com/logs/argon2019/ [==================================>-------------------------------------] 49 %           10s 94976 / 192723

that won't complete within 10 minutes, let alone 10 seconds. We should unbreak that.

TTLs are off by one hour

At 23:01 UTC aggregate-known from CRLite printed warnings:
No cached certificates for issuer=CN=Go Daddy Secure Certificate Authority - G2,OU=http://certs.godaddy.com/repository/,O=GoDaddy.com\, Inc.,L=Scottsdale,ST=Arizona,C=US (8Rw90Ej3Ttt8RRkrg-WYDS9n7IS03bk5bjP_UXPtaY8=) expDate=2019-11-26-23, but the loader thought there should be. (current count this worker=4782155)
No cached certificates for issuer=CN=cPanel\, Inc. Certification Authority,O=cPanel\, Inc.,L=Houston,ST=TX,C=US (hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) expDate=2019-11-26-23, but the loader thought there should be. (current count this worker=4406637)

The Redis cache expired those because hour 23 started, but they should have stuck around until hour 23 ended, so we've a fencepost issue somewhere. It's possible this is in my repair script, not in the Go implementation. (See #32 which indicates why I have a repair script)

ct-mapreduce-map does not handle CT pre-certificates

Pre-certificates, such as:

-----BEGIN CERTIFICATE-----
MIIEI6ADAgECAhADktkhTsTVqOlF0JrhFs/ZMA0GCSqGSIb3DQEBCwUAME0xCzAJ
BgNVBAYTAlVTMRUwEwYDVQQKEwxEaWdpQ2VydCBJbmMxJzAlBgNVBAMTHkRpZ2lD
ZXJ0IFNIQTIgU2VjdXJlIFNlcnZlciBDQTAeFw0xODAzMTYwMDAwMDBaFw0xODA2
MTIxMjAwMDBaMIGDMQswCQYDVQQGEwJTRTESMBAGA1UECBMJU3RvY2tob2xtMRIw
EAYDVQQHEwlTdG9ja2hvbG0xETAPBgNVBAoTCEVyaWNzc29uMRQwEgYDVQQLEwtJ
VCBTRVJWSUNFUzEjMCEGA1UEAxMaYXR0d2lmaS5kcml2ZS5lcmljc3Nvbi5uZXQw
ggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDcWzCAXYfzaz3hzzbtkUJW
N32EzDzNzipCdPirv5dlvJicbh8+rUwTK37jkq+pHtcCLf+gJqgTXsMyB1znYizc
zH2HxZEh8TgMr5/B0VPU/xEysPyioRkDBzHBqXx2WJPZrZuyvK7hmVHHragmHOZa
tHO7zzF/rDMInOGNoZ1IRCpfMi9jMKuWcahCHQ4A9ipgRB0dBOEhvbT7Yg9jfyu4
yUexh2aNbM7ZxZrl8FPhlPgnvJdzaWecDF8BrYgidBtXhhfjDiGgukQg7T2DAzqz
hfbFBqLN4dbHjLrLWst9Z+MZvg67rHWdpKREBF16zeP36j6/Shg6pph9vDQLqCyl
AgMBAAGjggHeMIIB2jAfBgNVHSMEGDAWgBQPgGEcgjFh1S8o541GOLQs4cbZ4jAd
BgNVHQ4EFgQUjVxUn3OkhHByp9sET7m0CNAiXHAwJQYDVR0RBB4wHIIaYXR0d2lm
aS5kcml2ZS5lcmljc3Nvbi5uZXQwDgYDVR0PAQH/BAQDAgWgMB0GA1UdJQQWMBQG
CCsGAQUFBwMBBggrBgEFBQcDAjBrBgNVHR8EZDBiMC+gLaArhilodHRwOi8vY3Js
My5kaWdpY2VydC5jb20vc3NjYS1zaGEyLWc2LmNybDAvoC2gK4YpaHR0cDovL2Ny
bDQuZGlnaWNlcnQuY29tL3NzY2Etc2hhMi1nNi5jcmwwTAYDVR0gBEUwQzA3Bglg
hkgBhv1sAQEwKjAoBggrBgEFBQcCARYcaHR0cHM6Ly93d3cuZGlnaWNlcnQuY29t
L0NQUzAIBgZngQwBAgIwfAYIKwYBBQUHAQEEcDBuMCQGCCsGAQUFBzABhhhodHRw
Oi8vb2NzcC5kaWdpY2VydC5jb20wRgYIKwYBBQUHMAKGOmh0dHA6Ly9jYWNlcnRz
LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydFNIQTJTZWN1cmVTZXJ2ZXJDQS5jcnQwCQYD
VR0TBAIwAA==
-----END CERTIFICATE-----

Prompt errors in problems like:
/tmp/x/2020-05-16/bob.pem:0 Unable to load certificate

Upgrading to Python Cryptography 2.0 provides some CT support, but not enough to avoid having this line fail:

cert = x509.load_der_x509_certificate(der_data, default_backend())

Handle too-large known certificates in Firestore

Error writing known certificates 2020-03-07::gxeKFFaZ2HFJIsTdTjEl6nVo3ckTCX-qzRMqb9Xoa1w=: rpc error: code = InvalidArgument desc = A document cannot be written because it exceeds the maximum size allowed."

The Issuer's known certs cache document was lost. This is pretty critical. The max size is 1 MB for a document, and back-of-envelope suggested that was large enough, but it clearly isn't.

Close this repository

Cross-referencing mozilla/crlite#119 ... Basically, I no longer map/reduce CT, and I'm not sure this library could even do it at this point. The ct-fetch tool and its associated pieces should just move into CRLite, and everything not CRLlite-related should be removed. This repo should get an update to its README marking it out-of-use and point to CRLite.

De-duplicate with multiple logs

The on-disk storage for ct-fetch serializes certificates straight to disk, it does not maintain any state about certificates to know if they've already been written. So right now, if you change logs or have multiple logs, you'll get duplicates.

The FQDN and RegDom map/reduce functions will probably handle that OK, but it might inflate the cert count since that's done in a simple fashion. Also, it's wasteful to the disk.

Each day's metadata should maintain a list of seen issuer/serial combinations that can be used to de-dupe and decide if we should re-serialize a cert, as it's encountered in CT.

Don't require a PEM storage backend

It should be possible to only update the memorycache (Redis). Most notably, this means there needs to be CT Log metadata also stored into Redis.

Unexpected fatals

Fatals received: StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Internal aborting: (2020-01-21/hOF1upiO-xevnDEQ13g3aY3M3Uqa44RN5rVlwvU2WC8=) (total time: 27m24.196755175s) (count=6069) (offset=160314) err=rpc error: code = Internal desc = unexpected EOF
StreamSerialsForExpirationDateAndIssuer iter.Next unexpected code Internal aborting: (2019-12-21/YLh1dUR9y6Kja30RrAn7JKnbQG_uEtLMkBgFF2Fuihg=) (total time: 4h37m30.615102801s) (count=2807) (offset=841319) err=rpc error: code = Internal desc = unexpected EOF

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.